AI and the future of information literacy and information ethics
A few thoughts in advance of a panel talk being hosted 21 February 2023 by ALIA: AI, libraries, and the changing face of information literacy. See you there.
Chatbots: Pity the chatbot. The derided “computer says no” tool that seemed to be known more for blocking direct human interaction with shops, banks, airlines, and insurance companies has finally found affection, and an extraordinary amount of free publicity, via OpenAI’s ChatGPT.
Released in late 2022, the latest version of ChatGPT 3.5 is billed as a limited research beta. How long it remains freely available is anyone’s guess. The tool is said to have triggered a new AI race with competitors including Google, Microsoft, and Baidu (more a coincidence of timing than by design, I would say), reflection on AI regulation, and a fair amount of existential angst about what makes writing innately human, truth, and ethics.
We probably won’t be talking about ChatGPT a few months from now. But we will almost certainly be talking about how generative text and creative AI tools are embedded and largely invisible in existing everyday activities and applications. If we think we have a problem with ‘black-boxing’ and non-transparent algorithms now things are about to get a whole lot more complicated. Some of these innovations will save time and effort and be genuinely fun and interesting to use. Others will further entrench inequities in the labour required to remove abhorrent content from training data, bias, and a lack of respect for creators’ rights.
AI ethics: or why this stuff keeps people up at night
AI ethics captures a vast range of concerns. Even within the library and information profession, whose very foundation is built on the value of access to quality information and patron privacy, there is insufficient debate about what new technologies and methodologies like AI and ML mean for services and the people that library serve. Privacy of personal data is affected by harvesting for training data, bias is common, and algorithmic decision-making that goes wrong can have major consequences.
A growing ethical debate concerns an apparent lack of respect for creator rights when ingesting content and creating training models. ChatGPT has apparently ingested the contents of Archive of our Own, a prominent fanfic archive, without the consent of its authors. Reddit and Wikipedia are also thought to be major sources that have helped create ChatGPT’s generic English writing style. It seems any open content on the web is fair game for these models, regardless of whether authors have morally consented to their work being used in this way. “Just because you can, doesn’t mean you should” (cf: Jurassic Park), of course. Even more controversial is the use of images for generative creative works and art. Or the use of AI to replace artists working on the backgrounds of an anime series. Expect to see more ethical and legal issues in this space and a lot more lawsuits as generative creative AI tools begin to be integrated directly into consumer software as well.
As for whether a tool that does not respect creator rights can itself be an author, the answer already seems clear cut: no.
For libraries, ethical issues in AI and ML ought to trigger a resurgence in information literacy. I have said frequently over the years that I cannot understand why a media and information literacy course is not the most popular and oversubscribed course at every university (and school and public library). I mean media and information literacy skills that deconstruct sources of knowledge, invested interests, and allow people to critically appraise the veracity of what they are reading, viewing, and hearing. I often reflect on conversations with colleagues around 2015 from Ukraine, Romania, Finland, Denmark that were ringing the alarm bell about a lack of appreciation for debate, democratic values, and truth. As we have seen in Europe, the US and elsewhere in the years since, these values have been sorely tested. Finland’s forethought to invest in ensuring young people have these skills in the face of endemic misinformation has proved to be visionary.
A technique used by autocrats and others to drown out genuine information is called “flooding”. With the rise of more widely available generative AI text tools, I worry that the bots are about to get a lot more ammunition. Each technological development reduces the costs and time required to create an awful lot of havoc. It is trivial to create a bot army on any given issue and drown out legitimate voices online on social media platforms and in search engine results. Such tactics are not unique to autocracies. Steve Bannon’s “flood the zone”, anyone? Search engines are already struggling to cope with filtering out low quality SEO generated content. Add in a barrage of sort-of-correct-sounding, intersparsed with incorrect AI generated text, fake references, deepfake images and video and it will be even harder to find reliable information, especially in a crisis.
AI hallucinations are a longstanding phenomenon, but are said to be accelerating with the development of large language models. What ChatGPT and other LLMs don’t ‘know’ based on their training data they may just make up. As is evident when we see fake citations and other made up information. Basic errors in fact are one thing. But what will the response be when these errors are about people and incorrect facts could be seen as defamatory. Years on from the Google News Spain case and right to be forgotten, how do we ask a language model to ‘forget’ what it doesn’t really know?
Garbage in, garbage out, as the saying goes. No wonder some say ChatGPT is just a mansplainer. These risks might sound farfetched. But consider OpenAI’s own comments a couple of years ago in the context of EU regulations on upload filters and potential to alter/censor content for European audiences:
“…the designers of the futuristic OpenAI system, which can create limitless deepfakes for text including negative and positive customer reviews, spam and fake news that are sufficiently persuasive to be plausible as human creations, decided to raise the alarm. Indeed, perhaps not surprisingly, the public has already been alerted to the fact that the technology is too dangerous to release for fear of its potential abuse (The Guardian News, February 14, 2019).” (Romero Moreno, 2020, p. 174)
Too dangerous to release. And yet, here we are.
Like content moderation before it, ChatGPT has also revealed troubling realities about how tools are trained so that we don’t see the most hateful, violent, illegal content. This labour is cheap to the companies that use it, but comes at a very high psychological cost to people that undertake the work.
To address questions of bias, among other intentions, there has been a groundswell of regulatory activity in China and the EU to require companies to open up their algorithms. Sina Weibo’s hot search is one algorithm that has been highly scrutinized. While many results are genuine and organic, the list can be gamed. Members of fanclubs, for instance, spend a great deal of effort trying to influence hot searches about their favourite entertainers while “antis” try to damage their idol’s perceived competitors. Successive crackdowns on toxic fan culture, campaigns to “clean up” the internet, an overriding emphasis on security, as well as regulations about data protection all provide context for the introduction of regulations to publish algorithms in mid-2021. Early reviews of the approach in China suggests that while regulation has helped to push for some transparency, algorithms are so complex and what companies report at times so vague that it is unclear what the regulations can achieve.
This could be a preview of what happens in other regions that go down the same regulatory road. For different reasons, the EU is rolling out an approach as part of the Digital Services Act. The EU approach is motivated by concerns about the influence of largely US-based multinational platforms used by EU citizens, and taps into the EU’s longstanding orientation as a regulatory leader.
Is it all doom and gloom? No
Yes there’s a lot to worry about, and a lot to get right to ensure that the future of generative, large language model AI is ethical and trustworthy. These are keywords in the plans and minds of many regulators, and people having a go in using OpenAI’s ChatGPT, Bing, Google “Bard”, the forthcoming Baidu “Ernie Bot”, and whatever tools come next.
The more likely scenario in the short-term is that we are going to see a lot more AI whether real or just hype. Be extra critical of apps ending in the domain name .ai. We will see new tools embedded into search products, bibliographic tools, word processors, spreadsheets and the like. Some of this is already happening - for instance subjects in some bibliographic tools are applied using machine learning, and many recommender tools use ML. If you are familiar with the Editor function in Microsoft Word or tools like Grammarly expect to see more like that. Much of this will be invisible. It will make it easier to correct your spelling and get help on how to use a particular function. I’d honestly be delighted if a company put all their attention into a fully functional and automated meeting scheduler that can account for internal and external people, hybrid and online formats, and timezones. Just an idea, Microsoft.
We are far from being put out of a job. Or losing our humanity. Not in 2023. I am convinced there’s more to do than ever for those who research these issues and for those working in libraries to ensure that we ensure the next wave of media and information literacy is fit for the algorithmic age, that research data is correctly described so that it can be ethically reused, and that we collaborate closely with researchers and integrity staff to ensure matters of authorship, attribution, accuracy in scholarly publishing in the face of new technologies is well understood.