Generative artificial intelligence (AI) stretches copyright law in unforeseen and uncomfortable ways. The US Copyright Office has issued guidance stating that the output of image-generating AI is not copyrightable unless human creativity went into the prompts that generated it.
Yet that leaves many questions: How much creativity is needed, and is it the same kind of creativity that an artist exercises with a paintbrush?
Another group of cases deal with text — typically novels and novelists — where some say that training a model on copyrighted material is itself copyright infringement, even if the model never reproduces those texts as part of its output.
However, reading texts has been part of the human learning process for as long as written language has existed. While people pay to buy books, they do not pay to learn from them.
How does one make sense of this? What should copyright law mean in the age of AI? Technologist Jaron Lanier offers one answer with his idea of data dignity, which implicitly distinguishes between training, or “teaching,” a model and generating output using a model.
The former should be a protected activity, whereas output might indeed infringe on someone’s copyright, Lanier says.
This distinction is attractive for several reasons. First, copyright law protects “transformative uses ... that add something new,” and it is quite obvious that this is what AI models are doing.
Moreover, it is not as though large language models (LLMs) such as ChatGPT contain the full text of, say, George R. R. Martin’s fantasy novels, from which they are brazenly copying and pasting.
Rather, the model is an enormous set of parameters — based on all the content ingested during training — that represent the probability that one word is likely to follow another. When these probability engines emit a Shakespearean sonnet that Shakespeare never wrote, that is transformative, even if the new sonnet is not remotely good.
Lanier sees the creation of a better model as a public good that serves everyone — even the authors whose works are used to train it. That makes it transformative and worthy of protection.
However, there is a problem with his concept of data dignity, which he fully acknowledges: It is impossible to distinguish meaningfully between “training” current AI models and “generating output” in the style of, say, novelist Jesmyn Ward.
AI developers train models by giving them smaller bits of input and asking them to predict the next word billions of times, tweaking the parameters slightly along the way to improve the predictions.
Yet the same process is then used to generate output, and therein lies the problem from a copyright standpoint.
A model prompted to write like Shakespeare might start with the word “To,” which makes it slightly more probable that it would follow that with “be,” which makes it slightly more probable that the next word would be “or,” and so forth. Even so, it remains impossible to connect that output back to the training data.
Where did the word “or” come from? While it happens to be the next word in Hamlet’s famous soliloquy, the model was not copying Hamlet. It simply picked “or” out of the hundreds of thousands of words it could have chosen, all based on statistics. This is not what we humans would recognize as creativity. The model is simply maximizing the probability that we humans would find its output intelligible.
Yet how, then, can authors be compensated for their work when appropriate? While it might not be possible to trace provenance with the current generative AI chatbots, that is not the end of the story. In the year or so since ChatGPT’s release, developers have been building applications on top of the existing foundation models.
Many use retrieval-augmented generation (RAG) to allow an AI to “know about” content that is not in its training data. If a person needs to generate text for their product catalog, they can upload their company’s data and then send it to the AI model with the instructions: “Only use the data included with this prompt in the response.”
Although RAG was conceived as a way to use proprietary information without going through the labor and computing-intensive process of training, it also incidentally creates a connection between the model’s response and the documents from which the response was created.
That means we now have provenance, which brings us much closer to realizing Lanier’s vision of data dignity.
If a human programmer’s currency-conversion software is published in a book, and a language model reproduces it in response to a question, that can be attributed to the original source and royalties can be allocated appropriately. The same would apply to an AI-generated novel written in the style of Ward’s Sing, Unburied, Sing.
Google’s “AI-powered overview” feature is a good example of what can be expected with RAG. As Google already has the world’s best search engine, its summarization engine should be able to respond to a prompt by running a search and feeding the top results into an LLM to generate the overview the users asked for. The model would provide the language and grammar, but it would derive the content from the documents included in the prompt. Again, this would provide the missing provenance.
Now that we know it is possible to produce output that respects copyright and compensates authors, regulators need to step up to hold companies accountable for failing to do so, just as they are held accountable for hate speech and other forms of inappropriate content.
No one accepts leading LLM providers’ claim that the task is technically impossible; it is another of the many business-model and ethical challenges that they can and must overcome.
Moreover, RAG also offers at least a partial solution to the present AI “hallucination” problem. If an application such as Google search supplies a model with the data needed to construct a response, the probability of it generating something totally false is much lower than when it is drawing solely on its training data. An AI’s output thus could be made more accurate if it is limited to sources that are known to be reliable.
Humanity is only just beginning to see what is possible with this approach. RAG applications will undoubtedly become more layered and complex, but now that the tools to trace provenance exist, tech companies no longer have an excuse for copyright unaccountability.
Mike Loukides, vice president of Content Strategy for O’Reilly Media Inc, is the author of System Performance Tuning, and a co-author of Unix Power Tools and Ethics and Data Science. Tim O’Reilly, founder and CEO of O’Reilly Media Inc, is a visiting professor at University College London’s Institute for Innovation and Public Purpose and the author of WTF? What’s the Future and Why It’s Up to Us.
Copyright: Project Syndicate
Chinese Nationalist Party (KMT) caucus whip Fu Kun-chi (傅?萁) has caused havoc with his attempts to overturn the democratic and constitutional order in the legislature. If we look at this devolution from the context of a transition to democracy from authoritarianism in a culturally Chinese sense — that of zhonghua (中華) — then we are playing witness to a servile spirit from a millennia-old form of totalitarianism that is intent on damaging the nation’s hard-won democracy. This servile spirit is ingrained in Chinese culture. About a century ago, Chinese satirist and author Lu Xun (魯迅) saw through the servile nature of
In their New York Times bestseller How Democracies Die, Harvard political scientists Steven Levitsky and Daniel Ziblatt said that democracies today “may die at the hands not of generals but of elected leaders. Many government efforts to subvert democracy are ‘legal,’ in the sense that they are approved by the legislature or accepted by the courts. They may even be portrayed as efforts to improve democracy — making the judiciary more efficient, combating corruption, or cleaning up the electoral process.” Moreover, the two authors observe that those who denounce such legal threats to democracy are often “dismissed as exaggerating or
Monday was the 37th anniversary of former president Chiang Ching-kuo’s (蔣經國) death. Chiang — a son of former president Chiang Kai-shek (蔣介石), who had implemented party-state rule and martial law in Taiwan — has a complicated legacy. Whether one looks at his time in power in a positive or negative light depends very much on who they are, and what their relationship with the Chinese Nationalist Party (KMT) is. Although toward the end of his life Chiang Ching-kuo lifted martial law and steered Taiwan onto the path of democratization, these changes were forced upon him by internal and external pressures,
The Chinese Nationalist Party (KMT) caucus in the Legislative Yuan has made an internal decision to freeze NT$1.8 billion (US$54.7 million) of the indigenous submarine project’s NT$2 billion budget. This means that up to 90 percent of the budget cannot be utilized. It would only be accessible if the legislature agrees to lift the freeze sometime in the future. However, for Taiwan to construct its own submarines, it must rely on foreign support for several key pieces of equipment and technology. These foreign supporters would also be forced to endure significant pressure, infiltration and influence from Beijing. In other words,