Conversations about AI and copyright seem ubiquitous in the news. Lawsuits continue to proliferate, leaving most researchers and the public confused about what is and isn’t legal in new AI tools and training for AI models.
Researchers engaged with open access may have specific questions about the intersection of copyright, AI, and open access publishing. Will my open access publications be used to train AI? Can I use open access publications to train an LLM tool in my research group?
Luckily, we have increased guidance and more understanding about these intersections and how to navigate them within our own work.
Training AI Tools on Open Access Content
Open access scholarship published by academic journals or hosted in pre–print and Green Open Access archives can be useful for training some AI tools. It’s important to remember that just because this work is openly available to read, it doesn’t mean it’s “free” to use; in contrast, it is likely copyrighted material.
Many researchers have questions about using copyrighted material to train AI tools. The Library Copyright Alliance (LCA), which represents the Association of Research Libraries (ARL) and the American Library Association (ALA), released a set of principles in 2024 affirming that that using copyrighted material to train AI is generally a fair use permitted by copyright law. While many copyright infringement lawsuits for AI training are still making their way through the courts, a recent decision in Bartz v. Anthropic agrees with the LCA principles on fair use.
However, access to open access material mixed with subscription material provided through the library’s subscriptions for Johns Hopkins students and researchers may be governed by additional and important database licensing restrictions. To aid researchers, we have a helpful list of databases that permit text and data mining like the kind used to train AI tools on our website.
Will My Open Access Scholarship be Used to Train AI?
Scholarship you post to a repository such as JScholarship or PubMed will likely be used to train AI tools. While in some cases, publishers are making licensing arrangements with technology companies for content hosted on their platforms, content accessible on the open internet is likely also being used for training. In these scenarios, tech firms or researchers believe they have a fair use to train on open access materials. There is no way to prevent your open access scholarship, or any information you share on the open web, from being used to train AI models.
That doesn’t mean that all training leads to negative outcomes. Particularly for scholarship that is breaking new ground or exploring topics that have not been centered in your discipline, having the perspectives of that work represented in the statistical models that generate AI outputs can be important. We know that AI tools can replicate societal biases through insufficient training, limitations of the models themselves, and challenges in AI design. When your open access scholarship is included in training data, it’s one way to passively challenge these biases.
Many publishers, including those that publish AI, are now adding information about AI training in author agreements. However, determining whether or not your work will be licensed to a technology company for AI use is significantly more prevalent in scholarly publishing of monograph-length works.
It’s also important to understand that your publication agreements can be the determining factor about who decides to enter into licensing agreements for your scholarly work to AI companies. If you transferred copyright to your original scholarship to a publisher, whether it’s a journal article or a monograph, the publisher could make all decisions about AI licensing and how (or if) any royalties are paid to authors.
While it may seem like authors can do little to control how AI tools use their work in training, it’s important to understand that outputs are still governed by copyright law in the United States. Generative AI tools that allow copyrighted open access to works to be reproduced may not meet the legal standards for fair use. Scholars have an important role in shaping policy and ensuring that AI tools follow copyright laws. Interested authors may want to consider following or joining the Authors Alliance, an advocacy organization for authors.
Do you have additional questions about AI, open access, and copyright? Email us at copyright@jhu.edu