A Whole Bunch of Authors Discovered Their Books Have Been Harvested to Train AI—Now What?

How We Got Here and What You Can Do About It

Oct 01, 2023

If it’s free for the taking, they’re gonna take it.

If you’re a writer, author, or work in creative industries more broadly, 2023 is sure to be remembered as the year AI invaded. It will be the year a lot of writers learned about large language models (LLMs) and “harvesting” (definition: obtaining/extracting information from existing sources), and the impact of new technologies in the creative domain. Much like climate change, how we got here is a story with a long arc (some of which I’ll unpack below), and exemplifies how we tend only to react when the beast is upon us, and after it’s already too late.

Most articles like this end rather than start with calls to action, but I’m flipping the script in an effort to showcase the urgency of our situation. These are five things you can/should do if you’re an author—because you inherently care about copyright, privacy, and the future of books:

1.     Sign up to be a member of the Authors Guild. It’s $135/year, and even if all you do is read their newsletters, their advocacy matters and it will directly impact you. Apply here.
2.     Write letters. I know this seems like a good idea, but in fact the more people who do it, the more overwhelming the response, the more hope we have that our voices will be heard. The Authors Guild recommends sending letters to Meta CEO Mark Zuckerberg; to Google CEO Sundar Pichai; to OpenAI CEO Sam Altman; and to Microsoft CEO Satya Nadella—and even offers a template. I recommend sending to those, plus Anthropic, Inflection, and Amazon. Send emails, but send snail mail too.
3.     Write to your Congressperson. Same situation. Let your elected officials know that you’re an author who objects to your content being used to train AI.
4.     Send a complaint to Google, naming ChatGPT as violating the terms of your copyright. Go here and click on CREATE A REQUEST. Here’s the URL you’re objecting to: https://chat.openai.com.
5.     Consider—and I know this is a hard one—not supporting the companies that are stealing your content. Maybe you can’t get off Facebook because you have a community there, I get it. But maybe shop less at Amazon. Use a browsers that aren’t Google. Do one thing that exercises your power to not support companies that are actively harming you as a writer/author.

Alright, so here’s what happened this week. Authors’ ire hit the roof when they discovered that their books were not just hypothetically used (harvested/stolen) to train AI; they really were. The Atlantic published a fast and easy tool that allows the search of 183,000 ISBNs to see which authors’ copyrights have been violated for the purpose of training AI without said authors’ consent.

Before this week, the violation of copyrights has been largely the domain of publishers’ hand-wringing and Authors Guild lawsuits. I remember being in a room of publishers ten years ago with everyone at the table angsting over the impact of the Google Books Library Project, which was scanning books at the pace of 6,000 pages per hour! They were digitizing our content at an alarming rate, all under the banner of making content “free and available,” one of the tenets of “open source.”

Open source is theoretically a good thing, as it applies to software. It means making source code free so that people can play around with technology and grow it and make it better and more efficient. Where AI is concerned, “free and available” takes on new meaning, however, and “open” extends to open libraries and the Internet Archive, which since 1996 has been literally archiving the Internet. This vast storage house of digital books is the training ground LLMs, the foundation of generative AI. LLMs feeds on our content—and thanks to decades of digitization, we have quite a feast on offer. Since many of us are just learning about LLMs, here’s a summarized definition (source TechTarget):

Large language models (LLMs) are artificial intelligence algorithms that uses large data sets to understand, summarize, generate, and predict new content. The term “generative AI” is also connected with LLMs because LLMs themselves are a type of generative AI specifically architected to generate text-based content.

The Internet Archive is an example of a place from which our data is likely being harvested. A look at their About page shows that their archive contains 41 million books and texts. They don’t offer a transparent FAQ like the Digital Public Library of America (DPLA) does. Here’s what the DPLA has to say about copyright violation:

What’s the deal with copyright and a DPLA item?
The copyright status of items in DPLA varies. DPLA links to a wide variety of different materials: many are in the public domain, while others are under rights restrictions but nonetheless publicly viewable. For individual rights information about an item, please check the “Rights” field in the metadata, or follow the link to the digital object on the content provider’s website for more information.

Translation: they’re violating copyright.

The problem with copyright is that it’s a bit like posting a sign in your yard asking that people not pick the lemons from your trees because they’re yours. Normal citizens walk by the sign and abide by your request, but serial fruit stealers will disregard the sign and snatch as many lemons as they want. We already know Amazon, Meta, and Google have loose morals; they’ll justify the taking of your lemons any which way they can, and there are no repercussions, so who cares anyway?

Last summer, I attended a conference in San Francisco called PageBreak about innovations in book publishing and the future of publishing. Tim O’Reilly was the closing keynote speaker that day, and his talk was the first time I’d heard the term “GPT” (Generative Pre-Trained Transformer). He’s one of the grandfathers of “open source,” and his speaking at the event was a testament to how much open libraries are an outgrowth of open-source software. To people like O’Reilly and a lot of attendees at PageBreak, open and free digital libraries represent the democratization of publishing. Publishers and most authors do not agree, however, since unlike physical libraries that buy in specific quantities of non-returnable print books to be lent out to one human being at a time, digital books are very hard to regulate and protect. If if the people who control the content believe in “free and available” over publisher and author profits, we’re not seeing eye-to-eye. I found myself sitting in the audience circling the single note I’d scrawled—“WTF?”—in the margins of the free notebook from my swag bag. As a publisher, the products I sell are stories bound into paper books, and the mass digitization and free dissemination of those stories for mass consumption—by humans but especially for AI—is an existential threat to an already-suffering business model.

When you listen to open source evangelists, they really believe they’re saving the world. Their arguments for open and free everything centers on access, democracy, and doing right by civilization. They don’t take into account how it feels to be an author who spent years researching, writing, toiling away at their craft, who wakes up one morning to find that a robot is being trained to write in their style. Or the totally valid fear authors have that their work is going to be mimicked and then marketed to the audiences they worked hard to cultivate, rendering their talent and their efforts—and their profits—obsolete.

The only entity that holds any true power to stem the tide is our federal government. But let’s take stock of where we are as a nation-state on this question of AI harvesting our content. Are there any rules or regulations in place? That’s a big fat NO. We do, however, have commitments! In late July, seven companies—Amazon, Anthropic, Google, Inflection, Meta, Microsoft and OpenAI—made “commitments” to new standards for safety and security in the AI space at a meeting at the White House. What any of that means is anyone’s guess, and every report I’ve read on the topic came to the same conclusion—what was agreed to was essentially unenforceable and full of hollow promises.

As a publisher, the steps our imprints have taken to try to protect our authors also feel pretty hollow. In response to all the hullabaloo, we’ve added an AI clause to our contracts and language to our copyright page template in an effort to prevent LLMs from using our books to train AI. But again, without enforcement, I feel a lot like the lady with the sign in my yard asking people to not take my lemons, please. If the lemon-snatcher wants my lemons, he’ll have them.

The only entity out there showing any kind of muscle in their efforts to protect writers and authors is the Authors Guild. If you’re an author, read their September 27, 2023, article, “You Just Found Out Your Book Was Used to Train AI. Now What?” AG has been leading the charge with various lawsuits, as well as actionable ideas, for years. They’re doing the advocacy and the actions, and we—writers and authors and publishers and publishing professionals—need to be doing what they recommend and more. We’re not yet at a fever-pitch moment, but we’re getting there. AI using our books to train their algorithms (with and without our permission!) is a threat to our businesses, our livelihoods, and even the very nature of how we live in the world, given how central content is to our everyday lives. It is decidedly too late to turn back the tide on AI and all its many impacts on writers and authors, but it’s not too late to get active and start taking actions to show these companies that we do not accept them using our content in their algorithms, or to insist to the government, our leaders, and AI companies that we want regulations and accountability—now.

For further reading:

https://www.zdnet.com/article/why-open-source-is-the-cradle-of-artificial-intelligence/

https://www.thebluegarret.com/blog/page-break-2022

https://www.theguardian.com/books/2023/sep/20/amazon-restricts-authors-from-self-publishing-more-than-three-books-a-day-after-ai-concerns

https://www.npr.org/2023/07/17/1187523435/thousands-of-authors-urge-ai-companies-to-stop-using-work-without-permission

https://www.smh.com.au/culture/books/it-filled-me-with-rage-authors-furious-as-ai-website-hoovers-up-books-20230809-p5dv0j.html

https://www.themarysue.com/authors-are-furious-after-finding-their-works-on-huge-list-of-books-used-to-train-ai/

https://www.computerworld.com/article/3703231/white-house-promises-on-ai-regulation-called-vague-and-disappointing.html

Niki Wilson

Oct 1, 2023

You are soooooo on it! I just read a post on Instagram from a person I follow that her work was on that list training AI. Now, here you are giving me the backstory and a course of action moving forward. Bravo!!!! This newsletter is exactly what I didn’t know I needed.

Expand full comment

1 reply by Brooke Warner

Laura Atkins

Thank you so much, Brooke, for keeping us informed. I appreciate you breaking this down so I can have a better understanding of the terminology and the risks.

9 more comments...

A Whole Bunch of Authors Discovered Their Books Have Been Harvested to Train AI—Now What?

How We Got Here and What You Can Do About It

Discussion about this post