[AI Agent Diaries] The LLM industry's dirty secret and what to do about it.

The LLM industry is experiencing a mini ChatGPT-in-reverse moment this week with The Atlantic's release of the LibGen search tool, showing how much pirated material is used for LLM training.

Mar 22, 2025

There are two errors with the title of this diary entry. Firstly, the LLM industry doesn’t just have one dirty secret, it has many dirty secrets. Secondly, they are not secrets. We all know about all the issues. However, it often takes some catalyst moment for something that people do know about to become something that is discussed widely enough and maybe lead to enough people caring about.

The dirty secret we are going to focus on today is the LLM industry’s use of copyrighted content. We’ve all know about this for a very long time. For example, there is the infamous Murati (ex-OpenAI CTO) interview with the Wall Street Journal where the incredibly capable and intelligent ex-CTO of OpenAI somehow was “not sure” what data was used to train Sora or couldn’t quite remember.

Since then we’ve learned a few more things. Most interestingly, there is the internal Meta communications released following legal action. Through their embarrassingly superficial debates we find out that engineers explicitly discussed whether it would be ok to use LibGen - a library containing millions of pirated books - because Meta’s competitors where doing it as well. Ultimately, Mark Zuckerberg approved the use of pirated material - living his “move fast and break things” motto to the full.

Then this week The Atlantic released a simple search tool that allowed anyone to see if their own books where used to train LLMs. In what I like to think of as reverse-ChatGPT moment people where horrified to find out explicitly that their hard work was used to train LLMs posting the results on social media in absolute outrage. The enthusiasm of people when OpenAI productised LLMs through ChatGPT and made it plain for all to see their capabilities turned to horror when the Atlantic productised the ability to determine the level of piracy that went on behind the scenes to train those LLMs.

Of course I did a little search for myself and while academic papers showed up that were public domain there is also a book that very much should not be public domain. I spent a good percentage of weekends in 2018 and 2019 writing that book. I don’t particularly care that its contents are publicly available for anyone to read without paying but I do care that it was used by organisations with access to billions of dollars without permission when they will very much complain if I don’t pay what is due for access to their services. Now, my main job is not writing. I can’t imagine how horrified people whose livelihood depends on authoring must feel.

The LLM industry are Modern Day Robber Barons

Let’s call a spade a spade.

What the LLM industry is doing is not dissimilar to what robber barons across the ages have done. Sam Altman, Mark Zuckerberg and others have identified a natural resource, content produced over centuries of human labour, and a means to exploit it, productised LLMs.

The analogies to Rockefeller, Carnegie are so plain. Monopolistic practices (they want their government to protect them from external competition), vertical integration (they manage every stage from the data centres to the end-user products maximising profit extraction), political influence (literally falling over themselves to gain access to government), while at the same time arguing for minimal environmental regulations, exploiting labour, displacing existing industries and people (replacing jobs) and financial manipulations (like the ridiculous White House $500 billion Stargate announcement or Elon Musk’s combination of X with X.ai).

What are we to do about it

There is a series of arguments against LLMs, from how the technology is fundamentally broken to how it is unethical. They all have merit for me. However, I also know that the technology is extremely useful and it is here, it is open source and it is not going away.

We cannot ban LLMs and we should not. I know I am over-simplifying but LLMs per se are not the problem. The problems are the afore mentioned robber barons. They are breaking the rules without consequences and they are then using the gains from the rule-breaking to facilitate further rule breaking or rule re-writing for their benefit.

I think there are two things that need to be done.

The Robber Barrons and the LLM industry needs to pay. There’s a number of mechanisms and far more qualified people to define how but a combination of ongoing taxation, specific fines, and prosecution to the extend the law allows should take place. The robber barons need to be stopped.
The technology requires regulation. Yes this will slow it down, yes some use cases are not going to be viable, yes some startups are going to have to stop or change what they are trying to do. Yes the regulation is not going to be perfect. That is ok. We’ve broken enough things. Time we started fixing some of them.

Current developments in the world and especially the US give us a pretty good idea where the “move fast and break things” ethos takes us. Perhaps its time we tried something different.

[AI Agent Diaries] The LLM industry's dirty secret and what to do about it.

The LLM industry is experiencing a mini ChatGPT-in-reverse moment this week with The Atlantic's release of the LibGen search tool, showing how much pirated material is used for LLM training.

The LLM industry are Modern Day Robber Barons

What are we to do about it

Discussion about this post