Who Did the AI Learn From?

Elena Daehnhardt

E-mail Twitter GitHub Pinterest LinkedIn Ko-fi

Midjourney 7.0: Rembrandt's workshop, where students make their pain strokes, HD

Picture this: you walk into Rembrandt’s painting school in 17th century Amsterdam. Students sit hunched over their canvases, copying the master’s brushstrokes over and over again.

The students are not trying to create fake Rembrandts, obviously. They want to understand how light works, how texture emerges, how composition breathes life into a painting. Through endless imitation, they slowly develop their own artistic voice.

This apprentice-style imitation is exactly how AI models learn today. Instead of studying brushstrokes, they devour text, images, music — anything digital they can get their virtual hands on.

These AI “students” consume massive amounts of existing work to understand patterns. From this, they learn to generate something that looks new.

But here’s where it gets messy: Rembrandt’s students had permission. They were invited into his workshop.

AI models? They often learn from whatever they can scrape from the internet — public content, copyrighted material, things shared freely, and things definitely not meant for machine consumption.

So here’s my question: Should AI need permission to learn, just like those old art students needed permission to enter the master’s studio?

Copyright Law and AI Training: Where the Legal Framework Breaks Down

Copyright law was never designed with machine learning in mind — nobody saw this coming. In the old days, copying a painting for private study might be fine, but selling it without permission was trouble. With AI, the “studying” happens at an industrial scale, and the outputs can look market-ready immediately.

Some people argue that training AI on copyrighted works falls under fair use (in the United States) or text and data mining exceptions (in Europe). Fair use is a US legal doctrine permitting limited use of copyrighted material without permission when the use is transformative, such as commentary, research, or — arguably — statistical pattern analysis. The idea is that analysing data for patterns is different from copying it wholesale ¹ ². High-profile disputes such as The New York Times v. OpenAI and Microsoft and Getty Images v. Stability AI are now testing exactly where that line sits in court, rather than in blog-post opinion.

Critics disagree entirely. They say creators should have control over whether their work gets used at all. After all, students had to knock on Rembrandt’s door for permission — shouldn’t AI do the same?

Both sides have valid points, and frankly, the legal system is still figuring this out.

AI Training Data Sources: Which Datasets Power Large Language Models

A large language model (LLM) is a neural network trained on massive text corpora that generates human-like language by predicting statistically likely word sequences ⁶. In Rembrandt’s workshop, if you asked a student “who taught you?”, they could point to specific canvases and say: “from here, from that painting, from the master himself.”

With today’s LLMs, good luck getting a straight answer. These digital students also learn from masters — novelists, journalists, programmers, musicians — but on a ridiculously massive scale.

We’re talking trillions of words and images from datasets like Common Crawl ¹, Wikipedia ², or collections like LAION ³.

But when you ask “Who did you learn from?”, you get corporate speak: “a mixture of publicly available and licensed data.”

That corporate non-answer is like asking Rembrandt’s student about their influences and getting: “various artistic materials from multiple sources.” Useless, right?

AI Training Data Transparency: Why Disclosure of Origins Matters

Transparency isn’t just a regulatory nice-to-have — it is structurally essential for three core reasons:

Consequence of Opacity	The Practical Impact
Ingrained Biases	Training data fundamentally shapes an AI’s “personality.” An LLM trained primarily on unfiltered forum comments behaves radically differently than one trained on academic journals.
Copyright & Fair Use	Ingesting protected works en masse without permission raises severe legal questions. While tech companies argue this is “fair use”, creators rightly point out the commercial exploitation of their intellectual property.
Erosion of Trust	Users deploying these tools in production deserve to know the provenance of the logic. Are you relying on an algorithm taught by professional publications, or by scraped social media?

Without transparency, we treat AI models like mysterious, infallible geniuses instead of apprentices whose learning we can trace, debug, and understand.

AI Training Data Disclosure Framework: Three Structural Transparency Measures

Listing every single document in a multi-terabyte training dataset is computationally impractical, and laboratories will fiercely protect their proprietary blends for competitive advantage. But we can bridge the gap with a standardised disclosure framework:

Transparency Measure	Implementation Strategy
Categorical Proportions	LLM model cards must share explicit taxonomy breakdowns: e.g., “30% validated news, 20% Wikipedia, 25% licensed books, 15% scraped forums, 10% academic papers.”
Dataset Registries	Mandate the publication of registries confirming the use of massive public pools like LAION or Common Crawl.
Standardised Opt-Outs	Implement cryptographic or embedded tag protocols allowing creators to universally signal that their work is excluded from machine consumption.

This disclosure framework is like Rembrandt’s students saying: “I learned mostly in the master’s studio, sometimes in the library, occasionally in the marketplace.” It isn’t perfect, granular documentation, but it establishes honesty and context.

Attribution and Compensation for Human Creators in AI Training Data

Training data sources listed in every LLM’s description would matter for more than transparency — it is a matter of respect for human creators. Programmers who shared their code publicly on GitHub are now watching AI master coding skills and compete for programming jobs. Artists and writers face the same pressure.

Acknowledging human contributions, at minimum, would make AI more respected and hopefully more respectful of human society — just like Rembrandt’s students respected their master. Acknowledging sources is basic courtesy, really.

AI Training Transparency: Learning Freely While Respecting Provenance

Structured AI training disclosure represents a middle-ground transparency standard: it lets creators verify how their work was used without forcing labs to publish proprietary dataset recipes wholesale. That is the practical core of the Rembrandt-workshop analogy. Students could learn from the master, but only by stepping inside with permission and acknowledgement. AI training could work with the same spirit: learn freely where permission is granted, respect the private studios of others.

The legal debate isn’t over — courts, lawmakers, and communities are still working this out. But the guiding principle stays simple: learning is valuable, but respect is essential.

If art students once acknowledged their teachers, AI should too — not because it makes the AI less impressive, but because it makes the learning process transparent and ethical. Good learning, whether with brushes or algorithms, gets stronger when it honours its sources.

This is just the beginning of a much larger conversation about how humans and AI will coexist. The sooner fair and respectful data-use norms get established, the better for everyone involved.