Elena' s AI Blog

Who Did the AI Learn From?

Elena Daehnhardt

Midjourney 7.0: Rembrandt's workshop, where students make their pain strokes, HD

Picture this: you walk into Rembrandt’s painting school in 17th century Amsterdam. Students sit hunched over their canvases, copying the master’s brushstrokes over and over again.

They are not trying to create fake Rembrandts, obviously. They want to understand how light works, how texture emerges, how composition breathes life into a painting. Through endless imitation, they slowly develop their own artistic voice.

This is exactly how AI models learn today. Instead of studying brushstrokes, they devour text, images, music — anything digital they can get their virtual hands on.

These AI “students” consume massive amounts of existing work to understand patterns. From this, they learn to generate something that looks new.

But here’s where it gets messy: Rembrandt’s students had permission. They were invited into his workshop.

AI models? They often learn from whatever they can scrape from the internet — public content, copyrighted material, things shared freely, and things definitely not meant for machine consumption.

So here’s my question: Should AI need permission to learn, just like those old art students needed permission to enter the master’s studio?

Copyright and the Digital Mess

Let me be honest — copyright law was never designed with machine learning in mind. Nobody saw this coming.

In the old days, copying a painting for private study might be fine, but selling it without permission? That’s trouble.

With AI, the “studying” happens at an industrial scale, and the outputs can look market-ready immediately.

Some people argue that training AI on copyrighted works falls under fair use (in the United States) or text and data mining exceptions (in Europe). The idea is that analysing data for patterns is different from copying it wholesale ¹ ².

Others completely disagree. They say creators should have control over whether their work gets used at all. After all, students had to knock on Rembrandt’s door for permission — shouldn’t AI do the same?

Both sides have valid points, and frankly, the legal system is still figuring this out.

Who Did the AI Learn From? 🎨🤖

In Rembrandt’s workshop, if you asked a student “who taught you?”, they could point to specific canvases and say: “from here, from that painting, from the master himself.”

With today’s large language models (LLMs), good luck getting a straight answer. These digital students also learn from masters — novelists, journalists, programmers, musicians — but on a ridiculously massive scale.

We’re talking trillions of words and images from datasets like Common Crawl ¹, Wikipedia ², or collections like LAION ³.

But when you ask “Who did you learn from?”, you get corporate speak: “a mixture of publicly available and licensed data.”

That’s like asking Rembrandt’s student about their influences and getting: “various artistic materials from multiple sources.” Useless, right?

The Transparency Deficit: Why Origins Matter

Look, I’m not being difficult here. Transparency isn’t just a regulatory nice-to-have — it is structurally essential for three core reasons:

Consequence of Opacity The Practical Impact
Ingrained Biases Training data fundamentally shapes an AI’s “personality.” An LLM trained primarily on unfiltered forum comments behaves radically differently than one trained on academic journals.
Copyright & Fair Use Ingesting protected works en masse without permission raises severe legal questions. While tech companies argue this is “fair use”, creators rightly point out the commercial exploitation of their intellectual property.
Erosion of Trust Users deploying these tools in production deserve to know the provenance of the logic. Are you relying on an algorithm taught by professional publications, or by scraped social media?

Without transparency, we treat AI models like mysterious, infallible geniuses instead of apprentices whose learning we can trace, debug, and understand.

A Proposed Disclosure Framework

Listen, listing every single document in a multi-terabyte training dataset is computationally impractical, and laboratories will fiercely protect their proprietary blends for competitive advantage. But we can bridge the gap with a standardised disclosure framework:

Transparency Measure Implementation Strategy
Categorical Proportions LLM model cards must share explicit taxonomy breakdowns: e.g., “30% validated news, 20% Wikipedia, 25% licensed books, 15% scraped forums, 10% academic papers.”
Dataset Registries Mandate the publication of registries confirming the use of massive public pools like LAION or Common Crawl.
Standardised Opt-Outs Implement cryptographic or embedded tag protocols allowing creators to universally signal that their work is excluded from machine consumption.

It’s like Rembrandt’s students saying: “I learned mostly in the master’s studio, sometimes in the library, occasionally in the marketplace.” It isn’t perfect, granular documentation, but it establishes honesty and context.

Respecting Human Creativity

Here’s what really gets to me: I would love to see training data sources listed in every LLM’s description. Not just for transparency, but out of respect for human creators.

We all know people are losing jobs to AI advances. The same programmers who shared their code publicly on GitHub are now watching AI master coding skills and competing for programming jobs. Same with artists and writers.

But at minimum, acknowledging human contributions would make AI more respected and hopefully more respectful of human society. Just like Rembrandt’s students respected their master.

It’s basic courtesy, really.

My Take

I keep coming back to that image of Rembrandt’s workshop. Students could learn from the master, but only by stepping inside with permission and acknowledgement. Maybe AI should work with the same spirit: learn freely where permission is granted, respect the private studios of others.

The legal debate isn’t over — courts, lawmakers, and communities are still working this out. But the guiding principle seems simple to me: learning is valuable, but respect is essential.

If art students once acknowledged their teachers, AI should too. Not because it makes the AI less impressive, but because it makes the learning process transparent and ethical.

Good learning — whether with brushes or algorithms — gets stronger when it honours its sources.

Actually, let me be completely honest: I think this is just the beginning of a much larger conversation about how humans and AI will coexist. The sooner we figure out fair and respectful ways to handle this, the better for everyone involved.

References

  1. Fair use
  2. Text mining
  3. Common Crawl
  4. Wikipedia: Database download
  5. LAION: Large-scale Artificial Intelligence Open Network

Did you like this post? Please let me know if you have any comments or suggestions.

Posts about AI that might be interesting for you






All Posts