Who Did the AI Learn From?

Elena Daehnhardt

E-mail Twitter GitHub Pinterest LinkedIn Ko-fi

Midjourney 7.0: Rembrandt's workshop, where students make their pain strokes, HD

Picture this: you walk into Rembrandt’s painting school in 17th century Amsterdam. Students sit hunched over their canvases, copying the master’s brushstrokes over and over again.

They are not trying to create fake Rembrandts, obviously. They want to understand how light works, how texture emerges, how composition breathes life into a painting. Through endless imitation, they slowly develop their own artistic voice.

This is exactly how AI models learn today. Instead of studying brushstrokes, they devour text, images, music — anything digital they can get their virtual hands on.

These AI “students” consume massive amounts of existing work to understand patterns. From this, they learn to generate something that looks new.

But here’s where it gets messy: Rembrandt’s students had permission. They were invited into his workshop.

AI models? They often learn from whatever they can scrape from the internet — public content, copyrighted material, things shared freely, and things definitely not meant for machine consumption.

So here’s my question: Should AI need permission to learn, just like those old art students needed permission to enter the master’s studio?

Copyright and the Digital Mess

Let me be honest — copyright law was never designed with machine learning in mind. Nobody saw this coming.

In the old days, copying a painting for private study might be fine, but selling it without permission? That’s trouble.

With AI, the “studying” happens at an industrial scale, and the outputs can look market-ready immediately.

Some people argue that training AI on copyrighted works falls under fair use (in the United States) or text and data mining exceptions (in Europe). The idea is that analysing data for patterns is different from copying it wholesale ¹ ².

Others completely disagree. They say creators should have control over whether their work gets used at all. After all, students had to knock on Rembrandt’s door for permission — shouldn’t AI do the same?

Both sides have valid points, and frankly, the legal system is still figuring this out.

Who Did the AI Learn From? 🎨🤖

In Rembrandt’s workshop, if you asked a student “who taught you?”, they could point to specific canvases and say: “from here, from that painting, from the master himself.”

With today’s large language models (LLMs), good luck getting a straight answer. These digital students also learn from masters — novelists, journalists, programmers, musicians — but on a ridiculously massive scale.

We’re talking trillions of words and images from datasets like Common Crawl ¹, Wikipedia ², or collections like LAION ³.

But when you ask “Who did you learn from?”, you get corporate speak: “a mixture of publicly available and licensed data.”

That’s like asking Rembrandt’s student about their influences and getting: “various artistic materials from multiple sources.” Useless, right?

Why This Actually Matters

Look, I’m not being difficult here. Transparency isn’t just nice to have — it’s essential:

Biases: The training data shapes the AI’s “personality.” A model trained mostly on Reddit comments will sound very different from one trained on academic papers or children’s books.
Copyright Issues: Using protected works without permission raises serious ethical and legal questions. Some call it “fair use” Fair_use, others call it theft.
Trust: Users deserve to know if they’re talking to a student of libraries, social media, or professional publications.

Without transparency, we treat AI models like mysterious geniuses instead of apprentices whose learning we can trace and understand.

A Practical Solution: Adding Training Sources to LLM Descriptions

Listen, listing every single document in a training dataset is impossible — the scale is massive, and companies keep some data secret for competitive reasons. But we can do better:

Share categories and proportions: “30% news articles, 20% Wikipedia, 25% books, 15% forums, 10% academic papers.”
Publish dataset registries for major public sources (LAION, Common Crawl.
Implement opt-out systems so creators can decide whether their work gets used.

It’s like Rembrandt’s students saying: “I learned mostly in the master’s studio, sometimes in the library, occasionally in the marketplace.” Not perfect documentation, but honest and helpful.

Respecting Human Creativity

Here’s what really gets to me: I would love to see training data sources listed in every LLM’s description. Not just for transparency, but out of respect for human creators.

We all know people are losing jobs to AI advances. The same programmers who shared their code publicly on GitHub are now watching AI master coding skills and competing for programming jobs. Same with artists and writers.

But at minimum, acknowledging human contributions would make AI more respected and hopefully more respectful of human society. Just like Rembrandt’s students respected their master.

It’s basic courtesy, really.

My Take

I keep coming back to that image of Rembrandt’s workshop. Students could learn from the master, but only by stepping inside with permission and acknowledgement. Maybe AI should work with the same spirit: learn freely where permission is granted, respect the private studios of others.

The legal debate isn’t over — courts, lawmakers, and communities are still working this out. But the guiding principle seems simple to me: learning is valuable, but respect is essential.

If art students once acknowledged their teachers, AI should too. Not because it makes the AI less impressive, but because it makes the learning process transparent and ethical.

Good learning — whether with brushes or algorithms — gets stronger when it honours its sources.

Actually, let me be completely honest: I think this is just the beginning of a much larger conversation about how humans and AI will coexist. The sooner we figure out fair and respectful ways to handle this, the better for everyone involved.