Data Harvesting and GDPR

What is it?

AI—especially machine learning and large language models—requires vast amounts of data to be trained and function effectively. But this constant need for “more data, faster” often clashes with the fundamental principles of GDPR. In the EU, a core principle is that data may only be collected if it is necessary for a specific purpose—a principle that stands in sharp contrast to the AI developers’ tendency to “harvest everything,” since more data typically leads to better models.

Many AI systems are trained on massive amounts of web content, documents, and text data—often without explicit consent from those who created or own the data. This means that personal information and sensitive data may, in the worst cases, be used without knowledge or approval, violating GDPR’s core principles of legality, transparency, and data minimization.

But the issue doesn’t stop at training. Even when ordinary users interact with AI in practice, new risks emerge. Many people find that AI models like ChatGPT or Claude deliver better results if you share reports, surveys, emails, or raw data—but this presents serious data protection challenges.

Examples:

An employee uploads an internal report to an AI model to get a summary or suggestions for next steps. The report contains personal data and business-critical information—and suddenly that data may be exposed to an external AI service, where it’s unclear how the information is stored or used.

Even if the model doesn’t store data for training purposes, the information may still be processed under conditions or in physical locations that do not meet GDPR’s standards for data protection, purpose limitation, and data security. In some cases, information may unintentionally appear in later interactions or be leaked due to system errors.

What to consider?

If you are training your own AI models, you must ensure that your data sources are legally obtained and that you comply with GDPR’s data minimization requirements. Use only the information you actually need—and be aware that even “anonymous” data can become personally identifiable at scale.

If you use generative AI in daily work, avoid uploading personal data unless you have explicit consent and know how the data is processed. The same applies to sensitive business information and copyright-protected material.

Consider establishing internal guidelines for AI use—not only to ensure GDPR compliance but also to protect your organization from data leaks and legal risks. Transparency, consent, and security should be the foundation of responsible AI use.

<< Back to overview

peter svarre

foredragsholder og digital strateg