Like the monster plant in Little Shop of Horrors, Generative AI has an insatiable appetite; for data though rather than food. Generative AI applications, like ChatGPT and Midjourney, need a constant supply of data to train (and improve) their output algorithms. In the early days of AI development, this data came from public sources especially the internet. However, this “data scraping” was not without legal obstacles.
Where personal data is used to train AI models, of course GDPR applies. The transparency provisions and the requirement for a legal basis are of particular importance. In 2022, the Information Commissioner’s Office (ICO) issued a fine of more than £7.5 million to Clearview AI for GDPR breaches in the way it compiled its online database containing 20 billion images of people’s faces and data scraped from the internet. The company did manage to successfully appealthe fine but the ICO, and other GDPR regulators in the EU, have issued clear warnings to AI companies to ensure they comply with GDPR.
To satisfy Generative AI’s demand for more data, AI developers have been striking deals with tech companies for access to the latter’s user data. This includes data generated by users whilst using popular websites and apps. In February it was reported that Tumblr and WordPress.com are preparing to sell user data to Midjourney and OpenAI. And (surprise surprise) Meta and Alexa have exploited user data, in the past, to train their AI models.
Elon Musk’s X (formerly Twitter) came under fire recently after it started collecting and using its users’ data, including their posts, to train X’s Grok AI model. This was allegedly done without notifying X users or asking for their consent. In June, the Irish Data Protection Commission (DPC), X’s Lead Supervisory Authority, made an urgent application under Section 134 of the Irish Data Protection Act 2018. This allows the DPC, where it considers there is an urgent need to act to protect the rights and freedoms of data subjects, to request the High Court for an order requiring the data controller to suspend, restrict or prohibit the processing of personal data.
This was the first time that any Lead Supervisory Authority has taken such action, and the first time that the DPC has sought to utilise its powers under Section 134. The DPC said the application was made to protect the rights and freedoms of X’s EU/EEA users, and came after extensive engagement between the DPC and X regarding its AI model training. Last week, the DPC announced that X had agreed to suspend its processing of the personal data contained in the public posts of X’s EU/EEA users which it processed between 7 May 2024 and 1 August 2024, for the purpose of training its AI model.
But this agreement is not the end of X’s privacy woes. Noyb, a privacy advocacy group headed by Max Schrems, has filed nine more GDPR complaints with regulators across Europe alleging that X appears to have breached a number of other GDPR provisions including the GDPR principles and the transparency rules. Several other major tech firms have also faced regulatory setbacks in Europe over privacy issues raised by their AI plans. In June Meta announced that it was pausing its plan to process user posts and images on Facebook and Instagram to train its AI tools after a number of GDPR complaints. LinkedIn was also the subject of a similar complaint by consumer organisations.
AI is a priority for the ICO. It’s existing guidance on AI explains how to apply the concepts of data protection law when developing or deploying AI and the AI toolkit helps organisations identify and mitigate risks during the AI lifecycle. The ICO consultation series on generative AI and data protection closed in June.
The training of Generative AI does not just pose GDPR compliance issues. In December last year, the New York Times announced it was suing OpenAI and Microsoft for copyright infringement. The lawsuit claimed the “unlawful use” of the paper’s “copyrighted news articles, in-depth investigations, opinion pieces, reviews, how-to guides, and more” to create AI products “threatens The Times’s ability to provide that service”.
Please subscribe to this blog and help us to get to 10,000 subscribers.
Join our Artificial Intelligence and Machine Learning, How to Implement Good Information Governance workshop for hands-on insights, key resource awareness, and best practices, ensuring you’re ready to navigate AI complexities fairly and lawfully.

