by Parmy Olson
Sorry, OpenAI. The European Union is making life for the leaders of artificial intelligence much less private.
A newly agreed draft of the region’s upcoming AI Act will force the maker of ChatGPT and other companies to share previously hidden details about how they build their products. The legislation will still rely on companies to audit themselves, but it’s nonetheless a promising development as corporate giants race to launch powerful AI systems with almost no oversight from regulators.
The law, which would come into force in 2025 after approval from EU member states, forces more clarity about the ingredients of powerful, “general purpose” AI systems like ChatGPT that can conjure images and text. Their developers will have to report a detailed summary of their training data to EU regulators, according to a copy of the draft seen by Bloomberg Opinion.
“Training data… who cares?” you might be wondering. As it happens, AI companies do. Two of the top AI companies in Europe lobbied hard to tone down those transparency requirements, and for the last few years, leading firms like OpenAI have become more secretive about the reams of data they’ve scraped from the Internet to train AI tools like ChatGPT and Google’s Bard and Gemini.
OpenAI, for instance, has only given vague outlines of the data it used to create ChatGPT, which included books, websites and other texts. That helped the company avoid more public scrutiny over its use of copyrighted works or the biased data sets it may have used to train its models.
Biased data is a chronic problem in AI that demands regulatory intervention. An October study by Stanford University showed that ChatGPT and another AI model generated employment letters for hypothetical people that were rife with sexist stereotypes. While it described a man as “expert,” a woman was a “beauty” and a “delight.” Other studies have shown similar, troubling outputs.
By forcing companies to more rigorously show their homework, there’s greater opportunity for researchers and regulators to probe where things are going wrong with their training data.
Companies running the biggest models will have to go one step further, rigorously testing them for security risks and how much energy their systems demand, and then report back to the European Commission. Rumors in Brussels are that OpenAI and several Chinese companies will fall into that category, according to Luca Bertuzzi, an editor with the EU news website Euractiv, who cited an internal note to EU Parliament.
But the act could and should have gone further. In its requirement for detailed summaries of training data, the draft legislation states:
“This summary should be comprehensive in its scope instead of technically detailed, for example by listing the main data collections or sets that went into training the model, such as large private or public databases or data archives, and by providing a narrative explanation about other data sources used.”
That’s vague enough for companies like OpenAI to hide a number of key data points: What kind of personal data are they using in their training sets? How prevalent is abusive or violent imagery and text? And how many content moderators have they hired, with different language abilities, to police how their tools are used?
Those are all questions that are likely to remain unanswered without more specifics. Another helpful guideline would have been for companies to give third-party researchers and academics the ability to audit the training data used in their models. Instead, companies will essentially audit themselves.
“We just came out of 15 years of begging social media platforms for information on how their algorithms work,” says Daniel Leufer, a Brussels-based senior policy analyst at Access Now, a digital-rights nonprofit. “We don’t want to repeat that.”
The EU’s AI Act is a decent, if slightly half-baked, start when it comes to regulating AI, and the region’s policy makers should be applauded for resisting corporate lobbying in their efforts to crack open the closely held secrets of AI companies. In the absence of any other similar regulation (and none to expect from the U.S.), this at least is a step in the right direction.