Free GPT-4o costs users data, copyright or not

TechTalk / Viewpoint

OpenAI's newest artificial intelligence tool, GPT-4o, and use it to train its AI model. But much of leverages a large and growing user base – drawn in by the promise that the service is free – to crowdsource massive amounts of multimodal data that is not owned by its users, and copyright holders will have little recourse

Angela Huyue Zhang and S. Alex Yang 7 Jun 2024

With the launch of GPT-4o, OpenAI has once again shown itself to be the world’s most innovative artificial-intelligence company. This new multimodal AI tool – which seamlessly integrates text, voice, and visual capabilities – is significantly faster than previous models, greatly enhancing the user experience. But perhaps the most attractive feature of GPT-4o is that it is free – or so it seems.

One does not have to pay a subscription fee to use GPT-4o. Instead, users pay with their data. Like a black hole, GPT-4o increases in mass by sucking up any and all material that gets too close, accumulating every piece of information that users enter, whether in the form of text, audio files or images.

GPT-4o gobbles up not only users’ own information but also third-party data that are revealed during interactions with the AI service. Let’s assume you are seeking a summary of a New York Times article’s content. You take a screenshot and share it with GPT-4o, which reads the screenshot and generates the requested summary within seconds. For you, the interaction is over. But OpenAI is now in possession of all the copyrighted material from the screenshot you provided, and it can use that information to train and enhance its model.

OpenAI is not alone. In the past year, many firms – including Microsoft, Meta, Google, and X ( formerly Twitter ) – have quietly updated their privacy policies in ways that potentially allow them to collect user data and apply it to train generative AI models. Though leading AI companies have already faced numerous lawsuits in the United States over their unauthorized use of copyrighted content for this purpose, their appetite for data remains as voracious as ever. After all, the more they obtain, the better they can make their models.

The problem for leading AI firms is that high-quality training data has become increasingly scarce. In late 2021, OpenAI was so desperate for more data that it reportedly transcribed over a million hours of YouTube videos, violating the platform’s rules. ( Google, YouTube’s parent company, has not pursued legal action against OpenAI, possibly to avoid accountability for its own harvesting of YouTube videos, the copyrights for which are owned by their creators. )

With GPT-4o, OpenAI is trying a different approach, leveraging a large and growing user base – drawn in by the promise of free service – to crowdsource massive amounts of multimodal data. This approach mirrors a well-known tech-platform business model: charge users nothing for services, from search engines to social media, while profiting from app tracking and data harvesting – what Harvard professor Shoshana Zuboff famously called “surveillance capitalism”.

To be sure, users can prohibit OpenAI from using their “chats” with GPT-4o for model training. But the obvious way to do this – on ChatGPT’s settings page – automatically turns off the user’s chat history, causing users to lose access to their past conversations. There is no discernible reason why these two functions should be linked, other than to discourage users from opting out of model training.

If users want to opt out of model training without losing their chat history, they must, first, figure out that there is another way, as OpenAI highlights only the first option. They must then navigate through OpenAI’s privacy portal – a multi-step process. Simply put, OpenAI has made sure that opting out carries significant transaction costs, in the hopes that users will not do it.

Even if users consent to the use of their data for AI training, consent alone would not guard against copyright infringement, because users are providing data that they may not actually own. Their interactions with GPT-4o thus have spillover effects on the creators of the content being shared – what economists call “externalities”. In this sense, consent means little.

While OpenAI’s crowdsourcing activities could lead to copyright violations, holding the company – or others like it – accountable will be no easy feat. AI-generated output rarely looks like the data that informed it, which makes it difficult for copyright holders to know for certain whether their content was used in model training. Moreover, a firm might be able to claim ignorance: users provided the content during interactions with its services, so how can the company know where they got it from?

Creators and publishers have employed a number of methods to keep their content from being sucked into the AI-training blackhole. Some have introduced technological solutions to block data scraping. Others have updated their terms of service to prohibit the use of their content for AI training. Last month, Sony Music – one of the world’s largest record labels – sent letters to more than 700 generative-AI companies and streaming platforms, warning them not to use its content without explicit authorization.

But as long as OpenAI can exploit the “user-provided” loophole, such efforts will be in vain. The only credible way to address GPT-4o’s externality problem is for regulators to limit AI firms’ ability to collect and use the data their users share.

Angela Huyue Zhang is an associate professor of law and director of the Philip K.H. Wong Center for Chinese Law at the University of Hong Kong, and S. Alex Yang is a professor of Management Science and Operations at London Business School.

Sign in