Article 6M0WM An Only Slightly Modest Proposal: If AI Companies Want More Content, They Should Fund Reporters, And Lots Of Them

An Only Slightly Modest Proposal: If AI Companies Want More Content, They Should Fund Reporters, And Lots Of Them

by
Mike Masnick
from Techdirt on (#6M0WM)
Story Image

In Jonathan Swift's A Modest Proposal," he satirized politicians who were out of touch and were treating the poor as an inconvenience, rather than a sign of human suffering and misery. So, he took what seemed like two big problems, according to those politicians, and came up with an obviously barbaric solution to solve both problems: by letting the poor sell their kids as food. This really only was designed to highlight the barbaric framing of the problem" by the Irish elite.

But, sometimes, there really are scenarios where there are two very real problems (not of a Swiftian nature) that might actually be in a position to be combined such that both problems are actually solved. And thus I present a non-Swiftian modest proposal: that AI companies desperate for high quality content should create funds to pay for journalists to create high quality content that the AI companies can use for training.

Lately, there have been multiple news articles about how desperate the AI companies are for fresh data to feed the voracious and insatiable training machine. The Wall Street Journal noted that the internet is too small" for AI companies.

Companies racing to develop more powerful artificial intelligence are rapidly nearing a new problem: The internet might be too small for their plans.

Ever more powerful systems developed by OpenAI, Google and others require larger oceans of information to learn from. That demand is straining the available pool of quality public data online at the same time that some data owners are blocking access to AI companies.

Some executives and researchers say the industry's need for high-quality text data could outstrip supply within two years, potentially slowing AI's development.

The problem is not just data, but high-quality data, as that report notes. You need the AI systems trained on well-written, useful content:

Most of the data available online is useless for AI training because it contains flaws such as sentence fragments or doesn't add to a model's knowledge. Villalobos estimated that only a sliver of the internet is useful for such training-perhaps just one-tenth of the information gathered by the nonprofit Common Crawl, whose web archive is widely used by AI developers.

The NY Times also published a similar-ish story, though it framed it in a much more nefarious light. It argued that the AI companies were cutting corners to harvest data for AI" systems. However, what the Times actually means is that AI companies believe (correctly, in my opinion) that they have a very strong fair use argument for training on whatever data they can find.

At Meta, which owns Facebook and Instagram, managers, lawyers and engineers last year discussed buying the publishing house Simon & Schuster to procure long works, according to recordings of internal meetings obtained by The Times. They also conferred on gathering copyrighted data from across the internet, even if that meant facing lawsuits. Negotiating licenses with publishers, artists, musicians and the news industry would take too long, they said.

I've discussed the copyright arguments repeatedly, including why I think the AI companies are correct that training on copyright-covered works shouldn't be infringing. I also think the rush to rely on copyright as a solution here is problematic. Doing so would only enrich big tech, since smaller companies and open source systems wouldn't be able to keep up. Also, requiring all training to be licensed would effectively break the open internet, by creating a new license to read." This would be bad.

But, all of this is coming at the same time that journalism is in peril. We're hearing stories of news orgs laying off tons of journalists. Or publications shutting down entirely. There are stories of news deserts" and how corruption is increasing as news orgs continue to fail.

The proposed solutions to this very real problem have been very, very bad. Link taxes are even more destructive to the open web and don't actually appear to work very well.

But... that doesn't mean there isn't a better solution. If the tech companies need good, well-written content to fill their training systems, and the world needs good, high-quality journalism, why don't the big AI companies agree to start funding journalists and solve both problems in one move?

This may sound similar to the demands of licensing works, but I'm not talking about past works. Those works are out there. I'm talking about paying for the creation of future works. It's not about licensing or copyright. It's about paying for the creation of new, high-quality journalism. And then letting those works exist freely on the internet for everyone.

It was already mentioned above that Meta considered buying a book publisher. Why not news publishers as well? But ownership of the journalists shouldn't even be the focus, as it could raise some other challenges. Instead, they can just set up a fund where anyone can apply. There can be a pretty clear set of benefits to all parties.

Journalists who join the programs (and they should be allowed to join multiple programs from multiple companies) agree to publish new, well-written articles on a regular basis, in exchange for some level of financial support. It should be abundantly clear that the AI companies have no say over the type of journalism being done, nor do they have any say in editorial beyond the ability to review the quality of the writing to make sure it's actually useful in training new systems.

The journalists only need to promise that anything they publish that receives funding from this program is made available to the training systems of the companies doing the funding.

In exchange, beyond just some funding, the AI companies could make a variety of AI tools available to the journalists as well, to help them improve the quality of their writing (I have a story coming up soon about how I've been using AI as a supplemental editor, but never to write any content).

This really feels like something that could solve at least some of the problems at both ends of this market. There are some potential limits here, of course. The AI companies need so much new content that it's unclear if this would create enough to matter. But it would create something. And it could be lots of somethings. And not only that, but it should be pretty damn up-to-date somethings (which can be useful).

There could be reasonable concerns about conflicts of interest, but as it stands today, most journalism is funded by rich billionaires already. I don't see how this is any worse. And, as suggested, it could be structured such that the journalists aren't employees, and it could (should?) have explicit promises about a lack of editorial control or interference.

The AI companies might also claim that it's too expensive to create a large enough pool, but if they're so desperate for good, high-quality content, to the point of potentially buying up famous publishers, then, um, it seems clear that they are willing to spend, and it's worth it to them.

It's not a perfect solution, but it sure seems like one that solves two big problems in one shot, without fucking up the open web or relying on copyright as a crutch. Instead, it funds the future production of high-quality journalism in a manner that is helpful both for the public at large and the AI companies that could contribute to the funding. It also doesn't require any big new government law. The companies can just... see the benefit themselves and set up the program.

The public gets a lot more high-quality journalism, and journalists get sustainable revenue sources to continue to do good reporting. It's not quite a Swiftian modest proposal, in that... it actually could make sense.

External Content
Source RSS or Atom Feed
Feed Location https://www.techdirt.com/techdirt_rss.xml
Feed Title Techdirt
Feed Link https://www.techdirt.com/
Reply 0 comments