Stack Overflow Will Charge AI Giants For Training Data
An anonymous reader quotes a report from Wired: Stack Overflow, a popular internet forum for computer programming help, plans to begin charging large AI developers as soon as the middle of this year for access to the 50 million questions and answers on its service, CEO Prashanth Chandrasekar says. The site has more than 20 million registered users. Stack Overflow's decision to seek compensation from companies tapping its data, part of a broader generative AI strategy, has not been previously reported. It follows an announcement by Reddit this week that it will begin charging some AI developers to access its own content starting in June. "Community platforms that fuel LLMs absolutely should be compensated for their contributions so that companies like us can reinvest back into our communities to continue to make them thrive," Stack Overflow's Chandrasekar says. "We're very supportive of Reddit's approach." Chandrasekar described the potential additional revenue as vital to ensuring Stack Overflow can keep attracting users and maintaining high-quality information. He argues that will also help future chatbots, which need "to be trained on something that's progressing knowledge forward. They need new knowledge to be created." But fencing off valuable data also could deter some AI training and slow improvement of LLMs, which are a threat to any service that people turn to for information and conversation. Chandrasekar says proper licensing will only help accelerate development of high-quality LLMs. Chandrasekar says that LLM developers are violating Stack Overflow's terms of service. Users own the content they post on Stack Overflow, as outlined in its TOS, but it all falls under a Creative Commons license that requires anyone later using the data to mention where it came from. When AI companies sell their models to customers, they "are unable to attribute each and every one of the community members whose questions and answers were used to train the model, thereby breaching the Creative Commons license," Chandrasekar says. Neither Stack Overflow nor Reddit has released pricing information. "Both Stack Overflow and Reddit will continue to license data for free to some people and companies," notes Wired. "Chandrasekar says Stack Overflow only wants remuneration only from companies developing LLMs for big, commercial purposes." "When people start charging for products that are built on community-built sites like ours, that's where it's not fair use," he says.
Read more of this story at Slashdot.