ChatGPT Gets Code Questions Wrong 52% of the Time
upstart writes:
But its suggestions are so annoyingly plausible:
ChatGPT, OpenAI's fabulating chatbot, produces wrong answers to software programming questions more than half the time, according to a study from Purdue University. That said, the bot was convincing enough to fool a third of participants.
The Purdue team analyzed ChatGPT's answers to 517 Stack Overflow questions to assess the correctness, consistency, comprehensiveness, and conciseness of ChatGPT's answers. The US academics also conducted linguistic and sentiment analysis of the answers, and questioned a dozen volunteer participants on the results generated by the model.
"Our analysis shows that 52 percent of ChatGPT answers are incorrect and 77 percent are verbose," the team's paper concluded. "Nonetheless, ChatGPT answers are still preferred 39.34 percent of the time due to their comprehensiveness and well-articulated language style." Among the set of preferred ChatGPT answers, 77 percent were wrong.
OpenAI on the ChatGPT website acknowledges its software "may produce inaccurate information about people, places, or facts." We've asked the lab if it has any comment about the Purdue study.
The pre-print paper is titled, "Who Answers It Better? An In-Depth Analysis of ChatGPT and Stack Overflow Answers to Software Engineering Questions." It was written by researchers Samia Kabir, David Udo-Imeh, Bonan Kou, and assistant professor Tianyi Zhang.
"During our study, we observed that only when the error in the ChatGPT answer is obvious, users can identify the error," their paper stated. "However, when the error is not readily verifiable or requires external IDE or documentation, users often fail to identify the incorrectness or underestimate the degree of error in the answer."
Even when the answer has a glaring error, the paper stated, two out of the 12 participants still marked the response preferred. The paper attributes this to ChatGPT's pleasant, authoritative style.
"From semi-structured interviews, it is apparent that polite language, articulated and text-book style answers, comprehensiveness, and affiliation in answers make completely wrong answers seem correct," the paper explained.
Journal Reference:
Kabir, Samia, Udo-Imeh, David N., Kou, Bonan, et al. Who Answers It Better? An In-Depth Analysis of ChatGPT and Stack Overflow Answers to Software Engineering Questions, arXiv (DOI: 10.48550/arXiv.2308.02312)
Read more of this story at SoylentNews.