Why You Shouldn't Trust ChatGPT's Answers To Software Engineering Questions

ChatGPT chatbot
For years now as a software developer, if you're stuck on a hard problem and you want to ask a stranger for help, the place to go has been a website called Stack Overflow. There's so much helpful content on there, and so much of it is indexed so well by Google, that there are memes about professional developers that don't really know anything about coding at all—they just copy and paste example code from Stack Overflow.

More recently, those types have been finding their way to ChatGPT, instead. The reason is that it gives answers to coding questions similar to Stack Overflow, complete with code examples, but instantaneously. Instead of having to wait for some guru longbeard to come by and deign to respond to your query, you can simply get an instant response from a bot that appears to give a correct result.

relative differences between stackoverflow and chatgpt
Relative to Stack Overflow. Negatives mean ChatGPT has less of that thing.

We say "appears to give" because though it's already well known that ChatGPT will easily put out non-functional code, nobody had done a proper study on it until now. Researchers at Purdue University just put out a paper titled "Who Answers It Better? An In-Depth Analysis of ChatGPT and Stack Overflow Answers to Software Engineering Questions." It's the result of a significant research project involving manual study of responses from both sources as well as semi-structured interviews with users.

You read the headline, so we probably don't need to tell you the outcome, but it was pretty bad. Out of 512 questions, some 52% of ChatGPT's answers were incorrect, with factual errors or non-functional code. Despite that, 65% of the AI answers were "comprehensive," which means that they took into account all aspects of the query, or prompt. Also, 77% of the answers were 'verbose', which in this context means that they're wordy and articulate, even if more than strictly necessary to answer the question.

quality of answers

This well-articulated quality along with the comprehensive nature of the responses could be why interviewees failed to notice factual errors and incorrect information in ChatGPT's responses some 40% of the time. This is the real terrifying statistic—it's not a surprise that ChatGPT hallucinates data, but it's a little shocking that nearly half the time, humans don't notice because the response "looks" correct.

That's ultimately the problem with all current generative AI—the things generated by these black box machines are more often than not, incorrect by any human reckoning. It's just that they look correct at a glance; they're "close enough" without thorough inspection. You can see it with images, with audio, and with text. While the tech is still improving, for now, you'd better stick to other humans when trying to learn a new skill.