Developers Say GPT-5 Is a Mixed Bag

Last week, when OpenAI launched GPT-5, it told software engineers the model was designed to be a “true coding collaborator” that excels at generating high-quality code and performing agentic, or automated, software tasks. While the company didn’t say so explicitly, OpenAI appeared to be taking direct aim at Anthropic’s Claude Code, which has quickly become many developers’ favored tool for AI-assisted coding.
But developers tell WIRED that GPT-5 has been a mixed bag so far. It shines at technical reasoning and planning coding tasks, but some say that Anthropic’s newest Opus and Sonnet reasoning models still produce better code. Depending on which version of GPT-5 developers are using—low, medium, or high verbosity—the model can be more elaborative, which sometimes leads it to generate unnecessary or redundant lines of code.
Some software engineers have also criticized how OpenAI evaluated GPT-5’s performance at coding, arguing that the benchmarks it used are misleading. One research firm called a graphic that OpenAI published boasting about GPT-5’s capabilities a “chart crime.”
GPT-5 does stand out in at least one way: Several people noted that, in comparison to competing models, it is a much more cost-effective option. “GPT-5 is mostly outperformed by other AI models in our tests, but it’s really cheap,” says Sayash Kapoor, a computer science doctoral student and researcher at Princeton University who cowrote the book AI Snake Oil.
Kapoor says he and his team have been running benchmark tests to evaluate GPT-5’s capabilities since the model was released to the public last week. He notes that the standard test his team uses—measuring how well a language model can write code that will reproduce the results of 45 scientific papers—costs $30 to run with GPT-5 set to medium, or mid-range verbosity. The same test using Anthropic’s Opus 4.1 costs $400. In total, Kapoor says his team has spent around $20,000 testing GPT-5 so far.
Although GPT-5 is cheap, Kapoor’s tests indicate the model is also less accurate than some of its competitors. Claude’s premium model achieved a 51 percent accuracy rating, measured by how many of the scientific papers it accurately reproduced. The medium version of GPT-5 received a 27 percent accuracy rating. (Kapoor has not yet run the same test using GPT-5 high, so it’s an indirect comparison, given that Opus 4.1 is Anthropic’s most powerful model.)
OpenAI spokesperson Lindsay McCallum referred WIRED to its blog, where it said that it trained GPT-5 on “real-world coding tasks in collaboration with early testers across startups and enterprises.” The company also highlighted some of its internal accuracy measurements for GPT-5, which showed that the GPT-5 “thinking” model, which does more deliberate reasoning, scored highest on accuracy among all of OpenAI’s models. GPT-5 “main,” however, still fell short of previously-released models on OpenAI’s own accuracy scale.
Anthropic spokesperson Amie Rotherham said in a statement that “performance claims and pricing models often look different once developers start using them in production environments. Since reasoning models can quickly use a lot of tokens while thinking, the industry is moving to a world where price per outcome matters more than price per token.”
Some developers say they’ve had largely positive experiences with GPT-5 so far. Jenny Wang, an engineer, investor, and creator of the personal styling agent Alta, told WIRED the model appears to be better at completing complex coding tasks in one shot than other models. She compared it to OpenAI’s o3 and 4o, which she uses frequently for code generation and straightforward fixes “like formatting, or if I want to create an API endpoint similar to what I already have,” Wang says.
In her tests of GPT-5, Wang says she asked the model to generate code for a press page for her company’s website, including specific design elements that would match the rest of the site’s aesthetic. GPT-5 completed the task in one take, whereas in the past, Wang would have had to revise her prompts during the process. There was one significant error, though: “It hallucinated the URLs,” Wang says.
Another developer, who spoke on the condition of anonymity because their employer didn’t authorize them to speak to the press, says GPT-5 excels at solving deep technical problems.
The developer’s current hobby project is writing a programmatic network analysis tool, one that would require code isolation for security purposes. “I basically presented my project and some paths I was considering, and GPT-5 took it all in and gave back a few recommendations along with a realistic timeline,” the developer explains. “I’m impressed.”
A handful of OpenAI’s enterprise partners and customers, including Cursor, Windsurf, and Notion, have publicly vouched for GPT-5’s coding and reasoning skills. (OpenAI included many of these remarks in its own blog post announcing the new model.) Notion also shared on X that it’s “fast, thorough, and handles complex work 15 percent better than other models we’ve tested.”
But within days of GPT-5’s release, some developers were weighing in online with complaints. Many said that GPT-5’s coding abilities seemed behind the curve for what was supposed to be a state-of-the-art, ultra-capable model from the world’s buzziest AI company.
“OpenAI’s GPT-5 is very good, but it seems like something that would have been released a year ago,” says Kieran Klassen, a developer who has been building an AI assistant for email inboxes. “Its coding capabilities remind me of Sonnet 3.5,” he adds, referring to an Anthropic model that launched in June 2024.
Amir Salihefendić, founder of the startup company Doist, said in a social media post that he’s been using GPT-5 in Cursor and has found it “pretty underwhelming” and that “it’s especially bad at coding.” He said the release of GPT-4 felt like a “Llama 4 moment,” referring to Meta’s AI model, which had also disappointed some people in the AI community.
On X, developer Mckay Wrigley wrote that GPT-5 is a “phenomenal everyday chat model,” but when it comes to coding, “I will still be using Claude Code + Opus.”
Other developers describe GPT-5 as “exhaustive”—at times helpful, but often irritating in its long-windedness. Wang, who was pleased overall with the frontend coding project she assigned to GPT-5, says that she did notice that the model was “more redundant. It clearly could have come up with a cleaner or shorter solution.” (Kapoor points out that the verbosity of GPT-5 can be adjusted, so that users can ask it to be less chatty or even do less reasoning in exchange for better performance or cheaper pricing.)
Itamar Friedman, the cofounder and CEO of the AI-coding platform Qodo, believes that some of the critiques of GPT-5 stem from evolving expectations around AI model releases. “I think a lot of people thought that GPT-5 would be another moment when everything about AI improved, because of this march towards AGI. When actually, the model improved on a few key sub-tasks,” he says.
Friedman refers to before 2022 as “BCE”—Before ChatGPT Era—when AI models improved holistically. In the post-ChatGPT era, new AI models are often better at certain things. “Claude Sonnet 3.5, for example, was the one model to rule them all on coding. And Google Gemini got really good at code review, to check if code is high quality,” Friedman says.
OpenAI has also gotten some heat for the methodology it used to run its benchmark tests and make performance claims about GPT-5—although benchmark tests vary considerably across the industry. SemiAnalysis, a research firm focused on the semiconductor and AI sector, noted that OpenAI only ran 477 out of the 500 tests that are typically included in SWE-bench, a relatively new AI industry framework for testing large language models. (This was for overall performance of the model, not just coding.)
OpenAI says that it always tests its AI models on a fixed subset of 477 tasks rather than the full 500 in the SWE-bench test, because those 477 tests are the ones the company has validated on its internal infrastructure. McCallum also pointed to GPT-5’s system card, which noted that changes in the model’s verbosity setting can “lead to variation in eval performance.”
Kapoor says that frontier AI companies are ultimately facing difficult trade-offs. “When model developers train new models, they’re introducing new constraints, too, and have to consider many factors: how users expect the AI to behave and how it performs at certain tasks like agentic coding, all while managing the cost,” he says. “In some sense, I believe OpenAI knew it wouldn’t break all of those benchmarks, so it made something that would generally please a wide range of people.”
wired