OpenAI introduced GPT-5.5 on April 23, describing it as a new class of intelligence designed for real-world work and powering autonomous agents. The company states that this model represents its most capable agentic AI system to date, built from the ground up to plan tasks, use external tools, verify its own output, and complete work independently.
GPT-5.5 is the first fully retrained base model since GPT-4.5. It was co-designed with NVIDIA using their GB200 and GB300 NVL72 rack-scale computing systems. According to OpenAI, the practical improvement is that tasks which previously required multiple human prompts and course corrections can now be handed off more completely to the model.
The rollout began on ChatGPT and Codex platforms for Plus, Pro, Business, and Enterprise users. API access became available on April 24.
Benchmark Performance
OpenAI’s strongest performance claim comes from Terminal-Bench 2.0, a benchmark that tests command-line workflows requiring planning and tool coordination in a sandboxed environment. GPT-5.5 scored 82.7%, compared to GPT-5.4’s 75.1% and Claude Opus 4.7’s 69.4%.
On SWE-Bench Pro, which evaluates GitHub issue resolution, GPT-5.5 achieved 58.6%, solving more issues in a single pass than previous versions. OpenAI also introduced an internal benchmark called Expert-SWE, where tasks carry a median estimated human completion time of 20 hours. GPT-5.5 scored 73.1%, up from GPT-5.4’s 68.5%.
In long-context reasoning, tested by MRCR v2 at one million tokens, a retrieval benchmark that measures a model’s ability to locate a specific answer buried in a large document, GPT-5.5 scored 74.0%, a significant jump from GPT-5.4’s 36.6%.
Tool-Use Benchmark Gap
On MCP Atlas, Scale AI’s Model Context Protocol tool-use benchmark, Claude Opus 4.7 leads at 79.1%, and no score was recorded for GPT-5.5. OpenAI included this absence in its own benchmark table, which signals confidence in the overall performance picture despite this gap.
Token Efficiency and Pricing
API access is priced at $5 per million input tokens and $30 per million output tokens, exactly twice the rates for GPT-5.4. OpenAI defends this pricing by stating that GPT-5.5 completes the same Codex tasks with fewer tokens than GPT-5.4. The company claims effective costs are roughly 20% higher once efficiency is factored in, a finding validated by independent testing lab Artificial Analysis.
A higher tier, GPT-5.5 Pro, is available to Pro, Business, and Enterprise users at $30 per million input tokens and $180 per million output tokens. It applies additional parallel test-time compute on harder problems and leads the list of publicly available models on BrowseComp, OpenAI’s agentic web-browsing benchmark, at 90.1%.
Businesses evaluating the model should stress-test token efficiency against actual workloads before committing. At 10 million output tokens per month, GPT-5.5 standard costs $300 compared to Claude Opus 4.7’s $250, a 20% premium that only pays off if the model’s superior agentic performance reduces task iterations and retries.
Internal Usage and Development Context
OpenAI reports that more than 85% of its employees now use Codex weekly across departments, including engineering and marketing. In one example, the communications team used GPT-5.5 to process six months of speaking request data, building a scoring and risk framework that helped automate low-risk approvals.
Greg Brockman described the release as a real step forward toward the kind of computing expected in the future. Chief scientist Jakub Pachocki noted that the last two years of model progress had felt surprisingly slow.
OpenAI says GPT-5.5 matches GPT-5.4’s per-token latency in production serving while performing at a higher level of intelligence. Larger models are typically slower to serve, but OpenAI states that this trade-off was avoided.
Industry Implications
Whether benchmark leads translate into production gains for teams running real agentic pipelines remains to be seen. The Terminal-Bench score is promising for unattended terminal agents and DevOps automation. The MCP Atlas gap is worth monitoring for organizations building heavily on tool-use orchestration.
Over the coming weeks, independent evaluations and enterprise case studies will clarify whether GPT-5.5 delivers on its efficiency and autonomy claims in real-world workloads. OpenAI has not announced a timeline for broader availability or future model iterations.