Ana içeriğe atla

GPT-5.1 Codex

GPT-5.1 Codex
A developer's guide to GPT-5.1 Codex — long-horizon coding tasks, API setup, real benchmarks, and how it compares to Claude Code and Verdent for agentic workflows.

GPT-5.1 Codex reflects a shift in how coding models handle long-running agent work. Earlier sessions were constrained by context limits that eventually filled, while compaction helps carry forward important task state across extended coding runs. OpenAI observed tasks continuing for more than 24 hours under this pattern.

The practical change is not simply a bigger answer window. It is a longer engineering loop: inspect the codebase, make changes, run tools, respond to failures, preserve progress, and continue without restarting the task from scratch.

For developers, that makes GPT-5.1 Codex most relevant when the work spans multiple files, test cycles, dependency checks, and review passes. The model can drive implementation, but successful use still depends on clear task boundaries, checkpointing, tool access, and validation.

Verdent fits into that workflow as the orchestration layer around current Codex models. It helps plan the work, split tasks safely, run them in isolated workspaces, and review completed changes so long-horizon coding work is easier to manage in real engineering environments.

What Is GPT-5.1 Codex

OpenAI released GPT-5.1-Codex-Max on November 19, 2025.

It was designed for agentic coding. It was not positioned as a general chat model.

The model could:

  • Inspect repositories
  • Edit multiple files
  • Run commands
  • Create pull requests
  • Review code
  • Debug test failures
  • Build frontend interfaces
  • Continue through long execution loops
SpecificationGPT-5.1-Codex-Max
Model IDgpt-5.1-codex-max
Current statusSuperseded
Context window400,000 tokens
Maximum output128,000 tokens
Knowledge cutoffSeptember 30, 2024
InputsText and images
OutputText
Function callingSupported
Structured outputSupported
Primary APIResponses API

The model is still documented by OpenAI. However, it has been superseded for new work.

Existing access may vary. New integrations should select a supported model from the current OpenAI model catalog.

Treat GPT-5.1 Codex as a model choice inside a larger coding workflow. It can reason over code, call tools, and produce edits, but the quality of the result still depends on the surrounding agent interface, repository permissions, dependency setup, test coverage, and review process.

For a developer team, the practical question is not only whether the model can write code. The larger question is whether the workflow can define the task, constrain the workspace, run the right commands, preserve useful state, and catch unsafe changes before they reach the main branch.

Long-Task Coding Capabilities

GPT-5.1-Codex-Max used compaction to continue beyond one context window.

Compaction removes less important history. It preserves key decisions, touched files, errors, commands, test results, and implementation direction. The model can then continue inside a fresh context window. This process can repeat during the task.

This was useful for:

  • Large repository refactors
  • Long debugging sessions
  • Repeated test-and-fix loops
  • Cross-file migrations
  • Frontend implementation
  • Pull request preparation
  • Repository-level questions

OpenAI observed the model working for more than 24 hours in internal evaluations.

That was an internal observation. It was not a guaranteed runtime or success rate.

The model still needed clear requirements. It also needed tool access, tests, and review.

Compaction improved continuity. It did not remove the need for engineering controls.

Long-Horizon Does Not Mean Unsupervised

A 24-hour agent can accumulate 24 hours of wrong assumptions. Plan-First Intelligence prevents Blind AI, while worktree isolation contains Code Chaos.

Verdent reported 76.1% on SWE-bench Verified. Its Production-Ready Quality adds tests and review around long-running execution.

Connect Codex as a BYOA worker.

For long sessions, the safest pattern is to break work into checkpoints:

  1. Define the target behavior and the files or modules likely to change.
  2. Ask for a short plan before edits begin.
  3. Run a narrow implementation pass instead of changing the whole repository at once.
  4. Execute the relevant tests, type checks, linters, or build commands.
  5. Review the diff and error output before the next pass.
  6. Continue only after the plan still matches the observed results.

This pattern makes compaction more useful. The model has clearer milestones to preserve, and reviewers get smaller, safer units of work to inspect.

GPT-5.1 Codex vs Claude Code vs Verdent

GPT-5.1 Codex, Claude Code, and Verdent operate at different layers.

GPT-5.1 Codex is a model. Claude Code is a coding agent. Verdent is a multi-agent development platform.

AreaGPT-5.1 CodexClaude CodeVerdent
Product typeCoding modelCoding agentMulti-agent development platform
Main workflowAgentic coding through Codex surfacesClaude-based terminal and IDE workPlanning, parallel execution, isolation, and review
Model choiceOpenAI Codex modelClaude and supported provider modelsBuilt-in models, BYOK, and BYOA
Parallel workDepends on the Codex agent environmentParallel sessions and agent workflowsManager dispatches independent workers
IsolationDepends on the agent environmentDepends on local configurationGit worktree-based workspace isolation
ReviewCodex code-review workflowsClaude-based code reviewBuilt-in Reviewer and multi-model review
Best fitLegacy Codex compatibilityDirect Claude coding workflowCoordinating complex work across agents

Claude Code is a strong fit when Claude is the primary coding agent and the developer wants a direct terminal or IDE workflow.

GPT-5.1 Codex is best understood as the model layer. It provides coding capability, reasoning over repository context, and tool-use behavior when the surrounding environment exposes files, commands, and review surfaces.

Verdent is a stronger fit when coordination is the main problem. It can divide one goal into tasks, assign independent work to separate workers, isolate changes in Git worktrees, and review the output before integration.

Verdent also supports Codex through Bring Your Own Agent. This lets Codex remain the execution agent while Verdent manages the broader workflow.

The tools can therefore be complementary. Codex handles implementation. Verdent handles planning, task management, workspace isolation, parallel work, and review. For teams managing larger changes, that separation reduces the risk that one long-running agent session turns into an unreviewed set of broad repository edits.

Teams comparing model-layer options can use the Claude Opus 4.5 guide to gauge the neighboring coding workflow before choosing an agent stack.

For source-level validation, OpenAI documentation is worth checking after you understand the GPT-5.1 Codex workflow described here.

API Setup & Pricing

GPT-5.1-Codex-Max was available through the OpenAI Responses API.

Its published token prices were:

Token typePrice per 1M tokens
Input$1.25
Cached input$0.125
Output$10.00

A basic Python request used the following structure:

from openai import OpenAI

client = OpenAI()

response = client.responses.create(
    model="gpt-5.1-codex-max",
    input=(
        "Inspect this repository. Find the cause of the failing tests. "
        "Implement the smallest safe fix. Run the relevant tests. "
        "Report the files changed and any remaining risks."
    ),
)

print(response.output_text)

This request may fail for accounts without legacy access. New work should use a current Codex model.

For new applications, replace gpt-5.1-codex-max with a current supported Codex model.

Use medium reasoning for routine development. Use higher reasoning only when the task clearly needs it.

OpenAI reported that GPT-5.1-Codex-Max used 30% fewer thinking tokens than GPT-5.1 Codex on SWE-bench Verified at the same medium reasoning level.

Before wiring this model into production tooling, confirm the current model ID, account access, rate limits, and pricing in the provider documentation.

For cost control, avoid sending an entire repository by default. Start with the issue description, the relevant files, failing test output, dependency versions, and the commands the agent may run. Add more context only when the model asks for it or when the first pass shows that the missing context matters.

A production setup should also log the model ID, reasoning level, prompt size, output size, commands executed, files changed, and test results. Those records make it easier to compare runs, control cost, and review failures.

Cost and access planning can also compare GPT-5.1 Codex with Grok 4 when latency, reasoning depth, and tool use affect deployment choices.

When details such as limits or setup steps matter, Reddit can help confirm the latest implementation surface.

Real-World Coding Benchmarks

OpenAI published the following results.

BenchmarkGPT-5.1 CodexGPT-5.1-Codex-Max
SWE-bench Verified73.7%77.9%
SWE-Lancer IC SWE66.3%79.9%
Terminal-Bench 2.052.8%58.1%

The Codex-Max results used Extra High reasoning effort. Compaction was enabled.

SWE-bench Verified tests whether an agent can resolve real repository issues.

SWE-Lancer focuses on economically relevant software tasks.

Terminal-Bench 2.0 measures terminal-based agent work.

These results show strong coding performance. They do not guarantee the same result on every repository.

Real performance depends on:

  • Repository structure
  • Prompt quality
  • Tool permissions
  • Available dependencies
  • Test coverage
  • Retry behavior
  • Review quality

OpenAI also reported that its engineers produced about 70% more pull requests after adopting Codex.

That was an internal productivity observation. It does not prove that the model alone caused the increase.

A reliable evaluation should use the same repository, task, tests, and review criteria for every model.

For internal engineering use, benchmark the workflow on tasks that resemble real work: fixing a failing test, migrating an API call, adding a guarded feature, refactoring a module, or preparing a pull request with tests. Track whether the agent changed the right files, ran the right commands, preserved existing behavior, and explained remaining risks.

The most useful benchmark is not only pass or fail. It should also measure review effort, number of retries, time to usable diff, test coverage added, and whether the final change is small enough for a human engineer to approve.

To reproduce these tasks from a terminal workflow, use Codex CLI and record the commands, test results, retries, and review notes alongside each run.

Before you budget a real project around GPT-5.1 Codex, compare the claims here with OpenAI documentation.

Legacy GPT-5.1 Codex Use in Verdent

GPT-5.1-Codex-Max should not be the default choice for a new integration. It has been superseded by GPT-5.2, GPT-5.3, and later Codex generations.

Use it when you need to reproduce a legacy workflow. Use a current Codex model for new development.

Verdent officially supports Codex through BYOA.

A practical setup is:

  1. Install and authenticate Codex.
  2. Open Verdent.
  3. Go to Settings → Models → CLI Agents.
  4. Enable Codex.
  5. Refresh the available models.
  6. Select a current Codex model for Worker tasks.
  7. Use Plan Mode to define the task.
  8. Review completed changes before integration.

This keeps Codex as the execution agent. Verdent adds planning, parallel work, workspace isolation, and review.

For legacy reproduction, keep the environment as stable as possible. Record the model ID, package versions, test command, repository commit, and agent configuration. That makes it easier to understand whether a difference comes from the model, the codebase, the dependencies, or the surrounding workflow.

For new work, use Verdent to keep the long-horizon pattern while upgrading the model. Plan Mode defines the objective, Workers execute scoped tasks, Git worktrees isolate changes, and review catches issues before merge.

Frequently Asked Questions

Is GPT-5.1 Codex still available?

GPT-5.1-Codex-Max has been superseded by newer Codex generations. Existing access may vary by account, product surface, and provider availability. New integrations should use a current supported Codex model.

Is GPT-5.1 Codex the same as GPT-5.1-Codex-Max?

No. GPT-5.1 Codex refers to the Codex model generation, while GPT-5.1-Codex-Max was the upgraded long-task version. Codex-Max produced the stronger published benchmark results and used compaction for longer coding sessions.

Can GPT-5.1-Codex-Max really work for 24 hours?

OpenAI observed tasks lasting more than 24 hours in internal evaluations. This was not a guaranteed runtime, completion rate, or quality level. Long sessions still need clear requirements, tool limits, test runs, checkpoints, and human review.

Is GPT-5.1 Codex the same as Codex CLI?

No. GPT-5.1 Codex is a model family. Codex CLI is a coding agent interface that can use supported Codex models. The model supplies coding capability; the CLI supplies the local workflow, file access, command execution, and developer interaction.

Does Verdent support Codex?

Yes. Verdent supports Codex through BYOA. Codex must be installed and authenticated on the local system, then enabled as a CLI agent in Verdent so it can be used for Worker tasks.

Should I start a new project with GPT-5.1-Codex-Max?

No. Use a current supported Codex model for new development. Keep GPT-5.1-Codex-Max only for legacy compatibility, historical comparison, or reproducing an older workflow.

Related Model Guides
Keep the Long-Horizon Pattern, Upgrade the Model

GPT-5.1 Codex has been superseded by newer Codex generations. Preserve the compaction-based workflow, checkpointed execution, and review discipline, but run new development on a current supported model.

Next Step

Run Codex Workflows on Supported Models

GPT-5.1 Codex is no longer the best target for long-horizon coding workflows. Keep the same compaction-based pattern and connect a current Codex worker in Verdent.