Gemini Flash-Lite: Thinking Levels

Hanks
Hanks Engineer
How to Use Thinking Levels in Gemini 3.1 Flash-Lite

I'll be honest — when Google launched Gemini 3.1 Flash-Lite on March 3, 2026, the speed numbers were what grabbed everyone's attention. 381 tokens per second. 2.5x faster time-to-first-token than 2.5 Flash. Fair enough. But the feature I keep coming back to is thinking levels — and almost nobody is talking about it seriously.

Here's why it matters: thinking levels let you run the same model across your entire workload, from a real-time translation pipeline to complex UI generation, and tune the reasoning depth per request. No model switching. No routing logic. Just one API parameter. For developers building at scale, that's a bigger deal than the raw speed number. This guide walks through exactly how it works and how to set it up.

What Are Thinking Levels?

What Are Thinking Levels?

Thinking levels control how deeply Gemini 3.1 Flash-Lite reasons internally before generating a response. Per the official Gemini 3 API documentation, you set this via the thinking_level parameter, which controls the maximum depth of the model's internal reasoning process. Critically, Google treats these levels as relative allowances for thinking, not strict token guarantees — actual token usage varies by task complexity.

Four levels are available: Minimal, Low, Medium, and High.

LevelReasoning DepthLatencyToken CostDefault?
MinimalNear-zero internal reasoningFastestLowestNo
LowBasic reasoning passFastLowNo
MediumModerate reasoning chainModerateModerateNo
HighFull internal deliberationSlowestHighestYes

One thing to internalize immediately: if you don't set thinking_level****, the model defaults to High. That means every API call you make without explicitly setting this parameter will use maximum reasoning tokens — and maximum cost. Set it deliberately.

Why Google Added This Feature

Before Gemini 3, controlling reasoning depth meant managing a numeric thinking_budget (a token count). It was functional but imprecise — developers had to guess how many tokens corresponded to "enough" reasoning for a given task type. The new thinking_level system replaces this with semantic labels that the model interprets as relative guidance.

The practical upshot: you can now architect a single-model pipeline where your simple classification calls use minimal, your content moderation uses low, and your UI generation uses high — all against the same gemini-3.1-flash-lite-preview endpoint. One model, one billing line, granular control. Google calls this "intelligence at scale," and the architecture actually backs that claim up.

Migration note: If you used thinking_budget: 0 with Gemini 2.5 Flash to disable thinking, use thinking_level: "minimal" for equivalent behavior in Flash-Lite. Do not send both parameters in the same request — it will return a 400 error. Per the Vertex AI documentation, thought signatures must still be handled even at minimal thinking level.

How to Set Thinking Levels in AI Studio

How to Set Thinking Levels in AI Studio

In Google AI Studio, thinking level appears as a dropdown in the model settings panel when gemini-3.1-flash-lite-preview is selected. Select your level before running a prompt — it applies to the current session.

For API calls, set it in the generationConfig object:

import google.generativeai as genai

model = genai.GenerativeModel("gemini-3.1-flash-lite-preview")

response = model.generate_content(
    "Classify this support ticket: 'My payment didn't go through'",
    generation_config=genai.GenerationConfig(
        thinking_level="low"
    )
)

print(response.text)

For REST API calls:

curl -X POST \
  "https://generativelanguage.googleapis.com/v1beta/models/gemini-3.1-flash-lite-preview:generateContent?key=$GEMINI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "contents": [{"parts": [{"text": "Translate to French: Hello, how are you?"}]}],
    "generationConfig": {
      "thinking_level": "minimal"
    }
  }'

Low Thinking — When to Use It

Low thinking applies a basic reasoning pass before responding. It's faster and cheaper than medium or high, while still catching nuances that pure pattern-matching would miss.

Best for:

  • Sentiment analysis and tone classification
  • Simple data extraction from structured inputs
  • FAQ matching and intent detection
  • Translation of standard content (not highly technical or idiomatic)
  • Content moderation on clear-cut cases

A real-world data point: one early-access developer reported sub-2-second completions at minimal thinking for customer support ticket classification, versus 5 seconds at medium — with medium catching nuances the quick pass missed. Low sits between those two points and is the right default for most high-volume, moderate-complexity pipelines.

# Low thinking — good for classification at scale
generation_config = genai.GenerationConfig(thinking_level="low")

High Thinking — When to Use It

High thinking enables full internal deliberation. The model reasons through the problem before generating output, which meaningfully improves accuracy on tasks with ambiguity, multi-step logic, or hierarchical structure.

Best for:

  • UI and dashboard generation (HTML, React, structured JSON)
  • Complex instruction-following with multiple constraints
  • Code generation where correctness matters more than speed
  • System simulations or multi-step agentic tasks
  • Structured output that must maintain logical consistency across long sequences

Early testers at Latitude reported a 20% higher success rate and 60% faster inference compared to their previous model when using Flash-Lite with high thinking for complex storytelling tasks. Whering reported 100% consistency in item tagging — a task requiring precise multi-attribute classification — using the model in their production pipeline.

# High thinking — for complex generation tasks
generation_config = genai.GenerationConfig(thinking_level="high")

How Thinking Levels Affect Cost

How Thinking Levels Affect Cost

Thinking tokens are billed at the same rate as output tokens. Since thinking_level controls how many thinking tokens the model generates internally, your choice directly affects your API bill.

Pricing baseline for Gemini 3.1 Flash-Lite (as of March 2026, per Google's official pricing):

Token TypePrice per 1M tokens
Input$0.25
Output (including thinking tokens)$1.50

The cost difference between levels compounds at scale. Here's what that looks like for a pipeline processing 1,000 requests per day with ~400-token average outputs:

Thinking LevelAvg. Output TokensDaily Cost (est.)Monthly Cost (est.)
Minimal~420~$0.63~$18.90
Low~470~$0.71~$21.30
Medium~560~$0.84~$25.20
High~700+~$1.05+~$31.50+

Estimates based on output-only cost at $1.50/1M tokens. Actual thinking token usage varies by task complexity.

The practical guidance is straightforward: by routing 80% of daily tasks to low or minimal and reserving high for the 20% that genuinely require deep reasoning, you can reduce API spend by 50–70% versus defaulting to high on every call.

One architectural pattern worth noting: because Flash-Lite supports thinking level control per request, you can use it as a routing layer. Flash-Lite at minimal reads incoming requests, determines complexity, and routes accordingly — simple tasks stay at low, complex tasks escalate to high or to a larger model. It's fast and cheap enough that the routing overhead doesn't meaningfully increase latency.

Gemini 3.1 Flash-Lite

Based on the task taxonomy from Google DeepMind's Flash-Lite model page and real-world production data from early-access developers:

Use CaseRecommended LevelRationale
Real-time translationMinimalSpeed-critical; low ambiguity
Content moderation (clear violations)LowFast pass; clear signal
Sentiment & intent classificationLowPattern-rich; low reasoning depth needed
Data extraction from forms/docsLowStructured input; predictable output
Content moderation (edge cases)MediumAmbiguous cases need reasoning
Changelog / release note generationMediumNeeds summarization logic across long inputs
FAQ drafting and response generationMediumTone + accuracy balance
UI component generation (HTML/React)HighHierarchical structure; correctness matters
Complex code generationHighMulti-constraint; logical consistency required
Agentic / multi-step task executionHighState tracking across steps
Simulation generationHighLong-range logical consistency

Quick decision rule:

  • Is the task high-volume with clear, structured input? → Minimal or Low
  • Does the task involve ambiguous input or moderate reasoning? → Medium
  • Does the task require generating hierarchical structures or following complex multi-part instructions? → High
  • Are you unsure and API cost isn't a primary constraint? → Leave unset (defaults to High)

The model code for API access is gemini-3.1-flash-lite-preview — available now in Google AI Studio and Vertex AI. Preview status means no SLA and potential API changes before GA, so plan accordingly for production deployments.

You might also find these useful:

Hanks
Written by Hanks Engineer

As an engineer and AI workflow researcher, I have over a decade of experience in automation, AI tools, and SaaS systems. I specialize in testing, benchmarking, and analyzing AI tools, transforming hands-on experimentation into actionable insights. My work bridges cutting-edge AI research and real-world applications, helping developers integrate intelligent workflows effectively.