Optimized for long‑form planning and robust agentic behavior, Grok 4 features a 256k context window and excels at step‑by‑step problem solving, math, logic, and instruction alignment. While multimodal capabilities are limited, Grok 4 dominates in text‑only domains and outperforms previous models across multiple SOTA evaluations.
Grok 4 is designed for advanced reasoning and complex tool‑use workflows. Built on the Grok 3 architecture with 10× more reinforcement learning compute, it sets state‑of‑the‑art scores on tasks like ARC-AGI‑2, AIME25, and Humanity’s Last Exam (HLE).
xAI Grok 4 Description
Grok 4 is the latest large language model from xAI, designed for high-level reasoning, agentic behavior, and real-world task automation. It builds upon Grok 3’s architecture, but trains reasoning with 10× more compute and integrates tool use directly into its RLHF pipeline.
Technical Specification
Performance Benchmarks
Context Window: 256,000 tokens
Max Output: ~4,096 tokens
Training Regime: 10× more RL compute than Grok 3
Tool Use: Native, with strong multi-step support
Performance Metrics
SOTA on ARC-AGI-2: 15.9%
AIME 2025: 76.9% accuracy
Humanity’s Last Exam (HLE):
With tools: 44.4% overall, 50.7% on text-only section
Without tools: 25.4% (vs 21.6% Gemini 2.5 Pro)
Metrics
Key Capabilities
Multi-step reasoning across long contexts
Native tool-use through real/synthetic environments
Deterministic outputs (non-streamed)
Planning with API execution
Robust performance on AGI-style benchmarks
Optimal Use Cases
Autonomous Agents – Tool-executing systems with embedded planning
Advanced QA Systems – Multi-document inference with 256K context
Research & Evaluation – Long-horizon tasks with strong logic
Strategic Analysis – Business/research planning using structured inputs
Code Agents – Multi-step reasoning over toolchains and environments
Code Samples
Comparison with Other Models
vs. GPT‑4o: GPT‑4o leads in multimodality and web browsing. Grok 4 offers better reasoning performance and tool integration in AGI-style tasks.
vs. Claude 4 Opus: Claude 4 excels in language safety and alignment. Grok 4 outperforms on ARC-AGI-2 (15.9% vs 8.6%) and HLE, especially in tool-enabled setups.
vs. Gemini 2.5 Pro: Gemini is strong in speed and instruction following. Grok 4 surpasses in zero-shot reasoning and planning (HLE 25.4% vs 21.6% without tools).
vs. Grok 3: Grok 4 is a major upgrade over Grok 3, trained with 10× more RL compute and native tool-use instruction. It achieves 25.4% on Humanity’s Last Exam without tools (vs. Grok 3’s ~14.7%), and delivers better multi-step reasoning and factual recall.