1M
0.117
0.585
Chat
Active

Nemotron 3 Super 120B-A12B

With 120 billion total weights but only 12 billion active during any single inference call, you get the reasoning quality associated with large-scale models at a fraction of the compute cost per token.
Nemotron 3 Super 120B-A12BTechflow Logo - Techflow X Webflow Template

Nemotron 3 Super 120B-A12B

A hybrid Mixture-of-Experts reasoning model that punches far above its active parameter count — running on just 12 billion active weights while drawing on the depth of 120 billion total parameters. Built for the realities of production-grade agentic systems.

What Is Nemotron 3 Super?

NVIDIA's Nemotron 3 Super 120B-A12B is part of the third generation of the Nemotron open model family — a series engineered specifically for building specialized, reliable AI agents rather than serving as a general-purpose chatbot. The "Super" designation marks a meaningful architectural step up from the lighter Nano variant, introducing several capabilities that simply weren't present before.

NVIDIA achieved this through their LatentMoE approach, where expert routing happens in a compressed latent space rather than full model dimensionality, meaning the system is smarter about which experts to engage, not just how many.

Architecture

LatentMoE

Tokens are first projected into a compressed latent space for expert routing and computation. This reduces the byte cost per unit of model intelligence — you call 4 experts but pay roughly the compute of 1. It's what makes the 12B active parameter count feel much larger than it is.

Mamba-2 + Attention Hybrid

Rather than relying purely on attention, the model interleaves Mamba-2 state-space layers with selective attention blocks. Mamba-2 handles long-range context efficiently; attention handles local precision. The combination is faster than a dense transformer at long contexts.

Multi-Token Prediction (MTP)

Unlike models that predict one token at a time, Nemotron 3 Super uses MTP layers as a native speculative decoding mechanism. This is what drives the 167+ tokens/sec output speed — it's not just hardware, it's baked into the model weights.

NVFP4 Pretraining

The Super model is the first in its family pretrained at NVFP4 precision rather than using it only for post-training quantization. This allowed training on the full 25T+ token corpus more efficiently without sacrificing the quality typically expected of BF16-trained models.

1 Million Token Context

Not just a theoretical maximum — the model outperforms both GPT-OSS-120B and Qwen3.5-122B on the RULER benchmark at the full 1M token setting. This matters for agent workflows where conversation and tool-use history must stay in context across hundreds of steps.

Configurable Reasoning Mode

Reasoning behavior is toggled via a flag in the chat template. When enabled, the model generates an internal reasoning trace before its final response — useful for complex multi-step tasks. When disabled, it responds directly, reducing latency for simpler queries.

Nemotron 3 Super API Pricing

  • $0.117 input
  • $0.585 output

Benchmark Results Worth Knowing

Benchmark scores are only meaningful when you know what they're measuring. Below are the evaluations most relevant to real-world developer use cases — not cherry-picked academic tasks, but coding agents, long-context retrieval, math reasoning, and instruction-following under load.

Benchmark What It Tests Score Notes
AIME 2025 Competitive math reasoning Leading open score Multi-environment RL training cited as key driver
SWE-Bench Verified Real GitHub issue resolution Top open model tier Tested as a full coding agent, not just code completion
TerminalBench CLI agent task completion Leading open model Requires multi-step tool use in terminal environments
PinchBench Coding agent accuracy 85.6% Best open model result at time of release
RULER @ 1M ctx Long-range retrieval accuracy Outperforms GPT-OSS-120B Also beats Qwen3.5-122B at full 1M token length
AA Intelligence Index v4 Composite reasoning score 36 / avg 15 Well above average among comparable open-weight models
Throughput vs. GPT-OSS-120B Tokens per second at scale 2.2× faster 8k input / 64k output setting
Throughput vs. Qwen3.5-122B Tokens per second at scale 7.5× faster Same test setting as above

Where Nemotron 3 Super Fits Best

This isn't a model you deploy for simple question-and-answer flows. It's built for sustained, multi-step, tool-using workflows where context accumulates and reasoning chains span dozens of decisions. Here's where it genuinely earns its keep.

Multi-Agent Orchestration

Designed from the ground up for collaborative agent pipelines. The million-token context lets it track complete state across planner, researcher, and executor sub-agents without truncation.

Autonomous Coding Agents

SWE-Bench Verified and PinchBench scores reflect performance on actual repositories — filing fixes, navigating codebases, and executing across a terminal. Not just code generation.

Enterprise Chatbots & RAG

High-volume workloads like IT ticket triage are a primary design target. The MoE architecture keeps per-call compute costs low even when request volume is sustained.

Long-Document Analysis

Cross-document aggregation and multi-document reasoning were explicitly part of the fine-tuning data. Useful for legal review, technical due diligence, and research summarization.

Structured Output Generation

Fine-tuned on structured output tasks explicitly. JSON schema adherence, tool-call formatting, and instruction-following with complex constraints are reliable in production settings.

Math & Scientific Reasoning

Pre-training included substantial synthetic math and science data. Reinforcement learning across 10+ environments — including formal reasoning — drove the AIME 2025 benchmark results.

What Is Nemotron 3 Super?

NVIDIA's Nemotron 3 Super 120B-A12B is part of the third generation of the Nemotron open model family — a series engineered specifically for building specialized, reliable AI agents rather than serving as a general-purpose chatbot. The "Super" designation marks a meaningful architectural step up from the lighter Nano variant, introducing several capabilities that simply weren't present before.

NVIDIA achieved this through their LatentMoE approach, where expert routing happens in a compressed latent space rather than full model dimensionality, meaning the system is smarter about which experts to engage, not just how many.

Architecture

LatentMoE

Tokens are first projected into a compressed latent space for expert routing and computation. This reduces the byte cost per unit of model intelligence — you call 4 experts but pay roughly the compute of 1. It's what makes the 12B active parameter count feel much larger than it is.

Mamba-2 + Attention Hybrid

Rather than relying purely on attention, the model interleaves Mamba-2 state-space layers with selective attention blocks. Mamba-2 handles long-range context efficiently; attention handles local precision. The combination is faster than a dense transformer at long contexts.

Multi-Token Prediction (MTP)

Unlike models that predict one token at a time, Nemotron 3 Super uses MTP layers as a native speculative decoding mechanism. This is what drives the 167+ tokens/sec output speed — it's not just hardware, it's baked into the model weights.

NVFP4 Pretraining

The Super model is the first in its family pretrained at NVFP4 precision rather than using it only for post-training quantization. This allowed training on the full 25T+ token corpus more efficiently without sacrificing the quality typically expected of BF16-trained models.

1 Million Token Context

Not just a theoretical maximum — the model outperforms both GPT-OSS-120B and Qwen3.5-122B on the RULER benchmark at the full 1M token setting. This matters for agent workflows where conversation and tool-use history must stay in context across hundreds of steps.

Configurable Reasoning Mode

Reasoning behavior is toggled via a flag in the chat template. When enabled, the model generates an internal reasoning trace before its final response — useful for complex multi-step tasks. When disabled, it responds directly, reducing latency for simpler queries.

Nemotron 3 Super API Pricing

  • $0.117 input
  • $0.585 output

Benchmark Results Worth Knowing

Benchmark scores are only meaningful when you know what they're measuring. Below are the evaluations most relevant to real-world developer use cases — not cherry-picked academic tasks, but coding agents, long-context retrieval, math reasoning, and instruction-following under load.

Benchmark What It Tests Score Notes
AIME 2025 Competitive math reasoning Leading open score Multi-environment RL training cited as key driver
SWE-Bench Verified Real GitHub issue resolution Top open model tier Tested as a full coding agent, not just code completion
TerminalBench CLI agent task completion Leading open model Requires multi-step tool use in terminal environments
PinchBench Coding agent accuracy 85.6% Best open model result at time of release
RULER @ 1M ctx Long-range retrieval accuracy Outperforms GPT-OSS-120B Also beats Qwen3.5-122B at full 1M token length
AA Intelligence Index v4 Composite reasoning score 36 / avg 15 Well above average among comparable open-weight models
Throughput vs. GPT-OSS-120B Tokens per second at scale 2.2× faster 8k input / 64k output setting
Throughput vs. Qwen3.5-122B Tokens per second at scale 7.5× faster Same test setting as above

Where Nemotron 3 Super Fits Best

This isn't a model you deploy for simple question-and-answer flows. It's built for sustained, multi-step, tool-using workflows where context accumulates and reasoning chains span dozens of decisions. Here's where it genuinely earns its keep.

Multi-Agent Orchestration

Designed from the ground up for collaborative agent pipelines. The million-token context lets it track complete state across planner, researcher, and executor sub-agents without truncation.

Autonomous Coding Agents

SWE-Bench Verified and PinchBench scores reflect performance on actual repositories — filing fixes, navigating codebases, and executing across a terminal. Not just code generation.

Enterprise Chatbots & RAG

High-volume workloads like IT ticket triage are a primary design target. The MoE architecture keeps per-call compute costs low even when request volume is sustained.

Long-Document Analysis

Cross-document aggregation and multi-document reasoning were explicitly part of the fine-tuning data. Useful for legal review, technical due diligence, and research summarization.

Structured Output Generation

Fine-tuned on structured output tasks explicitly. JSON schema adherence, tool-call formatting, and instruction-following with complex constraints are reliable in production settings.

Math & Scientific Reasoning

Pre-training included substantial synthetic math and science data. Reinforcement learning across 10+ environments — including formal reasoning — drove the AIME 2025 benchmark results.

Try it now

400+ AI Models

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

The Best Growth Choice
for Enterprise

Get API Key
Testimonials

Our Clients' Voices