0.78
4.68
Chat
Active

Qwen 3.6 27B

Whether you are building complex agentic workflows, high-volume customer support bots, or sophisticated code assistants, Qwen 3.6-27B provides the optimal infrastructure for next-generation AI products.
Qwen 3.6 27BTechflow Logo - Techflow X Webflow Template

Qwen 3.6 27B

Qwen3.6-27B is the first fully dense open-weight model in the Qwen3.6 series and it just beat a 397-billion-parameter model on agentic coding benchmarks.

Engineered for Speed. Validated for Accuracy

Benchmark-tested across coding (HumanEval), math (GSM8K), and reasoning (MMLU) tasks, this 27B-parameter model from Alibaba Cloud's Tongyi Lab delivers performance rivaling 70B-class systems, with 3–5× faster token throughput. Integrated via a low-latency API endpoint, it's the go-to choice for applications where every millisecond and every token budget matters — from real-time translation engines to autonomous decision agents.

API Pricing
  • Input: $0.78
  • Output: $4.68

Numbers That Actually Matter

Benchmark results are only useful when they correlate with real tasks. Qwen's team shaped this release around direct community feedback, targeting agentic coding workflows rather than chasing leaderboard novelty. Here's how it stacks up.

  • Coding: Outperforms legacy 70B-class models in Python and JavaScript generation tasks.
  • Mathematics: Achieves state-of-the-art scores in GSM8K and MATH benchmarks for its size category.
  • Instruction Following: Demonstrates superior adherence to complex, multi-constraint prompts compared to previous generations.

A Hybrid Architecture Built for Long Contexts

Most models at this size either sacrifice depth for speed or pay the quadratic cost of full self-attention at every layer. Qwen3.6-27B takes a different approach — a layered hybrid that keeps computation linear for most of the stack, reserving full attention for where it counts.

Gated DeltaNet Linear Attention

Three out of every four sublayers use Gated DeltaNet — a linear attention mechanism with O(n) complexity instead of the O(n²) scaling of standard self-attention. This is what makes 262K-token contexts tractable without ballooning memory requirements.

Repeating 16-Block Hybrid Pattern

The model's hidden layout repeats a specific pattern across 64 layers: three Gated DeltaNet → FFN blocks followed by one Gated Attention → FFN block. This 3:1 ratio is deliberate — it balances efficiency with expressiveness throughout the depth of the network.

Asymmetric Q/KV Head Counts

The attention layers use asymmetric query and key/value head configurations, reducing memory bandwidth at inference time without losing the representational capacity that matters for complex reasoning tasks.

Integrated Vision Encoder

Qwen3.6-27B is natively multimodal — it processes text, images, and video through a built-in vision encoder trained end-to-end alongside the language model, not bolted on afterward.

Specification Details
Parameters 27 Billion
Architecture Dense Causal LM + Vision
Layers 64
Hidden Dim 5,120
FFN Intermediate 17,408
Token Vocab 248,320 (padded)
Context (native) 262,144 tokens
Context (extended) 1,010,000 tokens
Quantized Variant FP8 (block size 128)
Local Footprint ~18 GB (GGUF quant)
Modalities Text · Image · Video
License Apache 2.0
Frameworks vLLM · SGLang · HF · KTransformers

What's Actually New Here

Every major model release claims to be better. Here are the specific things Qwen3.6-27B does differently and why they matter for people building real systems.

Thinking Preservation

Qwen3.6-27B introduces a mechanism that retains reasoning traces across conversation turns. In multi-step agent workflows, this means the model doesn't re-derive conclusions it already worked through, reducing redundant token generation and improving KV cache efficiency significantly.

Agentic Coding at Scale

The model is specifically tuned for frontend workflows and repository-level reasoning, navigating large codebases, editing across multiple files, and producing runnable output. Its QwenWebBench score of 1,487 is a 39% jump over Qwen3.5-27B, spanning web design, SVG, data visualization, 3D, and animation tasks.

Natively Multimodal

Text, images, and video are first-class inputs. The vision encoder is trained alongside the language model rather than glued on, which matters for tasks that require genuine cross-modal reasoning rather than just image captioning.

Hybrid Thinking Modes

The model supports both standard and extended reasoning modes. In thinking mode, reasoning traces can be preserved across turns. Developers can toggle enable_thinking at the API level to control when deep chain-of-thought is activated versus when fast inference is preferred.

Tool Calling & MCP Support

Qwen3.6-27B supports structured tool calling out of the box, compatible with the OpenAI tool-call format. It integrates natively with Model Context Protocol (MCP) servers for filesystem access, web browsing, and custom tool orchestration in agentic pipelines.

Applications

Autonomous Coding Agents

With SWE-bench Verified at 77.2%, this is one of the strongest open models available for building agents that write, edit, and debug code across real repositories. Terminal-Bench performance at 59.3% means it can operate in shell environments reliably over long sessions.

Frontend Code Generation

QwenWebBench covers web design, games, SVGs, data visualizations, and 3D. A score of 1,487, up from 1,068 for the previous 27B model, represents a meaningful jump in the quality of generated UI code, particularly for complex interactive layouts.

On-Premises / Privacy-First AI

An 18GB GGUF footprint on a single GPU with Apache 2.0 licensing means organizations that can't send data to external APIs now have a frontier-adjacent option they can host entirely inside their own infrastructure.

Long-Context Document Analysis

The 262K native context window, extensible to over one million tokens, makes it practical for tasks like codebase-wide reasoning, legal document review, long-form research synthesis, and processing large structured datasets in a single pass.

Multimodal Document Intelligence

The built-in vision encoder means the model can process PDFs with embedded images, annotated diagrams, screenshots, and video frames alongside text — useful for technical documentation parsing and visual QA workflows without needing a separate vision model.

Fine-Tuning & Research

Apache 2.0 means no commercial restrictions, and the dense architecture (no MoE routing complexity) is significantly easier to fine-tune than sparse models of similar capability. Pre-training on ~36 trillion tokens gives a strong general knowledge foundation to build on.

Common Questions

How does a 27B model beat a 397B model?

The short answer is that raw parameter count stopped being the main predictor of performance some time ago. Qwen3.6-27B benefits from a newer training recipe — higher-quality data, stronger reinforcement learning from human feedback, and an architecture optimized specifically for agentic coding tasks. The 397B model it outperforms is a mixture-of-experts design from an earlier generation, which means both the training and the evaluation criteria have evolved. That said, the 27B model only outperforms on the specific coding benchmarks listed — for broader general knowledge or other capability areas, larger models still tend to perform better.

How does the million-token context actually work?

The native context window is 262,144 tokens (~200K words). Beyond that, the model can be configured to extend to approximately 1,010,000 tokens using position interpolation techniques — but this extended range requires additional setup and performance can degrade on tasks requiring very precise retrieval from the far ends of a very long context. For most practical use cases, the native 262K window is sufficient and more reliable.

What's the difference between standard and thinking modes?

In standard mode, the model generates responses directly without explicit chain-of-thought reasoning traces. This is faster and cheaper, suitable for most conversational or straightforward coding tasks. In thinking mode (enabled via enable_thinking: True), the model works through a reasoning process before producing its final answer. This increases output token count and latency, but substantially improves accuracy on complex multi-step problems. Thinking Preservation extends this by caching those reasoning traces across turns.

Engineered for Speed. Validated for Accuracy

Benchmark-tested across coding (HumanEval), math (GSM8K), and reasoning (MMLU) tasks, this 27B-parameter model from Alibaba Cloud's Tongyi Lab delivers performance rivaling 70B-class systems, with 3–5× faster token throughput. Integrated via a low-latency API endpoint, it's the go-to choice for applications where every millisecond and every token budget matters — from real-time translation engines to autonomous decision agents.

API Pricing
  • Input: $0.78
  • Output: $4.68

Numbers That Actually Matter

Benchmark results are only useful when they correlate with real tasks. Qwen's team shaped this release around direct community feedback, targeting agentic coding workflows rather than chasing leaderboard novelty. Here's how it stacks up.

  • Coding: Outperforms legacy 70B-class models in Python and JavaScript generation tasks.
  • Mathematics: Achieves state-of-the-art scores in GSM8K and MATH benchmarks for its size category.
  • Instruction Following: Demonstrates superior adherence to complex, multi-constraint prompts compared to previous generations.

A Hybrid Architecture Built for Long Contexts

Most models at this size either sacrifice depth for speed or pay the quadratic cost of full self-attention at every layer. Qwen3.6-27B takes a different approach — a layered hybrid that keeps computation linear for most of the stack, reserving full attention for where it counts.

Gated DeltaNet Linear Attention

Three out of every four sublayers use Gated DeltaNet — a linear attention mechanism with O(n) complexity instead of the O(n²) scaling of standard self-attention. This is what makes 262K-token contexts tractable without ballooning memory requirements.

Repeating 16-Block Hybrid Pattern

The model's hidden layout repeats a specific pattern across 64 layers: three Gated DeltaNet → FFN blocks followed by one Gated Attention → FFN block. This 3:1 ratio is deliberate — it balances efficiency with expressiveness throughout the depth of the network.

Asymmetric Q/KV Head Counts

The attention layers use asymmetric query and key/value head configurations, reducing memory bandwidth at inference time without losing the representational capacity that matters for complex reasoning tasks.

Integrated Vision Encoder

Qwen3.6-27B is natively multimodal — it processes text, images, and video through a built-in vision encoder trained end-to-end alongside the language model, not bolted on afterward.

Specification Details
Parameters 27 Billion
Architecture Dense Causal LM + Vision
Layers 64
Hidden Dim 5,120
FFN Intermediate 17,408
Token Vocab 248,320 (padded)
Context (native) 262,144 tokens
Context (extended) 1,010,000 tokens
Quantized Variant FP8 (block size 128)
Local Footprint ~18 GB (GGUF quant)
Modalities Text · Image · Video
License Apache 2.0
Frameworks vLLM · SGLang · HF · KTransformers

What's Actually New Here

Every major model release claims to be better. Here are the specific things Qwen3.6-27B does differently and why they matter for people building real systems.

Thinking Preservation

Qwen3.6-27B introduces a mechanism that retains reasoning traces across conversation turns. In multi-step agent workflows, this means the model doesn't re-derive conclusions it already worked through, reducing redundant token generation and improving KV cache efficiency significantly.

Agentic Coding at Scale

The model is specifically tuned for frontend workflows and repository-level reasoning, navigating large codebases, editing across multiple files, and producing runnable output. Its QwenWebBench score of 1,487 is a 39% jump over Qwen3.5-27B, spanning web design, SVG, data visualization, 3D, and animation tasks.

Natively Multimodal

Text, images, and video are first-class inputs. The vision encoder is trained alongside the language model rather than glued on, which matters for tasks that require genuine cross-modal reasoning rather than just image captioning.

Hybrid Thinking Modes

The model supports both standard and extended reasoning modes. In thinking mode, reasoning traces can be preserved across turns. Developers can toggle enable_thinking at the API level to control when deep chain-of-thought is activated versus when fast inference is preferred.

Tool Calling & MCP Support

Qwen3.6-27B supports structured tool calling out of the box, compatible with the OpenAI tool-call format. It integrates natively with Model Context Protocol (MCP) servers for filesystem access, web browsing, and custom tool orchestration in agentic pipelines.

Applications

Autonomous Coding Agents

With SWE-bench Verified at 77.2%, this is one of the strongest open models available for building agents that write, edit, and debug code across real repositories. Terminal-Bench performance at 59.3% means it can operate in shell environments reliably over long sessions.

Frontend Code Generation

QwenWebBench covers web design, games, SVGs, data visualizations, and 3D. A score of 1,487, up from 1,068 for the previous 27B model, represents a meaningful jump in the quality of generated UI code, particularly for complex interactive layouts.

On-Premises / Privacy-First AI

An 18GB GGUF footprint on a single GPU with Apache 2.0 licensing means organizations that can't send data to external APIs now have a frontier-adjacent option they can host entirely inside their own infrastructure.

Long-Context Document Analysis

The 262K native context window, extensible to over one million tokens, makes it practical for tasks like codebase-wide reasoning, legal document review, long-form research synthesis, and processing large structured datasets in a single pass.

Multimodal Document Intelligence

The built-in vision encoder means the model can process PDFs with embedded images, annotated diagrams, screenshots, and video frames alongside text — useful for technical documentation parsing and visual QA workflows without needing a separate vision model.

Fine-Tuning & Research

Apache 2.0 means no commercial restrictions, and the dense architecture (no MoE routing complexity) is significantly easier to fine-tune than sparse models of similar capability. Pre-training on ~36 trillion tokens gives a strong general knowledge foundation to build on.

Common Questions

How does a 27B model beat a 397B model?

The short answer is that raw parameter count stopped being the main predictor of performance some time ago. Qwen3.6-27B benefits from a newer training recipe — higher-quality data, stronger reinforcement learning from human feedback, and an architecture optimized specifically for agentic coding tasks. The 397B model it outperforms is a mixture-of-experts design from an earlier generation, which means both the training and the evaluation criteria have evolved. That said, the 27B model only outperforms on the specific coding benchmarks listed — for broader general knowledge or other capability areas, larger models still tend to perform better.

How does the million-token context actually work?

The native context window is 262,144 tokens (~200K words). Beyond that, the model can be configured to extend to approximately 1,010,000 tokens using position interpolation techniques — but this extended range requires additional setup and performance can degrade on tasks requiring very precise retrieval from the far ends of a very long context. For most practical use cases, the native 262K window is sufficient and more reliable.

What's the difference between standard and thinking modes?

In standard mode, the model generates responses directly without explicit chain-of-thought reasoning traces. This is faster and cheaper, suitable for most conversational or straightforward coding tasks. In thinking mode (enabled via enable_thinking: True), the model works through a reasoning process before producing its final answer. This increases output token count and latency, but substantially improves accuracy on complex multi-step problems. Thinking Preservation extends this by caching those reasoning traces across turns.

Try it now

400+ AI Models

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

The Best Growth Choice
for Enterprise

Get API Key
Testimonials

Our Clients' Voices