JetBrains Mellum2: The 12B MoE Model That Makes Local AI Coding Viable

What Changed Today

JetBrains released Mellum2 — a 12B parameter Mixture-of-Experts (MoE) model specialized for software engineering tasks. It's open-source (Apache 2.0) and designed to be a fast, local "focal model" inside larger AI pipelines, not a replacement for frontier models like GPT-4 or Claude.

Key specs:

12B total params, 2.5B active per token — runs on a single GPU or even CPU with quantization
MoE architecture: 64 experts, 8 activated per token
128K context window — handles large codebases
Multi-Token Prediction (MTP) head for speculative decoding speedups
Trained on ~10.6T tokens with a three-phase curriculum shifting from web → code → math
Six checkpoints released covering the full training run

Source: MarkTechPost, The New Stack

Why This Matters for Developers and Business Owners

1. Local AI Becomes Actually Usable

Previous local coding models were either too small (2B-4B) to be useful, or too large (70B+) to run without expensive hardware. Mellum2 sits in the sweet spot: 12B total, but only 2.5B active per token due to MoE routing. That means:

Run on a single RTX 3090/4090 (24GB VRAM)
Or even CPU with 4-bit quantization for lighter tasks
No API costs, no data leaving your machine

For teams handling proprietary code or working under compliance constraints, this is a game-changer.

2. It's a "Focal Model" — Not Trying to Be Everything

JetBrains is honest about positioning: Mellum2 is a specialized component, not a general-purpose replacement. Use it for:

Code completion and inline suggestions
Debugging assistance
Refactoring and code review
Function calling and tool use

Pair it with a frontier model for architecture decisions, and Mellum2 for the 80% of daily coding tasks.

3. Cost Math for Business Owners

Approach	Monthly Cost (10 devs)	Data Control
GitHub Copilot Pro	$190/mo	Microsoft-hosted
Claude Code API	$500-2000/mo (variable)	Anthropic-hosted
Mellum2 (local)	$0 (hardware amortized)	Fully local
Mellum2 + small cloud GPU	~$50-100/mo	You control the box

For bootstrapped teams or agencies with tight margins, local models cut a real monthly expense.

How to Use It

Quick Start with Ollama

# Pull the model (when available on Ollama Hub — expected within days)
ollama pull mellum2:12b

# Or run from HuggingFace with llama.cpp
# Download weights from https://huggingface.co/jetbrains

# Start the server
ollama run mellum2:12b

Integration with JetBrains IDEs

JetBrains will likely integrate Mellum2 into AI Assistant as a local model option. Until then:

# Configure custom model in JetBrains AI Assistant
# Settings → AI Assistant → Custom Model → http://localhost:11434/v1/chat/completions

VS Code / Continue.dev Setup

// .continue/config.json
{
  "models": [
    {
      "title": "Mellum2 Local",
      "provider": "ollama",
      "model": "mellum2:12b",
      "apiBase": "http://localhost:11434"
    }
  ]
}

API Example (Python)

import requests

response = requests.post("http://localhost:11434/api/generate", json={
    "model": "mellum2:12b",
    "prompt": "Refactor this Python function to use list comprehensions:\n\ndef get_even(nums):\n    result = []\n    for n in nums:\n        if n % 2 == 0:\n            result.append(n)\n    return result",
    "stream": False
})

print(response.json()["response"])

Production Notes / Gotchas

MoE models need careful quantization — Standard 4-bit quantization may degrade expert routing. Use Q6_K or Q8_0 for critical work.
Context window is 128K but memory scales with it — Long files eat VRAM. Split large files into chunks for review tasks.
Speculative decoding with MTP head — The built-in draft model can 2x speed, but requires compatible inference engine (llama.cpp dev branch or vLLM).
Not multimodal — No image input. Stick to text/code tasks only.
Apache 2.0 = commercial use OK — No attribution required beyond license file, but check your legal team if redistributing.

Bottom Line

Mellum2 is the first practical, open-source coding model that doesn't require enterprise hardware or enterprise budgets. For teams already using JetBrains tools, it's a natural upgrade path. For everyone else, it's proof that the "local AI" promise is finally becoming real.

Try it if: You pay for Copilot/Claude Code and wonder if there's a cheaper way.
Skip it if: You need frontier-level reasoning for architecture or complex debugging — pair it with GPT-4 instead.

Published on Build With Abdallah
Questions? Email: buildwithabdallah@gmail.com