Available for Q3 2026 projects — Laravel, AI agents & automation
Build With Abdallah logo Build With Abdallah Software · AI · Automation
AI Agents 4 min read Jun 02, 2026

JetBrains Mellum2: The 12B MoE Model That Makes Local AI Coding Viable

JetBrains just open-sourced Mellum2, a 12B Mixture-of-Experts coding model that runs fast on consumer hardware. Here's what changed, why it matters for your team's budget, and how to plug it into your workflow.

A
Abdallah Mohamed
Senior Full-Stack Engineer

What Changed Today

JetBrains released Mellum2 — a 12B parameter Mixture-of-Experts (MoE) model specialized for software engineering tasks. It's open-source (Apache 2.0) and designed to be a fast, local "focal model" inside larger AI pipelines, not a replacement for frontier models like GPT-4 or Claude.

Key specs:

  • 12B total params, 2.5B active per token — runs on a single GPU or even CPU with quantization
  • MoE architecture: 64 experts, 8 activated per token
  • 128K context window — handles large codebases
  • Multi-Token Prediction (MTP) head for speculative decoding speedups
  • Trained on ~10.6T tokens with a three-phase curriculum shifting from web → code → math
  • Six checkpoints released covering the full training run

Source: MarkTechPost, The New Stack


Why This Matters for Developers and Business Owners

1. Local AI Becomes Actually Usable

Previous local coding models were either too small (2B-4B) to be useful, or too large (70B+) to run without expensive hardware. Mellum2 sits in the sweet spot: 12B total, but only 2.5B active per token due to MoE routing. That means:

  • Run on a single RTX 3090/4090 (24GB VRAM)
  • Or even CPU with 4-bit quantization for lighter tasks
  • No API costs, no data leaving your machine

For teams handling proprietary code or working under compliance constraints, this is a game-changer.

2. It's a "Focal Model" — Not Trying to Be Everything

JetBrains is honest about positioning: Mellum2 is a specialized component, not a general-purpose replacement. Use it for:

  • Code completion and inline suggestions
  • Debugging assistance
  • Refactoring and code review
  • Function calling and tool use

Pair it with a frontier model for architecture decisions, and Mellum2 for the 80% of daily coding tasks.

3. Cost Math for Business Owners

Approach Monthly Cost (10 devs) Data Control
GitHub Copilot Pro $190/mo Microsoft-hosted
Claude Code API $500-2000/mo (variable) Anthropic-hosted
Mellum2 (local) $0 (hardware amortized) Fully local
Mellum2 + small cloud GPU ~$50-100/mo You control the box

For bootstrapped teams or agencies with tight margins, local models cut a real monthly expense.


How to Use It

Quick Start with Ollama

# Pull the model (when available on Ollama Hub — expected within days)
ollama pull mellum2:12b

# Or run from HuggingFace with llama.cpp
# Download weights from https://huggingface.co/jetbrains

# Start the server
ollama run mellum2:12b

Integration with JetBrains IDEs

JetBrains will likely integrate Mellum2 into AI Assistant as a local model option. Until then:

# Configure custom model in JetBrains AI Assistant
# Settings → AI Assistant → Custom Model → http://localhost:11434/v1/chat/completions

VS Code / Continue.dev Setup

// .continue/config.json
{
  "models": [
    {
      "title": "Mellum2 Local",
      "provider": "ollama",
      "model": "mellum2:12b",
      "apiBase": "http://localhost:11434"
    }
  ]
}

API Example (Python)

import requests

response = requests.post("http://localhost:11434/api/generate", json={
    "model": "mellum2:12b",
    "prompt": "Refactor this Python function to use list comprehensions:\n\ndef get_even(nums):\n    result = []\n    for n in nums:\n        if n % 2 == 0:\n            result.append(n)\n    return result",
    "stream": False
})

print(response.json()["response"])

Production Notes / Gotchas

  1. MoE models need careful quantization — Standard 4-bit quantization may degrade expert routing. Use Q6_K or Q8_0 for critical work.
  2. Context window is 128K but memory scales with it — Long files eat VRAM. Split large files into chunks for review tasks.
  3. Speculative decoding with MTP head — The built-in draft model can 2x speed, but requires compatible inference engine (llama.cpp dev branch or vLLM).
  4. Not multimodal — No image input. Stick to text/code tasks only.
  5. Apache 2.0 = commercial use OK — No attribution required beyond license file, but check your legal team if redistributing.

Bottom Line

Mellum2 is the first practical, open-source coding model that doesn't require enterprise hardware or enterprise budgets. For teams already using JetBrains tools, it's a natural upgrade path. For everyone else, it's proof that the "local AI" promise is finally becoming real.

Try it if: You pay for Copilot/Claude Code and wonder if there's a cheaper way.
Skip it if: You need frontier-level reasoning for architecture or complex debugging — pair it with GPT-4 instead.


Published on Build With Abdallah
Questions? Email: buildwithabdallah@gmail.com