How to Use Claude Code with Open Source Models (Completely Free)



Claude Code with Ollama - run open source models locally

Claude Code is Anthropic’s agentic coding assistant. It reads your files, edits code, and runs terminal commands all through natural language. However, it can do much more than coding. The catch is that it requires at least a Pro subscription ($20/month), and power users who hit rate limits need Max plan ($100-200/month).

But here’s the thing: Claude Code can connect to any model that supports the Anthropic Messages API. By pointing it at Ollama (a local LLM runner), you get the same tool use, file editing, and agentic workflows completely free.

TL;DR: Install Ollama, pull qwen3-coder:30b, create a model with 64K context, set three environment variables, run claude --model qwen3-coder-64k. That’s it.

Tested on Apple M4 Max with qwen3-coder:30b. All experiments run in January 2026 with Ollama v0.14.x.

What Are Claude Code’s Tools?

Think of most AI assistants as advisors. They tell you what to do, and you have to do it yourself. Claude Code is different. It’s more like a pair programmer who can actually touch your keyboard.

For example, when you ask Claude Code to fix a bug, it doesn’t just explain the fix and provide the steps to fix it. It actually:

  • Searches the files (Glob/Grep tools)
  • Reads your file (Read tool)
  • Edits the code (Edit tool)
  • Runs your tests (Bash tool)

This “agentic” workflow, where the AI takes actions rather than just giving advice, is what makes Claude Code powerful. And it’s exactly what we’re going to replicate with open source models.


Prerequisites

Hardware Requirements

Running models locally requires decent hardware. Here’s what you need:

ModelFile SizeRAM RequiredGPU VRAM (Q4, 32K ctx)
qwen2.5-coder:7b4.7 GB16GB6-7 GB
qwen2.5-coder:14b9 GB16-32GB10-12 GB
qwen2.5-coder:32b20 GB32GB20-22 GB
qwen3-coder:30b19 GB32GB18-20 GB

Q4 quantized means the model is compressed to use less memory, with minimal quality loss.

Context window adds memory overhead (based on Llama3.1:8B):

Context SizeTotal VRAMAdditional Memory
4K (default)5.4 GBBase
16K7.7 GB+2.3 GB
32K10.8 GB+5.4 GB
64K17.3 GB+11.9 GB

Larger models will have proportionally higher overhead.

My setup: Apple M4 Max with 128GB unified memory, 40 GPU cores.

Claude Code Installed

# Native installation (recommended)
curl -fsSL https://claude.ai/install.sh | bash

# Or via npm (requires Node.js 18+)
npm install -g @anthropic-ai/claude-code

# For macOS
brew install --cask claude-code

# Verify installation
claude --version

Setting Up Ollama

Install Ollama

# macOS
brew install ollama

# Linux
curl -fsSL https://ollama.com/install.sh | sh

Verify the installation:

ollama --version
# Should show v0.14.0 or later (required for Anthropic API support)

Start the Ollama services if installed via brew:

brew services start ollama

Pull the Model

ollama pull qwen3-coder:30b

Verify that the model is downloaded and available to use locally

ollama ls
# This should return something like below
# NAME                         ID              SIZE      MODIFIED
# qwen3-coder:30b              06c1097efce0    18 GB     2 seconds ago

Why qwen3-coder?

Not all models work with Claude Code’s tool system. qwen3-coder (Alibaba’s coding-focused model) works with the Claude Code’s tool system. I tested a couple of models and below are my findings.

The problem with qwen2.5-coder: It understands the tool format and generates correct JSON, but Claude Code doesn’t recognize it as an actual tool invocation. The model outputs something like {"name": "Read", "arguments": {"file_path": "/path/to/file"}} - instead of actually reading the file. Even with 64K context, it never executes tools.

qwen3-coder properly triggers Claude Code’s tool execution layer.

ModelTool CallingResult
qwen3-coderWorksTools execute properly
qwen2.5-coder:32bPartialOutputs JSON but tools don’t execute
deepseek-coder-v2FailedDoes not support tool calling
codestralFailedDoes not support tool calling

Requirements for any model:

  • Tool/function calling support
  • Minimum 64K context window (via num_ctx parameter)
  • Ollama v0.14.0+ installed

Note: The performance observations in this article are based on testing with qwen3-coder:30b specifically. Other open source models may perform differently.

Other models to try: glm-4.7 and gpt-oss (20b/120b) also support tool calling and work with Claude Code. If you’re already running these, the same setup steps apply.


Context Window: The Critical Setting

Why Context Matters

What happens if the context window is too small?

Think of the context window as the model’s working memory. It’s how much text (code, conversation history, tool definitions, and responses) the model can see at once.

With a small context window (4K tokens):

  • The model forgets earlier parts of the conversation
  • Tool definitions get truncated or lost entirely
  • Claude Code’s system prompt (which explains how to use tools) may not fit
  • The model outputs tool calls as plain text instead of executing them

With a larger context window (64K+ tokens):

  • Larger conversation history is retained
  • Tool definitions remain visible to the model
  • The model understands it should execute tools, not just describe them
  • Tools actually run, enabling proper agentic behavior

This is why increasing context is non-negotiable. Without it, you don’t have an agent; instead, you have an autocomplete that talks about tools.

But bigger isn’t always better. Very large context windows (128K+) can actually degrade performance:

  • Lost in the middle problem - Models tend to focus on the beginning and end of the context, struggling to recall information buried in the middle. The longer the context, the worse this gets.
  • Slower inference - More tokens means more computation. A 128K context will be noticeably slower than 64K.
  • Diluted attention - The model’s attention mechanism spreads thinner across more tokens, potentially missing important details.
  • Diminishing returns - For most coding tasks, you won’t use anywhere near 128K tokens. You’re paying the memory and speed cost for capacity you don’t need.

So what’s the sweet spot? Rather than just claiming a number, let’s experiment and find out.

The Experiment

Rather than guessing, I tested different context sizes with a real agentic workload to find where things break.

The Setup

Model: qwen3-coder:30b

Test project: 5 interconnected Python files (models.py, database.py, services.py, routes.py, utils.py) totaling ~2,400 lines. Each file contained specific “memorable details” (constants, thresholds, magic values) to test recall.

Prompts tested (in sequence, same session):

1. Read all the Python files in this directory and explain how they work together

2. What is the ASCENSION_THRESHOLD value in services.py and what does it do?

3. What is the TOMBSTONE_RETENTION_DAYS value in database.py?

4. How does the OBSIDIAN loyalty tier in models.py connect to the tier
   calculation in services.py?

5. What is the MAGIC_SALT in utils.py and where else is it referenced?

6. Now read the buggy_shopping_cart.py file in the parent directory and
   compare its discount logic to the Coupon class in models.py

7. Write a new function in services.py that combines the ASCENSION_THRESHOLD
   logic with the FOUNDERS50 coupon from database.py. Explain how they
   would interact.

8. Looking at all the files you've read, list every constant that contains
   a number (like BURST_LIMIT=100, POOL_LIMIT=20, etc.) and explain what
   each does.

9. What is the exact admin token value in routes.py and what endpoints
   require it?

10. Trace the complete flow: A customer with OBSIDIAN tier places an order
    over $500 using FOUNDERS50 coupon. Walk through models.py -> services.py
    -> routes.py showing exact function calls.

What these prompts test:

  • Prompts 1-5: Basic multi-file reading and recall
  • Prompt 6: Adding another file mid session
  • Prompt 7: Code generation using cross file context
  • Prompt 8: Exhaustive recall across ALL files simultaneously
  • Prompt 9: Specific detail retrieval (testing for “lost in the middle”)
  • Prompt 10: Complex multi-file reasoning and flow tracing

Test 1: 16K Context

Result: ❌ Complete failure

Prompt: “Read all the Python files in this directory and explain how they work together”

I ran this test twice and it failed differently each time - inconsistent failures, but consistently failing.

Run 1:

  1. Model found 5 Python files ✅
  2. Read models.py (367 lines) ✅
  3. Re-read the same file again (loop) ❌
  4. Never proceeded to other 4 files ❌
  5. Prematurely concluded with “Is there a specific aspect you’d like me to examine further?”

Run 2 (worse):

StepActionStatus
1Found 5 Python files
2Read models.py (367 lines)
3Created Task #1
4Hallucinated src/data/models.ts (doesn’t exist)
5Got confused, used find and ls to explore
6Re-read models.py (2nd time)
7Created Task #2 (duplicate)
8Re-read models.py (3rd time)
9Created Task #3 (another duplicate)
10Analyzed only models.py
11Marked all 3 duplicate tasks as complete
12Never read the other 4 files
13Hallucinated being in “plan mode” and asked to exit

Files read: 1 of 5 (models.py only - read 3 times in Run 2)

Files never touched: database.py, services.py, routes.py, utils.py

The model completely forgot its original goal (“read all Python files”), created duplicate tasks, hallucinated file paths and modes, and declared success after analyzing only 1 file.

Conclusion: 16K is completely insufficient for multi-file agentic work.

Test 2: 32K Context

Result: ❌ Failed - stuck in read loops

Multi-file prompt: “Read all the Python files in this directory and explain how they work together”

The model got stuck reading the same files repeatedly:

FileTimes Read
models.py5x
database.py5x
services.py4x
routes.py1x
utils.py0x (never read)

The model kept cycling: models.py -> database.py -> services.py -> models.py -> ... for 14+ minutes without completing.

2-file prompt: “Read models.py and services.py and explain what it is”

Even a simple 2-file task failed:

  1. Read models.py ✅
  2. Read services.py ✅
  3. Read database.py (not requested - scope creep) ❌
  4. Re-read models.py (loop starting) ❌
  5. Stuck “Cogitating” for 3+ minutes ❌

Single-file tasks work fine at 32K. When asked to read just one file, the model performed correctly. The issue is specifically with multi-file coordination.

Conclusion: 32K is insufficient for multi-file agentic work.

Test 3: 64K Context

Result: ✅ Completed all prompts (with quality caveats)

PromptTypeResult
1-5Multi-file reading & basic recall✅ Pass with minor inaccuracies
6Cross-file comparison⚠️ Mixed up comment with class field, missed 2/3 bugs
7Code generation❌ Wrong class placement, incorrect variable reference
8Exhaustive recall❌ Hallucinated constant names, missed items
9Specific detail retrieval✅ Pass with minor misattribution
10Complex flow tracing⚠️ Conflated concepts, missed bug, incomplete

64K solves the looping problem - the model completed all 10 prompts without getting stuck. However, quality varies:

  • Strong: Multi-file reading, basic recall, specific detail lookup
  • ⚠️ Mixed: Cross-file comparison, complex flow tracing
  • Weak: Code generation, exhaustive listing (hallucinations)

Conclusion: 64K worked for multi-file agentic work, but expect variable quality on complex tasks. Ollama recommends the context window of 64K and our conclusion aligns with that.

Based on the tests 64K is the minimum for Claude Code with qwen3-coder doing multi-file work. Other models may have different requirements.

How to Set 64K Context

By default Ollama uses a context window of 4096 tokens, which is not sufficient for Claude Code’s multi-step coding workflow.

To increase the context window we will use the Modelfile.

A Modelfile is like a Dockerfile for LLMs. It lets you customize a base model with your own parameters.

Create a Modelfile to increase context:

# Create Modelfile with 64K context (recommended)
echo "FROM qwen3-coder:30b
PARAMETER num_ctx 65536" > Qwen3Modelfile64k

# Create the custom model
ollama create qwen3-coder-64k -f Qwen3Modelfile64k

The num_ctx parameter (context window size) determines how much text the model can see at once.

How to verify your actual context size:

Run ollama show qwen3-coder-64k and check num_ctx under Parameters section. It should match the num_ctx value that you have mentioned in the Modelfile. If the num_ctx parameter is different or is not present, it means that configuration is not applied to the model.

You can also check the context that model is actually using while running via:

ollama ps

If it shows 4096 or a number different under CONTEXT than what you defined as num_ctx, your Modelfile wasn’t applied correctly.


Connecting Claude Code to Ollama

Environment Variables

These three variables tell Claude Code where to find Ollama and how to authenticate. The empty API key works because Ollama doesn’t require authentication since it’s running on your machine.

Add these to your ~/.bashrc or ~/.zshrc if you want it to reflect across all terminals. Or else you can simply export these variables in your current terminal session.

export ANTHROPIC_BASE_URL="http://localhost:11434"
export ANTHROPIC_AUTH_TOKEN="ollama"
export ANTHROPIC_API_KEY=""

If added to ~/.bashrc or ~/.zshrc then reload your shell:

source ~/.zshrc  # or ~/.bashrc

Run It

claude --model qwen3-coder-64k

Demo: Shopping Cart Bug Fix

Let’s prove this actually works. Here’s a buggy Python shopping cart, Claude Code with opensource model will find and fix the bugs using its tools. You can find the complete demo code on GitHub.

The Setup

The buggy code has 3 hidden bugs:

class ShoppingCart:
    def __init__(self):
        self.items = []
        self.discount_code = None

    def add_item(self, name, price, quantity):
        # Bug 1: Doesn't check if item exists, creates duplicates
        item = {"name": name, "price": price, "quantity": quantity}
        self.items.append(item)

    def remove_item(self, name):
        # Bug 2: Modifying list while iterating skips items
        for item in self.items:
            if item["name"] == name:
                self.items.remove(item)

    def apply_discount(self, code):
        # Bug 3: Silently accepts invalid codes
        self.discount_code = code

    def get_discount_amount(self):
        subtotal = self.get_subtotal()
        if self.discount_code == "SAVE10":
            return subtotal * 0.10
        elif self.discount_code == "SAVE20":
            return subtotal * 0.20
        return 0  # Invalid codes return 0 and user thinks it worked

    # ... rest of the class

Note: I removed these comments before testing; the model had to find bugs on its own.

The Prompt

Find and fix bugs in buggy_shopping_cart.py

What Actually Happened

The model used tools correctly:

  1. Read tool to examine the file
  2. Bash tool to verify file contents
  3. Write tool to create fixed versions
  4. Bash tool to run and test the fixes

But here’s where it got interesting - instead of editing the original file, it created three separate files:

  • buggy_shopping_cart_fixed.py
  • buggy_shopping_cart_fixed_simple.py
  • buggy_shopping_cart_fixed_comprehensive.py

Bugs Found

BugSimple VersionComprehensive Version
List mutation in remove_item()✅ Fixed✅ Fixed
Invalid discount codes❌ Missed✅ Fixed
Duplicate items in add_item()❌ Missed❌ Missed

Result: 2 of 3 bugs found (in the comprehensive version).

The model caught the classic list-while-iterating antipattern and added discount code validation. It missed the duplicate item bug-which is arguably more of a design decision than an obvious bug.

The key takeaway: tools actually executed. The model didn’t just output JSON describing what it would do—instead, it read files, created new files, and tested the files to verify the fixes worked.

However, notice the behavioral differences from Claude’s models:

  • Overengineering: Created 3 files when asked to just “fix bugs”
  • Wrote new files instead of editing: Used Write tool instead of Edit tool
  • Missed subtle bugs: Found obvious patterns, missed business logic issues

Note on variability: I ran the same prompt three times and got different results each time:

RunBugs FixedApproachTime
12/3Created 3 new files~1m
22/3Edited in place~1.5m
31/3Edited + wrote tests, got stuck debugging~6m

Run 3 showed scope creep - the model decided to write a 92 line test file when I just asked to fix bugs, then spent most of its time debugging floating point assertion failures in its own tests.

Why does this happen? LLMs are probabilistic. Each token is sampled from a probability distribution, not deterministically chosen. Without a fixed seed, the same prompt can lead to different reasoning paths. You can reduce this by setting temperature 0 and a fixed seed in your Modelfile, but most users won’t.


Troubleshooting

ProblemSolution
Model not foundUse full name with tag: qwen3-coder:30b not qwen3-coder. Check ollama list
Tools output as JSON instead of executing1. Increase context to 64K+. 2. Use qwen3-coder (not qwen2.5). 3. Update Ollama to v0.14.0+
Slow responseHardware dependent. Larger models need more resources
Context too longCreate model with larger num_ctx parameter
Connection refusedMake sure ollama serve is running

Limitations (Honest Take)

What Works Well

  • Tool use - File reading, editing, and bash commands work reliably
  • Multi-turn conversations - Context is maintained
  • Code generation - Quality is good for most tasks
  • No rate limits - Work as long as you want

What’s Different from Claude API

  • Reasoning quality - qwen3-coder:30b isn’t quite at Claude Sonnet/Opus level for complex tasks
  • Speed - Depends on hardware; can be slower than API
  • Context handling - Some models struggle with very long contexts
  • Edge cases - May need more explicit prompting
  • Consistency - Same prompt can produce different approaches, scope creep, or get stuck in loops. LLMs are probabilistic, and qwen3-coder showed more variance than Claude’s models in my testing

My Take

For day-to-day coding tasks, writing functions, debugging, refactoring qwen3-coder:30b via Ollama works great. For complex architectural decisions or tricky debugging that requires deep reasoning, you might still want to use Claude models. The good news: you can switch between them easily by changing the model parameter.


Conclusion

Go with Ollama + qwen3-coder with 64K context. Our testing showed 64K is the minimum required for multi-file agentic work. 16K and 32K both fail with looping and hallucinations.

What you get:

  • Completely free
  • Private code never leaves your machine
  • Offline capable
  • No rate limits

Expect variable quality on complex tasks (code generation, exhaustive recall), but basic multi-file reading and detail lookup work well. The tool use actually works.

No local hardware? OpenRouter offers cloud access to open source models with a free tier but that’s a topic for another post.

Try it out and let me know how it goes.


References & Further Reading

  1. Demo Code Repository
  2. Ollama Library
  3. Qwen Docs: Speed Benchmark
  4. Unsloth: Qwen3-Coder Local Setup
  5. HuggingFace GGUF Models
  6. Ollama Context Memory Usage
  7. Ollama Anthropic Compatibility
  8. Ollama Context Length
  9. Lost in the Middle: How Language Models Use Long Contexts
  10. Databricks: Long Context RAG Performance
  11. Scale AI: Long Context Instruction Following
  12. Pinecone: Why Use Retrieval Instead of Larger Context

GitHub       YouTube       X       Reddit       LinkedIn