How to Use Claude Code with Open Source Models (Completely Free)

Claude Code is Anthropic’s agentic coding assistant. It reads your files, edits code, and runs terminal commands all through natural language. However, it can do much more than coding. The catch is that it requires at least a Pro subscription ($20/month), and power users who hit rate limits need Max plan ($100-200/month).

But here’s the thing: Claude Code can connect to any model that supports the Anthropic Messages API. By pointing it at Ollama (a local LLM runner), you get the same tool use, file editing, and agentic workflows completely free.

TL;DR: Install Ollama, pull qwen3-coder:30b, create a model with 64K context, set three environment variables, run claude --model qwen3-coder-64k. That’s it.

Tested on Apple M4 Max with qwen3-coder:30b. All experiments run in January 2026 with Ollama v0.14.x.

Prefer a visual walkthrough? Watch me set up this exact agent, configure the 64K context window, and build a demo app in 10 minutes:

What Are Claude Code’s Tools?

Think of most AI assistants as advisors. They tell you what to do, and you have to do it yourself. Claude Code is different. It’s more like a pair programmer who can actually touch your keyboard.

For example, when you ask Claude Code to fix a bug, it doesn’t just explain the fix and provide the steps to fix it. It actually:

Searches the files (Glob/Grep tools)
Reads your file (Read tool)
Edits the code (Edit tool)
Runs your tests (Bash tool)

This “agentic” workflow, where the AI takes actions rather than just giving advice, is what makes Claude Code powerful. And it’s exactly what we’re going to replicate with open source models.

Prerequisites

Hardware Requirements

Running models locally requires decent hardware. Here’s what you need:

Model	File Size	RAM Required	GPU VRAM (Q4, 32K ctx)
qwen2.5-coder:7b	4.7 GB	16GB	6-7 GB
qwen2.5-coder:14b	9 GB	16-32GB	10-12 GB
qwen2.5-coder:32b	20 GB	32GB	20-22 GB
qwen3-coder:30b	19 GB	32GB	18-20 GB

Q4 quantized means the model is compressed to use less memory, with minimal quality loss.

Context window adds memory overhead (based on Llama3.1:8B):

Context Size	Total VRAM	Additional Memory
4K (default)	5.4 GB	Base
16K	7.7 GB	+2.3 GB
32K	10.8 GB	+5.4 GB
64K	17.3 GB	+11.9 GB

Larger models will have proportionally higher overhead.

My setup: Apple M4 Max with 128GB unified memory, 40 GPU cores.

Claude Code Installed

# Native installation (recommended)
curl -fsSL https://claude.ai/install.sh | bash

# Or via npm (requires Node.js 18+)
npm install -g @anthropic-ai/claude-code

# For macOS
brew install --cask claude-code

# Verify installation
claude --version

Setting Up Ollama

Install Ollama

# macOS
brew install ollama

# Linux
curl -fsSL https://ollama.com/install.sh | sh

Verify the installation:

ollama --version
# Should show v0.14.0 or later (required for Anthropic API support)

Start the Ollama services if installed via brew:

brew services start ollama

Pull the Model

ollama pull qwen3-coder:30b

Verify that the model is downloaded and available to use locally

ollama ls
# This should return something like below
# NAME                         ID              SIZE      MODIFIED
# qwen3-coder:30b              06c1097efce0    18 GB     2 seconds ago

Why qwen3-coder?

Not all models work with Claude Code’s tool system. qwen3-coder (Alibaba’s coding-focused model) works with the Claude Code’s tool system. I tested a couple of models and below are my findings.

The problem with qwen2.5-coder: It understands the tool format and generates correct JSON, but Claude Code doesn’t recognize it as an actual tool invocation. The model outputs something like {"name": "Read", "arguments": {"file_path": "/path/to/file"}} - instead of actually reading the file. Even with 64K context, it never executes tools.

qwen3-coder properly triggers Claude Code’s tool execution layer.

Model	Tool Calling	Result
qwen3-coder	Works	Tools execute properly
qwen2.5-coder:32b	Partial	Outputs JSON but tools don’t execute
deepseek-coder-v2	Failed	Does not support tool calling
codestral	Failed	Does not support tool calling

Requirements for any model:

Tool/function calling support
Minimum 64K context window (via num_ctx parameter)
Ollama v0.14.0+ installed

Note: The performance observations in this article are based on testing with qwen3-coder:30b specifically. Other open source models may perform differently.

Other models to try: glm-4.7 and gpt-oss (20b/120b) also support tool calling and work with Claude Code. If you’re already running these, the same setup steps apply.

Context Window: The Critical Setting

Why Context Matters

What happens if the context window is too small?

Think of the context window as the model’s working memory. It’s how much text (code, conversation history, tool definitions, and responses) the model can see at once.

With a small context window (4K tokens):

The model forgets earlier parts of the conversation
Tool definitions get truncated or lost entirely
Claude Code’s system prompt (which explains how to use tools) may not fit
The model outputs tool calls as plain text instead of executing them

With a larger context window (64K+ tokens):

Larger conversation history is retained
Tool definitions remain visible to the model
The model understands it should execute tools, not just describe them
Tools actually run, enabling proper agentic behavior

This is why increasing context is non-negotiable. Without it, you don’t have an agent; instead, you have an autocomplete that talks about tools.

But bigger isn’t always better. Very large context windows (128K+) can actually degrade performance:

Lost in the middle problem - Models tend to focus on the beginning and end of the context, struggling to recall information buried in the middle. The longer the context, the worse this gets.
Slower inference - More tokens means more computation. A 128K context will be noticeably slower than 64K.
Diluted attention - The model’s attention mechanism spreads thinner across more tokens, potentially missing important details.
Diminishing returns - For most coding tasks, you won’t use anywhere near 128K tokens. You’re paying the memory and speed cost for capacity you don’t need.

So what’s the sweet spot? Rather than just claiming a number, let’s experiment and find out.

The Experiment

Rather than guessing, I tested different context sizes with a real agentic workload to find where things break.

The Setup

Model: qwen3-coder:30b

Test project: 5 interconnected Python files (models.py, database.py, services.py, routes.py, utils.py) totaling ~2,400 lines. Each file contained specific “memorable details” (constants, thresholds, magic values) to test recall.

Prompts tested (in sequence, same session):

1. Read all the Python files in this directory and explain how they work together

2. What is the ASCENSION_THRESHOLD value in services.py and what does it do?

3. What is the TOMBSTONE_RETENTION_DAYS value in database.py?

4. How does the OBSIDIAN loyalty tier in models.py connect to the tier
   calculation in services.py?

5. What is the MAGIC_SALT in utils.py and where else is it referenced?

6. Now read the buggy_shopping_cart.py file in the parent directory and
   compare its discount logic to the Coupon class in models.py

7. Write a new function in services.py that combines the ASCENSION_THRESHOLD
   logic with the FOUNDERS50 coupon from database.py. Explain how they
   would interact.

8. Looking at all the files you've read, list every constant that contains
   a number (like BURST_LIMIT=100, POOL_LIMIT=20, etc.) and explain what
   each does.

9. What is the exact admin token value in routes.py and what endpoints
   require it?

10. Trace the complete flow: A customer with OBSIDIAN tier places an order
    over $500 using FOUNDERS50 coupon. Walk through models.py -> services.py
    -> routes.py showing exact function calls.

What these prompts test:

Prompts 1-5: Basic multi-file reading and recall
Prompt 6: Adding another file mid session
Prompt 7: Code generation using cross file context
Prompt 8: Exhaustive recall across ALL files simultaneously
Prompt 9: Specific detail retrieval (testing for “lost in the middle”)
Prompt 10: Complex multi-file reasoning and flow tracing

Test 1: 16K Context

Result: ❌ Complete failure

Prompt: “Read all the Python files in this directory and explain how they work together”

I ran this test twice and it failed differently each time - inconsistent failures, but consistently failing.

Run 1:

Model found 5 Python files ✅
Read models.py (367 lines) ✅
Re-read the same file again (loop) ❌
Never proceeded to other 4 files ❌
Prematurely concluded with “Is there a specific aspect you’d like me to examine further?”

Run 2 (worse):

Step	Action	Status
1	Found 5 Python files	✅
2	Read `models.py` (367 lines)	✅
3	Created Task #1	✅
4	Hallucinated `src/data/models.ts` (doesn’t exist)	❌
5	Got confused, used `find` and `ls` to explore	❌
6	Re-read `models.py` (2nd time)	❌
7	Created Task #2 (duplicate)	❌
8	Re-read `models.py` (3rd time)	❌
9	Created Task #3 (another duplicate)	❌
10	Analyzed only `models.py`	❌
11	Marked all 3 duplicate tasks as complete	❌
12	Never read the other 4 files	❌
13	Hallucinated being in “plan mode” and asked to exit	❌

Files read: 1 of 5 (models.py only - read 3 times in Run 2)

Files never touched: database.py, services.py, routes.py, utils.py

The model completely forgot its original goal (“read all Python files”), created duplicate tasks, hallucinated file paths and modes, and declared success after analyzing only 1 file.

Conclusion: 16K is completely insufficient for multi-file agentic work.

Test 2: 32K Context

Result: ❌ Failed - stuck in read loops

Multi-file prompt: “Read all the Python files in this directory and explain how they work together”

The model got stuck reading the same files repeatedly:

File	Times Read
models.py	5x
database.py	5x
services.py	4x
routes.py	1x
utils.py	0x (never read)

The model kept cycling: models.py -> database.py -> services.py -> models.py -> ... for 14+ minutes without completing.

2-file prompt: “Read models.py and services.py and explain what it is”

Even a simple 2-file task failed:

Read models.py ✅
Read services.py ✅
Read database.py (not requested - scope creep) ❌
Re-read models.py (loop starting) ❌
Stuck “Cogitating” for 3+ minutes ❌

Single-file tasks work fine at 32K. When asked to read just one file, the model performed correctly. The issue is specifically with multi-file coordination.

Conclusion: 32K is insufficient for multi-file agentic work.

Test 3: 64K Context

Result: ✅ Completed all prompts (with quality caveats)

Prompt	Type	Result
1-5	Multi-file reading & basic recall	✅ Pass with minor inaccuracies
6	Cross-file comparison	⚠️ Mixed up comment with class field, missed 2/3 bugs
7	Code generation	❌ Wrong class placement, incorrect variable reference
8	Exhaustive recall	❌ Hallucinated constant names, missed items
9	Specific detail retrieval	✅ Pass with minor misattribution
10	Complex flow tracing	⚠️ Conflated concepts, missed bug, incomplete

64K solves the looping problem - the model completed all 10 prompts without getting stuck. However, quality varies:

✅ Strong: Multi-file reading, basic recall, specific detail lookup
⚠️ Mixed: Cross-file comparison, complex flow tracing
❌ Weak: Code generation, exhaustive listing (hallucinations)

Conclusion: 64K worked for multi-file agentic work, but expect variable quality on complex tasks. Ollama recommends the context window of 64K and our conclusion aligns with that.

Based on the tests 64K is the minimum for Claude Code with qwen3-coder doing multi-file work. Other models may have different requirements.

How to Set 64K Context

By default Ollama uses a context window of 4096 tokens, which is not sufficient for Claude Code’s multi-step coding workflow.

To increase the context window we will use the Modelfile.

A Modelfile is like a Dockerfile for LLMs. It lets you customize a base model with your own parameters.

Create a Modelfile to increase context:

# Create Modelfile with 64K context (recommended)
echo "FROM qwen3-coder:30b
PARAMETER num_ctx 65536" > Qwen3Modelfile64k

# Create the custom model
ollama create qwen3-coder-64k -f Qwen3Modelfile64k

The num_ctx parameter (context window size) determines how much text the model can see at once.

How to verify your actual context size:

Run ollama show qwen3-coder-64k and check num_ctx under Parameters section. It should match the num_ctx value that you have mentioned in the Modelfile. If the num_ctx parameter is different or is not present, it means that configuration is not applied to the model.

You can also check the context that model is actually using while running via:

ollama ps

If it shows 4096 or a number different under CONTEXT than what you defined as num_ctx, your Modelfile wasn’t applied correctly.

Connecting Claude Code to Ollama

Environment Variables

These three variables tell Claude Code where to find Ollama and how to authenticate. The empty API key works because Ollama doesn’t require authentication since it’s running on your machine.

Add these to your ~/.bashrc or ~/.zshrc if you want it to reflect across all terminals. Or else you can simply export these variables in your current terminal session.

export ANTHROPIC_BASE_URL="http://localhost:11434"
export ANTHROPIC_AUTH_TOKEN="ollama"
export ANTHROPIC_API_KEY=""

If added to ~/.bashrc or ~/.zshrc then reload your shell:

source ~/.zshrc  # or ~/.bashrc

Run It

claude --model qwen3-coder-64k

Demo: Shopping Cart Bug Fix

Let’s prove this actually works. Here’s a buggy Python shopping cart, Claude Code with opensource model will find and fix the bugs using its tools. You can find the complete demo code on GitHub.

The Setup

The buggy code has 3 hidden bugs:

class ShoppingCart:
    def __init__(self):
        self.items = []
        self.discount_code = None

    def add_item(self, name, price, quantity):
        # Bug 1: Doesn't check if item exists, creates duplicates
        item = {"name": name, "price": price, "quantity": quantity}
        self.items.append(item)

    def remove_item(self, name):
        # Bug 2: Modifying list while iterating skips items
        for item in self.items:
            if item["name"] == name:
                self.items.remove(item)

    def apply_discount(self, code):
        # Bug 3: Silently accepts invalid codes
        self.discount_code = code

    def get_discount_amount(self):
        subtotal = self.get_subtotal()
        if self.discount_code == "SAVE10":
            return subtotal * 0.10
        elif self.discount_code == "SAVE20":
            return subtotal * 0.20
        return 0  # Invalid codes return 0 and user thinks it worked

    # ... rest of the class

Note: I removed these comments before testing; the model had to find bugs on its own.

The Prompt

Find and fix bugs in buggy_shopping_cart.py

What Actually Happened

The model used tools correctly:

Read tool to examine the file
Bash tool to verify file contents
Write tool to create fixed versions
Bash tool to run and test the fixes

But here’s where it got interesting - instead of editing the original file, it created three separate files:

buggy_shopping_cart_fixed.py
buggy_shopping_cart_fixed_simple.py
buggy_shopping_cart_fixed_comprehensive.py

Bugs Found

Bug	Simple Version	Comprehensive Version
List mutation in `remove_item()`	✅ Fixed	✅ Fixed
Invalid discount codes	❌ Missed	✅ Fixed
Duplicate items in `add_item()`	❌ Missed	❌ Missed

Result: 2 of 3 bugs found (in the comprehensive version).

The model caught the classic list-while-iterating antipattern and added discount code validation. It missed the duplicate item bug-which is arguably more of a design decision than an obvious bug.

The key takeaway: tools actually executed. The model didn’t just output JSON describing what it would do—instead, it read files, created new files, and tested the files to verify the fixes worked.

However, notice the behavioral differences from Claude’s models:

Overengineering: Created 3 files when asked to just “fix bugs”
Wrote new files instead of editing: Used Write tool instead of Edit tool
Missed subtle bugs: Found obvious patterns, missed business logic issues

Note on variability: I ran the same prompt three times and got different results each time:

Run	Bugs Fixed	Approach	Time
1	2/3	Created 3 new files	~1m
2	2/3	Edited in place	~1.5m
3	1/3	Edited + wrote tests, got stuck debugging	~6m

Run 3 showed scope creep - the model decided to write a 92 line test file when I just asked to fix bugs, then spent most of its time debugging floating point assertion failures in its own tests.

Why does this happen? LLMs are probabilistic. Each token is sampled from a probability distribution, not deterministically chosen. Without a fixed seed, the same prompt can lead to different reasoning paths. You can reduce this by setting temperature 0 and a fixed seed in your Modelfile, but most users won’t.

Troubleshooting

Problem	Solution
Model not found	Use full name with tag: `qwen3-coder:30b` not `qwen3-coder`. Check `ollama list`
Tools output as JSON instead of executing	1. Increase context to 64K+. 2. Use qwen3-coder (not qwen2.5). 3. Update Ollama to v0.14.0+
Slow response	Hardware dependent. Larger models need more resources
Context too long	Create model with larger `num_ctx` parameter
Connection refused	Make sure `ollama serve` is running

Limitations (Honest Take)

What Works Well

Tool use - File reading, editing, and bash commands work reliably
Multi-turn conversations - Context is maintained
Code generation - Quality is good for most tasks
No rate limits - Work as long as you want

What’s Different from Claude API

Reasoning quality - qwen3-coder:30b isn’t quite at Claude Sonnet/Opus level for complex tasks
Speed - Depends on hardware; can be slower than API
Context handling - Some models struggle with very long contexts
Edge cases - May need more explicit prompting
Consistency - Same prompt can produce different approaches, scope creep, or get stuck in loops. LLMs are probabilistic, and qwen3-coder showed more variance than Claude’s models in my testing

My Take

For day-to-day coding tasks, writing functions, debugging, refactoring qwen3-coder:30b via Ollama works great. For complex architectural decisions or tricky debugging that requires deep reasoning, you might still want to use Claude models. The good news: you can switch between them easily by changing the model parameter.

Conclusion

Go with Ollama + qwen3-coder with 64K context. Our testing showed 64K is the minimum required for multi-file agentic work. 16K and 32K both fail with looping and hallucinations.

What you get:

Completely free
Private code never leaves your machine
Offline capable
No rate limits

Expect variable quality on complex tasks (code generation, exhaustive recall), but basic multi-file reading and detail lookup work well. The tool use actually works.

No local hardware? OpenRouter offers cloud access to open source models with a free tier but that’s a topic for another post.

Try it out and let me know how it goes.

How to Use Claude Code with Open Source Models (Completely Free)

What Are Claude Code’s Tools?

Prerequisites

Hardware Requirements

Claude Code Installed

Setting Up Ollama

Install Ollama

Pull the Model

Why qwen3-coder?

Context Window: The Critical Setting

Why Context Matters

The Experiment

The Setup

Test 1: 16K Context

Test 2: 32K Context

Test 3: 64K Context

How to Set 64K Context

Connecting Claude Code to Ollama

Environment Variables

Run It

Demo: Shopping Cart Bug Fix

The Setup

The Prompt

What Actually Happened

Bugs Found

Troubleshooting

Limitations (Honest Take)

What Works Well

What’s Different from Claude API

My Take

Conclusion

References & Further Reading

Related Posts

What is a Vector Space in Machine Learning? (With Math and Intuition)

Similar Product Search Using Amazon S3 Vector Buckets: Image and Text Retrieval with RRF Fusion

Taxicab/Manhattan | Euclidean Distance mathematical intuition