Claude Code is Anthropic’s agentic coding assistant. It reads your files, edits code, and runs terminal commands all through natural language. However, it can do much more than coding. The catch is that it requires at least a Pro subscription ($20/month), and power users who hit rate limits need Max plan ($100-200/month).
But here’s the thing: Claude Code can connect to any model that supports the Anthropic Messages API. By pointing it at Ollama (a local LLM runner), you get the same tool use, file editing, and agentic workflows completely free.
TL;DR: Install Ollama, pull qwen3-coder:30b, create a model with 64K context, set three environment variables, run claude --model qwen3-coder-64k. That’s it.
Tested on Apple M4 Max with qwen3-coder:30b. All experiments run in January 2026 with Ollama v0.14.x.
What Are Claude Code’s Tools?
Think of most AI assistants as advisors. They tell you what to do, and you have to do it yourself. Claude Code is different. It’s more like a pair programmer who can actually touch your keyboard.
For example, when you ask Claude Code to fix a bug, it doesn’t just explain the fix and provide the steps to fix it. It actually:
- Searches the files (Glob/Grep tools)
- Reads your file (Read tool)
- Edits the code (Edit tool)
- Runs your tests (Bash tool)
This “agentic” workflow, where the AI takes actions rather than just giving advice, is what makes Claude Code powerful. And it’s exactly what we’re going to replicate with open source models.
Prerequisites
Hardware Requirements
Running models locally requires decent hardware. Here’s what you need:
| Model | File Size | RAM Required | GPU VRAM (Q4, 32K ctx) |
|---|---|---|---|
| qwen2.5-coder:7b | 4.7 GB | 16GB | 6-7 GB |
| qwen2.5-coder:14b | 9 GB | 16-32GB | 10-12 GB |
| qwen2.5-coder:32b | 20 GB | 32GB | 20-22 GB |
| qwen3-coder:30b | 19 GB | 32GB | 18-20 GB |
Q4 quantized means the model is compressed to use less memory, with minimal quality loss.
Context window adds memory overhead (based on Llama3.1:8B):
| Context Size | Total VRAM | Additional Memory |
|---|---|---|
| 4K (default) | 5.4 GB | Base |
| 16K | 7.7 GB | +2.3 GB |
| 32K | 10.8 GB | +5.4 GB |
| 64K | 17.3 GB | +11.9 GB |
Larger models will have proportionally higher overhead.
My setup: Apple M4 Max with 128GB unified memory, 40 GPU cores.
Claude Code Installed
# Native installation (recommended)
curl -fsSL https://claude.ai/install.sh | bash
# Or via npm (requires Node.js 18+)
npm install -g @anthropic-ai/claude-code
# For macOS
brew install --cask claude-code
# Verify installation
claude --version
Setting Up Ollama
Install Ollama
# macOS
brew install ollama
# Linux
curl -fsSL https://ollama.com/install.sh | sh
Verify the installation:
ollama --version
# Should show v0.14.0 or later (required for Anthropic API support)
Start the Ollama services if installed via brew:
brew services start ollama
Pull the Model
ollama pull qwen3-coder:30b
Verify that the model is downloaded and available to use locally
ollama ls
# This should return something like below
# NAME ID SIZE MODIFIED
# qwen3-coder:30b 06c1097efce0 18 GB 2 seconds ago
Why qwen3-coder?
Not all models work with Claude Code’s tool system. qwen3-coder (Alibaba’s coding-focused model) works with the Claude Code’s tool system. I tested a couple of models and below are my findings.
The problem with qwen2.5-coder: It understands the tool format and generates correct JSON, but Claude Code doesn’t recognize it as an actual tool invocation. The model outputs something like {"name": "Read", "arguments": {"file_path": "/path/to/file"}} - instead of actually reading the file. Even with 64K context, it never executes tools.
qwen3-coder properly triggers Claude Code’s tool execution layer.
| Model | Tool Calling | Result |
|---|---|---|
| qwen3-coder | Works | Tools execute properly |
| qwen2.5-coder:32b | Partial | Outputs JSON but tools don’t execute |
| deepseek-coder-v2 | Failed | Does not support tool calling |
| codestral | Failed | Does not support tool calling |
Requirements for any model:
- Tool/function calling support
- Minimum 64K context window (via
num_ctxparameter) - Ollama v0.14.0+ installed
Note: The performance observations in this article are based on testing with qwen3-coder:30b specifically. Other open source models may perform differently.
Other models to try: glm-4.7 and gpt-oss (20b/120b) also support tool calling and work with Claude Code. If you’re already running these, the same setup steps apply.
Context Window: The Critical Setting
Why Context Matters
What happens if the context window is too small?
Think of the context window as the model’s working memory. It’s how much text (code, conversation history, tool definitions, and responses) the model can see at once.
With a small context window (4K tokens):
- The model forgets earlier parts of the conversation
- Tool definitions get truncated or lost entirely
- Claude Code’s system prompt (which explains how to use tools) may not fit
- The model outputs tool calls as plain text instead of executing them
With a larger context window (64K+ tokens):
- Larger conversation history is retained
- Tool definitions remain visible to the model
- The model understands it should execute tools, not just describe them
- Tools actually run, enabling proper agentic behavior
This is why increasing context is non-negotiable. Without it, you don’t have an agent; instead, you have an autocomplete that talks about tools.
But bigger isn’t always better. Very large context windows (128K+) can actually degrade performance:
- Lost in the middle problem - Models tend to focus on the beginning and end of the context, struggling to recall information buried in the middle. The longer the context, the worse this gets.
- Slower inference - More tokens means more computation. A 128K context will be noticeably slower than 64K.
- Diluted attention - The model’s attention mechanism spreads thinner across more tokens, potentially missing important details.
- Diminishing returns - For most coding tasks, you won’t use anywhere near 128K tokens. You’re paying the memory and speed cost for capacity you don’t need.
So what’s the sweet spot? Rather than just claiming a number, let’s experiment and find out.
The Experiment
Rather than guessing, I tested different context sizes with a real agentic workload to find where things break.
The Setup
Model: qwen3-coder:30b
Test project: 5 interconnected Python files (models.py, database.py, services.py, routes.py, utils.py) totaling ~2,400 lines. Each file contained specific “memorable details” (constants, thresholds, magic values) to test recall.
Prompts tested (in sequence, same session):
1. Read all the Python files in this directory and explain how they work together
2. What is the ASCENSION_THRESHOLD value in services.py and what does it do?
3. What is the TOMBSTONE_RETENTION_DAYS value in database.py?
4. How does the OBSIDIAN loyalty tier in models.py connect to the tier
calculation in services.py?
5. What is the MAGIC_SALT in utils.py and where else is it referenced?
6. Now read the buggy_shopping_cart.py file in the parent directory and
compare its discount logic to the Coupon class in models.py
7. Write a new function in services.py that combines the ASCENSION_THRESHOLD
logic with the FOUNDERS50 coupon from database.py. Explain how they
would interact.
8. Looking at all the files you've read, list every constant that contains
a number (like BURST_LIMIT=100, POOL_LIMIT=20, etc.) and explain what
each does.
9. What is the exact admin token value in routes.py and what endpoints
require it?
10. Trace the complete flow: A customer with OBSIDIAN tier places an order
over $500 using FOUNDERS50 coupon. Walk through models.py -> services.py
-> routes.py showing exact function calls.
What these prompts test:
- Prompts 1-5: Basic multi-file reading and recall
- Prompt 6: Adding another file mid session
- Prompt 7: Code generation using cross file context
- Prompt 8: Exhaustive recall across ALL files simultaneously
- Prompt 9: Specific detail retrieval (testing for “lost in the middle”)
- Prompt 10: Complex multi-file reasoning and flow tracing
Test 1: 16K Context
Result: ❌ Complete failure
Prompt: “Read all the Python files in this directory and explain how they work together”
I ran this test twice and it failed differently each time - inconsistent failures, but consistently failing.
Run 1:
- Model found 5 Python files ✅
- Read
models.py(367 lines) ✅ - Re-read the same file again (loop) ❌
- Never proceeded to other 4 files ❌
- Prematurely concluded with “Is there a specific aspect you’d like me to examine further?”
Run 2 (worse):
| Step | Action | Status |
|---|---|---|
| 1 | Found 5 Python files | ✅ |
| 2 | Read models.py (367 lines) | ✅ |
| 3 | Created Task #1 | ✅ |
| 4 | Hallucinated src/data/models.ts (doesn’t exist) | ❌ |
| 5 | Got confused, used find and ls to explore | ❌ |
| 6 | Re-read models.py (2nd time) | ❌ |
| 7 | Created Task #2 (duplicate) | ❌ |
| 8 | Re-read models.py (3rd time) | ❌ |
| 9 | Created Task #3 (another duplicate) | ❌ |
| 10 | Analyzed only models.py | ❌ |
| 11 | Marked all 3 duplicate tasks as complete | ❌ |
| 12 | Never read the other 4 files | ❌ |
| 13 | Hallucinated being in “plan mode” and asked to exit | ❌ |
Files read: 1 of 5 (models.py only - read 3 times in Run 2)
Files never touched: database.py, services.py, routes.py, utils.py
The model completely forgot its original goal (“read all Python files”), created duplicate tasks, hallucinated file paths and modes, and declared success after analyzing only 1 file.
Conclusion: 16K is completely insufficient for multi-file agentic work.
Test 2: 32K Context
Result: ❌ Failed - stuck in read loops
Multi-file prompt: “Read all the Python files in this directory and explain how they work together”
The model got stuck reading the same files repeatedly:
| File | Times Read |
|---|---|
| models.py | 5x |
| database.py | 5x |
| services.py | 4x |
| routes.py | 1x |
| utils.py | 0x (never read) |
The model kept cycling: models.py -> database.py -> services.py -> models.py -> ... for 14+ minutes without completing.
2-file prompt: “Read models.py and services.py and explain what it is”
Even a simple 2-file task failed:
- Read models.py ✅
- Read services.py ✅
- Read database.py (not requested - scope creep) ❌
- Re-read models.py (loop starting) ❌
- Stuck “Cogitating” for 3+ minutes ❌
Single-file tasks work fine at 32K. When asked to read just one file, the model performed correctly. The issue is specifically with multi-file coordination.
Conclusion: 32K is insufficient for multi-file agentic work.
Test 3: 64K Context
Result: ✅ Completed all prompts (with quality caveats)
| Prompt | Type | Result |
|---|---|---|
| 1-5 | Multi-file reading & basic recall | ✅ Pass with minor inaccuracies |
| 6 | Cross-file comparison | ⚠️ Mixed up comment with class field, missed 2/3 bugs |
| 7 | Code generation | ❌ Wrong class placement, incorrect variable reference |
| 8 | Exhaustive recall | ❌ Hallucinated constant names, missed items |
| 9 | Specific detail retrieval | ✅ Pass with minor misattribution |
| 10 | Complex flow tracing | ⚠️ Conflated concepts, missed bug, incomplete |
64K solves the looping problem - the model completed all 10 prompts without getting stuck. However, quality varies:
- ✅ Strong: Multi-file reading, basic recall, specific detail lookup
- ⚠️ Mixed: Cross-file comparison, complex flow tracing
- ❌ Weak: Code generation, exhaustive listing (hallucinations)
Conclusion: 64K worked for multi-file agentic work, but expect variable quality on complex tasks. Ollama recommends the context window of 64K and our conclusion aligns with that.
Based on the tests 64K is the minimum for Claude Code with qwen3-coder doing multi-file work. Other models may have different requirements.
How to Set 64K Context
By default Ollama uses a context window of 4096 tokens, which is not sufficient for Claude Code’s multi-step coding workflow.
To increase the context window we will use the Modelfile.
A Modelfile is like a Dockerfile for LLMs. It lets you customize a base model with your own parameters.
Create a Modelfile to increase context:
# Create Modelfile with 64K context (recommended)
echo "FROM qwen3-coder:30b
PARAMETER num_ctx 65536" > Qwen3Modelfile64k
# Create the custom model
ollama create qwen3-coder-64k -f Qwen3Modelfile64k
The num_ctx parameter (context window size) determines how much text the model can see at once.
How to verify your actual context size:
Run ollama show qwen3-coder-64k and check num_ctx under Parameters section. It should match the num_ctx value that you have mentioned in the Modelfile. If the num_ctx parameter is different or is not present, it means that configuration is not applied to the model.
You can also check the context that model is actually using while running via:
ollama ps
If it shows 4096 or a number different under CONTEXT than what you defined as num_ctx, your Modelfile wasn’t applied correctly.
Connecting Claude Code to Ollama
Environment Variables
These three variables tell Claude Code where to find Ollama and how to authenticate. The empty API key works because Ollama doesn’t require authentication since it’s running on your machine.
Add these to your ~/.bashrc or ~/.zshrc if you want it to reflect across all terminals. Or else you can simply export these variables in your current terminal session.
export ANTHROPIC_BASE_URL="http://localhost:11434"
export ANTHROPIC_AUTH_TOKEN="ollama"
export ANTHROPIC_API_KEY=""
If added to ~/.bashrc or ~/.zshrc then reload your shell:
source ~/.zshrc # or ~/.bashrc
Run It
claude --model qwen3-coder-64k
Demo: Shopping Cart Bug Fix
Let’s prove this actually works. Here’s a buggy Python shopping cart, Claude Code with opensource model will find and fix the bugs using its tools. You can find the complete demo code on GitHub.
The Setup
The buggy code has 3 hidden bugs:
class ShoppingCart:
def __init__(self):
self.items = []
self.discount_code = None
def add_item(self, name, price, quantity):
# Bug 1: Doesn't check if item exists, creates duplicates
item = {"name": name, "price": price, "quantity": quantity}
self.items.append(item)
def remove_item(self, name):
# Bug 2: Modifying list while iterating skips items
for item in self.items:
if item["name"] == name:
self.items.remove(item)
def apply_discount(self, code):
# Bug 3: Silently accepts invalid codes
self.discount_code = code
def get_discount_amount(self):
subtotal = self.get_subtotal()
if self.discount_code == "SAVE10":
return subtotal * 0.10
elif self.discount_code == "SAVE20":
return subtotal * 0.20
return 0 # Invalid codes return 0 and user thinks it worked
# ... rest of the class
Note: I removed these comments before testing; the model had to find bugs on its own.
The Prompt
Find and fix bugs in buggy_shopping_cart.py
What Actually Happened
The model used tools correctly:
- Read tool to examine the file
- Bash tool to verify file contents
- Write tool to create fixed versions
- Bash tool to run and test the fixes
But here’s where it got interesting - instead of editing the original file, it created three separate files:
buggy_shopping_cart_fixed.pybuggy_shopping_cart_fixed_simple.pybuggy_shopping_cart_fixed_comprehensive.py
Bugs Found
| Bug | Simple Version | Comprehensive Version |
|---|---|---|
List mutation in remove_item() | ✅ Fixed | ✅ Fixed |
| Invalid discount codes | ❌ Missed | ✅ Fixed |
Duplicate items in add_item() | ❌ Missed | ❌ Missed |
Result: 2 of 3 bugs found (in the comprehensive version).
The model caught the classic list-while-iterating antipattern and added discount code validation. It missed the duplicate item bug-which is arguably more of a design decision than an obvious bug.
The key takeaway: tools actually executed. The model didn’t just output JSON describing what it would do—instead, it read files, created new files, and tested the files to verify the fixes worked.
However, notice the behavioral differences from Claude’s models:
- Overengineering: Created 3 files when asked to just “fix bugs”
- Wrote new files instead of editing: Used Write tool instead of Edit tool
- Missed subtle bugs: Found obvious patterns, missed business logic issues
Note on variability: I ran the same prompt three times and got different results each time:
| Run | Bugs Fixed | Approach | Time |
|---|---|---|---|
| 1 | 2/3 | Created 3 new files | ~1m |
| 2 | 2/3 | Edited in place | ~1.5m |
| 3 | 1/3 | Edited + wrote tests, got stuck debugging | ~6m |
Run 3 showed scope creep - the model decided to write a 92 line test file when I just asked to fix bugs, then spent most of its time debugging floating point assertion failures in its own tests.
Why does this happen? LLMs are probabilistic. Each token is sampled from a probability distribution, not deterministically chosen. Without a fixed seed, the same prompt can lead to different reasoning paths. You can reduce this by setting temperature 0 and a fixed seed in your Modelfile, but most users won’t.
Troubleshooting
| Problem | Solution |
|---|---|
| Model not found | Use full name with tag: qwen3-coder:30b not qwen3-coder. Check ollama list |
| Tools output as JSON instead of executing | 1. Increase context to 64K+. 2. Use qwen3-coder (not qwen2.5). 3. Update Ollama to v0.14.0+ |
| Slow response | Hardware dependent. Larger models need more resources |
| Context too long | Create model with larger num_ctx parameter |
| Connection refused | Make sure ollama serve is running |
Limitations (Honest Take)
What Works Well
- Tool use - File reading, editing, and bash commands work reliably
- Multi-turn conversations - Context is maintained
- Code generation - Quality is good for most tasks
- No rate limits - Work as long as you want
What’s Different from Claude API
- Reasoning quality - qwen3-coder:30b isn’t quite at Claude Sonnet/Opus level for complex tasks
- Speed - Depends on hardware; can be slower than API
- Context handling - Some models struggle with very long contexts
- Edge cases - May need more explicit prompting
- Consistency - Same prompt can produce different approaches, scope creep, or get stuck in loops. LLMs are probabilistic, and qwen3-coder showed more variance than Claude’s models in my testing
My Take
For day-to-day coding tasks, writing functions, debugging, refactoring qwen3-coder:30b via Ollama works great. For complex architectural decisions or tricky debugging that requires deep reasoning, you might still want to use Claude models. The good news: you can switch between them easily by changing the model parameter.
Conclusion
Go with Ollama + qwen3-coder with 64K context. Our testing showed 64K is the minimum required for multi-file agentic work. 16K and 32K both fail with looping and hallucinations.
What you get:
- Completely free
- Private code never leaves your machine
- Offline capable
- No rate limits
Expect variable quality on complex tasks (code generation, exhaustive recall), but basic multi-file reading and detail lookup work well. The tool use actually works.
No local hardware? OpenRouter offers cloud access to open source models with a free tier but that’s a topic for another post.
Try it out and let me know how it goes.
References & Further Reading
- Demo Code Repository
- Ollama Library
- Qwen Docs: Speed Benchmark
- Unsloth: Qwen3-Coder Local Setup
- HuggingFace GGUF Models
- Ollama Context Memory Usage
- Ollama Anthropic Compatibility
- Ollama Context Length
- Lost in the Middle: How Language Models Use Long Contexts
- Databricks: Long Context RAG Performance
- Scale AI: Long Context Instruction Following
- Pinecone: Why Use Retrieval Instead of Larger Context