I built dev_search without having built a search system before. The moment I knew it worked was not a benchmark. It was watching Claude debug.
I was on a Google Hangout for a demo party, the Lytics team sharing what we'd all been building with AI that week. It's a tight-knit group. Some of us have been working together for almost a decade. My palms were a little sweaty. I'd been listening to everyone else demo their work and was eager to show mine.
I wanted to show the difference dev-agent made, so I ran the same debugging task twice: find why search returns duplicates.
Without dev-agent, Claude did what it usually does in a repo it doesn't understand yet. It guessed paths. It ran bash commands. It read whole files trying to triangulate the right function. It proposed writing a debug script just to figure out where the bug lived. Thirteen minutes. Ninety-nine cents.
With dev-agent, same question. Claude used dev_search. Five minutes later it had the root cause: no deduplication in the search pipeline, plus four reasonable fixes. Fifty-seven cents.
Sixty-two percent faster. Forty-two percent cheaper.
The model weights didn't change. The question didn't change. What changed was the shape of the retrieval.
what actually made the difference
I came into this thinking the embedding model would be the hard part.
It wasn't.
The thing that mattered most was much simpler: return snippets, not file paths.
When an assistant gets a file path, it still has to go hunting. It opens the file, scrolls, reads, guesses, backtracks, opens another one. If the file is big, you pay for all of that in time and tokens before it has even started solving the problem.
When it gets a snippet with the function body, the name, the line numbers, and a little metadata, the hunting mostly stops.
That was the real design decision.
Here is the shape of the result:
[
{
"path": "src/auth/service.ts",
"name": "generateToken",
"startLine": 88,
"endLine": 112,
"snippet": "export function generateToken(user: User) { ... }",
"score": 0.93
}
]
That is enough to answer a lot of codebase questions immediately. Or at least enough to get the assistant pointed in the right place without making it read 18,000 tokens of surrounding file first.
Once I saw that clearly, the rest of the architecture got easier to reason about.
the architecture is smaller than it sounds
The system has four moving parts:
- scan the repo into semantic units
- embed those units
- store the vectors
- format results for AI consumption
That's it.
What made it useful was not novelty. It was choosing the right unit at each step.
The scanner chunks by meaning instead of token count, so functions and classes stay intact.
The embedder uses a local MiniLM model because I wanted it cheap, local, and good enough.
The vector store is LanceDB because it was the easiest embedded option that did not ask me to run more infrastructure than the project deserved.
And the formatter returns actual code snippets rather than developer-friendly-but-AI-hostile output.
That last part is where most of the gain came from.
what mattered more than I expected
Three things mattered more than the embedding model itself.
semantic chunking
Most RAG systems chunk by token count. That gives you arbitrary boundaries and half-functions.
I chunked by semantic units using the TypeScript compiler API, so a function stayed a function and a class stayed a class.
That sounds like implementation detail until you imagine the assistant trying to answer "where do we generate auth tokens?" from a fragment that starts halfway through one function and ends halfway through another.
Meaningful chunks gave the search results enough integrity to be useful.
metadata
I kept more than just the snippet:
- file path
- line numbers
- function name
- signature
- callees
- export status
That metadata is what makes the results actionable instead of merely relevant.
It is also what lets the tools chain together cleanly. dev_search finds the likely code. dev_refs can follow relationships from there. dev_history can look at the file once you actually know it matters.
result formatting
This is the part I keep coming back to because it still feels underrated.
Developer tools are often designed for humans who are comfortable opening files and orienting themselves. AI assistants are different. They do better when the retrieval result is already close to the answer surface.
That changed how I thought about tool design more generally. The question stopped being "what would a developer want printed to the terminal?" and became "what is the smallest meaningful unit that lets the model stop guessing?"
what I chose and why
I made a lot of choices in this project without pretending they were optimal.
I asked Claude to compare embedding models. I asked about vector stores. I asked how to turn L2 distance into a similarity score that felt sane. Then I picked something reasonable and moved on.
MiniLM fit the constraints:
- local
- cheap
- documented
- good enough
LanceDB fit the constraints too:
- embedded
- persistent
- zero real setup
That phrase, good enough, mattered more here than perfection. I did not need the best embedding model in the world. I needed something that let me test whether the architecture around it was actually useful.
That turned out to be enough.
In practice, the architecture mattered more than the model.
what broke once I used it for real
The commit history tells the truth faster than the architecture diagram.
A few things broke quickly:
- I had an early version matching filenames too eagerly instead of code content.
- My first similarity scoring formula produced rankings that felt obviously wrong when I looked at real results.
- Cursor integration surfaced problems the tests did not: zombie processes, stdin weirdness, shutdown behavior.
- Event listeners leaked until I cleaned up the lifecycle more carefully.
None of that surprised me much after the fact. The part I care about is that the shape of the system made the fixes local.
That is another thing I learned here. When the architecture is modular for real, being wrong is cheaper.
I could swap scoring. Swap stores. Improve formatting. Rework scanner behavior. None of those changes required rethinking the whole tool.
when this tool wins and when it doesn't
dev_search is good at conceptual questions in unfamiliar codebases.
Questions like:
- where do we handle authentication?
- where do we generate tokens?
- what code touches this workflow?
It is not the best tool for exact string matching. That is still grep.
That distinction matters because semantic search gets oversold very easily. It is not a replacement for every other search mode. It is a better first move when the question is about meaning rather than spelling.
That was exactly the shape of the debugging task in the demo.
Claude did not need another path to more file reads. It needed a faster way to understand where the logic probably lived.
what I’m keeping from this
What I'm keeping from this is that tool design changes when the user is an assistant.
The lesson is that AI tooling gets much better when you design for the assistant's consumption model instead of for your own habits.
That means:
- return snippets, not references
- keep semantic units intact
- include the metadata that makes the result actionable
- make it cheap to swap parts once you learn more
I built this in a week without having built search before. Claude helped compress the exploration phase. I still had to choose. I still had to ship. I still had to look at the results and decide whether they made sense.
That is the part of working with AI I trust most right now. Let it compress the search space. Then make the decision. Then leave yourself room to be wrong.
FAQ
What is dev_search and how does it help AI code assistants?
dev_search is a semantic code search tool that returns ranked code snippets for natural-language questions. Instead of making the assistant read whole files just to find one function, it returns the likely code directly, along with metadata that makes the result actionable.
How does semantic code search differ from grep?
grep is for exact string matching. Semantic code search is for meaning. If the question is 'where do we handle authentication?' and the code does not literally say authentication, semantic search still has a shot. grep does not.
Why does returning snippets matter so much?
Because it changes the retrieval unit. If the assistant already has the code it needs, plus line numbers and metadata, it stops spending time and tokens rediscovering that code by opening large files.
What mattered more: the embedding model or the architecture around it?
The architecture around it. Semantic chunking, useful metadata, and snippet-first formatting mattered more than squeezing marginal quality out of the embedding model.
When should you use dev_search?
Use it for conceptual questions in unfamiliar codebases: where something happens, what code likely owns a workflow, or how a concept is implemented. For exact string lookup, use grep.