Developer Tooling With LLMs

I am not an AI evangelist. If you would like to read a piece about how the latest models are going to change the world or destabilize entire industries, this is probably not meant for you. What I am is a mid level software developer who has spent the last several months with an intense interest in tooling, and LLMs are the area in which developer tooling has had the loudest and fastest advancements in that timeframe. I would also like to preface this piece with something obvious: the landscape around LLM tooling changes very quickly, and what is true today may not be true for long.

On LLM Usage as a Developer

Before I dive into explaining and comparing tools, I want to go over various examples of LLM tooling I observe in both my own and others’ workflow as well as my opinions each use case, so that readers may understand where I’m coming from in terms of how I evaluate these tools.

Autocomplete

I think most developers are familiar with LLM autocomplete at this point, and for good reason: done well, it’s really nice. It isn’t a transformative, earth shattering change (indeed, we regularly hear advocates of agentic coding disparage it, mostly for not being any kind of paradigm shift), but unlike some other LLM use cases, there are virtually no downsides to using it. Even more, with a good implementation (Cursor) it is a genuine joy to use, something that is often ignored in the boardroom focused, ad-in-disguise content that tends to dominate content around LLM tooling. This was the first LLM feature a lot of devs were exposed to, and I’m glad, because if we had used early agents more, we’d all be soured on LLMs beyond repair. By no means essential, but a definite plus in my book.

Planning

I didn’t use LLMs for planning for a long time, and I will admit that I’m kicking myself for that now. I have had friends and coworkers argue to me that quality plans can be produced without LLM assistance, and they are right - (see what I did there with the emdash?) with enough time, you can almost certainly produce better plans without an LLM than with one. However, you would also spend an enormously greater amount of time doing so, to the point that, even as someone who did not use LLM planning at all a few months ago, I would go so far as to call using some version of this the “correct” way of doing things in 2026. My personal workflow is to use the website version of an LLM when working with loose, from-100,000ft plans that don’t require code context, and the planning mode of an LLM harness when I am ready to create something more actionable. I also find it particularly helpful to ask the LLM to identify any pitfalls in the plan. Because they are oh so amiable, they always do identify issues when pressed, and while more than half of the things it frets about are nonsense, the ~30% of issues it identifies that are substantial have saved me a great deal of time. This has actually been the most transformative usage for LLMs for me, because while they can sometimes get implementation wrong, working with an LLM you can arrive at a quality, thorough plan in consistently a fraction of the time previously possible.

As a “Super Search” Tool

This doesn’t get talked about enough, but is actually a very prime use case for LLMs in my opinion. Sending your robot butler to find examples of something in the codebase for you can work wonders when you want to make project wide (or even system wide) searches that can’t be easily translated into a query string/regexp. The amount of time/headache this can save is considerable, and not to be underestimated. It can also “understand” unfamiliar swathes of code at high speeds and summarize them, which can be another great time save (although I do encourage a healthy amount of skepticism in terms of faith in the bot’s conclusions). An underrated feature for sure, and something that is truly unique to LLMs and not just an enhanced version of an existing non-AI feature like code completion.

As StackOverflow 2.0

This is a common LLM use case, and that is probably because it (mostly) Just Works. It is very fast and convenient to copy and paste a slice of code and ask an LLM about it, why it is behaving a certain way, why doesn’t it work, etc. It is also the area where I most value inline, in-editor solutions as opposed to separate CLI/web tools, as switching contexts often breaks flow, which is another way it is superior to StackOverflow, as having to search the former breaks flow massively. Still, this is the first use case I have any ambivalence towards, simply because the bot sometimes gets it wrong, and there is something uniquely frustrating about being lied to by a computer that just isn’t the same as when a StackOverflow user is incorrect. We (or at least I am) are used to machines being deterministic, and one of the most important lessons to learn when working with LLMs is that they are not.

As a Coding Agent

This is certainly the most discussed and most controversial use case for LLMs, and it is easy to see why: the results can be all over the place. The LLM can generate working code at speeds far greater than any human, massively increasing developer velocity. At the same time, output can sometimes be low quality spaghetti even when it works, even with the latest models with good harnesses. And because that code “works”, it often gets committed and PR’d. This leads to, if not an “enshittification” of the codebase, at least an “ensloppification”. With that in mind, I still think there is a lot of value to be gleaned here if the engineer develops a feel for what sort of issues and what scope of issues LLM are likely to do a passable job on, and has the discipline and pride of craftsmanship to “desloppify” the output when necessary. I am reminded of Kurt Vonnegut’s old adage dividing writers into “swoopers” and “bashers”. Swoopers vomit out a ton of text and then carefully edit the mass down into a quality finished product, while bashers fret over every line, wanting every pen stroke to be perfect the first time. I view the LLM as the ultimate swooper, and while the unfiltered output can be largely deleterious to a codebase, with some chiseling into shape it can eventually become a quality product significantly faster than if done by a solo human basher.

As a File System/Install Manager

This is the main popular use case that I am mostly negative on. Any time I try to allow it to do this with restrictions/sandboxing, it ends up constantly prompting me for permissions, to the point that I’m tempted to just give up control and let it do its thing. The problem is, any time I’ve let it go without supervision, it almost invariably screws something up and potentially even introduces a security risk. Not to mention, even when it does work, it often has to try so many things in the process that it becomes difficult to parse what actually fixed the problem and why. The only useful mileage I’ve gotten out of using LLMs with the file system when asking it to diagnose issues, and even then it often tries to then proceed to solve the issue by taking a woodchipper to the problem. Overall, anything beyond asking a CLI tool “why isn’t this working?” is generally a no-go in my book, and God forbid you let them install anything.

Explaining LLM Tooling/Setup

Next, I would like to give a brief overview of the things that influence outcomes when using LLMs as developers use them today: the “harness”, added instructions/context, and added tools/MCP servers. Of course, the model used has arguably the biggest impact of all, but I think everyone already knows that different models have different abilities/strengths, and I do not wish to waste anyone’s time.

Harness

The “harness” for an LLM generally refers to the set environment, tools, and set of constraints that are used by an LLM. For example, one might run the same model (GPT 5.3 for instance), in Cursor, in Codex, in OpenCode, and on ChatGPT’s website, and receive a different outcome every time. This is because the model is operating in a different context under different constraints. A dead-simple example of this is that the model on ChatGPT’s website cannot view your filesystem or run commands on your computer, but the Codex CLI tool can. Before I started experimenting more with these tools, I underestimated the impact of harness on outcomes. However, now with more experience,I can say safely that, if using models of the same generation/general level of ability, changing harnesses can often have an even greater impact than switching models entirely.

Context

The added instructions/context here refer to additional project/application level instructions given to an LLM. This is decidedly the least cool part of all this, as it is essentially just additional prompt context the model reads/“knows” before it responds to the user, but it can still be quite impactful. The most well known examples of this are keeping an AGENTS.md or CLAUDE.md in the project root, but Cursor users may be more familiar with adding additional “cursor rules”. More recently, this has taken the form of “skills” users can install on various harness/agent applications, which in reality are usually just text files that “teach” the LLM how to do something, often by linking to other text files, which gives the skills the illusion of being much more concise and “magical” than they really are. They are also recently a major security risk/attack surface, because users are unlikely to read through the web of files that constitute a skill in its entirety. This also brings us to the first instance of LLM companies desperately attempting to build walled gardens to maintain customers/corner market share, as seemingly every tool has its unique way of doing what should be a very simple, universal thing. I.E. Anthropic wants you to use skills exclusive to Claude Code and include instructions in a file called the aforementioned CLAUDE.md, versus the more universal AGENTS.md. Still, the impact of these additions should not be underestimated, and while they might not as exciting as an MCP server that allows an agent to do something previously impossible, simply maintaining a well thought out AGENTS.md in the project can have a big impact on outcomes.

MCP/Tools

Lastly, we have MCP servers/additional tools and programs that LLMs can use to improve their capacities. There are far too many of these to list here, but the only one I’ve found truly indispensable has been the ability to use some kind of search engine, which I think is included by default in every major harness now. Something everyone “knows” about MCP servers but is underdiscussed is that they tend to massively increase context/token usage, which can grow expensive and/or risk making the model “dumber” as its context window is exhausted. In light of this, I tend to be more conservative with their use than some people might. In terms of specific examples, one of the favorites at my workplace is Playwright, which is a neat way to allow the LLM access to the browser and audit its client-side work. Recently, I have been experimenting with TideWave as an interesting option. It is created in part by Jose Valim (the creator of Elixir, the primary language I use for work ) and has a number of interesting abilities, including the capacity to connect to both the local db and local server instance and test queries/code against them. It has a separate (paid) in browser agent that allows it to interact directly with the application end to end, but just the ability to do the things I mentioned in the terminal are fairly useful thus far on their own. Additionally, unlike many MCPs, I’ve found that the additional capacity actually causes the models to use less tokens rather than more on average, a welcome change.

Models

I am going to write here primarily about Gemini, ChatGPT (Codex), and Claude. Grok, Mistral, and the various open source (or just open weight in Moonshot’s case) Chinese models surely have some use cases (recently I have been trying out Kimi K2.5, which is quite impressive), but I have found them to be generally behind those three front-runners. This section has by far the greatest potential to be rapidly out of date, so please bear that in mind as you read.

First we have Claude, the talk of the dev town. For whatever reason, Claude tends to be the model that gets pointed to whenever someone makes the claim that software engineering is “done” (it has been done within the next six months for close to two years now). Does it live up to that hype? I’d say…absolutely not, but you shouldn’t let that fool you into thinking it’s anything less than excellent. Claude Opus 4.5/4.6 (the difference is minuscule in practice) are definite candidates for “best” models for agentic coding, and the difference between what they output and how they “think” compared to models from a year ago is genuinely vast, but I also maintain that the code they produce is of pretty routinely mixed quality. Claude to me embodies “LLM code”: performant, defensive, impressive in what it can achieve at superhuman speeds…but also sometimes does not compile and is frequently unmaintainable. Furthermore, Anthropic’s Opus models are by far the most token/context hungry of any I have come across. Using Google and OpenAI’s mid level “pro” plans that cost 20–25$ a month, you would have to be a heavy user, into the realm of nonsense “vibe coding” to exhaust the limits they give you (assuming you work comparatively normal hours and spend a good amount of time doing things that aren’t coding, I.E. every dev job). With Claude Code, if you use primarily Opus, it is very realistic to reach/exceed the allowed limits, making it the least economical option of the three. Still, none of this isn’t to say Claude isn’t good, because it is. Shockingly so at times. Just not nearly so much that I put stock into the rumors of our impending demise. As for the cheaper Sonnet model, it is fairly capable, and significantly faster, but I find that for anything of significant complexity it is worth it to go to Opus.

Next I will bring up ChatGPT’s Codex, specifically the recent 5.2/5.3 models. I won’t beat around the bush: these are impressive. They come in 4 different “thinking” levels (low/medium/high/xhigh), and I have mostly used them on high, which seems to be the standard recommendation. They have similar perceived “intelligence” and code quality to Claude Opus, but are considerably faster and less context/token hungry. I have had things that Claude can’t seem to get right than Codex nails on the first try, but the reverse is also true. I cannot confidently say that one is “better” than the other, but if I had to pick one right now, it would actually be Codex, for compatibility reasons (more on that later), price, and the speed advantage over Claude Opus.

For Gemini, I find myself in an interesting place with it: the fast 3.0 “Flash” model for trivial/menial tasks is, for my money, the best thing on the market, but I find the the “Pro” model (even the new 3.1, which was released as I initially wrote this) lags behind the equivalent flagship Claude/Codex models in terms of code generation. However, used as a stack overflow replacement to explain code, it is my top pick, as I simply find its explanations to be consistently thorough, digestible, and generally accurate (although I emphasize: never fully trust an LLM; they all generate nonsense on occasion).

Tools/Applications:

Covering this topic was the main reason I started this article, and also the one the topic on which I suspect I might have the most useful insight on for experienced developers. I do not mean to suggest that I am a great expert in this space, merely that I have spent a lot of time tinkering with various apps and evaluating their performance. Even more, most people I see with equivalent/greater experience than me I have noticed tend to skew non-dev/“vibe-coder”, and it is my intent to provide the perspective of someone who isn’t afraid to dig and handle code themselves. For the sake of keeping an already long piece from running until the end of time, I am only going to include a paragraph or two for each one that I have used, but encourage readers to download and try them for themselves.

CLI Apps

Most of the following CLI apps also have a GUI version, but I have largely stuck with the TUI implementations. I fail to see any increased value in most of their GUI versions, and find the terminal preferable in most cases.

Aider

Aider is an open source CLI LLM agent, and I wanted to love it. It has all the characteristics of a project that I would like: open source, free, bring your own keys model to avoid vendor lock in, and is community driven. Unfortunately, I found that it performed very poorly for me. One of its hooks is something called “architect mode” where an agent “plans” the code for a feature and has a separate agent implement it. A feature of this mode is to allow the user to select two different models for each role, ostensibly as a cost saving measure. However, the “architect” does not so much help the user plan so much as it prepares an entire diff worth of code that the user then accepts/refuses, which the implementing agent then applies as a patch. I struggle to understand the utility of this. It isn’t really “planning” in any meaningful sense, and while it is nice enough not to have the file written to without permission, it’s also easy to roll back changes with git or other GUI tools in other harnesses. Add that to the fact that I found the resultant code quality to be easily among the worst of the harnesses I’ve used, and I cannot recommend it.

Droid

Droid is a paid harness by Factory.ai that I genuinely cannot determine the intended audience of. I was initially enticed to try it out by impressive results on TerminalBench, but the experience did little besides reinforce to me that benchmarks are of extremely limited utility for LLMs in the real world. The output I got from it was…fine, but the developer experience was downright abysmal compared to others, and required payment information to use at all, even if I had keys from an actual model provider. I am generally a proponent of open source solutions vs closed ones, but I also understand that not everyone shares my zeal, and that sometimes the closed source product is simply better to the point that there is no use grinding an ideological axe in denial. That is not the case here. From where I stand, it is a paid, closed source product that is massively worse than free and open source alternatives. Needless to say, I do not recommend it at all.

Gemini CLI

Not bad, but mediocre. This tends to get better output out of Gemini than Cursor/Zed (both of which it is oddly finicky with), but I get better results out of it with OpenCode, and it lags behind Codex, OpenCode, and Claude Code in virtually every way. It very distinctly gives the impression that Google is giving it far less love than its competitors are giving their tools, and while the future of Gemini in general looks bright, I cannot say the same for Gemini CLI. The fact that it is “free” and allows a few free requests per day is its primary redeeming feature.

Codex

This is the first CLI tool that I’ll cover that I would call “good”. It isn’t quite as feature rich as Claude Code or OpenCode, but the output it gets out of Codex models is substantially better than what I receive through most other harnesses. It also utilizes tool calls well. If there is a criticism to be had of it, it is that it mostly feels like a Claude Code ripoff, and lags behind it slightly in terms of both UI/UX and features. It isn’t my favorite, but I wouldn’t be eager to urge someone to try something else if they enjoyed Codex, which is not what I would say for Aider, Droid, or Gemini CLI.

Claude Code

This is what most people think of when they discuss CLI coding harnesses. Once again, Anthropic’s offering is perennially painted as a lurking demon waiting to take all of our jobs…when in reality it is just an extremely solid LLM harness. While the idea that “Claude Code” can replace a solid developer may be misguided, that doesn’t mean it should be dismissed as a tool. Firstly, the planning mode is fantastic, and showed me just how good and useful planning with LLMs can be. Beyond that, the UX is good, easily the best of any tools I have covered in this piece thus far. The code output is arguably the best it is possible to get out of an LLM at this time, and Claude with Claude Code is one of only two combinations (GPT Codex + OpenCode being the other) where I don’t absolutely cringe when thinking of using it as an agent to interact with the filesystem rather than just ask code questions/as a code generation bot. Indeed, it is a good product, as much as I hate to admit it (all of the major Western LLM providers are evil, but Anthropic seems uniquely committed to seeking a walled garden and even regulatory capture).

OpenCode

This has quickly become my favorite LLM harness. It has everything that I listed to love with Aider, but unlike Aider it all actually works how I want it to. It uses separate “agents” for a planning mode and a build mode, and they actually do what each of their names suggests. It also comes bundled with two subagents, one for “general” and one called “explore” that users can invoke both while the main agents work or independently. The website also provides excellent documentation on writing your own agents/subagents, something other harness providers tend to obscure at least somewhat. The aforementioned planning mode tends to output slightly higher level plans than Claude Code’s, but I find them to be of similar quality. Indeed, OpenCode having a solid planning feature unlike Codex is the biggest reason I don’t touch Codex anymore, and just use OpenAI models through OpenCode when I want to use them. Above all, whatever their team has done with the harness to direct it, they have done great work, as the code quality when using OpenAI models is arguably better than via their own Codex harness, and Gemini’s is absolutely better than with Gemini CLI. As a bonus, it’s also extremely easy to switch models, and the project is large enough that it sometimes gets to offer temporarily free models for promotion. The only negative I’ve found for this tool is that Anthropic and Google have started banning people from using their Claude/Gemini subscriptions with it, another frustrating example of LLM providers attempting to establish walled gardens. It is to OpenAI’s credit that they have not gone down that path, allowing use of ChatGPT Pro plans with third party apps like OpenCode, and Sam Altman has even expressed publicly (for whatever good that does) that they do not intend to. If not for Anthropic and Google’s obnoxious attempts at vendor lock in, I wouldn’t even consider another CLI LLM tool at this time, but as it stands Anthropic’s models and harness are good enough that Claude Code must also be in the conversation. Overall though, OpenCode has an attractive design, incredibly snappy response time, and excellent developer experience. Furthermore, Codex models through OpenCode have given me both the most enjoyable time and the best results I’ve had working with LLMs to date.

GUI Apps

Cursor

I question how much time to spend on Cursor because so many people are already familiar with it. Regardless, I feel that I must highlight its signature “tab” autocomplete feature. Simply put, it is best in class, and if that is your primary LLM use case, I recommend sticking with Cursor. I don’t use Cursor anymore, but it is the one thing I miss. Zed’s version is solid, and open source tools integrating fast models like Gemini Flash or Mistral’s Codestral for autocomplete in other editors are…serviceable, but frankly Cursor stands head and shoulders above in this area. In other ways, I don’t think Cursor is particularly good, and mostly coasts on the strength of the above functionality and VS Code’s large install base of familiar users. Its harness/agent I find to be on the weaker side, and while Cursor offers more model flexibility with a subscription than others, it also offers less bang for your buck in terms of model usage when using the more powerful models compared Google, Anthropic, or OpenAIs’ premium plans. Cursor also has a number of features like its “Bug Bot” PR review system and native planning mode that are, while not my favorite ways of accomplishing what they do, make it an attractive option if you absolutely must use one tool and no others.

Antigravity

All I have to say about Antigravity is thus: whose idea was this? It is the picked over guts of Windsurf reanimated as a Gemini advert vehicle by Google’s most questionable product directors. Are we meant to find that appealing? The only appreciable utility I’ve seen for this application is using it to finesse free trial credits for LLMs, usually Gemini or (funnily enough) Claude. It is a soulless VS Code fork with an unjustified existence, and I cannot fathom why one might choose it over Cursor.

VSCode + Extensions (Copilot, Roo Code, Cline)

My experiences with Copilot haven’t been terribly many, but they have been terribly negative. Almost uselessly bad autocomplete, and maybe the worst agent harness I’ve encountered in terms of results. I cannot imagine how/why this has any traction beyond Microsoft forcing it through sheer willingness to burn good will with their massive install base, as I certainly don’t see it as having any real merit compared to other options. There are also Cline and Roo Code, which I have not used, but (Cline particularly, which can also be added to other editors) seem to have a solid handful of fans. They are frequently compared to Aider however, and given how relatively poor my experience was with Aider, that does not make me overly eager to try them. Still, they may be something worth considering if Cursor ever alters its business model towards something unpalatable but users still crave the familiar VS Code environment.

Zed

Zed is a newer, Rust-based text editor that positions itself as a Cursor/VS Code alternative. I am generally a fan of it as a text editor over VS Code/Cursor (the speed gap is immediately noticeable), but this piece is more about LLM features so I’ll stick with that. The Zed team worked with the Claude and Gemini teams to pioneer something called Agent Communication Protocol (ACP) to let users use a version of the model’s “native” harness (I.E. Claude Code for Claude models) within the confines of an editor. While it is possible to use models on Zed without ACP through Zed’s own agent, the results for me have not been particularly good (worse than Cursor), while with ACP the results are very nearly as good as they are through the corresponding CLI harness (although more painful to set up external tools/additional MCP servers with). It also has the second best tab/predictive complete next to Cursor. Altogether, a very worthy consideration, especially since the ACP integration allows you to utilize external provider subscriptions from within your text editor (if that is preferred).

Goose

Goose is another free and open source tool, this time coming from Block, who were in the news somewhat recently for mass layoffs (never a good sign for a tool moving forward). While Goose does have a CLI version, I elected to use the GUI tool here, as some of the features warrant a GUI tool more in my opinion than several others, and the GUI version can be managed by brew rather than curl on my macbook. Goose’s killer feature is something called “recipes”, wherein users can define preexisting sets of LLM instructions complete with optional params, effectively making them somewhere between redeployable agents and scripts for LLMs. Recipes can be created from scratch, or the user can have the agent build a recipe based on the current session, if the session turns out to be something likely to need repeating in the future. If this sounds suspiciously similar to “skills” in with other agents/harnesses, that is because they are. Still, they aren’t quite the same thing, as recipes follow and predefined formula and can do helpful things like generate a form based on the recipe and take a defined set of parameters, rather than just sending everything to the model in a free text prompt. You can also schedule recipes to be executed at regular intervals like cron jobs, something which could be useful but I also worry could be catastrophically dangerous depending on what the agent is given access to. Under the hood, these recipes are just YAML files the agent has some special instructions on how to handle, but the idea and UI for interacting with it are interesting. It is also one of the more unique tools I’ve covered here. With most of the CLI apps, once you’ve used one, you can very quickly learn any of the others. Likewise, the gap between VSCode, Cursor, and even Zed is pretty easily bridgeable. Goose on the other hand marches to the beat of its own drum, and while I can’t immediately fault any of its choices, I did stumble when using it for the first time. It is also a little less polished than some other choices; The available models for configuration are not as quick to update as OpenCode or the first party apps, and there are a few small imperfections in the UI. Worse, the agent failed to create a recipe based on my session during my first use, and I had to write it by hand, partially defeating the purpose. Most damning of all however is that in three tries I never really got the program to save and execute my recipe with success, even after referencing the docs multiple times. There are good ideas here, but the current state of the execution and parent company do not leave me optimistic about the product’s future.

Closing

Even after all of that, the preceding section was by no means an exhaustive list. I didn’t touch on any of the myriad Neovim plugins (mainly because I am not a Neovim user), nor did I cover any of the existing Emacs packages (because there just aren’t that many of us weirdos). With that in mind, it is worth noting that in the face of all of these options, there still exists a desire with companies (spoken or otherwise) to simply pay one monolithic subscription and be done with it. I get it: managers and finance people don’t want to answer 200 emails from developers asking for permission/funds to try yet another new LLM tool. But I question the wisdom of that in regard to LLM tooling. If devs simply start Cursor/Claude Code/Copilot/whatever every day, they’ll be missing out on a lot of power and innovation that is out there, and changing fast. Conversely, this space moves at such a speed that you could easily spend all of your time exploring it rather than getting any work done, which is an obvious pitfall to avoid as well. I am not sure how best to strike a balance between these two poles, but I do feel confident that a balance should be struck for best results.

On LLM Usage as a Developer#

Autocomplete#

Planning#

As a “Super Search” Tool#

As StackOverflow 2.0#

As a Coding Agent#

As a File System/Install Manager#

Explaining LLM Tooling/Setup#

Harness#

Context#

MCP/Tools#

Models#

Tools/Applications:#

CLI Apps#

Aider#

Droid#

Gemini CLI#

Codex#

Claude Code#

OpenCode#

GUI Apps#

Cursor#

Antigravity#

VSCode + Extensions (Copilot, Roo Code, Cline)#

Zed#

Goose#

Closing#