Building Windsurf with Varun Mohan

The Pragmatic Engineer

0:00

-1:27:54

Building Windsurf with Varun Mohan

Varun Mohan, CEO of Windsurf, shares how building an AI-native IDE is reshaping software development—from optimizing LLM latency to enabling non-engineers to ship code.

Gergely Orosz

May 07, 2025

Transcript

Stream the Latest Episode

Listen and watch now on YouTube, Spotify and Apple. See the episode transcript at the top of this page, and timestamps for the episode at the bottom.

Brought to You By

Modal⁠ — The cloud platform for building AI applications
CodeRabbit⁠⁠ — Cut code review time and bugs in half. Use the code PRAGMATIC to get one month free.

—

In This Episode

What happens when LLMs meet real-world codebases? In this episode of The Pragmatic Engineer, I am joined by Varun Mohan, CEO and Co-Founder of Windsurf. Varun talks me through the technical challenges of building an AI-native IDE (Windsurf) and how these tools are changing the way software gets built.

We discuss:

What building self-driving cars taught the Windsurf team about evaluating LLMs
How LLMs for text are missing capabilities for coding like “fill in the middle”
How Windsurf optimizes for latency
Windsurf’s culture of taking bets and learning from failure
Breakthroughs that led to Cascade (agentic capabilities)
Why the Windsurf teams build their LLMs
How non-dev employees at Windsurf build custom SaaS apps – with Windsurf!
How Windsurf empowers engineers to focus on more interesting problems
The skills that will remain valuable as AI takes over more of the codebase
And much more!

Takeaways

Some of the most interesting topics discussed in the conversation were these:

1. Having a robust “eval suite” is a must-have for LLM products tools like Windsurf. Every time Windsurf considers integrating a new model — and releasing this model to customers — they need to answer the question: “is this model good enough?”

To do so, they’ve built an eval suite to “score” these models. This is a pretty involved task. At the same time, any team building products on LLMs would be wise to take inspiration. “Eval testing” within AI product development feels like the equivalent of “unit testing” or “integration testing” in more classic software development.

2. AI-powered IDEs make engineers more “fearless” and could reduce mental load. I asked Varun how using Windsurf changed the workload and output of engineers — especially given how most of the team have been software engineers well before LLM coding assistants were a thing. A few of Varun’s observations:

Engineers are more “fearless” in jumping into unfamiliar parts of the codebase — when, in the past, they would have waited to talk to people more familar with the code.
Devs increasingly first turn to AI for help, before pinging someone else (and thus interrupting that person)
Mental fatigue is down, thanks to tedious tasks can be handed off to prompts or AI agents

Varun stressed that he doesn’t see tools like Windsurf eliminating the need for skilled engineers: it simply changes the nature of the work, and can increase potential output.

3. Forking VS Code the “right” way means doing a lot of invisible work. While VS Code is open source and can be forked: VS Code Marketplace and lots of proprietary extensions. For example, when forking VS Code, the fork is not allowed to use extensions like Python language servers, remote SSH, and dev containers. The Windsurf team had to build custom extensions from scratch — which took a bunch of time, and users probably did not even notice the difference!

However, if Windsurf had not done this, and had broken the license of these extensions, they could have found themselves in legal hot water. So forking VS Code “properly” is not as simple as most devs would normally expect.

4. Could we see more non-developers create “work software?” Maybe. One of the very surprising stories was how Windsurf’s partnership lead (a non-developer) created a quoting tool by prompting Windsurf. This tool replaced a bespoke, stateless tool that the company paid for.

Varun and I agreed that a complex SaaS that has lots of state and other features is not really a target to be “replaced internally.” However, simple pieces of software can now be “prompted” by business users. I have my doubts about how maintainable these will be in the long run: just thinking about how even Big Tech struggles with internal tools built by a single dev, and then when this dev leaves, no one wants to take it over.

Interesting quotes

On optimizing GPU usage and latency at scale:

Gergely: “How do you deal with inference? You're surveying the systems that serve probably billions or hundreds of billions, well actually hundreds of billions tokens per day as you just said with low latency. What smart approaches do you do to do this? What kind of optimizations have you looked into?”
Varun: “Latency matters a ton in a way that's very different than some of these API providers. For the API providers, time to first token is important. But it doesn't matter that time to first token is 100 milliseconds. For us, that's the bar we are trying to look for. Can we get it to sub a couple hundred milliseconds and then 100s of tokens a second?
This output much faster than what all of the providers are providing in terms of throughput. And you can imagine there's a lot of other things that we want to do. How do we do things like speculative decoding or model parallelism? How do we make sure we can actually batch requests properly to get the maximum utilization of the GPU, all the while not hurting latency?
GPUs are amazing. They have a lot of compute. If I were to draw an analogy to CPUs, GPUs have over sort of two orders of magnitude more compute than a CPU. It might actually be more on the more recent GPUs, but keep that in mind.
But GPUs only have an order of magnitude more memory bandwidth than a CPU. What that actually means is if you do things that are not compute-intense, you will be memory-bound. So that necessarily means to get the most out of the compute of your processor, you need to be doing a lot of things in parallel. But if you need to wait to do a lot of things in parallel, you're going to be hurting the latency. So there's all of these different trade-offs that we need to make.”

On the fill-in-the middle model capability, and why Windsurf had to build their own models:

Gergely: “What is fill-in-the middle?
Varun: “The idea of fill-in-the-middle is this: if you look at the task of writing software, it's very different than chat application. In a chat, you're always appending something to the very end and maybe adding an instruction. But when writing code: you're writing code in ways that are in the middle of a line, in the middle of a code snipplet.
These LLM models, when they consume files, they actually tokenize the files. This means that they don't consume them byte by byte, they consume them token by token. But when you are writing code, the code snippet, it doesn't tokenize into something that looks like in distribution.
I'll give you an example: how many times do you think in the training data set for these models, does it see instead of return RETU only without the RN? Probably never. It probably never sees that.
However, when you type RETU, we need to predict “RN”. It sounds like a very small detail, but that is very important if you want to build the product.
Fill-in-the-middle is a capability that cannot be post-trained into the models. You need to do a non-trivial amount of training on top of a model or pre-train to get this capability. It was table stakes for us to provide fill-in-the-middle for our users. So this forced us very early on to build out our own models and figure out training recipes.”