The Pulse #90: Devin reversing ambitious claims

The “world’s first AI developer” tones down expectations and has been outperformed by an open source tool. Also: hiring upticks at Big Tech; a very realistic AI video generator by Microsoft, and more.

Apr 18, 2024

∙ Paid

The Pulse is a series covering insights, patterns, and trends within Big Tech and startups. Notice an interesting event or trend? Send me a message.

Today, we cover:

Devin: Reversing ambitious claims. A month ago, Devin launched with fanfare as “the world’s first AI developer,” claiming that it “even completed real jobs at Upwork.” Upon closer inspection, this claim did not hold up. The company behind Devin had since toned down expectations. Also: open source solution AutoCodeRover is offering even better performance than Devin’s closed-source and not-yet-publicly available tool. This space is commoditizing rapidly.
Industry pulse. Fintech valuations rising again; pre-earnings layoffs at Tesla and Google; Google fires staff trying to interfere with business; Rippling offering a secondary to its employees, and more.
Microsoft’s disturbingly realistic AI video generator. Microsoft Research showcased a tool that generated very realistic videos from a single image. The #1 use case will surely be fraudulent deepfakes generation. This development could well speed up AI regulation in several countries.
Hiring upticks at Meta, Netflix and Amazon? Data from interview preparation website interviewing.io suggests hiring is back at full speed at Meta, Netflix and – possibly – Amazon.

1. Devin: Reversing ambitious claims, and an open source threat

Devin is called the “world’s first AI developer” by the team that created it. Released as a waitlist-only tool a month ago, it grabbed attention with a series of impressive-looking videos. At the time, I reported that I suspected this was a marketing stunt for an AI assistant, rather than an actual “AI developer.”

Turns out, those who were skeptical of claims by Cognition Labs, the creator of Devin, were justified. An experienced software engineer called Carl – an AI enthusiast and AI coding tools user – has analyzed one of the Devin marketing videos, and says that the video contains straight up lies. He looked at a 2-minute video called “Devin’s Upwork Side Hustle” which claimed that Devin completed Upwork tasks. However, when looking closer:

Devin did not complete the task listed on Upwork as it was implied it had.
The video claimed that Devin fixed bugs, but did not state the bugs were hallucinated by Devin, and were not part of the repository as suggested.
The task took somewhere between 6-26 hours for Devin to complete
A lot of commands Devin executed did not make any sense, and makes things harder to debug, later

The developer at Cognition AI who created the original recording came back to defend the video, and struck a much less ambitious tone, writing: “we’ve also had Devin complete other Upwork jobs, but it definitely makes mistakes or often fails.”

In an update, the company now states “Devin is new and far from perfect, and we welcome feedback, questions, and constructive criticism.” This is a big change from last month’s announcement that began:

“Today we're excited to introduce Devin, the first AI software engineer. Devin is the new state-of-the-art on the SWE-Bench coding benchmark, has successfully passed practical engineering interviews from leading AI companies, and has even completed real jobs on Upwork.”

We now have reason to believe the “completed real works on Upwork” claim is untrue. It feels like Cognition Labs overhyped their tool to the point of falsehood. If there’s one audience you don’t want to mislead, it’s software engineers who look closer!

Will Devin be commoditized by open source software?

As loud as Devin’s launch was, the tool has no general availability. Those interested in what it can actually do need to sign up on a waiting list and… wait. This changed two weeks ago, when two “open source Devin alternatives” were released that can be used immediately. One even beat Devin on the SWE-Bench coding benchmark – making the claim that Devin is “state-of-the-art on the SWE-Bench coding benchmark” obsolete in just a few weeks!

SWE-agent: almost as good. A group at Princeton university open sourced its SWE-agent: which looks to have equivalent capabilities to Devin, whose team heavily advertised that Devin scored 13.86% on the SWE-bench benchmark: 3x better than models like Claude 2 or GPT-4. By the same benchmark, SWE-agent scores 12.29%, nearly as good!

AutoCodeRover: better than Devin? The new “state-of-the-art” agent for the SWE-bench testing suite is the open source, AutoCodeRover. It’s also an open source tool, built by a team at the University of Singapore.

AutoCodeRover resolves ~16% of issues of SWE-bench (total 2294 GitHub issues) and ~22% issues of SWE-bench lite (total 300 GitHub issues), improving the current state-of-the-art efficacy of AI software engineers. Source: AutoCodeRover

AutoCodeRover combines LLMs with analysis and debugging capabilities. It works in two stages:

Retrieve context: search the code and APIs to collect the right context
Generate patch: using the retrieved context

The team also released a paper detailing how AutoCodeRover works.

While Devin is in closed beta both AutoCodeRover and SWE-agent can be used immediately. Also, Devin is closed source, so we won’t know how it does what it does. These tools are opposites, so we can inspect how they combine new approaches to create a better AI coding agent.

The promise of AI coding agents

I expect lots of innovation with AI agent interfaces like the Agent-Computer Interface (ACI) with SWE-agent, and the “AI agent” approach used by AutoCodeRover. The big innovation of SWE-agent (and also Devin) is this agent interface. As the Princeton group points out:

“Just like how typical language models require good prompt engineering, good ACI design leads to much better results when using agents. As we show in our paper, a baseline agent without a well-tuned ACI does much worse than SWE-agent.”

Features that SWE-agent introduced to achieve these results:

Linting. If the linter doesn’t pass, proposed edits are not made.
An “LLM-friendly” file viewer. Instead of sharing full files with the LLM, files are shared in 100-line increments, with the LLM having the option to go up or down.
Transforming CLI responses for the LLM. “When commands have an empty output we return a message saying, "Your command ran successfully and did not produce any output."

I’m glad to see an open source implementation with similar capabilities to Devin. Thanks to this, we can stop worrying if an “AI developer” can replace devs. Instead, we can focus on what it can do okay, what it can do really well, and which areas it performs poorly in. And just as importantly: how long does it take to do these things, and how expensive is it to operate? As we get closer to answers, we also keep exploring what efficient AI agent interfaces could look like.

Browse and play with SWE-agent here and with AutoCodeRover here.

Is the SWE-Bench tests dataset any good?

Since the team at Devin benchmarked against the SWE-Bench data set, all AI coding tools are doing the same. So, what really is this data set, and how representative is it of real-world coding work? Software engineer and AI startup CTO Harry Tormey has dug in to analyze:

The dataset is almost 2,300 GitHub issues
The median task has about 2,000 files and 400,000 lines of code (!)
The reference solution (the accepted one) on median is 15 lines
Django (a Python framework) tasks are the majority of this dataset: 75% bugs, and 25% feature requests.

As Harry summarized:

“Looking through the tasks and their original PRs shows they're quite small, something a skilled engineer could handle quickly, in at most a couple of days.”

This SWE-Bench dataset seems like a good testing ground to gauge progress of AI coding tools. And what does Harry make of Devin, SWE-agent, and other AI agent coding tools completing about 1 out of 10 of tasks in this exercise package? He writes:

“It’s still early days for AI Software Engineers. I think within the next 2-3 years, AI will be able to handle debugging and multi-line code changes effectively across large codebases. This advancement will transform software engineers' roles from focusing on detailed coding tasks to prioritizing oversight and orchestration.
I believe this transition will be more significant than others I have made in my career such as from low-level to high-level programming languages. I don’t believe it will eliminate the need for human software expertise.”

The Pragmatic Engineer

The Pulse #90: Devin reversing ambitious claims

The “world’s first AI developer” tones down expectations and has been outperformed by an open source tool. Also: hiring upticks at Big Tech; a very realistic AI video generator by Microsoft, and more.

1. Devin: Reversing ambitious claims, and an open source threat

Will Devin be commoditized by open source software?

The promise of AI coding agents

Is the SWE-Bench tests dataset any good?

2. Industry pulse

Fintech valuations rising again

This post is for paid subscribers