Lumos POD
Varun Jindal, Engineer @ Lumos / Andrew Wang, Engineer @ Lumos
 , 

Cognition’s Devin: Your New Junior Engineer?

We wanted to share our experience using DevinAI, the core use cases we found for mature companies after experimentation, and our thoughts on when each of the latest AI tools should be used.

Table of Contents

Cognition's Devin: Your New Junior Engineer?

After Varun left a comment on one of DevinAI’s first PRs…

As Lumos grows, we see an increasing number of small tasks and repetitive processes burden our roadmap plans. Lumos is building a unified cybersecurity platform that’s easy to deploy and maintain. This platform strategy means that we constantly update core systems, like workflows, to support new products. We saw potential in DevinAI to help us with some of these refactors and cleanup tasks. We wanted to share our experience using DevinAI, the core use cases we found for matured companies after experimentation, and our thoughts on when each of the latest AI tools should be used.

Onboarding and Feature Set

Our First Attempt

When we first got access to DevinAI, we thought a lot of it would work out of the box since it had access to the whole codebase. We quickly realized that similar to an onboarding developer who has access to the whole codebase, there’s a lot you need to teach it. 

We experimented with pretty greenfield tasks in the very beginning. Varun asked it to try to run the frontend. It made changes across 28 of our docker files and pushed a PR. It received this response from one of our engineers:

That was the first moment we felt strongly that we should continue to invest in using DevinAI.

Understanding The Keys for DevinAI

Quickly we found that DevinAI requires an assortment of tools, both for DevinAI and for human-in-the-loop feedback, to work. There were two main keys for us:

  1. Ensuring that DevinAI could run tests on the code it writes
  2. Populating “Knowledge” so that DevinAI knew how to operate in the Lumos codebase

After DevinAI was able to run tests, we found that it would explore the codebase more to find the correct classes when writing tests and ensure that the code it wrote was functional via a unit test feedback loop. This led to PRs that made more sense and didn’t have to be fully rewritten all of the time.

For local development, we use AWS secrets manager. This requires AWS SSO login, Okta, and, for DevinAI, Google TOTP authentication. We thought this would be a blocker, but DevinAI has its own secrets manager where we stored our TOTP setup url and had it generate OTP codes.

However, by far the biggest unlock for us, was utilizing DevinAI’s Knowledge feature. This is where you can add context that DevinAI can use in certain situations.

Here’s where we store Lumos specific knowledge that an engineer who is onboarding might learn as they continue in their career at Lumos. Each piece of knowledge contains a description for when to use it and the prompt that is injected when DevinAI chooses to use the piece of knowledge. Based on the description, DevinAI generates a “DevinAI uses this when…” statement to cover a wide range of scenarios where the knowledge can be utilized – more than what the user’s description and prompt initially covers.

We’ve been able to populate this with knowledge like what alembic commands to run to create a migration in our codebase. 

On reflection, we have a lot of these “statements” in our brains. For example, “when I’m working on notification related features, I should construct a template for Slack, email, and Teams since they have varying syntaxes.” It’s time consuming to encode all of this knowledge in DevinAI, but the payoff is that the tasks you give DevinAI can become less prescriptive.

Developer Tooling

One thing is undeniable, compared to the tooling of other AI agents we’ve seen, DevinAI’s is the best. It’s incredible to be able to hop into DevinAI’s Ubuntu machine, configure the VM however you want, then snapshot it to use for later.

DevinAI itself can navigate web pages, input credentials, come up with commands (e.g. installing a TOTP Python package to generate OTP codes). DevinAI biases towards human-in-the-loop design: developers can pause a DevinAI session to modify code where DevinAI may have hallucinated or run commands via an in-browser/ssh VSCode session or modify the URL of DevinAI’s current browser activity.

Additionally, you can follow along with what DevinAI is doing in the preview and make live comments to change its behavior anywhere, from a PR review to a Slack thread reply.

Example Use Case 1: Routine Tasks

This use case is by far DevinAI’s strongest use case. Asking DevinAI to delete an unused component / function or migrate to Tailwind from Material UI tends to be quite accurate.

With a simple prompt like so:

Can you create a PR that deletes DomainAppOffboardingWorkflow/index.tsx and it's related code? It's no longer used anywhere

tag vjindal0112 as a reviewer on the PR

It was able to push a PR that was quickly stamped and merged by our team.

Similarly, DevinAI performed quite well when converting React MUI components to Tailwind using our styles. With this conversion, it didn’t create a stampable PR, but it’s possible it could have if we had DevinAI lint the code and added knowledge to use our pre-commit hooks prior to pushing to Github. 

The code for both PRs are very simple and certainly can be revised to completion by an experienced developer in minutes. But the ability to effectively use downtime, such as waiting for tests to run, saves developers from idle time.

Example Use Case 2: From Scratch Prototype

We also wanted to test DevinAI with a much more greenfield task. We had this thought of creating a very performant NPM package for currency conversions so we thought DevinAI could give it a shot. 

Prompting DevinAI

We supplied DevinAI with a Github PAT (personal access token) to commit to the repository and see CI failures, and credentials to NPM to release the package.

We wanted to create a package that would hardcode currency conversion values in a JSON file and completely avoid an API call. This would work because on every deployment the package would be able to update to the next version. The currency numbers would be as old as your last deployment. We gave it a website from which to source the currency numbers: www.oanda.com

We were thoroughly surprised to see that DevinAI coded this mostly flawlessly in its first go. It first created a plan that we were able to approve with a button click. Once it had an initial prototype, it ran the code until it succeeded. It decided to enter in different numbers into the website to generate the JSON values for currency conversions and use Playwright to do so. 

However, we didn’t want it to do this. We wanted JSON values to be generated more efficiently with an API, so we gave it this feedback on the Github Repo.

It reacted with an “eyes” emoji and then proceeded to make the change!

With all of its PRs, we still had to go in and fix many issues before we could merge them in, but they were solid enough starts where we could actually take over the PRs. 

How Good Was The Output Code?

DevinAI is currently very high variance when things don’t go as planned. But the draft code worked quite well.

Here’s an example where DevinAI got stuck on npm package dependency errors and committed a bunch of hallucinated, non-incrementally-beneficial code:

However, there are also times when it does work out! During its implementation DevinAI once encountered a timeout error, so it updated the script to use an additional plugin to get around the timeout/captcha issues. 

While DevinAI was able to get things going initially, it would likely have trouble implementing incremental features on top of this without good linting and testing setups. This seems great for bootstrapped siloed prototypes within a company. 

When DevinAI was faced with more complex issues, it would get stuck in a loop of trying similar things and testing again and again. In reality, it needed to use Google to search for new forum posts which were added after a model’s training date.

Additionally, sometimes when it comes to code errors, DevinAI isn’t able to deduce the root cause of a bug because DevinAI tends to stay limited to the codebase but not its installed dependencies. It doesn’t go and read function signatures and function docs that would give DevinAI more insight into the bugs it's facing. We are eager for more DevinAI tooling for the AI agent to better accomplish its tasks such as providing a search engine capability.

Example Use Case 3: Prototype for Context Sharing

Another use case we’ve found for DevinAI has been context sharing. When you’re onboarding an engineer to a new part of the codebase to implement a feature, transferring knowledge can be tricky and usually requires many docs and meetings. 

Now, if you have an idea of how to implement a feature that you want to hand off, you can simply prompt DevinAI to make an initial prototype:

We were able to easily hand off the PR it created, and with a few comments on that PR, the engineer receiving the PR didn’t even need a synchronous meeting. They were able to pick up the context and push out a PR the very same day with the correct changes. 

Comparison: DevinAI / Replit Agent / Cursor

Lumos Usage Stats of DevinAI

We’ve had access to DevinAI for 48 days. In this time we’ve had DevinAI push 44 PRs of which 34 are closed and 19 are merged.

As you can tell, we’re still not at a point where we’re using DevinAI extensively. There’s more knowledge to be added to DevinAI and more adoption yet to happen. The majority of PRs that DevinAI has merged have been recent merges. We expect this usage to go up. 

Breakdown

DevinAI Cursor Replit Agent
Pros Fully autonomous for simple tasks that go to plan

Excels at route tasks that are tedious to open PRs for

Is able to do wide reaching refactors, creating multiple commits for easier reviews
Can generate full files worth of code

Excels with business context heavy tasks because it can assist rather than do

It’s fast with generation
Super fast at correcting course

Fast 0-1 deployment and integrates with their cloud services like database
Cons Is poor at understanding business-context-heavy tasks

It’s slow to act on feedback, recreating plans

Doesn’t always know when it’s stuck
Cannot interact with Github or the terminal

Not as adept (yet) at wide reaching refactors that require many files
Cannot integrate with an existing codebase, rendering it not useful for us yet

We tested various commercial AI tools from AI agents like Replit Agent, Lovable.dev, and GitHub Copilot Workspace to companion tools like Cursor and GitHub Copilot. The two that we now use are DevinAI and Cursor. 

With Cursor Composer starting to become more developed, we might see Cursor beginning to fill the gaps at wide reaching refactors, but for now DevinAI is definitely the best tool we’ve used for that purpose. DevinAI is also able to fill in the gaps in these refactors or tasks more effectively, even once drafting a migration of our SQLAlchemy ORM code by referencing the migration guide and our ORM model code paths. On the other hand, using DevinAI for more precise edits proved more difficult than having Cursor assist for those. 

We’re excited to see both develop, but we definitely will be using both for the foreseeable future!

Pricing: How Much Does It Cost?

DevinAI credits, termed as ACUs (autonomous compute units), roughly equate to USD during our trial; the standard rate is $500 for 250 ACUs. For us, that means that a PR as shown above:

Costs about $5 to make. 

DevinAI has default limits so that you can avoid spending too much. For every command that you give DevinAI, it has a 10 ACU limit. This means that within a single session, any command can at most cost you $20. 

On average the PRs we pushed using DevinAI tended to cost between $5 and $40 to create with an average of roughly $20. Given our learnings, we expect that number to drop because we’ll be using it for simpler tasks.

The Future of DevinAI at Lumos

At Lumos, we desire to take low cost, reversible decisions quickly. As these AI tools continue to develop, we hope that more development decisions become these types of decisions.

We’re excited because some of the most costly roadmapping decisions we have to make at Lumos are migrations. As these decisions become lower cost, we expect to see fundamental changes in how we roadmap and decide to build. 

Is Devin as Productive as an Engineer on Your Team?

For now, the answer to the question is no… not yet.

For DevinAI’s main use cases right now, point solutions are likely going to be cheaper and more deterministic. Especially for tasks like removing code (take Knip for example). However, we’re investing in DevinAI because we think the use cases we discovered have large implications and potential within these use cases themselves. Onboarding to this platform is difficult and takes a while, but the more information DevinAI has about your codebase, the better it is. Our bet is that as OpenAI / Anthropic release their next models and DevinAI’s tooling for its own AI agents improve, DevinAI will be “leveled” higher and higher.