MoreRSS

site iconIrrational ExuberanceModify

By Will Larson. CTO at Carta, writes about software engineering and has authored several books including "An Elegant Puzzle."
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of Irrational Exuberance

Building an internal agent: Context window compaction

2025-12-27 00:00:00

Although my model of choice for most internal workflows remains ChatGPT 4.1 for its predictable speed and high-adherence to instructions, even its 1,047,576-token context window can run out of space. When you run out of space in the context window, your agent either needs to give up, or it needs to compact that large context window into a smaller one. Here are our notes on implementing compaction.

This is part of the Building an internal agent series.

Why compaction matters

Long-running workflows with many tool calls or user messages, along with any workflow dealing with large files, often run out of space in their context window. Although context window exhaustion is not relevant in most cases you’ll find for internal agents, ultimately it’s not possible to implement a robust, reliable agent without solving for this problem, and compaction is a straightforward solution.

How we implemented it

Initially, in the beautiful moment where we assumed compaction wouldn’t be a relevant concern to our internal workflows, we implemented an extremely naive solution to compaction: if we ever ran out of tokens, we discarded older tool responses until we had more space, then continued. Because we rarely ran into compaction, the fact that this worked poorly wasn’t a major issue, but eventually the inelegance began to weigh on me as we started dealing with more workflows with large files.

In our initial brainstorm on our 2nd iteration of compaction, I initially got anchored on this beautiful idea that compaction should be sequenced after implementing support for sub-agents, but I was never able to ground that intuition in a concrete reason why it was necessary, and we implemented compaction without sub-agent support.

The gist of our approach to compaction is:

  1. After every user message (including tool responses), add a system message with the consumed and available tokens in the context window. In that system message, we also include the updated list of available files that can be read from

  2. User messages and tool responses greater than 10,000 tokens are exposed as a new “virtual file”, with only their first 1,000 tokens included in the context window. The agent must use file manipulation tools to read more than those first 1,000 tokens (both 1k and 10k are configurable values)

  3. Add a set of “base tools” that are always available to agents, specifically including the virtual file manipulation tools, as we’d finally reached a point where most agents simply could not operate without a large number of mostly invisible internal tools. These tools were file_read which can read entire files, lines ranges within a file, or byte ranges within a file, and file_regex which is similar but performs a regex scan against a file up to a certain number of matches.

    Every use of a file is recorded in the files data, so the agent knows what has and hasn’t been read into the context window (particularly relevant for preloaded files), along the lines of:

    <files>
     <file id='a' name='image.png' size='32kb'>
     <file_read />
     </file>
     <file id='a' name='image.png' size='32kb'>
     <file_read start_line=10 end_line=20 />
     </file>
    </files>
    

    This was surprisingly annoying to implement cleanly, mostly because I came onto this idea after iteratively building the agent as a part-time project for several months. If I could start over, I would start with files as a core internal construct, rather than adding it on later.

  4. If a message pushed us over 80% (configurable value) of the model’s available context window, use the compaction prompt that Reddit claims Claude Code uses. The prompt isn’t particularly special, it just already exists and seems pretty good

  5. After compacting, add the prior context window as a virtual file to allow the agent to retrieve pieces of context that it might have lost

Each of these steps is quite simple, but in combination they really do provide a fair amount of power for handling complex, prolonged workflows. Admittedly, we still have a configurable cap on the number of tools that can be called in a workflow (to avoid agents spinning out), but this means that agents dealing with large or complex data are much more likely to succeed usefully.

How is it working? / What’s next?

Whereas for most of our new internal agent features, there are obvious problems or iterations, this one feels like it’s good enough to forget for a long, long time. There are two reasons for this: first, most of our workflows don’t require large context windows, and, second, honestly this seems to work quite well.

If context windows get significantly larger in the future, which I don’t see too much evidence will happen at this moment in time, then we will simply increase some of the default values to use more tokens, but the core algorithm here seems good enough.

Building an internal agent: Progressive disclosure and handling large files

2025-12-26 23:00:00

One of the most useful initial extensions I made to our workflows was injecting associated images into the context window automatically, to improve the quality of responses to tickets and messages that relied heavily on screenshots. This was quick and made the workflows significantly more powerful.

More recently, there are a number of workflows attempting to operate on large complex files like PDFs or DOCXs, and the naive approach of shoving them into the context window hasn’t worked particularly well. This post explains how we’ve adapted the principle of progressive disclosure to allow our internal agents to work with large files.

This is part of the Building an internal agent series.

Large files and progressive disclosure

Progressive disclosure is the practice of limiting what is added to the context window to the minimum necessary amount, and adding more detail over time as necessary.

A good example of progressive disclosure is how agent skills are implemented:

  1. Initially, you only add the description of each available skill into the context window
  2. You then load the SKILL.md on demand
  3. The SKILL.md can specify other files to be further loaded as helpful

In our internal use-case, we have skills for JIRA formatting, Slack formatting, and Notion formatting. Some workflows require all three, but the vast majority of workflows require at most one of these skills, and it’s straightforward for the agent to determine which are relevant to a given task.

File management is a particularly interesting progressive disclosure problem, because files are so helpful in many scenarios, but are also so very large. For example, requests for help in Slack are often along the lines of “I need help with this login issue ”, which is impossible to solve without including that image into the context window. In other workflows, you might want to analyze a daily data export in a very large PDF which is 5-10MB as a PDF, but only 10-20kb of tables and text when extracted from the PDF. This gets even messier when the goal is to compare across multiple PDFs, each of which is quite large.

Our approach

Our high-level approach to the large-file problem is as follows:

  1. Always include metadata about available files in the prompt, similar to the list of available skills. This will look something like:

    Files:
     - id: f_a1
     name: my_image.png
     size: 500,000
     preloaded: false
     - id: f_b3
     name: ...
    

    The key thing is that each id is a reference that the agent is able to pass to tools. This allows it to operate on files without loading their context into the context window.

  2. Automatically preload the first N kb of files into the context window, as long as they are appropriate mimetypes for loading (png, pdf, etc). This is per-workflow configurable, and could be set as low as 0 if a given workflow didn’t want to preload any files.

    I’m still of mixed minds whether preloading is worth doing, as it takes some control away from the agent.

  3. Provide three tools for operating on files:

    • load_file(id) loads an entire file into the context window
    • peek_file(id, start, stop) loads a section of a file into the context window
    • extract_file(id) transforms PDFs, PPTs, DOCX and so on into simplified textual versions
  4. Provide a large_files skill which explains how and when to use the above tools to work with large files. Generally, it encourages using extract_file on any PDF, DOCX or PPT file that it wants to work with, and otherwise loading or peeking depending on the available space in the context window

This approach was quick to implement, and provides significantly more control to the agent to navigate a wide variety of scenarios involving large files. It’s also a good example of how the “glue layer” between LLMs and tools is actually a complex, sophisticated application layer rather than merely glue.

How is this working?

This has worked well. In particular, one of our internal workflows oriented around giving feedback about documents attached to a ticket, in comparison to other similar, existing documents. The workflow simply did not work at all prior to this approach, and now works fairly well without workflow-specific support for handling these sorts of large files, because the large_files skill handles that in a reusable fashion without workflow authors being aware of it.

What next?

Generally, this feels like a stand-alone set of functionality that doesn’t require significant future investment, but there are three places where we will need to continue building:

  1. Until we add sub-agent support, our capabilities are constrained. In many cases, the ideal scenario of dealing with a large file is opening it in a sub-agent with a large context window, asking that sub-agent to summarize its contents, and then taking that summary into the primary agent’s context window.
  2. It seems likely that extract_file should be modified to return a referencable, virtual file_id that is used with peek_file and load_file rather than returning contents directly. This would make for a more robust tool even when extracting from very large files. In practice, extracted content has always been quite compact.
  3. Finally, operating within an AWS Lambda requires pure Python packages, and ultimately pure Python is not very fast at parsing complex XML-derived document formats like DOCX. Ultimately, we could solve this by adding a layer to our lambda with the lxml dependencies in it, and at some point we might.

Altogether, a very helpful extension for our internal workflows.

Building an internal agent: Adding support for Agent Skills

2025-12-26 22:00:00

When Anthropic introduced Agent Skills, I was initially a bit skeptical of the problem they solved–can we just use prompts and tools?–but I’ve subsequently come to appreciate them, and have explicitly implemented skills in our internal agent framework. This post talks about the problem skills solves, how the engineering team at Imprint implemented them, how well they’ve worked for us, and where we might work with them next.

This is part of the Building an internal agent series.

What problem do Agent Skills solve?

Agent Skills are a series of techniques that solve three important workflow problems:

  1. use progressive disclosure to more effectively utilize the constrained context windows
  2. minimize conflicting or unnecessary context in the context window
  3. provide reusable snippets for solving recurring problems to avoid individual workflow-creators having to solve recurring problems like e.g. Slack formatting or dealing with large files

All three of these problems initially seemed very insignificant when we started building out our internal workflows, but once the number of internal workflows reached into the dozens, both become difficult to manage. Without reusable snippets, I lost the leverage to improve all workflows at once, and without progressive disclosure the agents would get a vast amount of irrelevant content that could confuse them, particularly when it came to things like inconsistencies between Markdown and slack’s mrkdwn formatting language, both of which are important to different tools used by our workflows.

How we implemented Agent Skills

As a disclaimer, I recognize that it’s not necessary to implement agent skills, as you can integrate with e.g. Claude’s Agent Skills support for APIs. However, one of our design decisions is being largely platform agnostic, such that we can switch across model providers, and consequently we decided to implement skills within our framework.

With that out of the way, we started implementing by reviewing the Agent Skills documentation at agentskills.io, and cloning their Python reference implementation skills-ref into our repository to make it accessible to Claude Code.

The resulting implementation has these core features:

  1. Skills are in skills/ repository, with each skill consisting of its own sub-directory with a SKILL.md

  2. Each skill is a Markdown file with metadata along these lines:

    ---
    name: pdf-processing
    description: Extract text and tables...
    metadata:
     author: example-org
     version: "1.0"
    ---
    
  3. The list of available skills–including their description from metadata–is injected into the system prompt at the beginning of each workflow, and the load_skills tool is available to the agent to load the entire file into the context window.

  4. Updated workflow configuration to optionally specify required, allowed, and prohibited skills to modify the list of exposed skills injected into the system prompt.

    My guess is that requiring specific skills for a given workflow is a bit of an anti-pattern, “just let the agent decide!”, but it was trivial to implement and the sort of thing that I could imagine is useful in the future.

  5. Used the Notion MCP to retrieve all the existing prompts in our prompt repository, identify existing implicit skills in the prompts we had created, write those initial skills, and identify which Notion prompts to edit to eliminate the now redundant sections of their prompts.

Then we shipped it into production.

How they’ve worked

Humans make mistakes all the time. For example, I’ve seen many dozens of JIRA tickets from humans that don’t explain the actual problem they are having. People are used to that, and when a human makes a mistake, they blame the human. However, when agents make a mistake, a surprising percentage of people view it as a fundamental limitation of agents as a category, rather than thinking that, “Oh, I should go update that prompt.”

Skills have been extremely helpful as the tool to continue refining down these edge cases where we’ve relied on implicit behavior because specifying the exact behavior was simply overwhelming. As one example, we ask that every Slack message end with a link to the prompt that drove the response. That always worked, but the details of the formatting would vary in an annoying, distracting way: sometimes it would be the equivalent of [title](link), sometimes link, sometimes [link](link). With skills, it is now (almost always) consistent, without anyone thinking to include those instructions in their workflow prompts.

Similarly, handling large files requires a series of different tools that benefit from In-Context Learning (aka ICL, which is a fancy term for including a handful of examples of correct and incorrect usage), which absolutely no one is going to add to their workflow prompt but is extremely effective at improving how the workflow uses those tools.

For something that I was initially deeply skeptical about, I now wish I had implemented skills much earlier.

Where we might go next

While our skills implementation is working well today, there are a few opportunities I’d like to take advantage of in the future:

  1. Add a load_subskill skill to support files in skills/{skill}/* beyond the SKILL.md. So far, this hasn’t been a major blocker, but as some skills get more sophisticated, the ability to split varied use-cases into distinct files would improve our ability to use skills for progressive disclosure

  2. One significant advantage that Anthropic has over us is their sandboxed Python interpreter, which allows skills to include entire Python scripts to be specified and run by tools. For example, a script for parsing PDFs might be included in a skill, which is extremely handy. We don’t currently have a sandboxed interpreter handy for our agents, but this could, in theory anyway, significantly cut down on the number of custom skills we need to implement.

    At a minimum, it would do a much better job at operations that require reliable math versus relying on the LLM to do its best at performing math-y operations.

I think both of these are actually pretty straightforward to implement. The first is just a simple feature that Claude could implement in a few minutes. The latter feels annoying to implement, but could also be implemented in less than an hour by running a second lambda running Nodejs with Pyodide, and exposing access to that lambda as a tool. It’s just so inelegant for a Python process to call a Nodejs process to run sandboxed Python that I haven’t done it quite yet.

2025 in review.

2025-12-19 01:00:00

Yet another edition of my annual recap! This year brought my son to kindergarten, me to forty and to a new job at Imprint, my fourth book to bookstores, and a lot more time in the weeds of developing software.


Previously: 2024, 2023, 2022, 2021, 2020, 2019, 2018, 2017

Goals

Evaluating my goals for this year and decade:

  • [Completed] Write at least four good blog posts each year.

    Moving from an orchestration-heavy to leadership-heavy management role, Good engineering management is a fad, What is the competitive advantage of authors in the age of LLMs?, Facilitating AI adoption at Imprint

  • [Completed] Write three books about engineering or leadership in 2020s.

    This year I finished Crafting Engineering Strategy with O’Reilly. This is my third engineering book in the 2020s. More about this in the Writing section below.

  • [Completed] Do something substantial and new every year that provides new perspective or deeper practice.

    After almost a decade of not submitting a substantial pull request at work, I’ve been back in the mix since joining Imprint. I’ve submitted a solid handful of real pull requests that implement production features, and have used Claude Code widely in their creation. I’ve missed this a lot, and have learned a bunch about developing software with LLMs.

  • [In progress] 20+ folks who I’ve managed or meaningfully supported move into VPE or CTO roles at 50+ person or $100M+ valuation companies.

    This is a decade goal ending in 2029. I previously increased the goal in 2022 from 3-5 to 20. In 2024, the count was at 10. Things haven’t moved too much since then, but I’ll refresh next year.

    I think that I’m on track, but I will say that I think getting into these roles is markedly harder than it was three years ago. There are just fewer of these roles available recently, and they tend to be both more demanding and more difficult than the standard VPE/CTO role a few years ago.

For backstory on these goals: I originally set them in 2019, and then revised them in 2022. I’ve come to believe that I should be revising these every year, but also that it’s not that interesting to revise them every year. I’ll revise them again in a few years.

Writing

I finished my fourth book, Crafting Engineering Strategy, and wrote some notes on writing it. I’m really excited for this book to be done, because I think it’s been a missing book in the industry, and I hope it will change how the industry thinks about “engineering strategy.” In particular, I hope it’ll pull us away from the frequent gripe that “we have no engineering strategy!” You do have an engineering strategy, it’s just not written down yet.

As part of finishing this book, I’ve also recognized that if I write another book, it will be far into the future. After publishing four books in six years, I’m booked out, and I’m pretty sure I’ve tapped out my last decade’s path of writing books to advance the industry. I’ll definitely keep writing, but it’ll be posts focused on the stuff I’m concretely working on, without trying to map them into a larger book structure.

(Last year I mentioned adding The High-Context Triad to a second edition of Staff Engineer, which I still plan to do, but I’m not quite sure when. Probably in a few years.)

Work

I left Carta in May after two years there, and joined Imprint. Imprint has just been a lot of fun for me. I’ve written a small number of real pull requests that implement meaningful things. That’s something I haven’t done since working at Uber, and aligns with my desire to be working in the details again. There’s nothing more energizing to me than getting to solve real, concrete problems, and that’s exactly the sort of job Imprint has been for me. I just haven’t been spenting time on stuff like implementing internal workflow agents or automatically merging Dependabot pull requests in a long time, and I missed it.

It’s also, after some years spent on making teams more efficient, been an opportunity to really hire again, which I haven’t gotten to do since my first couple years at Calm. It’s never easy working at a fast growing company, but you do learn a lot, and quite quickly.

Family

My son entered kindergarten this year. I turned 40. My wife is starting to explore the world of fractional software development, and she’s figuring out its rules. We’ve had a fair amount of health issues in the immediate and extended family, but altogether everything is going well.

Speaking

I didn’t do much public speaking, although I spoke on Book Overflow about Staff Engineer, which was a fun discussion.

I also spoke at several private events, and recorded practice runs on YouTube of Good engineering management is a fad and CTOs must earn the right to specialize. Those are very similar talks, where I’ve been iterating on the core idea of how engineering managers need to adapt to the current era.

Reading

In 2024, I read 27 profession-adjacent books. In 2023, I read 11. I’m not quite sure how many I read in 2022, because I put together a 2019-2022 professional reading recap, but it was about 50 over four years. This year I didn’t do much professional reading, mostly because I was too busy with the new job and polishing my most recent book.

What I did read was:

  1. AI Engineering by Chip Huyen
  2. Recoding America by Jennifer Pahlka
  3. Facilitating Software Architecture by Andrew Harmel-Law
  4. Turning the Flywheel by Jim Collins

It’s interesting to note the drop in volume, but I feel fine about it. I don’t read to hit a goal, I read to learn or understand a particular problem, and found myself mostly working on topics that didn’t align well with that approach this year.


If you’ve written something about your year, send it my way!

Automatically merging dependabot PRs

2025-12-18 23:00:00

One of the recurring themes of software development is patching security issues. Most repository hosting services have fairly good issue reporting at this point, but many organizations still struggle to apply those fixes in a timely fashion. This past week we were discussing how to reduce the overhead of this process, and I was curious: can you just auto-merge Github Dependabot pull-requests?

It turns out, [the answer is yes], and it works pretty well. You get control over which types of updates (patches, minor updates, major updates, etc) you want to auto-merge, and it will also respect your automated checks. If you have great CI/CD that runs blocking linting, typing and tests, then this works particularly well. If you don’t, then, well, this will be an effective mechanism to get you to having good linting, typing, and tests afer traversing a small ocean of tears.

I got this running for about a dozen repositories at work over the past few days, but I’ll show an example of setting up the same mechanism for my blog.

First, add a .github/workflows/dependabot-auto-merge.yml file to your repository that looks like this:

# Automatically approve and merge Dependabot PRs for minor and patch updates
name: Dependabot auto-merge
on: pull_request

permissions:
 contents: write
 pull-requests: write

jobs:
 dependabot:
 runs-on: ubuntu-latest
 if: github.event.pull_request.user.login == 'dependabot[bot]' && github.repository == 'lethain/irrational_hugo'
 steps:
 - name: Dependabot metadata
 id: metadata
 uses: dependabot/fetch-metadata@v2
 with:
 github-token: "${{ secrets.GITHUB_TOKEN }}"
 - name: Approve Dependabot PR
 run: gh pr review --approve "$PR_URL"
 env:
 PR_URL: ${{ github.event.pull_request.html_url }}
 GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
 - name: Enable auto-merge for Dependabot PRs
 if: steps.metadata.outputs.update-type == 'version-update:semver-patch' || steps.metadata.outputs.update-type == 'version-update:semver-minor'
 run: gh pr merge --auto --squash "$PR_URL"
 env:
 PR_URL: ${{ github.event.pull_request.html_url }}
 GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}

Then go to your repository settings (something like https://github.com/lethain/irrational_hugo/settings), and enable auto-merging for your repository. This still respects all required branch rules, like required test passes or approvals, etc.

Then make sure you have appropriate status checks for whatever linting, typing and tests you have in your repository.

Then enable Dependabot (something like https://github.com/lethain/irrational_hugo/settings/security_analysis). Even the default settings are just find.

Then you’re done. The PRs from dependabot will automatically merge going forward. There are lots of nuances here–I already found one issues that automatically merged despite an issue because of a missing test–but ultimately I think that’s valuable pressure to improve the testing quality, rather than a reason to avoid, or backtrack, on the approach.

Facilitating AI adoption at Imprint

2025-12-07 23:00:00

I’ve been working on internal “AI” adoption, which is really LLM-tooling and agent adoption, for the past 18 months or so. This is a problem that I think is, at minimum, a side-quest for every engineering leader in the current era. Given the sheer number of folks working on this problem within their own company, I wanted to write up my “working notes” of what I’ve learned.

This isn’t a recommendation about what you should do, merely a recap of how I’ve approached the problem thus far, and what I’ve learned through ongoing iteration. I hope the thinking here will be useful to you, or at least validates some of what you’re experiencing in your rollout. The further you read, the more specific this will get, ending with cheap-turpentine-esque topics like getting agents to reliably translate human-readable text representations of Slack entities into mrkdwn formatting of the correct underlying entity.

I am hiring: If you’re interested in working together with me on internal agent and AI adoption at Imprint, we are hiring our founding Senior Software Engineer, AI. The ideal candidate is a product engineer who’s spent some time experimenting with agents, and wants to spend the next year or two digging into this space.

Prework: building my intuition

As technologists, I think one of the basics we owe our teams is spending time working directly with new tools to develop an intuition for how they do, and don’t work. AI adoption is no different.

Towards that end, I started with a bit of reading, especially Chip Huyen’s AI Engineering, and then dove in a handful of bounded projects: building my rudimentary own agent platform using Claude code for implementation, creating a trivial MCP for searching my blog posts, and an agent to comment on Notion documents.

Each of these projects was two to ten hours, and extremely clarifying. Tool use is, in particular, something that seemed like magic until I implemented a simple tool-using agent, at which point it become something extremely non-magical that I could reason about and understand.

Our AI adoption strategy

Imprint’s general approach to refining AI adoption is strategy testing: identify a few goals, pick an initial approach, and then iterate rapidly in the details until the approach genuinely works. In an era of crushing optics, senior leaders immersing themselves in the details is one of our few defenses.

First draft of Imprint’s strategy for AI adoption

Shortly after joining, I partnered with the executive team to draft the above strategy for AI adoption. After a modest amount of debate, the pillars we landed on were:

  1. Pave the path to adoption by removing obstacles to adoption, especially things like having to explicitly request access to tooling. There’s significant internal and industry excitemetn for AI adoption, and we should believe in our teams. If they aren’t adopting tooling, we predominantly focus on making it easier rather than spending time being skeptical or dismissive of their efforts towards adoption.
  2. Opportunity for adoption is everywhere, rather than being isolated to engineering, customer service, or what not. To become a company that widely benefits from AI, we need to be solving the problem of adoption across all teams. It’s not that I believe we should take the same approach everywhere, but we need some applicable approach for each team.
  3. Senior leadership leads from the front to ensure what we’re doing is genuinely useful, rather than getting caught up in what we’re measuring.

As you see from those principles, and my earlier comment, my biggest fear for AI adoption is that they can focus on creating the impression of adopting AI, rather than focusing on creating additional productivity. Optics are a core part of any work, but almost all interesting work occurs where optics and reality intersect, which these pillars aimed to support.


As an aside, in terms of the components of strategy in Crafting Engineering Strategy, this is really just the strategy’s policy. In addition, we used strategy testing to refine our approach, defined a concrete set of initial actions to operationalize it (they’re a bit too specific to share externally), and did some brief exploration to make sure I wasn’t overfitting on my prior work at Carta.

Documenting tips & tricks

My first step towards adoption was collecting as many internal examples of tips and tricks as possible into a single Notion database. I took a very broad view on what qualified, with the belief that showing many different examples of using tools–especially across different functions–is both useful and inspiring.

The image is a table listing AI tips and trainings with columns for the name of the tip and the relevant team. It includes topics like using Claude Code, adding Slack bots, and employing ChatGPT for marketing.

I’ve continued extending this, with contributions from across the company, and it’s become a useful resource for both humans and bots alike to provide suggestions on approaching problems with AI tooling.

Centralizing our prompts

One of my core beliefs in our approach is that making prompts discoverable within the company is extremely valuable. Discoverability solves four distinct problems:

  1. Creating visibility into what prompt’s can do (so others can be inspired to use them in similar scenarios). For example, that you can use our agents to comment on a Notion doc when it’s created, respond in Slack channels effectively, triage Jira tickets, etc
  2. Showing what a good prompt looks like (so others can improve their prompts). For example, you can start moving complex configuration into tables and out of lists which are harder to read and accurately modify
  3. Serving as a repository of copy-able sections to reuse across prompts. For example, you can copy one of our existing “Jira-issue triaging prompts” to start triaging a new Jira project
  4. Prompts are joint property of a team or function, not the immutable construct of one person. For example, anyone on our Helpdesk team can improve the prompt responding to Helpdesk requests, not just one person with access to the prompt, and it’s not locked behind being comfortable with Git or Github (although I do imagine we’ll end up with more restrictions around editing our most important internal agents over time)
  5. Identifying repeating prompt sub-components that imply missing or hard to use tools. For example, earlier versions of our prompts had a lot of confusion around how to specify Slack users and channels, which I got comfortable working around, but others did not

My core approach is that every agent’s prompt is stored in a single Notion database which is readable by everyone in the company. Most prompts are editable by everyone, but some have editing restrictions.

Here’s an example of a prompt we use for routing incoming Jira issues from Customer Support to the correct engineering team.

The image provides instructions for triaging Jira tickets, detailing steps for retrieving comments, updating labels, and determining responsible teams. It includes guidelines for using Slack for communication and references, and lists teams with their on-call aliases and areas of responsibility.

Here’s a second example, this time of responding to requests in our Infrastructure Engineering team’s request channel.

The image contains detailed instructions for service desk agents on handling Slack messages related to access requests for tools such as AWS, VPN, NPM, and more. It provides step-by-step guidelines for different scenarios, including retrieving user IDs, handling specific requests, and directing users to appropriate resources or teams.

Pretty much all prompts end with an instruction to include a link to the prompt in the generated message. This ensures it’s easy to go from a mediocre response to the prompt-driving the response, so that you can fix it.

Adopting standard platform

In addition to collecting tips and prompts, the next obvious step for AI adoption is identifying a standard AI platform to be used within the company, e.g. ChatGPT, Claude, Gemini or what not.

We’ve gone with OpenAI for everyone. In addition to standardizing on a platform, we made sure account provisioning was automatic and in place on day one. To the surprise of no one who’s worked in or adjacent to IT, a lot of revolutionary general AI adoption is… really just account provisioning and access controls. These are the little details that can so easily derail the broader plan if you don’t dive into them.

Within Engineering, we also provide both Cursor and Claude. That said, the vast majority of our Claude usage is done via AWS Bedrock, which we use to power Claude Code… and we use Claude Code quite a bit.

Other AI tooling

While there’s a general industry push towards adopting more AI tooling, I find that a significant majority of “AI tools” are just SaaS vendors that talk about AI in their marketing pitches. We have continued to adopt vendors, but have worked internally to help teams evaluate which “AI tools” are meaningful.

We’ve spent a fair amount of time going deep on integrating with AI tooling for chat and IVR tooling, but that’s a different post entirely.

Metrics

Measuring AI adoption is, like all measurement topics, fraught. Altogether, I’ve found measuring tool adoption very useful for identifying the right questions to ask. Why haven’t you used Cursor? Or Claude Code? Or whatever? These are fascinating questions to dig into. I try to look at usage data at least once a month, with a particular focus on two questions:

  1. For power adopters, what are they actually doing? Why do they find it useful?
  2. For low or non-adopters, why aren’t they using the tooling? How could we help solve that for them?

At the core, I believe folks who aren’t adopting tools are rational non-adopters, and spending some time understanding the (appearance of) resistance goes further than top-down mandate. I think it’s often an education gap that is bridged easily enough. Conceivably, at some point I’ll discover a point of diminishing returns, where the lack of progress is stymied on folks who are rejecting AI tooling–or because the AI tooling isn’t genuinely useful–but I haven’t found that point yet.

Building internal agents

The next few sections are about building internal agents. The core implementation is a single stateless lambda which handles a wide variety of HTTP requests, similar-ish to Zapier. This is currently implemented in Python, and is roughly 3,000 lines of code, much of it dedicated to oddities like formatting Slack messages, etc.

For the record, I did originally attempt to do this within Zapier, but I found that Zapier simply doesn’t facilitate the precision I believe is necessary to do this effectively. I also think that Zapier isn’t particularly approachable for a non-engineering audience.

What has fueled adoption (especially for agents)

As someone who spent a long time working in platform engineering, I still want to believe that you can build a platform, and users will come. Indeed, I think it’s true that a small number of early adopters will come, if the problem is sufficiently painful for them, as was the case for Uber’s service migration (2014).

However, what we’ve found effective for driving adoption is basically the opposite of that. What’s really worked is the intersection of platform engineering and old-fashioned product engineering:

  1. (product eng) find a workflow with a lot of challenges or potential impact
  2. (product eng) work closely with domain experts to get the first version working
  3. (platform eng) ensure that working solution is extensible by the team using it
  4. (both) monitor adoption as indicator of problem-solution fit, or lack thereof

Some examples of the projects where we’ve gotten traction internally:

  • Writing software with effective AGENTS.md files guiding use of tests, typechecking and linting
  • Powering initial customer questions through chat and IVR
  • Routing chat bots to steer questions to solve the problem, provide the the answer, or notify the correct responder
  • Issue triaging for incoming tickets: tagging them, and assigning them to the appropriate teams
  • Providing real-time initial feedback on routine compliance and legal questions (e.g. questions which occur frequently and with little deviation)
  • Writing weekly priorities updates after pulling a wide range of resources (Git commits, Slack messages, etc)

For all of these projects that have worked, the formula has been the opposite of “build a platform and they will come.” Instead it’s required deep partnership from folks with experience building AI agents and using AI tooling to make progress. The learning curve for effective AI adoption in important or production-like workflows remains meaningfully high.

Configuring agents

Agents that use powerful tools represent a complex configuration problem. First, exposing too many tools–especially tools that the prompt author doesn’t effectively understand–makes it very difficult to create reliable workflows. For example, we have an exit_early command that allows terminating the agent early: this is very effective in many cases, but is also easy to break your bot. Similarly, we have a slack_chat command that allows posting across channels, which can support a variety of useful workflows (e.g. warm-handoffs of a question in one channel into a more appropriate alternative), but can also spam folks. Second, as tools get more powerful, they can introduce complex security scenarios.

To address both of these, we currently store configuration in a code-reviewed Git repository. Here’s an example of a JIRA project.

This image shows a configuration script for a Jira setup with specified project keys, a prompt ID, a list of allowed tools such as “notion_search” and “slack_chat,” and a model set to “gpt-4.1”. The configuration also has a setting “respond_to_issue” set to False.

Here’s another for specifying a Slack responder bot.

This image shows a code snippet configuring a channel in Slack for “eng-new-hires,” with specified Slack channel IDs, a Notion prompt ID, and a list of allowed tools like “notion_search” and “jira_search_jql.” The model specified is “gpt-4.1.”

Compared to a JSON file, we can statically type the configuration, and it’s easy to extend over time. For example, we might want to extend slack_chat to restrict which channels a given bot is allowed to publish into, which would be easy enough. For most agents today, the one thing not under Git-version control is the prompts themselves, which are versioned by Notion. However, we can easily require specific agents to use prompts within the Git-managed repository for sensitive scenarios.

After passing tests, linting and typechecking, the configurations are automatically deployed.

Resolving foreign keys

It’s sort of funny to mention, but one thing that has in practice really interfered with easily writing effective prompts is making it easy to write things like @Will Larson which is then translated into <@U12345> or whatever the appropriate Slack identifier is for a given user, channel, or user group. The same problem exists for Jira groups, Notion pages and databases, and so on.

This is a good example of where centralizing prompts is useful. I got comfortable pulling the unique identifiers myself, but it became evident that most others were not. This eventually ended with three tools for Slack resolution: slack_lookup which takes a list of references to lookup, slack_lookup_prefix which finds all Slack entities that start with a given prefix (useful to pull all channels or groups starting with @oncall-, for example, rather than having to hard-code the list in your prompt), and slack_search_name which uses string-distance to find potential matches (again, useful for dealing with typos).

If this sounds bewildering, it’s largely the result of Slack not exposing relevant APIs for this sort of lookup. Slack’s APIs want to use IDs to retrieve users, groups and channels, so you have to maintain your own cache of these items to perform a lookup. Performing the lookups, especially for users, is itself messy. Slack users have a minimum of three ways they might be referenced: user.profile.display_name, user.name, and user.real_name, only a subset of which are set for any given user. The correct logic here is, as best I can tell, to find a match against user.profile.display_name, then use that if it exists. Then do the same for user.name and finally user.real_name. If you take the first user that matches one of those three, you’ll use the wrong user in some scenarios.

In addition to providing tools to LLMs for resolving names, I also have a final mandatory check for each response to ensure the returned references refer to real items. If not, I inject which ones are invalid into the context window and perform an additional agent loop with only entity-resolution tools available. This feels absurd, but it was only at this point that things really started working consistently.


As an aside, I was embarassed by these screenshots, and earlier today I made the same changes for Notion pages and databases as I had previously for Slack.

Formatting

Similarly to foreign entity resolution, there’s a similar problem with Slack’s mrkdwn variant of Markdown and JIRA’s Atlassian Document Format: they’re both strict.

The tools that call into those APIs now have strict instructions on formatting. These had been contained in individual prompts, but they started showing up in every prompt, so I knew I needed to bring them into the agent framework itself rather than forcing every prompt-author to understand the problem.

My guess is that I need to add a validation step similar to the one I added for entity-resolution, and that until I do so, I’ll continue to have a small number of very infrequent but annoying rendering issues, To be honest, I personally don’t mind the rendering issues, but that creates a lot of uncertainty for others using agents, so I think solving them is a requirement.

Logging and debugging

Today, all logs, especially tool usage, are fed into two places. First, it goes into Datadog for full logging visibility. Second, and perhaps more usefully for non-engineers, they feed into a Slack channel, #ai-logs which create visibility into which tools are used and with which (potentially truncated) parameters.

Longer term, I imagine this will be exposed via a dedicated internal web UX, but generally speaking I’ve found that the subset of folks who are actively developing agents are pretty willing to deal with a bit of cruft. Similarly the folks who aren’t developing agents directly don’t really care, they want it to work perfectly every time, and aren’t spending time looking at logs.

Biggest remaining gap: universal platform for accessing user-scope MCP servers

The biggest internal opportunity that I see today is figuring out how to get non-engineers an experience equivalent to running Claude Code locally with all their favorite MCP servers plugged in. I’ve wanted ChatGPT or Claude.ai to provide this, but they don’t really quite get there, Claude Desktop is close, but is somewhat messy to configure as we think about finding a tool that we can easily allow everyone internally to customize and use on a daily basis.

I’m still looking for what the right tool is here. If anyone has any great suggestions that we can be somewhat confident will still exist in two years, and don’t require sending a bunch of internal data to a very early stage company, then I’m curious to hear!

What’s next?

You’re supposed to start a good conclusion with some sort of punchy anecdote that illuminates your overall thesis in a new way. I’m not sure if I can quite meet that bar, but the four most important ideas for me are:

  1. We are still very early on AI adoption, so focusing on rate of learning is more valuable than anything else
  2. If you want to lead an internal AI initiative, you simply must be using the tools, and not just ChatGPT, but building your own tool-using agent using only an LLM API
  3. My experience is that real AI adoption on real problems is a complex blend of: domain context on the problem, domain experience with AI tooling, and old-fashioned IT issues. I’m deeply skeptical of any initiative for internal AI adoption that doesn’t anchor on all three of those. This is an advantage of earlier stage companies, because you can often find aspects of all three of those in a single person, or at least across two people. In larger companies, you need three different organizations doing this work together, this is just objectively hard
  4. I think model selection matters a lot, but there are only 2-3 models you need at any given moment in time, and someone can just tell you what those 2-3 models are at any given moment. For example, GPT-4.1 is just exceptionally good at following rules quickly. It’s a great model for most latency-sensitive agents

I’m curious what other folks are finding!