2025-12-27 00:00:00
Although my model of choice for most internal workflows remains ChatGPT 4.1 for its predictable speed and high-adherence to instructions, even its 1,047,576-token context window can run out of space. When you run out of space in the context window, your agent either needs to give up, or it needs to compact that large context window into a smaller one. Here are our notes on implementing compaction.
This is part of the Building an internal agent series.
Long-running workflows with many tool calls or user messages, along with any workflow dealing with large files, often run out of space in their context window. Although context window exhaustion is not relevant in most cases you’ll find for internal agents, ultimately it’s not possible to implement a robust, reliable agent without solving for this problem, and compaction is a straightforward solution.
Initially, in the beautiful moment where we assumed compaction wouldn’t be a relevant concern to our internal workflows, we implemented an extremely naive solution to compaction: if we ever ran out of tokens, we discarded older tool responses until we had more space, then continued. Because we rarely ran into compaction, the fact that this worked poorly wasn’t a major issue, but eventually the inelegance began to weigh on me as we started dealing with more workflows with large files.
In our initial brainstorm on our 2nd iteration of compaction, I initially got anchored on this beautiful idea that compaction should be sequenced after implementing support for sub-agents, but I was never able to ground that intuition in a concrete reason why it was necessary, and we implemented compaction without sub-agent support.
The gist of our approach to compaction is:
After every user message (including tool responses), add a system message with the consumed and available tokens
in the context window. In that system message, we also include the updated list of available files that can
be read from
User messages and tool responses greater than 10,000 tokens are exposed as a new “virtual file”, with only their first 1,000 tokens included in the context window. The agent must use file manipulation tools to read more than those first 1,000 tokens (both 1k and 10k are configurable values)
Add a set of “base tools” that are always available to agents, specifically including the virtual file manipulation tools,
as we’d finally reached a point where most agents simply could not operate without a large number of mostly invisible internal
tools. These tools were file_read which can read entire files, lines ranges within a file, or byte ranges within a file,
and file_regex which is similar but performs a regex scan against a file up to a certain number of matches.
Every use of a file is recorded in the files data, so the agent knows what has and hasn’t been read into
the context window (particularly relevant for preloaded files), along the lines of:
<files>
<file id='a' name='image.png' size='32kb'>
<file_read />
</file>
<file id='a' name='image.png' size='32kb'>
<file_read start_line=10 end_line=20 />
</file>
</files>
This was surprisingly annoying to implement cleanly, mostly because I came onto this idea after iteratively building the agent as a part-time project for several months. If I could start over, I would start with files as a core internal construct, rather than adding it on later.
If a message pushed us over 80% (configurable value) of the model’s available context window, use the compaction prompt that Reddit claims Claude Code uses. The prompt isn’t particularly special, it just already exists and seems pretty good
After compacting, add the prior context window as a virtual file to allow the agent to retrieve pieces of context that it might have lost
Each of these steps is quite simple, but in combination they really do provide a fair amount of power for handling complex, prolonged workflows. Admittedly, we still have a configurable cap on the number of tools that can be called in a workflow (to avoid agents spinning out), but this means that agents dealing with large or complex data are much more likely to succeed usefully.
Whereas for most of our new internal agent features, there are obvious problems or iterations, this one feels like it’s good enough to forget for a long, long time. There are two reasons for this: first, most of our workflows don’t require large context windows, and, second, honestly this seems to work quite well.
If context windows get significantly larger in the future, which I don’t see too much evidence will happen at this moment in time, then we will simply increase some of the default values to use more tokens, but the core algorithm here seems good enough.
2025-12-26 23:00:00
One of the most useful initial extensions I made to our workflows was injecting associated images into the context window automatically, to improve the quality of responses to tickets and messages that relied heavily on screenshots. This was quick and made the workflows significantly more powerful.
More recently, there are a number of workflows attempting to operate on large complex files like PDFs or DOCXs, and the naive approach of shoving them into the context window hasn’t worked particularly well. This post explains how we’ve adapted the principle of progressive disclosure to allow our internal agents to work with large files.
This is part of the Building an internal agent series.
Progressive disclosure is the practice of limiting what is added to the context window to the minimum necessary amount, and adding more detail over time as necessary.
A good example of progressive disclosure is how agent skills are implemented:
SKILL.md on demandSKILL.md can specify other files to be further loaded as helpfulIn our internal use-case, we have skills for JIRA formatting, Slack formatting, and Notion formatting. Some workflows require all three, but the vast majority of workflows require at most one of these skills, and it’s straightforward for the agent to determine which are relevant to a given task.
File management is a particularly interesting progressive disclosure problem, because files are so
helpful in many scenarios, but are also so very large. For example,
requests for help in Slack are often along the lines of “I need help with this login issue
Our high-level approach to the large-file problem is as follows:
Always include metadata about available files in the prompt, similar to the list of available skills. This will look something like:
Files:
- id: f_a1
name: my_image.png
size: 500,000
preloaded: false
- id: f_b3
name: ...
The key thing is that each id is a reference that the agent is able to pass
to tools. This allows it to operate on files without loading their context into
the context window.
Automatically preload the first N kb of files into the context window,
as long as they are appropriate mimetypes for loading (png, pdf, etc).
This is per-workflow configurable, and could be set as low as 0
if a given workflow didn’t want to preload any files.
I’m still of mixed minds whether preloading is worth doing, as it takes some control away from the agent.
Provide three tools for operating on files:
load_file(id) loads an entire file into the context windowpeek_file(id, start, stop) loads a section of a file into the context windowextract_file(id) transforms PDFs, PPTs, DOCX and so on into simplified textual versionsProvide a large_files skill which explains how and when to use the above tools
to work with large files. Generally, it encourages using extract_file on any PDF, DOCX or PPT
file that it wants to work with, and otherwise loading or peeking depending on the available space
in the context window
This approach was quick to implement, and provides significantly more control to the agent to navigate a wide variety of scenarios involving large files. It’s also a good example of how the “glue layer” between LLMs and tools is actually a complex, sophisticated application layer rather than merely glue.
This has worked well. In particular, one of our internal workflows oriented around
giving feedback about documents attached to a ticket, in comparison to other
similar, existing documents. The workflow simply did not work at all
prior to this approach, and now works fairly well without workflow-specific
support for handling these sorts of large files,
because the large_files skill handles that in a reusable fashion without
workflow authors being aware of it.
Generally, this feels like a stand-alone set of functionality that doesn’t require significant future investment, but there are three places where we will need to continue building:
extract_file should be modified to return a referencable, virtual file_id
that is used with peek_file and load_file rather than returning contents directly.
This would make for a more robust tool even when extracting from very large files.
In practice, extracted content has always been quite compact.lxml
dependencies in it, and at some point we might.Altogether, a very helpful extension for our internal workflows.
2025-12-26 22:00:00
When Anthropic introduced Agent Skills, I was initially a bit skeptical of the problem they solved–can we just use prompts and tools?–but I’ve subsequently come to appreciate them, and have explicitly implemented skills in our internal agent framework. This post talks about the problem skills solves, how the engineering team at Imprint implemented them, how well they’ve worked for us, and where we might work with them next.
This is part of the Building an internal agent series.
Agent Skills are a series of techniques that solve three important workflow problems:
All three of these problems initially seemed very insignificant when we started building out our internal workflows,
but once the number of internal workflows reached into the dozens, both become difficult to manage.
Without reusable snippets, I lost the leverage to improve all workflows at once, and without progressive disclosure
the agents would get a vast amount of irrelevant content that could confuse them, particularly when it came to things
like inconsistencies between Markdown and slack’s mrkdwn formatting language, both of which are important to different
tools used by our workflows.
As a disclaimer, I recognize that it’s not necessary to implement agent skills, as you can integrate with e.g. Claude’s Agent Skills support for APIs. However, one of our design decisions is being largely platform agnostic, such that we can switch across model providers, and consequently we decided to implement skills within our framework.
With that out of the way, we started implementing by reviewing the Agent Skills documentation at agentskills.io, and cloning their Python reference implementation skills-ref into our repository to make it accessible to Claude Code.
The resulting implementation has these core features:
Skills are in skills/ repository, with each skill consisting of its own sub-directory
with a SKILL.md
Each skill is a Markdown file with metadata along these lines:
---
name: pdf-processing
description: Extract text and tables...
metadata:
author: example-org
version: "1.0"
---
The list of available skills–including their description from metadata–is injected into the system prompt at the beginning of each workflow,
and the load_skills tool is available to the agent to load the entire file into the context window.
Updated workflow configuration to optionally specify required, allowed, and prohibited skills to modify the list of exposed skills injected into the system prompt.
My guess is that requiring specific skills for a given workflow is a bit of an anti-pattern, “just let the agent decide!”, but it was trivial to implement and the sort of thing that I could imagine is useful in the future.
Used the Notion MCP to retrieve all the existing prompts in our prompt repository, identify existing implicit skills in the prompts we had created, write those initial skills, and identify which Notion prompts to edit to eliminate the now redundant sections of their prompts.
Then we shipped it into production.
Humans make mistakes all the time. For example, I’ve seen many dozens of JIRA tickets from humans that don’t explain the actual problem they are having. People are used to that, and when a human makes a mistake, they blame the human. However, when agents make a mistake, a surprising percentage of people view it as a fundamental limitation of agents as a category, rather than thinking that, “Oh, I should go update that prompt.”
Skills have been extremely helpful as the tool to continue refining down these edge cases
where we’ve relied on implicit behavior because specifying the exact behavior was simply overwhelming.
As one example, we ask that every Slack message end with a link to the prompt that drove the
response. That always worked, but the details of the formatting would vary in an annoying, distracting
way: sometimes it would be the equivalent of [title](link), sometimes link, sometimes [link](link).
With skills, it is now (almost always) consistent, without anyone thinking to include those instructions
in their workflow prompts.
Similarly, handling large files requires a series of different tools that benefit from In-Context Learning (aka ICL, which is a fancy term for including a handful of examples of correct and incorrect usage), which absolutely no one is going to add to their workflow prompt but is extremely effective at improving how the workflow uses those tools.
For something that I was initially deeply skeptical about, I now wish I had implemented skills much earlier.
While our skills implementation is working well today, there are a few opportunities I’d like to take advantage of in the future:
Add a load_subskill skill to support files in skills/{skill}/* beyond the SKILL.md.
So far, this hasn’t been a major blocker, but as some skills get more sophisticated,
the ability to split varied use-cases into distinct files would improve our ability
to use skills for progressive disclosure
One significant advantage that Anthropic has over us is their sandboxed Python interpreter, which allows skills to include entire Python scripts to be specified and run by tools. For example, a script for parsing PDFs might be included in a skill, which is extremely handy. We don’t currently have a sandboxed interpreter handy for our agents, but this could, in theory anyway, significantly cut down on the number of custom skills we need to implement.
At a minimum, it would do a much better job at operations that require reliable math versus relying on the LLM to do its best at performing math-y operations.
I think both of these are actually pretty straightforward to implement. The first is just a simple feature that Claude could implement in a few minutes. The latter feels annoying to implement, but could also be implemented in less than an hour by running a second lambda running Nodejs with Pyodide, and exposing access to that lambda as a tool. It’s just so inelegant for a Python process to call a Nodejs process to run sandboxed Python that I haven’t done it quite yet.
2025-12-19 01:00:00
Yet another edition of my annual recap! This year brought my son to kindergarten, me to forty and to a new job at Imprint, my fourth book to bookstores, and a lot more time in the weeds of developing software.
Previously: 2024, 2023, 2022, 2021, 2020, 2019, 2018, 2017
Evaluating my goals for this year and decade:
[Completed] Write at least four good blog posts each year.
Moving from an orchestration-heavy to leadership-heavy management role, Good engineering management is a fad, What is the competitive advantage of authors in the age of LLMs?, Facilitating AI adoption at Imprint
[Completed] Write three books about engineering or leadership in 2020s.
This year I finished Crafting Engineering Strategy with O’Reilly. This is my third engineering book in the 2020s. More about this in the Writing section below.
[Completed] Do something substantial and new every year that provides new perspective or deeper practice.
After almost a decade of not submitting a substantial pull request at work, I’ve been back in the mix since joining Imprint. I’ve submitted a solid handful of real pull requests that implement production features, and have used Claude Code widely in their creation. I’ve missed this a lot, and have learned a bunch about developing software with LLMs.
[In progress] 20+ folks who I’ve managed or meaningfully supported move into VPE or CTO roles at 50+ person or $100M+ valuation companies.
This is a decade goal ending in 2029. I previously increased the goal in 2022 from 3-5 to 20.
In 2024, the count was at 10.
Things haven’t moved too much since then, but I’ll refresh next year.
I think that I’m on track, but I will say that I think getting into these roles is markedly harder than it was three years ago. There are just fewer of these roles available recently, and they tend to be both more demanding and more difficult than the standard VPE/CTO role a few years ago.
For backstory on these goals: I originally set them in 2019, and then revised them in 2022. I’ve come to believe that I should be revising these every year, but also that it’s not that interesting to revise them every year. I’ll revise them again in a few years.
I finished my fourth book, Crafting Engineering Strategy, and wrote some notes on writing it. I’m really excited for this book to be done, because I think it’s been a missing book in the industry, and I hope it will change how the industry thinks about “engineering strategy.” In particular, I hope it’ll pull us away from the frequent gripe that “we have no engineering strategy!” You do have an engineering strategy, it’s just not written down yet.
As part of finishing this book, I’ve also recognized that if I write another book, it will be far into the future. After publishing four books in six years, I’m booked out, and I’m pretty sure I’ve tapped out my last decade’s path of writing books to advance the industry. I’ll definitely keep writing, but it’ll be posts focused on the stuff I’m concretely working on, without trying to map them into a larger book structure.
(Last year I mentioned adding The High-Context Triad to a second edition of Staff Engineer, which I still plan to do, but I’m not quite sure when. Probably in a few years.)
I left Carta in May after two years there, and joined Imprint. Imprint has just been a lot of fun for me. I’ve written a small number of real pull requests that implement meaningful things. That’s something I haven’t done since working at Uber, and aligns with my desire to be working in the details again. There’s nothing more energizing to me than getting to solve real, concrete problems, and that’s exactly the sort of job Imprint has been for me. I just haven’t been spenting time on stuff like implementing internal workflow agents or automatically merging Dependabot pull requests in a long time, and I missed it.
It’s also, after some years spent on making teams more efficient, been an opportunity to really hire again, which I haven’t gotten to do since my first couple years at Calm. It’s never easy working at a fast growing company, but you do learn a lot, and quite quickly.
My son entered kindergarten this year. I turned 40. My wife is starting to explore the world of fractional software development, and she’s figuring out its rules. We’ve had a fair amount of health issues in the immediate and extended family, but altogether everything is going well.
I didn’t do much public speaking, although I spoke on Book Overflow about Staff Engineer, which was a fun discussion.
I also spoke at several private events, and recorded practice runs on YouTube of Good engineering management is a fad and CTOs must earn the right to specialize. Those are very similar talks, where I’ve been iterating on the core idea of how engineering managers need to adapt to the current era.
In 2024, I read 27 profession-adjacent books. In 2023, I read 11. I’m not quite sure how many I read in 2022, because I put together a 2019-2022 professional reading recap, but it was about 50 over four years. This year I didn’t do much professional reading, mostly because I was too busy with the new job and polishing my most recent book.
What I did read was:
It’s interesting to note the drop in volume, but I feel fine about it. I don’t read to hit a goal, I read to learn or understand a particular problem, and found myself mostly working on topics that didn’t align well with that approach this year.
If you’ve written something about your year, send it my way!
2025-12-18 23:00:00
One of the recurring themes of software development is patching security issues. Most repository hosting services have fairly good issue reporting at this point, but many organizations still struggle to apply those fixes in a timely fashion. This past week we were discussing how to reduce the overhead of this process, and I was curious: can you just auto-merge Github Dependabot pull-requests?
It turns out, [the answer is yes], and it works pretty well. You get control over which types of updates (patches, minor updates, major updates, etc) you want to auto-merge, and it will also respect your automated checks. If you have great CI/CD that runs blocking linting, typing and tests, then this works particularly well. If you don’t, then, well, this will be an effective mechanism to get you to having good linting, typing, and tests afer traversing a small ocean of tears.
I got this running for about a dozen repositories at work over the past few days, but I’ll show an example of setting up the same mechanism for my blog.
First, add a .github/workflows/dependabot-auto-merge.yml file to your repository
that looks like this:
# Automatically approve and merge Dependabot PRs for minor and patch updates
name: Dependabot auto-merge
on: pull_request
permissions:
contents: write
pull-requests: write
jobs:
dependabot:
runs-on: ubuntu-latest
if: github.event.pull_request.user.login == 'dependabot[bot]' && github.repository == 'lethain/irrational_hugo'
steps:
- name: Dependabot metadata
id: metadata
uses: dependabot/fetch-metadata@v2
with:
github-token: "${{ secrets.GITHUB_TOKEN }}"
- name: Approve Dependabot PR
run: gh pr review --approve "$PR_URL"
env:
PR_URL: ${{ github.event.pull_request.html_url }}
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
- name: Enable auto-merge for Dependabot PRs
if: steps.metadata.outputs.update-type == 'version-update:semver-patch' || steps.metadata.outputs.update-type == 'version-update:semver-minor'
run: gh pr merge --auto --squash "$PR_URL"
env:
PR_URL: ${{ github.event.pull_request.html_url }}
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
Then go to your repository settings (something like
https://github.com/lethain/irrational_hugo/settings), and enable auto-merging
for your repository. This still respects all required branch rules, like required test passes
or approvals, etc.

Then make sure you have appropriate status checks for whatever linting, typing and tests you have in your repository.

Then enable Dependabot (something like https://github.com/lethain/irrational_hugo/settings/security_analysis).
Even the default settings are just find.

Then you’re done. The PRs from dependabot will automatically merge going forward. There are lots of nuances here–I already found one issues that automatically merged despite an issue because of a missing test–but ultimately I think that’s valuable pressure to improve the testing quality, rather than a reason to avoid, or backtrack, on the approach.
2025-12-07 23:00:00
I’ve been working on internal “AI” adoption, which is really LLM-tooling and agent adoption, for the past 18 months or so. This is a problem that I think is, at minimum, a side-quest for every engineering leader in the current era. Given the sheer number of folks working on this problem within their own company, I wanted to write up my “working notes” of what I’ve learned.
This isn’t a recommendation about what you should do, merely a recap of how I’ve approached the problem thus far,
and what I’ve learned through ongoing iteration. I hope the thinking here will be useful to you, or at least validates
some of what you’re experiencing in your rollout. The further you read, the more specific this will get,
ending with cheap-turpentine-esque topics like getting agents to reliably translate human-readable text representations of Slack entities into mrkdwn formatting of the correct underlying entity.
I am hiring: If you’re interested in working together with me on internal agent and AI adoption at Imprint, we are hiring our founding Senior Software Engineer, AI. The ideal candidate is a product engineer who’s spent some time experimenting with agents, and wants to spend the next year or two digging into this space.
As technologists, I think one of the basics we owe our teams is spending time working directly with new tools to develop an intuition for how they do, and don’t work. AI adoption is no different.
Towards that end, I started with a bit of reading, especially Chip Huyen’s AI Engineering, and then dove in a handful of bounded projects: building my rudimentary own agent platform using Claude code for implementation, creating a trivial MCP for searching my blog posts, and an agent to comment on Notion documents.
Each of these projects was two to ten hours, and extremely clarifying. Tool use is, in particular, something that seemed like magic until I implemented a simple tool-using agent, at which point it become something extremely non-magical that I could reason about and understand.
Imprint’s general approach to refining AI adoption is strategy testing: identify a few goals, pick an initial approach, and then iterate rapidly in the details until the approach genuinely works. In an era of crushing optics, senior leaders immersing themselves in the details is one of our few defenses.

Shortly after joining, I partnered with the executive team to draft the above strategy for AI adoption. After a modest amount of debate, the pillars we landed on were:
As you see from those principles, and my earlier comment, my biggest fear for AI adoption is that they can focus on creating the impression of adopting AI, rather than focusing on creating additional productivity. Optics are a core part of any work, but almost all interesting work occurs where optics and reality intersect, which these pillars aimed to support.
As an aside, in terms of the components of strategy in Crafting Engineering Strategy, this is really just the strategy’s policy. In addition, we used strategy testing to refine our approach, defined a concrete set of initial actions to operationalize it (they’re a bit too specific to share externally), and did some brief exploration to make sure I wasn’t overfitting on my prior work at Carta.
My first step towards adoption was collecting as many internal examples of tips and tricks as possible into a single Notion database. I took a very broad view on what qualified, with the belief that showing many different examples of using tools–especially across different functions–is both useful and inspiring.

I’ve continued extending this, with contributions from across the company, and it’s become a useful resource for both humans and bots alike to provide suggestions on approaching problems with AI tooling.
One of my core beliefs in our approach is that making prompts discoverable within the company is extremely valuable. Discoverability solves four distinct problems:
My core approach is that every agent’s prompt is stored in a single Notion database which is readable by everyone in the company. Most prompts are editable by everyone, but some have editing restrictions.
Here’s an example of a prompt we use for routing incoming Jira issues from Customer Support to the correct engineering team.

Here’s a second example, this time of responding to requests in our Infrastructure Engineering team’s request channel.

Pretty much all prompts end with an instruction to include a link to the prompt in the generated message. This ensures it’s easy to go from a mediocre response to the prompt-driving the response, so that you can fix it.
In addition to collecting tips and prompts, the next obvious step for AI adoption is identifying a standard AI platform to be used within the company, e.g. ChatGPT, Claude, Gemini or what not.
We’ve gone with OpenAI for everyone. In addition to standardizing on a platform, we made sure account provisioning was automatic and in place on day one. To the surprise of no one who’s worked in or adjacent to IT, a lot of revolutionary general AI adoption is… really just account provisioning and access controls. These are the little details that can so easily derail the broader plan if you don’t dive into them.
Within Engineering, we also provide both Cursor and Claude. That said, the vast majority of our Claude usage is done via AWS Bedrock, which we use to power Claude Code… and we use Claude Code quite a bit.
While there’s a general industry push towards adopting more AI tooling, I find that a significant majority of “AI tools” are just SaaS vendors that talk about AI in their marketing pitches. We have continued to adopt vendors, but have worked internally to help teams evaluate which “AI tools” are meaningful.
We’ve spent a fair amount of time going deep on integrating with AI tooling for chat and IVR tooling, but that’s a different post entirely.
Measuring AI adoption is, like all measurement topics, fraught. Altogether, I’ve found measuring tool adoption very useful for identifying the right questions to ask. Why haven’t you used Cursor? Or Claude Code? Or whatever? These are fascinating questions to dig into. I try to look at usage data at least once a month, with a particular focus on two questions:
At the core, I believe folks who aren’t adopting tools are rational non-adopters, and spending some time understanding the (appearance of) resistance goes further than top-down mandate. I think it’s often an education gap that is bridged easily enough. Conceivably, at some point I’ll discover a point of diminishing returns, where the lack of progress is stymied on folks who are rejecting AI tooling–or because the AI tooling isn’t genuinely useful–but I haven’t found that point yet.
The next few sections are about building internal agents. The core implementation is a single stateless lambda which handles a wide variety of HTTP requests, similar-ish to Zapier. This is currently implemented in Python, and is roughly 3,000 lines of code, much of it dedicated to oddities like formatting Slack messages, etc.
For the record, I did originally attempt to do this within Zapier, but I found that Zapier simply doesn’t facilitate the precision I believe is necessary to do this effectively. I also think that Zapier isn’t particularly approachable for a non-engineering audience.
As someone who spent a long time working in platform engineering, I still want to believe that you can build a platform, and users will come. Indeed, I think it’s true that a small number of early adopters will come, if the problem is sufficiently painful for them, as was the case for Uber’s service migration (2014).
However, what we’ve found effective for driving adoption is basically the opposite of that. What’s really worked is the intersection of platform engineering and old-fashioned product engineering:
Some examples of the projects where we’ve gotten traction internally:
AGENTS.md files guiding use of tests, typechecking and lintingFor all of these projects that have worked, the formula has been the opposite of “build a platform and they will come.” Instead it’s required deep partnership from folks with experience building AI agents and using AI tooling to make progress. The learning curve for effective AI adoption in important or production-like workflows remains meaningfully high.
Agents that use powerful tools represent a complex configuration problem.
First, exposing too many tools–especially tools that the prompt author doesn’t effectively understand–makes
it very difficult to create reliable workflows. For example, we have an exit_early command that allows terminating
the agent early: this is very effective in many cases, but is also easy to break your bot.
Similarly, we have a slack_chat command that allows posting across channels, which can support a variety of useful
workflows (e.g. warm-handoffs of a question in one channel into a more appropriate alternative),
but can also spam folks.
Second, as tools get more powerful, they can introduce complex security scenarios.
To address both of these, we currently store configuration in a code-reviewed Git repository. Here’s an example of a JIRA project.

Here’s another for specifying a Slack responder bot.

Compared to a JSON file, we can statically type the configuration, and it’s easy to extend over time.
For example, we might want to extend slack_chat to restrict which channels a given bot is allowed to
publish into, which would be easy enough.
For most agents today, the one thing not under Git-version control is the prompts themselves, which are versioned by Notion.
However, we can easily require specific agents to use prompts within the Git-managed repository for sensitive scenarios.
After passing tests, linting and typechecking, the configurations are automatically deployed.
It’s sort of funny to mention, but one thing that has in practice really interfered with
easily writing effective prompts is making it easy to write things like @Will Larson which
is then translated into <@U12345> or whatever the appropriate Slack identifier is for a given
user, channel, or user group. The same problem exists for Jira groups, Notion pages and databases,
and so on.
This is a good example of where centralizing prompts is useful. I got comfortable pulling the unique
identifiers myself, but it became evident that most others were not.
This eventually ended with three tools for Slack resolution: slack_lookup which takes a list
of references to lookup, slack_lookup_prefix which finds all Slack entities that start with
a given prefix (useful to pull all channels or groups starting with @oncall-, for example, rather than having to hard-code the list in your prompt), and slack_search_name which uses string-distance to find potential matches (again, useful for dealing with typos).
If this sounds bewildering, it’s largely the result of Slack not exposing relevant APIs for this sort of lookup.
Slack’s APIs want to use IDs to retrieve users, groups and channels, so you have to maintain your own cache of
these items to perform a lookup. Performing the lookups, especially for users, is itself messy. Slack users have
a minimum of three ways they might be referenced: user.profile.display_name, user.name, and user.real_name,
only a subset of which are set for any given user.
The correct logic here is, as best I can tell, to find a match against user.profile.display_name, then use that if it exists.
Then do the same for user.name and finally user.real_name. If you take the first user that matches one of those three,
you’ll use the wrong user in some scenarios.
In addition to providing tools to LLMs for resolving names, I also have a final mandatory check for each response to ensure the returned references refer to real items. If not, I inject which ones are invalid into the context window and perform an additional agent loop with only entity-resolution tools available. This feels absurd, but it was only at this point that things really started working consistently.
As an aside, I was embarassed by these screenshots, and earlier today I made the same changes for Notion pages and databases as I had previously for Slack.
Similarly to foreign entity resolution,
there’s a similar problem with Slack’s mrkdwn variant of Markdown
and JIRA’s Atlassian Document Format:
they’re both strict.
The tools that call into those APIs now have strict instructions on formatting. These had been contained in individual prompts, but they started showing up in every prompt, so I knew I needed to bring them into the agent framework itself rather than forcing every prompt-author to understand the problem.
My guess is that I need to add a validation step similar to the one I added for entity-resolution, and that until I do so, I’ll continue to have a small number of very infrequent but annoying rendering issues, To be honest, I personally don’t mind the rendering issues, but that creates a lot of uncertainty for others using agents, so I think solving them is a requirement.
Today, all logs, especially tool usage, are fed into two places. First, it goes into Datadog for full logging visibility.
Second, and perhaps more usefully for non-engineers, they feed into a Slack channel, #ai-logs which create visibility
into which tools are used and with which (potentially truncated) parameters.
Longer term, I imagine this will be exposed via a dedicated internal web UX, but generally speaking I’ve found that the subset of folks who are actively developing agents are pretty willing to deal with a bit of cruft. Similarly the folks who aren’t developing agents directly don’t really care, they want it to work perfectly every time, and aren’t spending time looking at logs.
The biggest internal opportunity that I see today is figuring out how to get non-engineers an experience equivalent to running Claude Code locally with all their favorite MCP servers plugged in. I’ve wanted ChatGPT or Claude.ai to provide this, but they don’t really quite get there, Claude Desktop is close, but is somewhat messy to configure as we think about finding a tool that we can easily allow everyone internally to customize and use on a daily basis.
I’m still looking for what the right tool is here. If anyone has any great suggestions that we can be somewhat confident will still exist in two years, and don’t require sending a bunch of internal data to a very early stage company, then I’m curious to hear!
You’re supposed to start a good conclusion with some sort of punchy anecdote that illuminates your overall thesis in a new way. I’m not sure if I can quite meet that bar, but the four most important ideas for me are:
I’m curious what other folks are finding!