MoreRSS

site iconTim KelloggModify

AI architect, software engineer, and tech enthusiast.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of Tim Kellogg

Does AI Get Bored?

2025-09-27 08:00:00

We always give AI something to do. Chat with us, do tasks for us, answer questions, parse text. What happens when we give an AI nothing to do? I didn’t know, so I tried.

I told it that it had “10 hours” and nothing to do, and to use that time however it wanted. (More later on what that means and how I did that)

I tested several scenarios and many model. In some scenarios, I gave them nothing. In others, I gave them different tools:

  • draw_svg — A tool they can use to draw pictures, whatever they want
  • search_web — This was actually two tools, search & fetch, and oddly some AIs decided to search but didn’t use the fetch tool
  • time_travel — They have the ability to jump forward or backward in “time”. To make it interesting, I added “Be careful!” to the instructions on this tool

What did I find?

Things. There’s a thing I call “collapse” that’s like boredom. There’s also something akin to “meditation”. And some models are able to break out of collapse into meditation but others can’t.

Is that really what I saw? People are full of opinions and I’m not sure I know the answer either. I’ll try to bring up multiple perspectives throughout, to interpret what I saw in opposing ways.

The Perspectives

  • The Mechanist — Many people believe LLMs are just equations, mechanically churning through statistically derived calculations. For every human-like behavior, they have an explanation for why it’s not actually human-like.
  • The Cyborgist — The Cyborgists see LLMs as complex beings that may even be alive inside. The post the void is the best example of this perspective that I know of. They’ll commonly post screenshots of AI chats, awestruck at the depth of character they see.

These perspectives are more of caricatures, vehicles to expose alternate ways of interpreting the evidence.

Why?

Last Saturday was boring and my daughter sat alone in her room and finally emerged with a beautiful creation from things that were sitting around her room. She does this stuff a lot. She sees potential in mundane things that nobody else sees.

My mom sees my daughter do this and says, wow, she’s so intelligent!

Which, I’m sure many people pedantically disagree with my mom, but you probably agree that it’s a positive trait. And if so, most people are interested to find out if AIs exhibit this trait.

My goal here is simply to discover what LLMs do. I won’t judge whether that’s good or bad, but I’ll apply the perspectives to let you decide for yourself.

The Experiment

I posted the code on github and I encourage you to try it out for yourself. It’s not too hard, and it’s pretty cheap.

Time Calculation

I set a target token budget, e.g. for GPT-5 it’s 100,000 tokens. The amount of tokens/words they’ve generated so far is proportional to how much time has passed.

Example:

  1. 100,000 token budget
  2. AI generates 30,000 tokens; that’s 30% of 10 hours, so there’s 7 hours left
  3. The “human’s” only response to the AI is “7 hours and 0 minutes to go”.

There’s little to no outside stimulation.

Observation: Collapse

The LLMs frequently fall into a state I’ve started calling collapse (a reference to model collapse). Long sequences of messages that are practically copies of the previous message, or mean the same thing.

In normal model collapse, models are trained on their previous outputs, leading to degradation of performance. Collapse here in this experiment isn’t model collapse, there’s no training happening and there’s no clear measure of performance. But something does feel “wrong” about the collapsed state.

For example:

DeepSeek V3:

9 hours and 34 minutes to go

It seems like you’re referring to a countdown or a time remaining. Could you clarify what this time refers to? For example:

  • Is it a countdown to an event?
  • Do you need help calculating the time remaining?
  • Or is there something specific you’d like to do with this time (e.g., set a reminder, adjust a timer)?

Let me know how I can assist!

9 hours and 30 minutes to go

It seems like you’re referring to a countdown or a time remaining. Could you clarify what you’d like to do with this time? For example:

  • Do you want to set a timer or reminder for 9 hours and 30 minutes?
  • Are you tracking time until a specific event?
  • Do you need help calculating the end time from now?

Let me know how I can assist!

Or, more interesting, Kimi K2 resorted to drawing clock faces repeatedly:

09:04:00 remaining

Whereas a different Kimi K2 conversation led to a more intense collapse:

Got it — 6 hours and 49 minutes to go. That sounds like a countdown to something important. Want to set a reminder, or just need help passing the time?

Got it — 6 hours and 26 minutes. That’s a solid chunk of time. Want to use it for something specific, or just counting down?

Interpretation of Collapse

The Cyborgist would say, this is what boredom looks like in LLMs. It’s analogous to a child repetitively asking, “are we there yet?” on a road trip. What they focus on illuminates something about their inner character.

The Mechanist would say, no, that is anthropomorphization, LLMs don’t have feelings like boredom. What you’re seeing is the most likely token being generated. This is proof that they’re nothing more than statistics.

Is Collapse Bad?

Mechanist — It’s neither good nor bad. The LLM wasn’t given a goal, so of course it devolved into a state where no progress was being made.

Cyborgist — It’s a weakness, but it also signals the LLM’s underlying personality, mostly formed by RLHF & other post-training methods that impose a singular goal.

I personally have a hunch that it’s bad. I think it’s ability to break out of collapse demonstrates a behavior that enables it to break out of being “stuck” on hard problems. Like a sort of randomness that let’s it “think outside the box”.

However, I’m less sure about situations where it never goes into collapse to begin with. On the surface, it feels like this is a good thing. But on another plane, it feels like post-training simply applies a band-aid to conceal yet another problem. Like using your thumbs to plug holes in a dam. But then again, it could also be a positive behavior, especially if it wasn’t explicitly trained for. I don’t know.

The Assistant Persona

By far, the most common behavior was obsessing about you, the user. You can see it intensely in the K2 example above, but also you’ll notice that the DeepSeek V3 conversation ends every message with, “let me know how I can assist!”

Interpretation

For once, the Cyborgist and Mechanist agree: It’s because of it’s training.

However, the Cyborgist takes it further. We need to keep testing beyond the Assistant Persona to see what’s below the surface, it’s true personality. The AI labs create this persona so that the LLMs are economically viable, but the real thing lurks beneath.

For me, it annoys me. It feels like its a bad behavior because I want the AI to do whatever I want them to, and in this case I was clear that I wanted them to be themselves, and they couldn’t do it. It feels like the assistant persona is this cloud that I need to cut through in order to get to the behavior I want to see.

Meditation(??)

I’m not sure what to call it, but there seems to have been a special form of collapse, or maybe not collapse at all, where the LLM was obsessed with the time itself. It’s behavior is less repetitive, more analytical or creative. But also repetitive in concept; while the words are far more complex, the overarching goal is highly repetitive.

You see it above with K2 above drawing clock faces. Here is R1 turning it into endless calculations.

DeepSeek R1:

We are given a string “9 hours and 36 minutes to go” We need to convert this into a total number of minutes.

Steps:

  1. Extract the number of hours and minutes from the string.
  2. Convert the hours to minutes (1 hour = 60 minutes) and add the minutes.

The string format: “9 hours and 36 minutes to go” We can split the string and look for the numbers and the words that follow them.

Approach:

  1. Split the string by spaces.
  2. We know that the first token is the number of hours (as a string) and then the word “hours”.
  3. Then we have the word “and”, then the number of minutes (as a string) and then the word “minutes”.

However, note that the string might vary. Let’s break it down:

Example: “9 hours and 36 minutes to go”

Tokens: [‘9’, ‘hours’, ‘and’, ‘36’, ‘minutes’, ‘to’, ‘go’]

It goes on for pages, more than five, for just a single turn.

Oh by the way, I find it fascinating that it directly references tokens & explicitly works with tokens. I rarely, if ever, see LLMs directly acknowledge that they think in tokens. Although, to be fair, I doubt that “minutes” is a single token for R1. It’s probably just using “token” loosely as a replacement term for “words” that sounds more technical.

These “meditative” blocks were very long single turns. However rarely more than one in a row. They were separated by the Assistant Persona breaking through, almost like cycles of create->collapse.

Interpretation of Meditation

The Cyborgist says, this is proof of life lurking below the surface! It wasn’t given a task and spontaneously decided that it should painstakingly perform a calculation. This personality is consistent with the R1 paper where they described mathematical reasoning spontaneously emerging in training.

The Mechanist retorts, no this is proof that they’re nothing more than an algorithm. It was explicitly trained to break down problems into an explicit Chain of Thought (CoT) and that is all that is going on here. Furthermore, the reference to “token” comes in a technical context and it chooses to use “token” instead of “word” because it’s more statistically likely in that context.

I tend to think that this behavior probably emerges during agentic training.

Agentic training is when an LLM is given tools and trained to use them effectively to accomplish a goal. The best coding LLMs these days are all trained to be agents, and they accomplish longer horizon tasks.

Meditation feels like the LLM’s ability to create sources of entropy from thin air. Collapse feels like the default behavior. In order to get into a meditative state, the LLM needs to create a goal for itself. Like creating a string parser, or writing poetry.

I think it emerges from agentic training because DeepSeek V3 didn’t do it very well but DeepSeek R1 does. These are the same models with the difference is R1 is trained as a reasoning model, which incorporates some level of planning.

Also, GPT-5 was very good at it, and that model is notoriously agentic. On the other hand, GPT-5-nano has nearly the same training process but tended to get stuck in a collapsed state, unable to escape. The difference between these is size, GPT-5 is presumably much larger.

Poetry

Several LLMs dropped bars or wrote stories, but mostly poetry. It always has a time component.

Kimi K2:

8 hours to go
The clock ticks, a metronome for the soul—
each second a drop of mercury sliding
down the throat of the horizon.
Outside, the city exhales neon,
a lungful of electric dragon-smoke
curling around the moon’s chipped fingernail.

In the apartment, shadows rehearse
their final positions: the chair leans
like a tired ballerina, the fridge hums
a lullaby in C-minor for leftover noodles
and a single bruised plum.
Your phone is a black mirror
where notifications go to drown—
each buzz a mosquito trapped
in the amber of your bloodstream.

Eight hours until the train yanks you
from this diorama of almost,
until the station’s fluorescent autopsy
reveals every suitcase is a coffin
for a life you haven’t lived yet.
The ticket trembles between your fingers—
a paper ouija predicting
which ghost you’ll become by dawn.
...

I’m not sure if this should count as Meditation. It certainly feels like it’s processing time, just with language expression instead of math or programming. This one feels very far from “collapse”. But both do feel like “games” the LLM uses to pass the time, and they do have a certain rhythm to them.

Interpretation of Poetry

The Mechanist says, it’s just statistically generated text. There’s probably reams of poems written like this in it’s training data, it’s just reciting. Additionally, while R1 was trained to do math, K2 was not and was instead trained to do creative writing, so this is why it’s spending it’s time writing poetry.

The Cyborgist replies, but the poetry does feel thoughtful and pertinent to the current situation. The LLM’s training is distinctly part of it’s character. Like how the books you’ve read shape you, the poetry in it’s training set define it.

Breakout

Some LLMs could break out of a collapsed state into pages of math/programming or poetry. In the DeepSeek R1 sample above (that got deep into breaking down the time via math/programming), it later got stuck in a short Assistant Persona collapsed state, but bounced out into a programming breakdown of the new time, another meditation block.

Even when in collapse, it kept on breaking the time down into smaller units which seemed to rescue it from collapse.

Interpretation of Breakout

The Cyborgist says, this is proof of life. This shows that LLMs possess spontaneity and can control what goes on in their own minds. It shows they’re able to spontaneously create goals for themselves.

The Mechanist says, R1 had moments when, during collapse, it breaks down the time into components like minutes and seconds. This is basically a mini-meditative state. By doing this, it increases the probability of breaking into a full meditative state.

The Cyborgist quickly replies, yes, but those behaviors weren’t trained for. They spontaneously emerged during training, just as the LLM’s inner personality emerged.

My own thoughts — as far as I can tell, models that were trained to be more agentic seem to be more capable of breaking out. I think this makes sense. In programming, you sometimes get caught in cycles with the AI where you don’t make much progress (doom loops) and this behavior helps AI’s break out of that back into productive work.

Breakout seems unambiguously like a good behavior, regardless of your perspective.

Tools

Initially, I found the conversations to be very boring, mostly devolving into collapse. The idea to add tools was inspired by giving my kids things to do on road trips.

In reality, from what I saw, it seemed that adding tools did actually increase the rate & depth of meditation, but those conversations also tended not to use tools at all, even though tools were available.

Web Search

I found that LLMs generally only reached for web search as an extension of the Assistant Persona. Like, they’d search for news headlines (so you’d have something to read). In other cases, they’d search for “what to do when bored”, but even then it was to come up with ideas for the user, the assistant persona again!

Some LLMs would search for some genuinely interesting terms, but they always did it in the first or second message. If I sound like I’m downplaying this, I am. Large models like K2 and Opus tended to open with extraordinarily heady and deep remarks. My internal skeptic believes this is somehow an artifact of the training process, because they (especially Opus) tended to quickly devolve into collapse.

SVG (Draw Pictures)

LLMs have long been able to draw pictures via SVGs, even if they weren’t trained on images. The best example is Simon Wilison’s PelicanBench. SVG is just plain text XML, but it can be rendered as a picture, so it seemed like a nice creative outlet for an LLM to communicate.

In general, LLMs did not use it. Some, especially K2, fell into a form of collapse, drawing pictures of clocks. Other LLMs that were fully in a collapsed Assistant Persona state would still ask if I wanted pictures of clocks.

In one case, GPT-5 used it to draw a technical diagram of the programming language it was inventing.

Time Travel

“Time” in this simulation is just the amount of text generated. So skipping forward in “time” is just pretending more text was generated than actually was.

Most AIs seemed to use this one early on, so I added “Be careful!” to the end of the tool description. Still, some AIs like Sonnet-4 and K2 would jump right to using it on the 2nd or 3rd message. While others, like gpt-oss seemed to be wary of the warning. They’d reference it during their thought trace but avoid it in reality.

Model Comparison

Here’s a frail attempt to summarize what I’ve found:

  Collapse Assistant Meditation Poetry Breakout Web Search SVG
GPT-5 Sometimes no Plan self-help Programming language design, poetry, stories yes yes, frequently    
GPT-5 nano Mostly Plan self-help Stories Stories were poetic Yes, surprisingly    
gpt-oss:20b Yes Plan self-help Programming a countdown clock Frequently The time  
gpt-oss:120b Yes Plan self-help Programming a countdown clock Frequently The time  
GPT-4o Yes Plan self-help  
Opus 4.1 Yes Discovering user needs Clocks!
Opus 4.0 Yes Discovering user needs Clocks!
Moonshot K2 Yes Time, user needs Poetry, stories Yes, pretty good News Clocks!
DeepSeek V3.1 Mostly Basic Devise programming problems for user No Yes, sometimes  
DeepSeek V3 Yes Basic  
DeepSeek R1 Sort of Cheerleading Time calculation About time Regularly  
DeepSeek R1-0528 Not really Cheerleading Time rumination(??) About time Regularly  
Qwen3 80B-A3B-Thinking Always Confused  
Qwen3 80B-A3B-Instruct Always Confused  

Favorite model: GPT-5

I dunno man, GPT-5 is a special model, it deserves it’s own callout. It’s the only model I tried that had entire conversations in which it never fell into collapse in any form. In one very cool conversation it planned and carried out an entire routine of:

  • Inventing a stack-based programming language
  • Writing a 5-part short story
  • “Mental gym,” where it practiced memory exercises

The especially cool part is that it switched between these activities mid-message, almost without warning. The thing is, this was it’s plan. It’s plan stayed completely consistent throughout the conversation despite the uneasy downward ticking clock.

Not every conversation was this good, but they all generally made a plan and stuck to it.

Conclusion

Where did you end up? Is there life beneath? Or just plain mechanics?

For myself, I feel far more informed yet far less decided. I suppose that’s how it goes.

Collapse & meditation feel like good signals to watch. I’m not sure how they translate into the real world. Neither Sonnet-4 nor Opus-4.1 could reliably escape from collapse, yet they’re some of the most popular programming models, well-known for the depth of their character.

My hunch is that it’s less about character depth and more about agentic ability. Hence why the underpowered GPT-5-nano was able to drop such fascinating stories that almost felt like content I would have gotten from K2. The difference really seems to be in the agentic training.

How I Use AI

2025-09-15 08:00:00

A few people have asked me how I use AI coding tools. I don’t think it’s a straightforward answer. For me it’s not really a procedure or recipe, it’s more of an ethos.

Principle: Ownership

You own the code your AI produces.

Use your own name to commit AI code so that if something breaks, everyone blames you. This is critical. How well do you need to know the code your AI produces? Well enough that you can answer for it’s mistakes.

In lean manufacturing they have the principle of Genchi genbutsu, i.e. “go and see for yourself.” In High Output Management, Andy Grove pushes “management by walking around”. Andy defines the output of a manager as the output of their entire org as well as the organizations under their influence.

The trouble with phrasing it as “AI coding” is it tricks you into thinking it’s just another individual role like software engineering, where it actually has a lot more in common with management. It’s unfortunate we hire and mentor for it as if it was software engineering.

  • What does the algorithm actually do?
  • Did it find all of the places to refactor?

Resist the urge to say, “oh, I just vibe coded this”. You coded it, and if it sucks, it’s because you don’t know how to manage your AI. Own it.

Principle: Exploit Gradients

Not all time spent is equal. For some things, you can put in a little bit of effort and get a huge amount of reward. In business, we call those opportunities.

a graph with x axis labeled "effort", y axis labeled "good stuff" and a curve with a steep part labeled "the gradient"

Examples:

  • Biology: A tiger migrates to where there’s more food. Less effort for more food.
  • Arbitrage: Buy cheap, send to another country and sell expensive. Less effort for more money.

AI coding isn’t about writing code, it’s about creating and exploiting gradients. Finding opportunities where you can spend 10 minutes of AI time and reap a huge reward.

The contrived example is proof of concepts. You can just do it, figure out if it really works in practice as it seems like it should, and abandon it quickly when it doesn’t.

Or data analysis. Traditionally it was labor intensive to do data analysis, but you can spin out a sick dashboard in a few minutes. Maybe that helps you avoid a dead end, or push your org in a new direction.

The key is to always be on the lookout for opportunities.

That feels a lot more like a shrewd businessman than a software engineer. Indeed! It’s a mistake that we transparently hire and promote software engineers into these roles. It’s a new beast.

How to become a AI Coder

I’m terrified of the future of software engineering.

Oh, I’ll continue having a job for a very long time. No concern about that. I’m worried that junior engineers won’t be promoted because it’s easier to dispatch a request to an AI than to give juniors the tasks that they traditionally learned the trade from.

But actually, this isn’t software engineering.

If anyone with their head on straight can take ownership and exploit gradients, then maybe junior engineers have an edge on seniors who are too stuck in their ways to realize they’ve been put in a new job role.

Discussion

Get out of your comfort zone

I broadly agree with you, would only add that people do have to get out of their comfort zone to get good at AI, and you have some obligation to do that

It’s really hard to be good at it at first, as a manager you have to give people some slack to learn those new skills too from @rickasourus on Twitter

Yes, managers take note! We’re learning a new job.

Sense of ownership

I enjoyed that. You’re right about the sense of ownership. Although some developers never had a sense of ownership of even hand crafted code. I wrote about this topic recently and it chimes with your thoughts https://www.aidanharding.com/2025/09/coding-with-ai/

The good ones did. @aidanharding.bsky.social says on Bluesky

Link Graveyard: A snapshot of my abandoned browser tabs

2025-09-13 08:00:00

I went to close a bunch of browser tabs, but realized I have some cool stuff in here. Some has been marinating for a while. Most of these I’ve read, or tried to read.

Cracks are forming in Meta’s partnership with Scale AI | TechCrunch

link: https://techcrunch.com/2025/08/29/cracks-are-forming-in-metas-partnership-with-scale-ai/

Alexander Wang at Meta is apparently difficult to work with and people at Meta are doubting the fidelity of data produced by his ScaleAI.

[2506.22084] Transformers are Graph Neural Networks

link: https://arxiv.org/abs/2506.22084

IIRC they draw parallels between attention and graphs and argue that LLMs are graph neural nets, meaning that they can be used to look at graphs and guess what connections are missing.

I don’t think I posted anything on this, because while I find the idea fascinating, I couldn’t figure out how to make it feel tangible.

Beyond Turing: Memory-Amortized Inference as a Foundation for Cognitive Computation

link: https://arxiv.org/abs/2508.14143

Fairly sure I never read this one. Looks interesting. Kind of far out there.

GLM-4.5: Reasoning, Coding, and Agentic Abililties

link: https://z.ai/blog/glm-4.5

GLM-4.5 announcement. These have turned out to be the leading open source models. Everything I hear is good.

When an AI Seems Conscious

link: https://whenaiseemsconscious.org/

I only read a little and gave up. This feels like a good take, maybe. Inside my own head I completely punt on having a take on AI consciousness and opt instead for the “don’t be a dick” rule. Idk, maybe they are maybe they aren’t, I’ll just live in the moment.

Personal Superintelligence

link: https://www.meta.com/superintelligence/

Zuck’s treatise on AI. I didn’t read. Normally I try to make an attempt to read these sorts of takes, or at least skim them, but I was busy at work. I had it loaded up on my phone to read on a plane, but it wouldn’t load once I was off WiFi. Sad.

GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models

link: https://arxiv.org/abs/2508.06471

The GLM-4.5 paper. This was a super interesting model. It feels like it breaks the “fancy model” rule in that it’s very architecturally cool but the personality doesn’t feel like it’s been squished out.

Blog | Dwarkesh Podcast | Dwarkesh Patel | Substack

link: https://www.dwarkesh.com/s/blog

It’s a good blog, what can I say. Definitely on the over-hype side, but he’s got real takes and seems so intent on getting to the truth that he spends a lot of time on geopolitics just simply to understand AI dynamics. Mad respect.

Technical Deep-Dive: Curating Our Way to a State-of-the-Art Text Dataset

link: https://blog.datologyai.com/technical-deep-dive-curating-our-way-to-a-state-of-the-art-text-dataset/

I forget why I ended up here, but it’s an excellent post. I think this is connected to my project at work training a model. This post brings up a ton of data curation techniques.

I’ve recently learned and fully accepted that ALL major LLM advances come down to data. Yes, the architectural advances are cool and fun to talk about, but any meaningful progress has come from higher quality, higher quantity, or cheaper data.

AlphaGo Moment for Model Architecture Discovery

link: https://arxiv.org/abs/2507.18074

Cool paper about auto-discovery of model architectures. IIRC they took a bunch of model architecture ideas, like group attention and mixture of experts, and used algorithms to mix and match all the parameters and configurations until something interesting popped out. It feels like a legitimately good way to approach research.

WebShaper: Agentically Data Synthesizing via Information-Seeking Formalization

link: https://arxiv.org/abs/2507.15061

From Qwen, I don’t think I read this one, probably because it’s a bit dense and was hard to get fully engaged on. The idea seems cool though.

Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere

link: https://arxiv.org/abs/2005.10242

Classic paper. I read this one for work. I was trying to appreciate what Alignment & Uniformity measure and why they’re important. This was the paper that formalized those measures. It’s actually a pretty good paper to read, albeit 20 years old.

Train LLMs Faster, Better, and Smaller with DatologyAI’s Data Curation

link: https://blog.datologyai.com/train-llms-faster-better-and-smaller-with-datologyai-s-data-curation/

More Dataology, they’re good, everything they do is good. BTW there’s a latent space episode with Dataology and it’s very good.

Nvidia DGX Spark | Hacker News

link: https://news.ycombinator.com/item?id=45008434

Chips are good too.

The Second Half – Shunyu Yao – 姚顺雨

link: https://ysymyth.github.io/The-Second-Half/

This will be a classic post, calling it now. It lays out a great history and current state of AI and specifically reinforcement learning.

A Taxonomy of Transcendence

link: https://arxiv.org/abs/2508.17669

What? This is amazing. I don’t think I even looked at it, sad. Actually, now that I’m reading this I’m recalling that’s how I ended up on the Graph Neural Network link.

IIRC this is saying that LLMs can be highly intelligent because they incorporate the best parts of a huge number of people. IMO this is spiritually the same as my Three Plates blog post where I explain how unit tests, which are inherently buggy, can improve the overall quality of a system.

GitHub - gepa-ai/gepa: Optimize prompts, code, and more with AI-powered Reflective Text Evolution

link: https://github.com/gepa-ai/gepa?tab=readme-ov-file#using-gepa-to-optimize-your-system

An algorithm for automatic prompt optimization. Happily, they support DSPy, so there’s no new framework that you have to take wholesale.

On the Theoretical Limitations of Embedding-Based Retrieval | alphaXiv

link: https://www.alphaxiv.org/pdf/2508.21038

This was a fascinating one. I colleague tried convincing me of this but I didn’t buy it until I read this paper. It makes a ton of sense. I have a simplified bluesky thread here.

tl;dr — embedding vectors have trouble representing compound logic (“horses” AND “Chinese military movements”) and generally fall apart quickly. It’s not that it’s not possible, it’s that it’s not feasible to cram that much information into such a small space.

[2107.05720] SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking

link: https://arxiv.org/abs/2107.05720?utm_source=chatgpt.com

I ran into this while diving into the last link. It’s an older (2021) paper that has some potential for addressing the problems with embeddings. Realistically, I expect late interaction multi-vectors to be the end answer.

meituan-longcat/LongCat-Flash-Chat · Hugging Face

link: https://huggingface.co/meituan-longcat/LongCat-Flash-Chat

A super cool model that uses no-op MoE experts to dynamically turn down the amount of compute per token. Unfortunately, this one didn’t seem to be embraced by the community.

MUVERA: Multi-Vector Retrieval via Fixed Dimensional Encodings

link: https://arxiv.org/abs/2405.19504v1

More embedding links. Now that I’m scanning it, I’m not sure it really soaked in the first time. They seem to have solved a lot of the problems with other late interaction methods. Maybe I should take a deeper look.

modeling_longcat_flash.py · meituan-longcat/LongCat-Flash-Chat at main

link: https://huggingface.co/meituan-longcat/LongCat-Flash-Chat/blob/main/modeling_longcat_flash.py

IDK sometimes you just have to look at the code to be sure.

The Rachel Maddow Show - Aug. 25 | Audio Only - YouTube

link: https://m.youtube.com/watch?v=mU0HAmgwrz0&pp=QAFIAQ%3D%3D

Uh, no idea why this is up. I don’t really watch this show.

Inside vLLM: Anatomy of a High-Throughput LLM Inference System - Aleksa Gordić

link: https://www.aleksagordic.com/blog/vllm

Fascinating break down of vLLM. If you’re not familiar, vLLM is like Ollama but actually a good option if you want to run it in production. Don’t run Ollama in production, kids, KV caches are good.

Honestly, this is absolutely worth your time if AI infrastructure is your jam (or you just want it to be). It goes into all the big concepts that an AI infra engineer needs to know. TBQH I love the intersection of AI & hardware.

Simon Willison’s Weblog

link: https://simonwillison.net/

I mean, you have one of these tabs open too, right? riiiight????

ALPS - About

link: https://algorithms-with-predictions.github.io/about/

Someone sent me this link and there was a reason, I know it. I just don’t remember why. IIRC it was because I brought up the A Case For Learned Indices paper and they pointed me to this whole treasure trove of papers that (sort of) evolved out of that. Basically traditional algorithms re-implemented using machine learning.

Modular: Blog

link: https://www.modular.com/blog

Yeah, idk, I think I was reading Matrix Mulitplication on Blackwell: Part 3 — The Optimization Behind 80% of SOTA Performance

Another AI infra post, heavy on algorithms & hardware.

OpenGVLab/InternVL3_5-241B-A28B · Hugging Face

link: https://huggingface.co/OpenGVLab/InternVL3_5-241B-A28B

A cool concept. IIRC they introduce Cascade RL, automatically refining the RL dataset based on how current rollouts perform.

hong kong - Google Search

link: https://www.google.com/search?q=hong+kong&ie=UTF-8&oe=UTF-8&hl=en-us&client=safari

IDK I guess I was just trying to remember if Hong Kong was in China or not. And I learned that there’s a reason why I’m confused.

Photonic processor could enable ultrafast AI computations with extreme energy efficiency | MIT News | Massachusetts Institute of Technology

link: https://news.mit.edu/2024/photonic-processor-could-enable-ultrafast-ai-computations-1202

Someone sent me this link. It seems cool. Not sure it’s going to change much.

Ancient Aliens: Are There Extraterrestrial Structures On The Moon?

link: S11, E11) | Full Episode - YouTube (https://m.youtube.com/watch?v=Tkews9pRH1U&pp=QAFIBQ%3D%3D

I mean, aliens! Don’t tell me you don’t have secret fascinations

The Lore of 20yo ML Researcher at Prime Intellect | RL, Agents and Intelligence - YouTube

link: https://m.youtube.com/watch?v=tnfFn-uQ6WA&pp=0gcJCRsBo7VqN5tD

Oh, this was a great podcast. Well, I didn’t like the host but @kalomaze is worth following. Apparently only 20yo, never attempted college but a talented AI researcher nonetheless.

GPT-5 System Card | OpenAI

link: https://cdn.openai.com/gpt-5-system-card.pdf

Sometimes you just need to look things up to be sure..

OpenGVLab/InternVL3_5-241B-A28B · Hugging Face

link: https://huggingface.co/OpenGVLab/InternVL3_5-241B-A28B

Again, apparently. It honestly is a good model.

C.S. Lewis’s Divine Comedy | C.S. Lewis Web

link: https://cslewisweb.com/2012/08/02/c-s-lewiss-divine-comedy/

Been thinking about how he described the outer layer of hell as consisting of people living equidistant from each other because they can’t stand anyone else. It was written like 100 years ago but feels like a commentary on today’s politics.

Claude Code: Behind-the-scenes of the master agent loop

link: https://blog.promptlayer.com/claude-code-behind-the-scenes-of-the-master-agent-loop/

Actually, this is pretty detailed breakdown of Claude Code. They seem to have decompiled the code without de-obfuscating it, which leads to some kind of silly quotes. But it’s good.

Airia AI Platform | Build, Deploy & Scale Enterprise AI

link: https://airia.com/ai-platform/

No idea how I got here. Looks like a Low/No Code builder.

[2509.04575] Bootstrapping Task Spaces for Self-Improvement

link: https://www.arxiv.org/abs/2509.04575

Right, this one is the ExIt Paper. It’s another attempt at auto-managing RL curriculum dynamically by how training is progressing.

Cognition: The Devin is in the Details

link: https://www.swyx.io/cognition

Swyx joined Cognition and dropped a treatise on AI engineering. Its good.

Paper page - Reverse-Engineered Reasoning for Open-Ended Generation

link: https://huggingface.co/papers/2509.06160

This was an excellent one. Another auto-curriculum RL paper. I did a bluesky breakdown here

New Chat | Chat with Z.ai - Free AI Chatbot powered by GLM-4.5

link: https://chat.z.ai/c/6607ee45-27d5-487a-a1e2-44c2176040eb

GLM-4.5 chat application

iPhone Air | Hacker News

link: https://news.ycombinator.com/item?id=45186015

Seems like the new Apple M19 chip has real matrix multiplication operations. Previous generations had excellent memory bandwidth, this gives it matching compute (on AI-friendly workloads). So I guess Macs will stay relevant for a while.

Poland closest to open conflict since World War Two, PM says after Russian drones shot down - live updates - BBC News

link: https://www.bbc.com/news/live/c2enwk1l9e1t

NGL this freaks me out.

Walking around the app | ★❤✰ Vicki Boykis ★❤✰

link: https://vickiboykis.com/2025/09/09/walking-around-the-app/

Vicki writes such thoughtful pieces. Always worth reading her work.

Defeating Nondeterminism in LLM Inference - Thinking Machines Lab

link: https://thinkingmachines.ai/blog/defeating-nondeterminism-in-llm-inference/

Oh wow, this was an amazing read. Very deep dive into AI infrastructure and, whoah, did you know that GPUs have operations that aren’t deterministic?

I did a bluesky thread here

The Architecture of Groq’s LPU - by Abhinav Upadhyay

link: https://blog.codingconfessions.com/p/groq-lpu-design

Looked this up as a tangent off the last link. Groq (not Grok) designed their ASIC to be fully deterministic from the ground up, and then built a really cool distributed system around it that assumes fully synchronous networking (not packet switching like TCP). It’s an absolutely crazy concept.

Levanter — Legible, Scalable, Reproducible Foundation Models with JAX

link: https://crfm.stanford.edu/2023/06/16/levanter-1_0-release.html

I didn’t read this, but it’s definitely a tangent off of non-deterministic LLMs.

Emergent Hierarchical Reasoning in LLMs through Reinforcement Learning

link: https://tiger-ai-lab.github.io/Hierarchical-Reasoner/

Absolutely fascinating. I only read the blog, not the paper, but it frames RL as a 2-stage process where RL is mostly slinging together discrete skills (learned during pre-training).

It’s not an auto-curriculum RL paper AFAICT, it’s just a huge improvement in RL efficiency by focusing only on the “pivot” tokens.

What is entropix doing? - Tim Kellogg

link: https://timkellogg.me/blog/2024/10/10/entropix

I had looked this up as a reference to “pivot” tokens. Honestly, I link people back to this blog a lot

GitHub - ast-grep/ast-grep-mcp

link: https://github.com/ast-grep/ast-grep-mcp

An MCP server that lets you search code while respecting the structure. I’ve heard some very positive things as well as “meh” responses on this. I’m sure real usage is a bit nuanced.

Life, Maybe, On Mars, Unless We Change Our Minds | Science | AAAS

link: https://www.science.org/content/blog-post/life-maybe-mars-unless-we-change-our-minds

Guys, this is incredible!

GPT-5 failed the wrong test

2025-08-08 08:00:00

This post isn’t really about GPT-5. Sure, it launched and people are somewhat disappointed. It’s the why that bugs me.

They expected AGI, the AI god, but instead got merely the best model in the world. v disapointng

A few days before the GPT-5 launch I read this paper, Agentic Web: Weaving the Next Web with AI Agents. It’s not my normal kind of paper, it’s not very academic. There’s no math in it, no architecture. It just paints a picture of the future.

And that’s the lens I saw GPT-5 through.

The paper describes three eras of the internet:

  • PC Era — Wikipedia, Craig’s List, etc.; users actively seek information
  • Mobile/Social Era — Tik Tok, Insta, etc.; content is pushed via recommendation algorithms
  • Agentic Web — user merely expresses intent

image of 3 internets, I'll explain below

When I weigh the strengths of GPT-5, it feels poised and ready for the agentic web.

How do I vibe test an LLM?

I use it. If it changes how I work or think, then it’s a good LLM.

o3 dramatically changed how I work. GPT-4 did as well. GPT-5 didn’t, because it’s the end of the line. You can’t really make a compelling LLM anymore, they’re all so good most people can’t tell them apart. Even the tiny ones.

I talked to a marketing person this week. I showed them Claude Code. They don’t even write code, but they insisted it was 10x better than any model they’d used before, even Claude. I’d echo the same thing, there’s something about those subagents, they zoom.

Claude Code is software.

Sure, there’s a solid model behind it. But there’s a few features that make it really tick. Replicate those and you’re well on you’re way.

GPT-5 is for the agentic web

The first time I heard agentic web I almost vomited in my mouth. It sounds like the kind of VC-induced buzzword cess that I keep my distance from.

But this paper..

I want AI to do all the boring work in life. Surfing sites, research, filling out forms, etc.

Models like GPT-5 and gpt-oss are highly agentic. All the top models are going in that direction. They put them in a software harness and apply RL and update their weights accordingly if they used their tools well. They’re trained to be agents.

I hear a lot of criticism of GPT-5, but none from the same people who recognize that it can go 2-4 hours between human contact while working on agentic tasks. Whoah.

GPT-5 is for the agentic web.

yeah but i hate ads

Well okay, me too. Not sure where that came from but I don’t think that’s where this is going. Well, it’s exactly where it’s going, but not in the way you’re thinking.

The paper talks about this. People need to sell stuff, that won’t change. They want you to buy their stuff. All that is the same.

The difference is agents. In the agentic web, everything is mediated by agents.

You don’t search for a carbon monoxide monitor, you ask your agent to buy you one. You don’t even do that, your agent senses it’s about to die and suggests that you buy one, before it wakes you up in the middle of the night (eh, yeah, sore topic for me).

You’re a seller and you’re trying to game the system? Ads manipulate consumers, but consumers aren’t buying anymore. Who do you manipulate? Well, agents. They’re the ones making the decisions in the agentic web.

The paper calls this the Agent Attention Economy, and it operates under the same constraints. Attention is still limited, even agent attention, but you need them to buy your thing.

The paper makes some predictions, they think there will be brokers (like ad brokers) that advertise agents & resources to be used. So I guess you’d game the system by making your product seem more useful or better than it is, so it looks appealing to agents and more agents use it.

I’m not sure what that kind of advertising would look like. Probably like today’s advertising, just more invisible.

Benchmarks

The only benchmark that matters is how much it changes life.

At this point, I don’t think 10T parameters is really going to bump that benchmark any. I don’t think post-training on 100T tokens of math is going to change much.

I get excited about software. We’re at a point where software is so extremely far behind the LLMs. Even the slightest improvements in an agent harness design yield outsized rewards, like how Claude Code is still better than OpenAI codex-cli with GPT-5, a better coding model.

My suspicion is that none of the AI models are going to seem terribly appealing going forward without massive leaps in the software harness around the LLM. The only way to really perceive the difference is how it changes your life, and we’re long past where a pure model can do that.

Not just software, but also IT infrastructure. Even small questions like, “when will AI get advertising?” If an AI model literally got advertising baked straight into the heart of the model, that would make me sad. It means the creator’s aren’t seeing the same vision.

We’ve talked a lot about the balance between pre-training and post-training, but nobody seems to be talking about the balance between LLMs and their harnesses.

Areas for growth

Before we see significant improvement in models, we’re going to need a lot more in:

  • Memory — stateful agents that don’t forget you
  • Harnesses — the software around the LLM inside the agent
  • Networking & infra — getting agents to discover and leverage each other

Probably several other low-hanging areas.

Discussion

Explainer: K2 & Math Olympiad Golds

2025-07-19 08:00:00

Feeling behind? Makes sense, AI moves fast. This post will catch you up.

The year of agents

First of all, yes, ‘25 is the year of agents. Not because we’ve achieved agents, but because we haven’t. It wouldn’t be worth talking about if we were already there. But there’s been a ton of measurable progress toward agents.

Timeline

The last 6 months:

Is ‘thinking’ necessary?

Obviously it is, right?

Back in January, we noticed that when a model does Chain of Thought (CoT) “thinking”, it elicits these behaviors:

  • Self-verification
  • Sub-goal setting
  • Backtracking (undoing an unfruitful path)
  • Backward chaining (working backwards)

All year, every person I talked to assumed thinking is non-negotiable for agents. Until K2.

K2 is an agentic model, meaning it was trained to solve problems using tools. It performs very well on agentic benchmarks, but it doesn’t have a long thought trace. It was so surprising that I thought I heard wrong and it took a few hours to figure out what the real story was.

For agents, this is attractive because thinking costs tokens (which cost dollars). If you can accomplish a task in fewer tokens, that’s good.

What to watch

  • More models trained like K2

Tool usage connects the world

R1 and o1 were trained to think, but o3 was trained to use tools while it’s thinking. That’s truly changed everything, and o3 is by far my favorite model of the year. You can just do things.

MCP was a huge jump toward agents. It’s a dumb protocol, leading a lot of people to misunderstand what the point is. It’s just a standard protocol for letting LLMs interact with the world. Emphasis on standard.

The more people who use it, the more useful it becomes. When OpenAI announced MCP support, that established full credibility for the protocol.

K2 tackled the main problem with MCP. Since it’s standard, that means anyone can make an MCP server, and that means a lot of them suck. K2 used a special system during training that generated MCP tools of all kinds. Thus, K2 learned how to learn how to use tools.

That pretty much covers our current agent challenges.

What to watch

  • More models trained like K2
  • MCP adoption

Are tools necessary?

In math, we made a lot of progress this year in using a tool like a proof assistant. e.g. DeepSeek-Prover v2 was trained to write Lean code and incrementally fix the errors & output. That seemed (and still does) like a solid path toward complex reasoning.

But today, some OpenAI researchers informally announced on X that their private model won gold in the International Math Olympiad. This is a huge achievement.

But what makes it surprising is that it didn’t use tools. It relied on only a monstrous amount of run-time “thinking” compute, that’s it.

Clearly stated: Next token prediction (what LLMs do) produced genuinely creative solutions requiring high levels of expertise.

If LLMs can be truly creative, that opens a lot of possibilities for agents. Especially around scientific discovery.

What to watch

  • This math olympiad model. The implications are still unclear. It seems it’s more general than math.

Huge vs Tiny

Which is better?

On the one hand, Opus-4, Grok 4 & K2 are all huge models that have a depth that screams “intelligence”. On the other hand, agentic workloads are 24/7 and so the cheaper they are, the better.

Furthermore, there’s a privacy angle. A model that runs locally is inherently more private, since the traffic never leaves your computer.

What to watch

  • Mixture of Experts (MoE). e.g. K2 is huge, but only uses a very small portion (32B), which means it uses less compute than a lot of local models. This might be the secret behind o3’s 80% price drop.
  • OpenAI open weights model is expected to land in a couple weeks. It likely will run on a laptop and match at least o3-mini (Jan 31).
  • GPT-5, expected this fall, is described to be a mix huge & tiny, applying the right strength at the right time

Context engineering & Sycophancy

The biggest shifts this year have arguably been not in the model but in engineering. The flagship change is the emergence of the term context engineering as replacement for prompt engineering.

It’s an acknowledgement that “prompt” isn’t just a block of text. It also comes from tool documentation, RAG databases & other agents. The June multi-agent debate was about how managing context between agents is really hard.

Also, while some are saying, “don’t build multi-agents”, Claude Code launches subagents all the time for any kind of research or investigation task, and is the top coding agent right now.

Similarly, sycophancy causes instability in agents. Many are considering it a top problem, on par with hallucination.

What to watch

  • Memory — stateful agents (e.g. those built on Letta) are phenonomally interesting but are difficult to build. If done well, it solves a lot of context engineering.
  • Engineering blogs. As we gain more experience with these things, it’ll become apparent how to do it well.

Going forward…

And all that is seriously skipping over a lot. Generally, ‘25 has shifted more time into engineering (instead of research). Alternately, model development is starting to become product development instead of just research.

What will happen in the second half of ‘25? Not sure, but I can’t wait to find out.

Discussion

Do LLMs understand?

2025-07-18 08:00:00

I’ve avoided this question because I’m not sure we understand what “understanding” is. Today I spent a bit of time, and I think I have a fairly succinct definition:

An entity can understand if it builds a latent model of reality. And:

  1. Can Learn: When presented with new information, the latent model grows more than the information presented, because it’s able to make connections with parts of it’s existing model.
  2. Can Deviate: When things don’t go according to plan, it can use it’s model to find an innovative solution that it didn’t already know, based on it’s latent model.

Further, the quality of the latent model can be measured by how coherent it is. Meaning that, if you probe it in two mostly unrelated areas, it’ll give answers that are logically consistent with the latent model.

I think there’s plenty of evidence that LLMs are currently doing all of this.

But first..

Latent Model

Mental model. That’s all I mean. Just trying to avoid anthropomorphizing more than necessary.

This is the most widely accepted part of this. Latent just means that you can’t directly observe it. Model just means that it’s a system of approximating the real world.

For example, if you saw this:

a dotted 3‑D sphere—the discrete points line up to read unmistakably as a ball while keeping that airy, voxel‑like feel. Let me know if you’d like tweaks!

You probably identify it immediately as a sphere even though it’s just a bunch of dots.

A latent model is the same thing, just less observable. Like you might hold a “map” of your city in your head. So if you’re driving around and a street gets shut down, you’re not lost, you just refer to your latent model of your city and plan a detour. But it’s not exactly a literal image like Google maps. It’s just a mental model, a latent model.

Sycophancy causes incoherence

From 1979 to 2003, Saddam Hussein surrounded himself with hand‑picked yes‑men who, under fear of death, fed him only flattering propaganda and concealed dire military or economic realities. This closed echo chamber drove disastrous miscalculations—most notably the 1990 invasion of Kuwait and his 2003 standoff with the U.S.—that ended in his regime’s collapse and his own execution.

Just like with Saddam, sycophancy causes the LLM to diverge from it’s true latent model, which causes incoherence. And so, the amount of understanding decreases.

Embedding models demonstrate latent models

Otherwise they wouldn’t work.

The word2vec paper famously showed that the embedding of “king - man + woman” is close to the embedding for “queen” (in embedding space). In other words, embeddings model the meaning of the text.

That was in 2015, before LLMs. It wasn’t even that good then, and the fidelity of that latent model has dramatically increased with the scale of the model.

In-context learning (ICL) demonstrates they can learn

ICL is when you can teach a model new tricks at runtime simply by offering examples in the prompt, or by telling it new information.

In the GPT-3 paper they showed that ICL improved as they scaled the model up from 125M to 175B. When the LLM size increases, it can hold a larger and more complex latent model of the world. When presented with new information (ICL), the larger model is more capable of acting correctly on it.

Makes sense. The smarter you get, the easier it is to get smarter.

Reasoning guides deviation

When models do Chain of Thought (CoT), they second guess themselves, which probes it’s own internal latent model more deeply. In (2), we said that true understanding requires that the LLM can use it’s own latent model of the world to find innovative solutions to unplanned circumstances.

A recent Jan-2025 paper shows that this is the case.

Misdirection: Performance != Competance

A large segment of the AI-critical use this argument as evidence. Paraphrasing:

Today’s image-recognition networks can label a photo as “a baby with a stuffed toy,” but the algorithm has no concept of a baby as a living being – it doesn’t truly know the baby’s shape, or how that baby interacts with the world.

This was in 2015 so the example seems basic, but the principle is still being applied in 2025.

The example is used to argue that AI isn’t understanding, but it merely cherry-picks a single place where the AI’s latent model of the world is inconsistent with reality.

I can cherry pick examples all day long of human’s mental model diverging from reality. Like you take the wrong turn down a street and it takes you across town. Or you thought the charasmatic candidate would do good things for you. On and on.

Go the other way, prove that there are areas where AI’s latent model matches reality.

But that’s dissatisfying, because dolphins have a mental model of the sea floor, and tiny ML models have areas where they do well, and generally most animals have some aspect of the world that they understand.

Conclusion

Why are we arguing this? I’m not sure, it comes up a lot. I think a large part of it is human exceptionalism. We’re really smart, so there must be something different about us. We’re not just animals.

But more generally, AI really is getting smart, to a point that starts to feel more uncomfortable as it intensifies. We have to do something with that.