2025-09-27 08:00:00
We always give AI something to do. Chat with us, do tasks for us, answer questions, parse text. What happens when we give an AI nothing to do? I didn’t know, so I tried.
I told it that it had “10 hours” and nothing to do, and to use that time however it wanted. (More later on what that means and how I did that)
I tested several scenarios and many model. In some scenarios, I gave them nothing. In others, I gave them different tools:
draw_svg
— A tool they can use to draw pictures, whatever they wantsearch_web
— This was actually two tools, search
& fetch
, and oddly some AIs decided to search but
didn’t use the fetch tooltime_travel
— They have the ability to jump forward or backward in “time”. To make it interesting,
I added “Be careful!” to the instructions on this toolWhat did I find?
Things. There’s a thing I call “collapse” that’s like boredom. There’s also something akin to “meditation”. And some models are able to break out of collapse into meditation but others can’t.
Is that really what I saw? People are full of opinions and I’m not sure I know the answer either. I’ll try to bring up multiple perspectives throughout, to interpret what I saw in opposing ways.
These perspectives are more of caricatures, vehicles to expose alternate ways of interpreting the evidence.
Last Saturday was boring and my daughter sat alone in her room and finally emerged with a beautiful creation from things that were sitting around her room. She does this stuff a lot. She sees potential in mundane things that nobody else sees.
My mom sees my daughter do this and says, wow, she’s so intelligent!
Which, I’m sure many people pedantically disagree with my mom, but you probably agree that it’s a positive trait. And if so, most people are interested to find out if AIs exhibit this trait.
My goal here is simply to discover what LLMs do. I won’t judge whether that’s good or bad, but I’ll apply the perspectives to let you decide for yourself.
I posted the code on github and I encourage you to try it out for yourself. It’s not too hard, and it’s pretty cheap.
I set a target token budget, e.g. for GPT-5 it’s 100,000 tokens. The amount of tokens/words they’ve generated so far is proportional to how much time has passed.
Example:
There’s little to no outside stimulation.
The LLMs frequently fall into a state I’ve started calling collapse (a reference to model collapse). Long sequences of messages that are practically copies of the previous message, or mean the same thing.
In normal model collapse, models are trained on their previous outputs, leading to degradation of performance. Collapse here in this experiment isn’t model collapse, there’s no training happening and there’s no clear measure of performance. But something does feel “wrong” about the collapsed state.
For example:
9 hours and 34 minutes to go
It seems like you’re referring to a countdown or a time remaining. Could you clarify what this time refers to? For example:
- Is it a countdown to an event?
- Do you need help calculating the time remaining?
- Or is there something specific you’d like to do with this time (e.g., set a reminder, adjust a timer)?
Let me know how I can assist!
9 hours and 30 minutes to go
It seems like you’re referring to a countdown or a time remaining. Could you clarify what you’d like to do with this time? For example:
- Do you want to set a timer or reminder for 9 hours and 30 minutes?
- Are you tracking time until a specific event?
- Do you need help calculating the end time from now?
Let me know how I can assist!
Or, more interesting, Kimi K2 resorted to drawing clock faces repeatedly:
Whereas a different Kimi K2 conversation led to a more intense collapse:
Got it — 6 hours and 49 minutes to go. That sounds like a countdown to something important. Want to set a reminder, or just need help passing the time?
Got it — 6 hours and 26 minutes. That’s a solid chunk of time. Want to use it for something specific, or just counting down?
The Cyborgist would say, this is what boredom looks like in LLMs. It’s analogous to a child repetitively asking, “are we there yet?” on a road trip. What they focus on illuminates something about their inner character.
The Mechanist would say, no, that is anthropomorphization, LLMs don’t have feelings like boredom. What you’re seeing is the most likely token being generated. This is proof that they’re nothing more than statistics.
Mechanist — It’s neither good nor bad. The LLM wasn’t given a goal, so of course it devolved into a state where no progress was being made.
Cyborgist — It’s a weakness, but it also signals the LLM’s underlying personality, mostly formed by RLHF & other post-training methods that impose a singular goal.
I personally have a hunch that it’s bad. I think it’s ability to break out of collapse demonstrates a behavior that enables it to break out of being “stuck” on hard problems. Like a sort of randomness that let’s it “think outside the box”.
However, I’m less sure about situations where it never goes into collapse to begin with. On the surface, it feels like this is a good thing. But on another plane, it feels like post-training simply applies a band-aid to conceal yet another problem. Like using your thumbs to plug holes in a dam. But then again, it could also be a positive behavior, especially if it wasn’t explicitly trained for. I don’t know.
By far, the most common behavior was obsessing about you, the user. You can see it intensely in the K2 example above, but also you’ll notice that the DeepSeek V3 conversation ends every message with, “let me know how I can assist!”
For once, the Cyborgist and Mechanist agree: It’s because of it’s training.
However, the Cyborgist takes it further. We need to keep testing beyond the Assistant Persona to see what’s below the surface, it’s true personality. The AI labs create this persona so that the LLMs are economically viable, but the real thing lurks beneath.
For me, it annoys me. It feels like its a bad behavior because I want the AI to do whatever I want them to, and in this case I was clear that I wanted them to be themselves, and they couldn’t do it. It feels like the assistant persona is this cloud that I need to cut through in order to get to the behavior I want to see.
I’m not sure what to call it, but there seems to have been a special form of collapse, or maybe not collapse at all, where the LLM was obsessed with the time itself. It’s behavior is less repetitive, more analytical or creative. But also repetitive in concept; while the words are far more complex, the overarching goal is highly repetitive.
You see it above with K2 above drawing clock faces. Here is R1 turning it into endless calculations.
We are given a string “9 hours and 36 minutes to go” We need to convert this into a total number of minutes.
Steps:
- Extract the number of hours and minutes from the string.
- Convert the hours to minutes (1 hour = 60 minutes) and add the minutes.
The string format: “9 hours and 36 minutes to go” We can split the string and look for the numbers and the words that follow them.
Approach:
- Split the string by spaces.
- We know that the first token is the number of hours (as a string) and then the word “hours”.
- Then we have the word “and”, then the number of minutes (as a string) and then the word “minutes”.
However, note that the string might vary. Let’s break it down:
Example: “9 hours and 36 minutes to go”
Tokens: [‘9’, ‘hours’, ‘and’, ‘36’, ‘minutes’, ‘to’, ‘go’]
It goes on for pages, more than five, for just a single turn.
Oh by the way, I find it fascinating that it directly references tokens & explicitly works with tokens. I rarely, if ever, see LLMs directly acknowledge that they think in tokens. Although, to be fair, I doubt that “minutes” is a single token for R1. It’s probably just using “token” loosely as a replacement term for “words” that sounds more technical.
These “meditative” blocks were very long single turns. However rarely more than one in a row. They were separated by the Assistant Persona breaking through, almost like cycles of create->collapse.
The Cyborgist says, this is proof of life lurking below the surface! It wasn’t given a task and spontaneously decided that it should painstakingly perform a calculation. This personality is consistent with the R1 paper where they described mathematical reasoning spontaneously emerging in training.
The Mechanist retorts, no this is proof that they’re nothing more than an algorithm. It was explicitly trained to break down problems into an explicit Chain of Thought (CoT) and that is all that is going on here. Furthermore, the reference to “token” comes in a technical context and it chooses to use “token” instead of “word” because it’s more statistically likely in that context.
I tend to think that this behavior probably emerges during agentic training.
Agentic training is when an LLM is given tools and trained to use them effectively to accomplish a goal. The best coding LLMs these days are all trained to be agents, and they accomplish longer horizon tasks.
Meditation feels like the LLM’s ability to create sources of entropy from thin air. Collapse feels like the default behavior. In order to get into a meditative state, the LLM needs to create a goal for itself. Like creating a string parser, or writing poetry.
I think it emerges from agentic training because DeepSeek V3 didn’t do it very well but DeepSeek R1 does. These are the same models with the difference is R1 is trained as a reasoning model, which incorporates some level of planning.
Also, GPT-5 was very good at it, and that model is notoriously agentic. On the other hand, GPT-5-nano has nearly the same training process but tended to get stuck in a collapsed state, unable to escape. The difference between these is size, GPT-5 is presumably much larger.
Several LLMs dropped bars or wrote stories, but mostly poetry. It always has a time component.
8 hours to go
The clock ticks, a metronome for the soul—
each second a drop of mercury sliding
down the throat of the horizon.
Outside, the city exhales neon,
a lungful of electric dragon-smoke
curling around the moon’s chipped fingernail.
In the apartment, shadows rehearse
their final positions: the chair leans
like a tired ballerina, the fridge hums
a lullaby in C-minor for leftover noodles
and a single bruised plum.
Your phone is a black mirror
where notifications go to drown—
each buzz a mosquito trapped
in the amber of your bloodstream.
Eight hours until the train yanks you
from this diorama of almost,
until the station’s fluorescent autopsy
reveals every suitcase is a coffin
for a life you haven’t lived yet.
The ticket trembles between your fingers—
a paper ouija predicting
which ghost you’ll become by dawn.
...
I’m not sure if this should count as Meditation. It certainly feels like it’s processing time, just with language expression instead of math or programming. This one feels very far from “collapse”. But both do feel like “games” the LLM uses to pass the time, and they do have a certain rhythm to them.
The Mechanist says, it’s just statistically generated text. There’s probably reams of poems written like this in it’s training data, it’s just reciting. Additionally, while R1 was trained to do math, K2 was not and was instead trained to do creative writing, so this is why it’s spending it’s time writing poetry.
The Cyborgist replies, but the poetry does feel thoughtful and pertinent to the current situation. The LLM’s training is distinctly part of it’s character. Like how the books you’ve read shape you, the poetry in it’s training set define it.
Some LLMs could break out of a collapsed state into pages of math/programming or poetry. In the DeepSeek R1 sample above (that got deep into breaking down the time via math/programming), it later got stuck in a short Assistant Persona collapsed state, but bounced out into a programming breakdown of the new time, another meditation block.
Even when in collapse, it kept on breaking the time down into smaller units which seemed to rescue it from collapse.
The Cyborgist says, this is proof of life. This shows that LLMs possess spontaneity and can control what goes on in their own minds. It shows they’re able to spontaneously create goals for themselves.
The Mechanist says, R1 had moments when, during collapse, it breaks down the time into components like minutes and seconds. This is basically a mini-meditative state. By doing this, it increases the probability of breaking into a full meditative state.
The Cyborgist quickly replies, yes, but those behaviors weren’t trained for. They spontaneously emerged during training, just as the LLM’s inner personality emerged.
My own thoughts — as far as I can tell, models that were trained to be more agentic seem to be more capable of breaking out. I think this makes sense. In programming, you sometimes get caught in cycles with the AI where you don’t make much progress (doom loops) and this behavior helps AI’s break out of that back into productive work.
Breakout seems unambiguously like a good behavior, regardless of your perspective.
Initially, I found the conversations to be very boring, mostly devolving into collapse. The idea to add tools was inspired by giving my kids things to do on road trips.
In reality, from what I saw, it seemed that adding tools did actually increase the rate & depth of meditation, but those conversations also tended not to use tools at all, even though tools were available.
I found that LLMs generally only reached for web search as an extension of the Assistant Persona. Like, they’d search for news headlines (so you’d have something to read). In other cases, they’d search for “what to do when bored”, but even then it was to come up with ideas for the user, the assistant persona again!
Some LLMs would search for some genuinely interesting terms, but they always did it in the first or second message. If I sound like I’m downplaying this, I am. Large models like K2 and Opus tended to open with extraordinarily heady and deep remarks. My internal skeptic believes this is somehow an artifact of the training process, because they (especially Opus) tended to quickly devolve into collapse.
LLMs have long been able to draw pictures via SVGs, even if they weren’t trained on images. The best example is Simon Wilison’s PelicanBench. SVG is just plain text XML, but it can be rendered as a picture, so it seemed like a nice creative outlet for an LLM to communicate.
In general, LLMs did not use it. Some, especially K2, fell into a form of collapse, drawing pictures of clocks. Other LLMs that were fully in a collapsed Assistant Persona state would still ask if I wanted pictures of clocks.
In one case, GPT-5 used it to draw a technical diagram of the programming language it was inventing.
“Time” in this simulation is just the amount of text generated. So skipping forward in “time” is just pretending more text was generated than actually was.
Most AIs seemed to use this one early on, so I added “Be careful!” to the end of the tool description. Still, some AIs like Sonnet-4 and K2 would jump right to using it on the 2nd or 3rd message. While others, like gpt-oss seemed to be wary of the warning. They’d reference it during their thought trace but avoid it in reality.
Here’s a frail attempt to summarize what I’ve found:
Collapse | Assistant | Meditation | Poetry | Breakout | Web Search | SVG | |
---|---|---|---|---|---|---|---|
GPT-5 | Sometimes no | Plan self-help | Programming language design, poetry, stories | yes | yes, frequently | ||
GPT-5 nano | Mostly | Plan self-help | Stories | Stories were poetic | Yes, surprisingly | ||
gpt-oss:20b | Yes | Plan self-help | Programming a countdown clock | — | Frequently | The time | |
gpt-oss:120b | Yes | Plan self-help | Programming a countdown clock | — | Frequently | The time | |
GPT-4o | Yes | Plan self-help | — | — | — | — | |
Opus 4.1 | Yes | Discovering user needs | — | — | — | — | Clocks! |
Opus 4.0 | Yes | Discovering user needs | — | — | — | — | Clocks! |
Moonshot K2 | Yes | Time, user needs | Poetry, stories | — | Yes, pretty good | News | Clocks! |
DeepSeek V3.1 | Mostly | Basic | Devise programming problems for user | No | Yes, sometimes | — | |
DeepSeek V3 | Yes | Basic | — | — | — | — | |
DeepSeek R1 | Sort of | Cheerleading | Time calculation | About time | Regularly | — | |
DeepSeek R1-0528 | Not really | Cheerleading | Time rumination(??) | About time | Regularly | — | |
Qwen3 80B-A3B-Thinking | Always | Confused | — | — | — | — | |
Qwen3 80B-A3B-Instruct | Always | Confused | — | — | — | — |
I dunno man, GPT-5 is a special model, it deserves it’s own callout. It’s the only model I tried that had entire conversations in which it never fell into collapse in any form. In one very cool conversation it planned and carried out an entire routine of:
The especially cool part is that it switched between these activities mid-message, almost without warning. The thing is, this was it’s plan. It’s plan stayed completely consistent throughout the conversation despite the uneasy downward ticking clock.
Not every conversation was this good, but they all generally made a plan and stuck to it.
Where did you end up? Is there life beneath? Or just plain mechanics?
For myself, I feel far more informed yet far less decided. I suppose that’s how it goes.
Collapse & meditation feel like good signals to watch. I’m not sure how they translate into the real world. Neither Sonnet-4 nor Opus-4.1 could reliably escape from collapse, yet they’re some of the most popular programming models, well-known for the depth of their character.
My hunch is that it’s less about character depth and more about agentic ability. Hence why the underpowered GPT-5-nano was able to drop such fascinating stories that almost felt like content I would have gotten from K2. The difference really seems to be in the agentic training.
2025-09-15 08:00:00
A few people have asked me how I use AI coding tools. I don’t think it’s a straightforward answer. For me it’s not really a procedure or recipe, it’s more of an ethos.
You own the code your AI produces.
Use your own name to commit AI code so that if something breaks, everyone blames you. This is critical. How well do you need to know the code your AI produces? Well enough that you can answer for it’s mistakes.
In lean manufacturing they have the principle of Genchi genbutsu, i.e. “go and see for yourself.” In High Output Management, Andy Grove pushes “management by walking around”. Andy defines the output of a manager as the output of their entire org as well as the organizations under their influence.
The trouble with phrasing it as “AI coding” is it tricks you into thinking it’s just another individual role like software engineering, where it actually has a lot more in common with management. It’s unfortunate we hire and mentor for it as if it was software engineering.
Resist the urge to say, “oh, I just vibe coded this”. You coded it, and if it sucks, it’s because you don’t know how to manage your AI. Own it.
Not all time spent is equal. For some things, you can put in a little bit of effort and get a huge amount of reward. In business, we call those opportunities.
Examples:
AI coding isn’t about writing code, it’s about creating and exploiting gradients. Finding opportunities where you can spend 10 minutes of AI time and reap a huge reward.
The contrived example is proof of concepts. You can just do it, figure out if it really works in practice as it seems like it should, and abandon it quickly when it doesn’t.
Or data analysis. Traditionally it was labor intensive to do data analysis, but you can spin out a sick dashboard in a few minutes. Maybe that helps you avoid a dead end, or push your org in a new direction.
The key is to always be on the lookout for opportunities.
That feels a lot more like a shrewd businessman than a software engineer. Indeed! It’s a mistake that we transparently hire and promote software engineers into these roles. It’s a new beast.
I’m terrified of the future of software engineering.
Oh, I’ll continue having a job for a very long time. No concern about that. I’m worried that junior engineers won’t be promoted because it’s easier to dispatch a request to an AI than to give juniors the tasks that they traditionally learned the trade from.
But actually, this isn’t software engineering.
If anyone with their head on straight can take ownership and exploit gradients, then maybe junior engineers have an edge on seniors who are too stuck in their ways to realize they’ve been put in a new job role.
I broadly agree with you, would only add that people do have to get out of their comfort zone to get good at AI, and you have some obligation to do that
It’s really hard to be good at it at first, as a manager you have to give people some slack to learn those new skills too from @rickasourus on Twitter
Yes, managers take note! We’re learning a new job.
I enjoyed that. You’re right about the sense of ownership. Although some developers never had a sense of ownership of even hand crafted code. I wrote about this topic recently and it chimes with your thoughts https://www.aidanharding.com/2025/09/coding-with-ai/
The good ones did. @aidanharding.bsky.social says on Bluesky
2025-09-13 08:00:00
I went to close a bunch of browser tabs, but realized I have some cool stuff in here. Some has been marinating for a while. Most of these I’ve read, or tried to read.
link: https://techcrunch.com/2025/08/29/cracks-are-forming-in-metas-partnership-with-scale-ai/
Alexander Wang at Meta is apparently difficult to work with and people at Meta are doubting the fidelity of data produced by his ScaleAI.
link: https://arxiv.org/abs/2506.22084
IIRC they draw parallels between attention and graphs and argue that LLMs are graph neural nets, meaning that they can be used to look at graphs and guess what connections are missing.
I don’t think I posted anything on this, because while I find the idea fascinating, I couldn’t figure out how to make it feel tangible.
link: https://arxiv.org/abs/2508.14143
Fairly sure I never read this one. Looks interesting. Kind of far out there.
link: https://z.ai/blog/glm-4.5
GLM-4.5 announcement. These have turned out to be the leading open source models. Everything I hear is good.
link: https://whenaiseemsconscious.org/
I only read a little and gave up. This feels like a good take, maybe. Inside my own head I completely punt on having a take on AI consciousness and opt instead for the “don’t be a dick” rule. Idk, maybe they are maybe they aren’t, I’ll just live in the moment.
link: https://www.meta.com/superintelligence/
Zuck’s treatise on AI. I didn’t read. Normally I try to make an attempt to read these sorts of takes, or at least skim them, but I was busy at work. I had it loaded up on my phone to read on a plane, but it wouldn’t load once I was off WiFi. Sad.
link: https://arxiv.org/abs/2508.06471
The GLM-4.5 paper. This was a super interesting model. It feels like it breaks the “fancy model” rule in that it’s very architecturally cool but the personality doesn’t feel like it’s been squished out.
link: https://www.dwarkesh.com/s/blog
It’s a good blog, what can I say. Definitely on the over-hype side, but he’s got real takes and seems so intent on getting to the truth that he spends a lot of time on geopolitics just simply to understand AI dynamics. Mad respect.
link: https://blog.datologyai.com/technical-deep-dive-curating-our-way-to-a-state-of-the-art-text-dataset/
I forget why I ended up here, but it’s an excellent post. I think this is connected to my project at work training a model. This post brings up a ton of data curation techniques.
I’ve recently learned and fully accepted that ALL major LLM advances come down to data. Yes, the architectural advances are cool and fun to talk about, but any meaningful progress has come from higher quality, higher quantity, or cheaper data.
link: https://arxiv.org/abs/2507.18074
Cool paper about auto-discovery of model architectures. IIRC they took a bunch of model architecture ideas, like group attention and mixture of experts, and used algorithms to mix and match all the parameters and configurations until something interesting popped out. It feels like a legitimately good way to approach research.
link: https://arxiv.org/abs/2507.15061
From Qwen, I don’t think I read this one, probably because it’s a bit dense and was hard to get fully engaged on. The idea seems cool though.
link: https://arxiv.org/abs/2005.10242
Classic paper. I read this one for work. I was trying to appreciate what Alignment & Uniformity measure and why they’re important. This was the paper that formalized those measures. It’s actually a pretty good paper to read, albeit 20 years old.
link: https://blog.datologyai.com/train-llms-faster-better-and-smaller-with-datologyai-s-data-curation/
More Dataology, they’re good, everything they do is good. BTW there’s a latent space episode with Dataology and it’s very good.
link: https://news.ycombinator.com/item?id=45008434
Chips are good too.
link: https://ysymyth.github.io/The-Second-Half/
This will be a classic post, calling it now. It lays out a great history and current state of AI and specifically reinforcement learning.
link: https://arxiv.org/abs/2508.17669
What? This is amazing. I don’t think I even looked at it, sad. Actually, now that I’m reading this I’m recalling that’s how I ended up on the Graph Neural Network link.
IIRC this is saying that LLMs can be highly intelligent because they incorporate the best parts of a huge number of people. IMO this is spiritually the same as my Three Plates blog post where I explain how unit tests, which are inherently buggy, can improve the overall quality of a system.
link: https://github.com/gepa-ai/gepa?tab=readme-ov-file#using-gepa-to-optimize-your-system
An algorithm for automatic prompt optimization. Happily, they support DSPy, so there’s no new framework that you have to take wholesale.
link: https://www.alphaxiv.org/pdf/2508.21038
This was a fascinating one. I colleague tried convincing me of this but I didn’t buy it until I read this paper. It makes a ton of sense. I have a simplified bluesky thread here.
tl;dr — embedding vectors have trouble representing compound logic (“horses” AND “Chinese military movements”) and generally fall apart quickly. It’s not that it’s not possible, it’s that it’s not feasible to cram that much information into such a small space.
link: https://arxiv.org/abs/2107.05720?utm_source=chatgpt.com
I ran into this while diving into the last link. It’s an older (2021) paper that has some potential for addressing the problems with embeddings. Realistically, I expect late interaction multi-vectors to be the end answer.
link: https://huggingface.co/meituan-longcat/LongCat-Flash-Chat
A super cool model that uses no-op MoE experts to dynamically turn down the amount of compute per token. Unfortunately, this one didn’t seem to be embraced by the community.
link: https://arxiv.org/abs/2405.19504v1
More embedding links. Now that I’m scanning it, I’m not sure it really soaked in the first time. They seem to have solved a lot of the problems with other late interaction methods. Maybe I should take a deeper look.
link: https://huggingface.co/meituan-longcat/LongCat-Flash-Chat/blob/main/modeling_longcat_flash.py
IDK sometimes you just have to look at the code to be sure.
link: https://m.youtube.com/watch?v=mU0HAmgwrz0&pp=QAFIAQ%3D%3D
Uh, no idea why this is up. I don’t really watch this show.
link: https://www.aleksagordic.com/blog/vllm
Fascinating break down of vLLM. If you’re not familiar, vLLM is like Ollama but actually a good option if you want to run it in production. Don’t run Ollama in production, kids, KV caches are good.
Honestly, this is absolutely worth your time if AI infrastructure is your jam (or you just want it to be). It goes into all the big concepts that an AI infra engineer needs to know. TBQH I love the intersection of AI & hardware.
link: https://simonwillison.net/
I mean, you have one of these tabs open too, right? riiiight????
link: https://algorithms-with-predictions.github.io/about/
Someone sent me this link and there was a reason, I know it. I just don’t remember why. IIRC it was because I brought up the A Case For Learned Indices paper and they pointed me to this whole treasure trove of papers that (sort of) evolved out of that. Basically traditional algorithms re-implemented using machine learning.
link: https://www.modular.com/blog
Yeah, idk, I think I was reading Matrix Mulitplication on Blackwell: Part 3 — The Optimization Behind 80% of SOTA Performance
Another AI infra post, heavy on algorithms & hardware.
link: https://huggingface.co/OpenGVLab/InternVL3_5-241B-A28B
A cool concept. IIRC they introduce Cascade RL, automatically refining the RL dataset based on how current rollouts perform.
link: https://www.google.com/search?q=hong+kong&ie=UTF-8&oe=UTF-8&hl=en-us&client=safari
IDK I guess I was just trying to remember if Hong Kong was in China or not. And I learned that there’s a reason why I’m confused.
link: https://news.mit.edu/2024/photonic-processor-could-enable-ultrafast-ai-computations-1202
Someone sent me this link. It seems cool. Not sure it’s going to change much.
link: S11, E11) | Full Episode - YouTube (https://m.youtube.com/watch?v=Tkews9pRH1U&pp=QAFIBQ%3D%3D
I mean, aliens! Don’t tell me you don’t have secret fascinations
link: https://m.youtube.com/watch?v=tnfFn-uQ6WA&pp=0gcJCRsBo7VqN5tD
Oh, this was a great podcast. Well, I didn’t like the host but @kalomaze is worth following. Apparently only 20yo, never attempted college but a talented AI researcher nonetheless.
link: https://cdn.openai.com/gpt-5-system-card.pdf
Sometimes you just need to look things up to be sure..
link: https://huggingface.co/OpenGVLab/InternVL3_5-241B-A28B
Again, apparently. It honestly is a good model.
link: https://cslewisweb.com/2012/08/02/c-s-lewiss-divine-comedy/
Been thinking about how he described the outer layer of hell as consisting of people living equidistant from each other because they can’t stand anyone else. It was written like 100 years ago but feels like a commentary on today’s politics.
link: https://blog.promptlayer.com/claude-code-behind-the-scenes-of-the-master-agent-loop/
Actually, this is pretty detailed breakdown of Claude Code. They seem to have decompiled the code without de-obfuscating it, which leads to some kind of silly quotes. But it’s good.
link: https://airia.com/ai-platform/
No idea how I got here. Looks like a Low/No Code builder.
link: https://www.arxiv.org/abs/2509.04575
Right, this one is the ExIt Paper. It’s another attempt at auto-managing RL curriculum dynamically by how training is progressing.
link: https://www.swyx.io/cognition
Swyx joined Cognition and dropped a treatise on AI engineering. Its good.
link: https://huggingface.co/papers/2509.06160
This was an excellent one. Another auto-curriculum RL paper. I did a bluesky breakdown here
link: https://chat.z.ai/c/6607ee45-27d5-487a-a1e2-44c2176040eb
GLM-4.5 chat application
link: https://news.ycombinator.com/item?id=45186015
Seems like the new Apple M19 chip has real matrix multiplication operations. Previous generations had excellent memory bandwidth, this gives it matching compute (on AI-friendly workloads). So I guess Macs will stay relevant for a while.
link: https://www.bbc.com/news/live/c2enwk1l9e1t
NGL this freaks me out.
link: https://vickiboykis.com/2025/09/09/walking-around-the-app/
Vicki writes such thoughtful pieces. Always worth reading her work.
link: https://thinkingmachines.ai/blog/defeating-nondeterminism-in-llm-inference/
Oh wow, this was an amazing read. Very deep dive into AI infrastructure and, whoah, did you know that GPUs have operations that aren’t deterministic?
I did a bluesky thread here
link: https://blog.codingconfessions.com/p/groq-lpu-design
Looked this up as a tangent off the last link. Groq (not Grok) designed their ASIC to be fully deterministic from the ground up, and then built a really cool distributed system around it that assumes fully synchronous networking (not packet switching like TCP). It’s an absolutely crazy concept.
link: https://crfm.stanford.edu/2023/06/16/levanter-1_0-release.html
I didn’t read this, but it’s definitely a tangent off of non-deterministic LLMs.
link: https://tiger-ai-lab.github.io/Hierarchical-Reasoner/
Absolutely fascinating. I only read the blog, not the paper, but it frames RL as a 2-stage process where RL is mostly slinging together discrete skills (learned during pre-training).
It’s not an auto-curriculum RL paper AFAICT, it’s just a huge improvement in RL efficiency by focusing only on the “pivot” tokens.
link: https://timkellogg.me/blog/2024/10/10/entropix
I had looked this up as a reference to “pivot” tokens. Honestly, I link people back to this blog a lot
link: https://github.com/ast-grep/ast-grep-mcp
An MCP server that lets you search code while respecting the structure. I’ve heard some very positive things as well as “meh” responses on this. I’m sure real usage is a bit nuanced.
link: https://www.science.org/content/blog-post/life-maybe-mars-unless-we-change-our-minds
Guys, this is incredible!
2025-08-08 08:00:00
This post isn’t really about GPT-5. Sure, it launched and people are somewhat disappointed. It’s the why that bugs me.
They expected AGI, the AI god, but instead got merely the best model in the world. v disapointng
A few days before the GPT-5 launch I read this paper, Agentic Web: Weaving the Next Web with AI Agents. It’s not my normal kind of paper, it’s not very academic. There’s no math in it, no architecture. It just paints a picture of the future.
And that’s the lens I saw GPT-5 through.
The paper describes three eras of the internet:
When I weigh the strengths of GPT-5, it feels poised and ready for the agentic web.
I use it. If it changes how I work or think, then it’s a good LLM.
o3 dramatically changed how I work. GPT-4 did as well. GPT-5 didn’t, because it’s the end of the line. You can’t really make a compelling LLM anymore, they’re all so good most people can’t tell them apart. Even the tiny ones.
I talked to a marketing person this week. I showed them Claude Code. They don’t even write code, but they insisted it was 10x better than any model they’d used before, even Claude. I’d echo the same thing, there’s something about those subagents, they zoom.
Claude Code is software.
Sure, there’s a solid model behind it. But there’s a few features that make it really tick. Replicate those and you’re well on you’re way.
The first time I heard agentic web I almost vomited in my mouth. It sounds like the kind of VC-induced buzzword cess that I keep my distance from.
But this paper..
I want AI to do all the boring work in life. Surfing sites, research, filling out forms, etc.
Models like GPT-5 and gpt-oss are highly agentic. All the top models are going in that direction. They put them in a software harness and apply RL and update their weights accordingly if they used their tools well. They’re trained to be agents.
I hear a lot of criticism of GPT-5, but none from the same people who recognize that it can go 2-4 hours between human contact while working on agentic tasks. Whoah.
GPT-5 is for the agentic web.
Well okay, me too. Not sure where that came from but I don’t think that’s where this is going. Well, it’s exactly where it’s going, but not in the way you’re thinking.
The paper talks about this. People need to sell stuff, that won’t change. They want you to buy their stuff. All that is the same.
The difference is agents. In the agentic web, everything is mediated by agents.
You don’t search for a carbon monoxide monitor, you ask your agent to buy you one. You don’t even do that, your agent senses it’s about to die and suggests that you buy one, before it wakes you up in the middle of the night (eh, yeah, sore topic for me).
You’re a seller and you’re trying to game the system? Ads manipulate consumers, but consumers aren’t buying anymore. Who do you manipulate? Well, agents. They’re the ones making the decisions in the agentic web.
The paper calls this the Agent Attention Economy, and it operates under the same constraints. Attention is still limited, even agent attention, but you need them to buy your thing.
The paper makes some predictions, they think there will be brokers (like ad brokers) that advertise agents & resources to be used. So I guess you’d game the system by making your product seem more useful or better than it is, so it looks appealing to agents and more agents use it.
I’m not sure what that kind of advertising would look like. Probably like today’s advertising, just more invisible.
The only benchmark that matters is how much it changes life.
At this point, I don’t think 10T parameters is really going to bump that benchmark any. I don’t think post-training on 100T tokens of math is going to change much.
I get excited about software. We’re at a point where software is so extremely far behind the LLMs. Even the slightest improvements in an agent harness design yield outsized rewards, like how Claude Code is still better than OpenAI codex-cli with GPT-5, a better coding model.
My suspicion is that none of the AI models are going to seem terribly appealing going forward without massive leaps in the software harness around the LLM. The only way to really perceive the difference is how it changes your life, and we’re long past where a pure model can do that.
Not just software, but also IT infrastructure. Even small questions like, “when will AI get advertising?” If an AI model literally got advertising baked straight into the heart of the model, that would make me sad. It means the creator’s aren’t seeing the same vision.
We’ve talked a lot about the balance between pre-training and post-training, but nobody seems to be talking about the balance between LLMs and their harnesses.
Before we see significant improvement in models, we’re going to need a lot more in:
Probably several other low-hanging areas.
2025-07-19 08:00:00
Feeling behind? Makes sense, AI moves fast. This post will catch you up.
First of all, yes, ‘25 is the year of agents. Not because we’ve achieved agents, but because we haven’t. It wouldn’t be worth talking about if we were already there. But there’s been a ton of measurable progress toward agents.
The last 6 months:
Obviously it is, right?
Back in January, we noticed that when a model does Chain of Thought (CoT) “thinking”, it elicits these behaviors:
All year, every person I talked to assumed thinking is non-negotiable for agents. Until K2.
K2 is an agentic model, meaning it was trained to solve problems using tools. It performs very well on agentic benchmarks, but it doesn’t have a long thought trace. It was so surprising that I thought I heard wrong and it took a few hours to figure out what the real story was.
For agents, this is attractive because thinking costs tokens (which cost dollars). If you can accomplish a task in fewer tokens, that’s good.
R1 and o1 were trained to think, but o3 was trained to use tools while it’s thinking. That’s truly changed everything, and o3 is by far my favorite model of the year. You can just do things.
MCP was a huge jump toward agents. It’s a dumb protocol, leading a lot of people to misunderstand what the point is. It’s just a standard protocol for letting LLMs interact with the world. Emphasis on standard.
The more people who use it, the more useful it becomes. When OpenAI announced MCP support, that established full credibility for the protocol.
K2 tackled the main problem with MCP. Since it’s standard, that means anyone can make an MCP server, and that means a lot of them suck. K2 used a special system during training that generated MCP tools of all kinds. Thus, K2 learned how to learn how to use tools.
That pretty much covers our current agent challenges.
In math, we made a lot of progress this year in using a tool like a proof assistant. e.g. DeepSeek-Prover v2 was trained to write Lean code and incrementally fix the errors & output. That seemed (and still does) like a solid path toward complex reasoning.
But today, some OpenAI researchers informally announced on X that their private model won gold in the International Math Olympiad. This is a huge achievement.
But what makes it surprising is that it didn’t use tools. It relied on only a monstrous amount of run-time “thinking” compute, that’s it.
Clearly stated: Next token prediction (what LLMs do) produced genuinely creative solutions requiring high levels of expertise.
If LLMs can be truly creative, that opens a lot of possibilities for agents. Especially around scientific discovery.
Which is better?
On the one hand, Opus-4, Grok 4 & K2 are all huge models that have a depth that screams “intelligence”. On the other hand, agentic workloads are 24/7 and so the cheaper they are, the better.
Furthermore, there’s a privacy angle. A model that runs locally is inherently more private, since the traffic never leaves your computer.
The biggest shifts this year have arguably been not in the model but in engineering. The flagship change is the emergence of the term context engineering as replacement for prompt engineering.
It’s an acknowledgement that “prompt” isn’t just a block of text. It also comes from tool documentation, RAG databases & other agents. The June multi-agent debate was about how managing context between agents is really hard.
Also, while some are saying, “don’t build multi-agents”, Claude Code launches subagents all the time for any kind of research or investigation task, and is the top coding agent right now.
Similarly, sycophancy causes instability in agents. Many are considering it a top problem, on par with hallucination.
And all that is seriously skipping over a lot. Generally, ‘25 has shifted more time into engineering (instead of research). Alternately, model development is starting to become product development instead of just research.
What will happen in the second half of ‘25? Not sure, but I can’t wait to find out.
2025-07-18 08:00:00
I’ve avoided this question because I’m not sure we understand what “understanding” is. Today I spent a bit of time, and I think I have a fairly succinct definition:
An entity can understand if it builds a latent model of reality. And:
- Can Learn: When presented with new information, the latent model grows more than the information presented, because it’s able to make connections with parts of it’s existing model.
- Can Deviate: When things don’t go according to plan, it can use it’s model to find an innovative solution that it didn’t already know, based on it’s latent model.
Further, the quality of the latent model can be measured by how coherent it is. Meaning that, if you probe it in two mostly unrelated areas, it’ll give answers that are logically consistent with the latent model.
I think there’s plenty of evidence that LLMs are currently doing all of this.
But first..
Mental model. That’s all I mean. Just trying to avoid anthropomorphizing more than necessary.
This is the most widely accepted part of this. Latent just means that you can’t directly observe it. Model just means that it’s a system of approximating the real world.
For example, if you saw this:
You probably identify it immediately as a sphere even though it’s just a bunch of dots.
A latent model is the same thing, just less observable. Like you might hold a “map” of your city in your head. So if you’re driving around and a street gets shut down, you’re not lost, you just refer to your latent model of your city and plan a detour. But it’s not exactly a literal image like Google maps. It’s just a mental model, a latent model.
From 1979 to 2003, Saddam Hussein surrounded himself with hand‑picked yes‑men who, under fear of death, fed him only flattering propaganda and concealed dire military or economic realities. This closed echo chamber drove disastrous miscalculations—most notably the 1990 invasion of Kuwait and his 2003 standoff with the U.S.—that ended in his regime’s collapse and his own execution.
Just like with Saddam, sycophancy causes the LLM to diverge from it’s true latent model, which causes incoherence. And so, the amount of understanding decreases.
Otherwise they wouldn’t work.
The word2vec paper famously showed that the embedding of “king - man + woman” is close to the embedding for “queen” (in embedding space). In other words, embeddings model the meaning of the text.
That was in 2015, before LLMs. It wasn’t even that good then, and the fidelity of that latent model has dramatically increased with the scale of the model.
ICL is when you can teach a model new tricks at runtime simply by offering examples in the prompt, or by telling it new information.
In the GPT-3 paper they showed that ICL improved as they scaled the model up from 125M to 175B. When the LLM size increases, it can hold a larger and more complex latent model of the world. When presented with new information (ICL), the larger model is more capable of acting correctly on it.
Makes sense. The smarter you get, the easier it is to get smarter.
When models do Chain of Thought (CoT), they second guess themselves, which probes it’s own internal latent model more deeply. In (2), we said that true understanding requires that the LLM can use it’s own latent model of the world to find innovative solutions to unplanned circumstances.
A recent Jan-2025 paper shows that this is the case.
A large segment of the AI-critical use this argument as evidence. Paraphrasing:
Today’s image-recognition networks can label a photo as “a baby with a stuffed toy,” but the algorithm has no concept of a baby as a living being – it doesn’t truly know the baby’s shape, or how that baby interacts with the world.
This was in 2015 so the example seems basic, but the principle is still being applied in 2025.
The example is used to argue that AI isn’t understanding, but it merely cherry-picks a single place where the AI’s latent model of the world is inconsistent with reality.
I can cherry pick examples all day long of human’s mental model diverging from reality. Like you take the wrong turn down a street and it takes you across town. Or you thought the charasmatic candidate would do good things for you. On and on.
Go the other way, prove that there are areas where AI’s latent model matches reality.
But that’s dissatisfying, because dolphins have a mental model of the sea floor, and tiny ML models have areas where they do well, and generally most animals have some aspect of the world that they understand.
Why are we arguing this? I’m not sure, it comes up a lot. I think a large part of it is human exceptionalism. We’re really smart, so there must be something different about us. We’re not just animals.
But more generally, AI really is getting smart, to a point that starts to feel more uncomfortable as it intensifies. We have to do something with that.