2025-09-28 20:57:25
I sometimes think about people whose careers started in the ‘90s. They had a roaring decade of economic growth. And even if they did not participate in the dot com boom they still had the opportunity to invest in Google, Amazon or Microsoft low valuations. They had the potential to generate extraordinary wealth purely by dint of public market investments or buying a house in Palo Alto.
We can contrast that with the 2010s. Decade was roaring again; the stock market actually did quite well. But the truly outsized returns were almost entirely stuck within the private markets. Much of venture capital over the last decade has been privatizing the previously public gains, of a company going from 1 billion, 5 billion, 20 billion to 10, 50, 100 billion market caps or more. In fact the last big IPO that happened was Facebook in 2012 and that was already outsized, being valued at five times that of Google’s by the time the public could get their hands on it. In fact one of the best trades that existed perhaps ever was buying its stock when their market cap fell to 300 billion or so a few years ago.
Or, looked at another way, in 1980 the median age of a listed U.S. company was 6 years; today it is 20.
Meanwhile every other major company remains private seemingly endlessly. Even now Stripe remains private, so does Databricks, so does SpaceX … They give their employees liquidity, provide some high fee methods for others to invest via SPVs or futures, even report the occasional metric. And if you want any exposure you better be prepared to pay 5% fees and then probably 2 and 20 on top of it for the SPV.
Now, the number of people investing in the market has gone up so maybe it’s just alpha erasure. So it’s not to say there are no alpha generating investments at all. There absolutely have been 10 baggers or more in the public markets, Palantir shot up like crazy. But they’re as few as they’re speculative. All the while the number of public companies even has fallen off a cliff.
But it does tell us why meme stocks became a thing. Right? Speculative mania by itself is nothing new, from tulips to Cisco in 2000, but Tesla is a different animal. As was (is) GameStop! It also explains why crypto is a thing, why smart 20 year olds are yoloing their bonus checks into alt coins or short expiry options.
It’s because there’s a clear sense of now or never. This was the entire crypto ethos. Don’t build a Telco, create a Telco token! Even the rise of AI heightens this! If you managed to join Openai in 2020 you’re a multi multi millionaire, you won the lottery. If you didn’t, it’s over. If you combine the workforce of the largest labs you still wouldn’t even show up in any aggregate measures.
Back in the days of yore, if you did not manage to get a job at Google in 2005 you could still buy its stock. You had at least the option of gaining from its appreciation assuming you thought it inevitable. Over the last decade and a half there have been multiple generations who succeeded from getting a job at one of these giants and working their way up, and equally and more from people who invested in those giants. That’s what brought about the belief that the arc of history trended upwards.
Today, there exists no such option. There only exists short term manic rises even for the longer term theses. The closest anyone can get to the AI boom is Nvidia, an old stock, which has shot up as the preferred seller of shovels in this gold rush. The closest anyone can get at an institutional scale even is Situational Awareness which bought calls on Intel Capital and has also rightfully shot up. These are in effect synthetic lottery tickets the public market was forced to invent because the real lottery, OpenAI equity, is locked. The claim is not that returns vanished, but that access to the tails shifted.
But from the perspective of most people on the street they either work for one of the large labs in which case you are paid extraordinarily well, enough to almost single-handedly prop up the US economy, while for everybody else you are at best treading water. And by the way, the broader solutions to try and fix it by adding private equity to 401k portfolios is as risky as it is expensive. Not to mention opaque. The roaring parts of the economy are linked, sure, to the public markets and the broader economy benefits, but at a distance.
I wrote once about Zeitgeist Farming, a way that seemed to be developing to get rich by betting on the zeitgest and doing no real work, as a seemingly emergent phenomena in the markets, and it seems to have continued its dominance. And we see the results. It’s the Great Polarisation.
I’m obviously not saying that life sucks or that folks who don’t are destitute, this is not a science fiction dystopia, far from it, but it is very clear that the fruits of our progress seem fewer and coarsely distributed. And when they’re not, the feeling of there being haves and have nots gets stronger. It might well be that the haves are only a tiny tiny tiny minority who are doing exceedingly well, while the majority are doing just fine, great even historically speaking, but the “there but for the flip of a coin go I” feeling remains strong.
This is what’s different to the ages before. Physics PhDs went into wall street and made billions, but it didn’t feel like they hit a lottery as much as they were at the top of their profession, a profession that was different, even priestly, in its insularity. AI, rightly or wrongly, doesn’t feel like that.
It doesn’t help that the rhetoric from all the labs is that the end is nigh. The end of all humanity, if you believe some, but at least the end of jobs according to even the more level headed prognosticians. Leaving aside how right they might end up being, that’s a scary place to be.
While this particular rhetoric is new it taps into a fear that’s existed, latent, inside many over the entire past decade and half. We all know folks who joined so-and-so company at the right time and rode the valuation up. We also know incredibly smart folks who didn’t, and who didn’t “get their bag”.
Crypto alt-coin bubble might have seemed a cause for the societal sickness, but it’s not. It’s a symptom. A symptom of the fact that to get ahead it feels, viscerally, like you have to gamble.
After all, when life resembles a lottery, then what’s left but to play the odds.
2025-09-17 07:59:21
All right, so there's been a major boom in people using AI and also people trying to figure out what AI is good for. One would imagine they go hand in hand but alas. About 10% of the world are already using it. Almost every company has people using it. It’s pretty much all people can talk about on conference calls. You can hardly find an email or a document these days that is not written by ChatGPT. Okay, considering that is the case, there is a question about, like, how good are these models, right? Any yardstick that we have kind of used, whether it's its ability to do math or to do word problems or logic puzzles or, I don't know, going and buying a plane ticket online or researching a concert ticket, it's kind of beaten all those tasks, and more.
So, considering that, what is a question, a good way to figure out what they're ultimately capable of? One the models are actually doing reasonably well and can be mapped on some kind of a curve, which doesn’t suffer from the “teaching to the test” problem.
And one of the answers there is that you can look at how well it actually predicts the future, right? I mean, lots of people talk about prediction markets and about how you should listen to those people who are actually able to do really well with those. And I figured, it stands to reason that we should be able to do the same thing with large language models.
So the obvious next step became to test it is to try and take a bunch of news items and then ask, you know, the model what will happen next. Which is what I did. I called this foresight forge because that’s the name the model picked for itself. (It publishes daily predictions with GPT-5, used to be o31.) I thought I would let it take all the decisions, from choosing the sources to predictions to ranking it with probabilities after and doing regular post mortems.
Like an entirely automated research engine.
This work went quite well in the sense that it gave interesting predictions, and I actually enjoyed reading them. It was insightful! Though, like, a bit biased toward positive outcomes. Anyway, still useful, and a herald of what’s to come.
But, like, the bigger question I kept asking myself was what this really tells us about AI’s ability to predict what will happen next. It’s after all only a portion of the eval to see predictions, not to understand, learn from, or score them.
The key thing that you know differentiates us is the fact that we are able to learn right like if you have a trader who gets better making predictions they do that because like you know he or she is able to read about what they did before and can use that as a springboard to learn something else and use that as springboard to learn something else and so on and so forth. Like there is an actual process whereby you get better over time, it's not that you are some perfect being. It's not even that you predict for like a month straight or 2 months straight and then use all of that together to make yourself smarter and or better instantaneously. Learning is a constant process.
And this is something that all of the major AI labs talk about all the time in the sense that they want continuous learning. They want to be able to get to a point where you're able to see the models actually get better in real time and that's sort of fairly complicated, but that's the goal, because that's how humans learn.
A short aside on training. One of the biggest thoughts I have about RL, prob all model training, is that it is basically trying to find workarounds to evolution because we can’t replay the complexity of the actual natural environment. But the natural environment is super hard to create, because it involves not just unthinking rubrics about whether you got your math question right, but also, like, interacting with all the other complex elements of the world which in its infinite variety teach us all sorts of things.
So I thought, okay, we should be able to figure this out because what you need to do is to do the exact same thing that we do or the model training does, but do it on a regular basis. Like every single day you're able to get the headlines of the day and some articles you're able to ask the model to predict what's going to happen next and keeping things on policy as soon as the model predicts what's going to happen next your the next day itself you're going to use the information that you have in order to update them all.
Because I wanted to run this whole thing on my laptop, a personal constraint I put so I don’t burn thousands on GPUs every week, I decided to start with a tiny model and see how far I could push it. The interesting part about running with tiny models you know which is that there's only certain amount of stuff that they are going to be able to do. I used Qwen/Qwen3-0.6B on MLX.
(I also chose the name Varro. Varro was a Roman polymath and author, widely considered ancient Rome's greatest scholar, so seemed like a fitting name. Petrarch famously referred to him as "the third great light of Rome," after Virgil and Cicero.)
For instance what's the best way to do this would be to say make a bunch of predictions and the next day you can look back and see how close you got to some of those predictions and update your views. Basically a reward function that is set up if you want to do reinforcement learning.
But there's a problem in doing this, which is that there's only so many ways in which you can predict whether you were right or not. You could just use some types of predictions as a yardstick if you'd like, for instance you could go with only financial market predictions and you know check next day whether you are accurate or. This felt too limiting. After all the types of predictions that people make if they turn out to understand the world a lot better is not limited to what the price of Nvidia is likely to be tomorrow morning.
Not to mention that also has a lot of noise. See CNBC. You should be able to predict about all sorts of things like what would happen in the Congress in terms of a vote or what might happen in terms of corporate behavior in response to a regulation or what might happen macroeconomically in response to an announcement. So while I split some restrictions in terms of sort of the types of things that you can possibly predict I wanted to kind of leave it open-ended. Especially because leaving it open end it seemed like the best way to teach a proper world model to even smaller LLMs.
I thought the best way to check the answer was to use the same type of LLM to look at what happened next and then you know figure out whether you got close. Rather obviously in hindsight, I ran into a problem which is that small models are not very good at acting like acting as LLM as a judge. They get things way too wrong. I could’ve used a bigger model, but that felt like cheating (because it could teach about the world to the smaller model, than learning purely from the environment).
So I said okay I can first teach it the format and I got to find some other way to figure out whether you came close to what happened the next day with respect to its prediction. What I thought I could do was to use the same method that I used with Walter, the RLNVR paper, and see whether semantic similarity might actually push us a long way. Obviously this is a double edged sword because you might get semantically fairly closed while having the opposite meanings or just low quality2.
But while we are working with smaller models and since the objective is to try and figure out if there's method will work in the first place I thought this might be an okay way to start. And that's kind of what we did. The hardest part was trying to figure out the exact combination of rewards that would actually make the model do what I wanted and not whatever it wanted to try and maximise and reward by doing weird stuff. Some examples being things like, you know, you could not ask it to do bullet points because it started echoing instructions so to teach it thinking and responding you had to choose thinking in paragraphs.
Long story short, it works (as always, ish3). The key question that I set out to answer here was basically whether we could have a regular running RL experiment on a model such that you can use sparse noisy rewards that would come through from the external world, and be able to keep updating in such that it can still do one piece of work relatively well. While I chose one of the harder ways to do this by predicting the whole world, I was super surprised that even a small model did learn to get better at predicting next day's headlines.
I wouldn't have expected it because there is no logical reason to believe that tiny models can still learn sufficient world model type information that it can do this. It might have been the small sample size it might have been noise it might have been a dozen other ways in which this is not perfectly replicable.
But that's not the point. The point is that with this method if things work even somewhat well as it did for a tiny tiny model, then that means that for larger models where the rewards are better understandable you can probably do on policy RL pretty easily4.
This is a huge unlock. Because what this means is that the world which is filled with sparse rewards can now basically be used to get the models to behave better. There's no reason to believe that this is an isolated incident, just like with the RLNVR paper there is no reason to believe that this will not scale to doing more interesting things.
And since I did the work I learned that cursor, the AI IDE, does something similar for their autocomplete model. Where they take a much stronger reward signal, in terms of whether humans accept or reject the suggestions that it actually makes, they are able to update the policy and roll out a new model every couple of hours. Which is huge!
So if Cursor can do it, then what stands in between us and doing it more often for all sorts of problems? Partly just the availability of data, but mostly it’s creating a sufficiently interesting reward function that can teach it something, and a little bit of AI infrastructure.
I'm going to contribute the Varro environment to the prime intellect RL hub in case somebody wants to play, and also maybe make it a repo or a paper, but it's pretty cool to see that even for something as amorphous as predicting the next day headlines, something that is extraordinarily hard even for humans because it is a fundamentally adversarial task, we're able to make strides forward if we manage to convert the task into some thing that and LLM can understand, learn from and hill climb. The future is totally going to look like a video game.
In academic work, please cite this essay as: Krishnan, R. (2025, September 16). Prediction is hard, especially about the future. Strange Loop Canon. https://www.strangeloopcanon.com/p/prediction-is-hard-especially-about
See if you can spot which day it changed
Anyway, the way we do it is, create a forecast that is a short paragraph with five beats: object, direction + small magnitude, tight timeframe, named drivers, and a concrete verification sketch. And that house style gives us a loss function we can compute. Each day: ingest headlines → generate 8 candidates per headline → score (structure + semantics; truth later) → update policy via GSPO.
Across runs the numbers tell a simple story.
COMPOSITERUN (one-line schema): quality 0.000, zeros 1.000, leak 0.132, words 28.9. The template starved learning.
NEWCOMPOSITERUN (paragraphs, looser): quality 0.462, zeros 0.100, leak 0.693, words 124.5. Gains unlocked, hygiene worsened.
NEWCOMPOSITERUN2 (very low KL): quality 0.242, zeros 0.432, leak 0.708, words 120.8. Under-explored and under-performed.
SEMANTICRUN (moderate settings): quality 0.441, zeros 0.116, leak 0.708, words 123.8. Steady but echo-prone.
SEMANTICRUN_TIGHT_Q25 (tight decoding + Q≈0.25): quality 0.643, zeros 0.013, leak 0.200, words 129.2. Best trade-off.
The daily cadence was modest but legible. I ran a small Qwen-0.6B on MLX with GSPO, 8 rollouts per headline, typically ~200–280 rollouts/day (e.g., 32×8, 31×8). The tight run trained for 2,136 steps with average reward around 0.044; KL floated in the 7–9 range on the best days for best stability with exploration. Entropy control really matters. The working recipe: paragraphs with five beats; LLM=0; Semantic≈0.75; Format(Q)≈0.25; sampler=tight; ~160–180 tokens; positive 3–5 sentence prompt; align scorer and detector. If ramble creeps in, nudge Q toward 0.30; if outputs get too generic, pull Q back.
2025-08-25 21:55:27
I usually work with three monitors. A few days ago, as I was looking across the usual combination of open documents, slack, whatsapp, and assorted chrome windows, I noticed something.
Somehow, over the past few weeks (months maybe) portions of my screens had gotten taken over by multiple Terminals. It’s not because I do a lot of development, it’s because every project I have or work on is now linked with AI agents in some way shape or form. Even when I want to write a report or analyse a bunch of documents or do some wonky math or search my folders to find out the exact date I bought my previous home for some administrative reason.
A part of this is that people ask occasionally how I use AI and I struggle to answer because it’s integrated with roughly everything that I do. Almost anything I do on the computer now involves LLMs somewhere in the chain.
I was thinking about this again over the weekend because there’s a lot of discussion about what the future will look like.
As agents are getting better at doing long duration tasks it's also becoming more important to see what they're doing, respond to their requests and questions, and where needed, intervene.
This has implications for what work looks like in the future. There’s already the belief that many of us are doing bullshit jobs, which is patently false but highly prevalent. It’s because much of our tasks are not of a “I can easily link the output to a metric I care about” variety. It’s a statement of our ignorance, not about reality.
But it is true that many jobs we do today would seem incomprehensible to people a couple decades ago. And we can extrapolate that trend going forward.
What this means is that most jobs are going to become externally individual contributor roles where they are actually acting as a manager. I wrote recently:
The next few years are going to see an absolute “managerial explosion” where we try to figure out better rubrics and rating systems, including using the smartest models to rate themselves, as we train models to do all sorts of tasks. This whole project is about the limits of current approaches and smaller models.
This is true, but it’s too anodyne. So I wanted to visualise it for myself, just to make things more real. What does it “feel” like, to be in command of a large number of agents? The agents would constantly be doing things that you want them to and you’d have to be on top of them, and the other humans you interact with, to make sure things got done properly.
So I made a dashboard to try and visualise what this might look like.
This is a fundamentally different view of work. It is closer to videogames. Constant vigilance! A large number of balls in the air at all times. Ability to juggle context, respond to idiosyncratic errors, misunderstandings. And able to respond quickly.
These are normally managerial tasks. And that too if you’re a very good manager! I’m sure you are, or you’ve seen, people with a phone in their hand and furiously typing when they’re at the park or walking to their car. Who deal with multiple emails and messages and slack and ping and phone calls and zooms on a regular basis, often alt-tabbing from one to the next.
Some of this alt-tabbing will involve what we might call “real work”. To help intervene in things that the AI gets wrong. To answer questions from other employees or customers. To provide more context, to figure out where to pay attention, to get things unstuck.
To help do this there will be logs of what was done before, the KPIs that you’d set up, edit, adjust, update and monitor continuously. The reporting of those will also be done by AI agents. You’d watch them as your Fleet.
You might change the throttling up top to speed up or slow down particular parts of the organisation, like a conductor, both to manage resources and to manage smooth delivery. Everything runs as a web of interactions and you’re in the middle, orchestrating it all.
You’d of course be interacting with plenty of other orchestrators too. Maybe in your own organisation, or maybe in others. There will be many layers and subnetworks to consider.
This also has some downstream effects. It means all jobs will have an expiration date. You might get hired to do things, but as soon as what you do gets “learnt” by an AI agent that can get systematised and automated1. It means every job becomes a project.
This can be seen as dystopian, I can just imagine the Teamsters reacting to this, but it’s the same dance every white collar job has gone through in the last two decades, just sped up.
What this future shows is that the future of work will look a lot more like rapid fire management. Ingest new information, summarise, compare things to policy, request more docs where needed, reconcile ledgers, sync feeds, chase POs, quote to cash, so on and on. Each of those and hundreds more would be replaced, or at least massively augmented, by agents.
This isn’t a seamless transition. The world of engineering is filled with people who somehow hate having been promoted from coder to manager. The requirement to split attention, constant vigilance, the intellectual burden of being “always on”, these are all added skillsets that aren’t being taxed today for almost anyone2.
This is already the case. Claude Code spawns sub agents. Codex and Cursor have background tasks. People routinely run many of these in parallel and run projects by alt-tabbing in their mind and surfing twitter in their down times. While these are for coding, that will change with time. Any job that can be sufficiently sliced into workstreams will suffer the same fate. We’re all about to be videogame players.
Note that I’m not making any claims about superintelligence, only about the intelligence required to automate “quote to cash”.
I have a friend who is highly successful in the valley but doesn’t answer Slack messages. If anything is truly urgent people would phone him, or he’d check emails at specific hours and respond. He has a system, in other words, in order to deal with the chaos that management brings with it. Others have other systems, where whether they’re at costco or disneyworld they can’t help but answer when the phone pings. We all will have to figure out our own equilibria.
2025-08-24 01:59:47
So, LLMs suck at Twitter. It’s kind of poetic, because twitter is full of bots. But despite sometimes trying to be naughty and sometimes trying to be nice they mostly still suck. It does remarkably well in some tasks and terribly in others. And writing is one of the hardest.
My friend and I were joking about this, considering words are at the very core of these miraculous machines, and thought hey wouldn’t it be nice if we could train a model to get better? We were first wondering if one could create an AI journalist that could actually write actual articles with actual facts and arguments and everything. Since we were thinking about an AI that could write, we called it Walter. Because of Bagehot. And Cronkite. We thought it had to be plausible, at least at a small scale. Which is why we tried the experiment (paper here)1.
This is particularly hard in a different way from math or coding, because how do you even know what the right answer is? Is there one? To get to a place where the training is easier and the rewards are richer, we thought of trying to write tweet sized takes on articles. So, Walter became a small, cranky, surprisingly competent engine that ingests social media data about articles, sees how people reacted, and trained itself via reinforcement learning to write better2.
As Eliot once said, “Between the idea / And the reality / … falls the Shadow.” this was us trying to light a small lamp in there using RLNVR: our cheeky acronym for “reinforcement learning from non-verified rewards”.
Now, why small models? Well, a big reason, beyond being GPU poor, is that big models are resilient. They're like cars with particularly powerful shock absorbers, they are forgiving if you make silly assumptions. Small models are not. They are dumb. And precisely because they are dumb, you are forced to be smart.
What I mean is that if you really want to understand something, the best way is to try and explain it to someone else. That forces you to sort it out in your own mind. And the more slow and dim-witted your pupil, the more you have to break things down into more and more simple ideas. And that’s really the essence of programming. By the time you’ve sorted out a complicated idea into little steps that even a stupid machine can deal with, you’ve certainly learned something about it yourself. The teacher usually learns more than the pupil3.
This also makes reward modelling particularly interesting. Because anytime you think you have come up with a good reward model, if there is any weakness or flaw in how you measure your reward, a small model will find it and exploit it ruthlessly. Goodhart’s Law is not just for management.
This is not to say that only small models do that; of course we have seen large models reward hack and learn lessons they were not meant to. But it is fascinating to see a 500 million parameter model learn that it can trick your carefully designed evaluation rubric just by outputting tokens just so. It drives home just how powerful transformers actually are, because it doesn't matter how complicated a balanced scorecard you create; they will find a way to hack it. Tweaking specific weights given to different elements, fighting with a sampling bias towards articles with enough skeets, penalties and thresholds for similarities … all grist for their mill.
We should also say, social media engagement data is magnificently broken as a training signal. It’s sort of “well known”, but it’s hard to imagine exactly how bad until you try and use it. We first ingested Bluesky skeets plus their engagement signals (likes, reposts, replies). Since we wanted actual signal, we decided to use the URL as the organizer: we group all the skeets that point at the same URL, then ask the model to produce a fresh skeet for that article. For the reward, we use embeddings to calculate the most similar historic posts (this worked best), then sanity check, and then rank based on how well those posts did.
The outside world in this instance, as in many, has its problems. For instance:
Bias. Big accounts seem “better,” in that they get more and more interesting reactions, than small accounts who post very similar things. The Matthew Effect holds true in social media. To solve that, we had to do baseline normalization: Score a post relative to its author’s usual. Raw engagement minus the author’s baseline turns “how big is your account?” into “was this unusually good for you?”.
Sparsity. You get one post and one outcome, not ten A/B variants. And for that we tried max-based semantic transfer: For a new post, find the single most similar historical post about the same article and reward the similarity to that top performer. The max transfer mattered more than we expected. In this domain, the right teacher is a specific great prior, not the average of pretty‑good priors.
But this messy, biased, sparse signal is the only feedback that exists. The world doesn't hand out clean training labels. It hands you whatever people actually do, and you have to figure out how to learn from that.
Together, this turned a one-shot, messy outcome into a dense signal. We used GRPO first to train, though later we upgraded to train with GSPO with clipping and a KL leash to keep voice anchored4. We also added UED (Unsupervised Environment Design) so the curriculum self-organizes: to pick link targets where the policy shows regret/variance, and push there5.
Before training, the model usually hedged and link-dumped and added a comical number of hashtags. After training it was clearly much better. It proposed stakes, hinted at novelty, and tagged sparingly. When we A/B tested the same URL, the trained outcome is the one you’d actually post. Example:
Before (the base model): 🚀 SpaceX's Starship successfully landed at Cape Canaveral! 🚀 #SpaceX #Starship #CapeCanaveral Landing 🚀 #SpaceX #SpaceX #SpaceX #SpaceX #SpaceX #SpaceX #SpaceX #SpaceX #SpaceX #SpaceX #SpaceX #SpaceX #SpaceX #SpaceX #SpaceX #SpaceX #SpaceX #SpaceX #SpaceX #SpaceX #SpaceX
After (the trained model): 🚀 SpaceX's Starship has successfully landed at Cape Canaveral, marking a key milestone toward future missions. #SpaceX #Starship #landing #Mars
LLMs love adding hashtags to tweets a lot. In the short runs, those didn’t entirely disappear, but did reduce a lot. And became better. Still, I admit I do have a soft spot for the first one for its sheer enthusiasm! Similarly, just for fun, here’s one about tariffs:
Before: A major retro handheld maker has stopped all U.S. shipments over tariffs… #retrohandheld #retrohandheld #retrohandheld #tariffs #trade
After: 🎮 A top retro handheld brand just paused U.S. shipments due to tariffs. Big ripple for imports, modders, and collectors. What’s your go-to alternative? #retrogaming #tariffs
But the most interesting part for us was that the pattern extends anywhere you have weak, messy signals, which is, well, most of real life. So the ideas here should theoretically also extend to other fields:
Creative writing: optimize for completion/saves; transfer from prior hits.
Education: optimize for retention/time-on-task; transfer from explanations that helped.
Product docs/UX: optimize for task completion/helpfulness; baseline by product area and release.
Research comms: optimize for expert engagement/citations; baseline by venue/community.
Take the raw data; normalize away obvious bias; transfer what worked via similarity however you want to calculate or analyse that; keep the loop numerically stable; and add small, legible penalties to deter degenerate strategies. And be extremely, extremely, vigilant about the model reward hacking. In subtle and obvious ways this will happen, it’s closer to crafting a story than writing a program. It also gives you a visceral appreciation of the bitter lesson, and makes you aware of the voracious appetite of these models to learn anything that you throw at them by any means necessary.
The next few years are going to see an absolute “managerial explosion” where we try to figure out better rubrics and rating systems, including using the smartest models to rate themselves, as we train models to do all sorts of tasks. This whole project is about the limits of current approaches and smaller models. When GPT-5 writes good social posts6, you can't tell if it learned general principles or just memorized patterns.
When a 500M model succeeds at a tiny task, all offline on your laptop where you mostly surf Twitter, it feels kind of amazing. Do check out the paper. Like intelligence truly can be unbounded, and you will soon have a cyberpunk world where models will be run anywhere and everywhere for tasks both mundane and magnificent.
After writing this we came across the recent Gemini 2.5 report, echoing the same instinct at a very different scale: tight loops that let models learn from imperfect, real interactions. Which was cool!
Note that “better” here does not only mean “optimize engagement at all costs.” Instead it’s the far more subtle “learn the latent rubric of what reads well and travels in this odd little medium.”
“It would be hard to learn much less than my pupils without undergoing a prefrontal lobotomy.”
Maybe an example can help. Ten people posted the same article about SpaceX. Normalize each author’s engagement by their baseline (e.g., 45 vs 20 → +25; 210 vs 200 → +10; 12 vs 5 → +7). Embed all posts. For a new candidate, compute cosine similarity to each and take max(similarity × normalized weight). If the best match has sim 0.82 and weight 0.9, reward ≈ 0.74. No live A/B; the signal comes from “be like the best thing that worked.”
Early training followed the classic arc: diverse exploration → partial convergence → collapse risk. With GSPO-style normalization, a small KL guardrail, and light penalties, the loop stays open and outputs nudge toward historical winners.
*If
2025-07-28 22:46:04
We are going to get ads in our AI. It is inevitable. It’s also okay.
OpenAI, Anthropic and Gemini are in the lead for the AI race. Anything they produce also seems to get copied (and made open source) by Bytedance, Alibaba and Deepseek, not to mention Llama and Mistral. While the leaders have carved out niches (OpenAI is a consumer company with the most popular website, Claude is the developer’s darling and wins the CLI coding assistant), the models themselves are becoming more interchangeable amongst them.
Well, not quite interchangeable yet. Consumer preferences matter. People prefer using one vs the other, but these are nuanced points. Most people are using the default LLMs available to them. If someone weren’t steeped in the LLM world and watching every move, the model-selection is confusing and the difference between the models sound like so much gobbledegook.
One solution is to go deeper and create product variations that others don’t, such that people are attracted to your offering. OpenAI is trying with Operator and Codex, though I’m unclear if that’s a net draw, rather than a cross sell for usage.
Gemini is also trying, by introducing new little widgets that you might want to use. Storybook in particular is really nice here, and I prefer it to their previous knockout success, which was NotebookLM.
But this is also going to get commoditised, as every large lab and many startups are going to be able to copy it. This isn’t a fundamental difference in the model capabilities after all, it’s a difference in how well you can create an orchestration. That doesn’t seem defensible from a capability point of view, though of course it is from a brand point of view.
Another option is to introduce new capabilities that will attract users. OpenAI has Agent and Deep Research. Claude has Artefacts, which are fantastic. Gemini is great here too, despite their reputation, it also has Deep Research but more importantly it has the ability to talk directly to Gemini live, show yourself on a webcam, and share your screen. It even has Veo3, which can generate vidoes with sound today.
I imagine much of this will also get copied by other providers if and when these get successful. Grok already has voice and video that you can show to the outside world. I think ChatGPT also has it but I honestly can’t recall while writing this sentence without looking it up, which is certainly an answer. Once again these are also product design and execution questions about building software around the models, and that seems less defensible than even the model building in the first place.
Now, if the orchestration layers will compete as SaaS companies did over consumer attraction and design and UX and ease and so on, the main action remains the models themselves. We briefly mentioned they’re running neck and neck in terms of the functionality. I didn’t mention Grok, who have billions and have good models too, or Meta who have many more billions and are investing it with the explicit aim of creating superintelligence.
Here the situation is more complicated. The models are decreasing in price extremely rapidly. They’ve fallen by anywhere from 95 to 99% or more over the last couple years. This hasn’t hit the revenues of the larger providers because they’re releasing new models rapidly at higher-ish prices and also extraordinary growth in usage.
This, along with the fact that we’re getting Deepseek R1 and Kimi-K2 and Qwen3 type open source models indicates that the model training by itself is unlikely to provide sufficiently large enduring advantage. Unless the barrier simply is investment (which is possible).
What could happen is that the training gets expensive enough that these half dozen (or a dozen) providers decide enough is enough and say we are not going to give these models out for free anymore.
So the rise in usage will continue but if you’re losing a bit of money on models you can’t make it up in volume. So it’ll tend down, at least until some equilibrium.
Now, by itself this is fine. Because instead of it being a saas-like high margin business making tens of billions of dollars it’ll be an Amazon like low margin business making hundreds of billions of dollars and growing fast. A Costco for intelligence.
But this isn’t enough for owning the lightcone. Not if you want to be a trillion dollar company. So there has to be better options. They could try to build new niches and succeed, like a personal device, or a car, or computers, all hardware like devices which can get you higher margins if the software itself is being competed away. Even cars! Definitely huge and definitely being worked on.
And they’re already working on that. This will have uncertain payoffs, big investments, and strong competition. Will it be a true new thing or just another layer built on top of existing models remains to be seen.
There’s another option, which is to bring the best business model we have ever invented into the AI world. That is advertising.
It solves the problem of differential pricing, which is the hardest problem for all technologies but especially for AI, which will see a few providers who are all fighting it out to be the cheapest in order to get the most market share while they’re trying to get more people to use it. And AI has a unique challenge in that it is a strict catalyst for anything you might want to do!
For instance, imagine if Elon Musk is using Claude to have a conversation, the answer to which might well be worth trillions of dollars of his new company. If he only paid you $20 for the monthly subscription, or even $200, that would be grossly underpaying you for the privilege of providing him with the conversation. It’s presumably worth 100 or 1000x that price.
Or if you're using it to just randomly create stories for your kids, or to learn languages, or if you're using it to write an investment memo, those are widely varying activities in terms of economic value, and surely shouldn't be priced the same. But how do you get one person to pay $20k per month and other to pay $0.2? The only way we know how to do this is via ads.
And if you do it it helps in another way - it even helps you open up even your best models, even if rate limited, to a much wider group of people. Subscription businesses are a flat edge that only captures part of the pyramid.
We can even calculate its economic inevitbaility. Ads have an industry mean CPC (cost per click) of $0.63. Display ads have click through rates of 0.46%. If tokens cost $20/1m for completion, and average turns have 150 counted messages, with 400 tokens each, that means we have to make $1.9 or thereabouts in CPC to break even per API cost. Now, the API cost isn’t the cost to OpenAI, but it means for same margins or better they’d have to triple the CPC.
Is it feasible for the token costs to fall by another 75%? Or for the ads via chat to have higher conversion than a Google display ad? Both seem plausible. Long‑term cost curves (Hopper to Blackwell, speculative decoding) suggest another 3× drop in cash cost per token by 2027. Not just for product sales, but even for news recommendations or even service links.
And what would it look like? Here’s an example. The ads themselves are AI generated (4.1 mini) but you can see how it could get so much more intricate! It could:
Have better recommendations
Contain expositions from products or services or even content engines
Direct purchase links to products or links to services
Upsell own products
Have a second simultaneous chat about the existing chat
A large part of purchasing already happens via ChatGPT or at least starts on there. And even if you’re not directly purchasing pots or cars or houses or travel there’s books and blogs and even instagram style impulse purchases one might make. The conversion rates are likely to be much (much!) higher than even social media, since this is content, and it’s happening in an extremely targeted fashion. Plus, since conversations have a lag from AI inference anyway, you can have other AIs helping figure out which ads make sense and it won’t even be tiresome (see above!).
I predict this will work best for OpenAI and Gemini. They have the customer mindshare. And an interface where you can see it, unlike Claude via its CLI12. Will Grok be able to do it? Maybe, they already have an ad business via X (formerly Twitter). Will it matter? Unlikely.
And since we'll be using AI agents to do increasingly large chunks of work we will even see an ad industry built and focused on them. Ads made by AI to entice other AIs to use them.
Put all these together I feel ads are inevitable. I also think this is a good thing. I know this pits me against much of the prevailing wisdom, which thinks of ads as a sloptimised hyper evil that will lead us all into temptation and beyond. But honestly whether it’s ads or not every company wants you to use their product as much as possible. That’s what they’re selling! I don’t particularly think of Slack optimising the sound of its pings or games A/B testing the right upskill level for a newbie as immune to the pull of optimisation because they don’t have ads.
Now, a caveat. If the model providers start being able to change the model output according to the discussion, that would be bad. But I honestly don't think this is feasible. We're still in the realm where we can't tell the model to not be sycophantic successfully for long enough periods of time. People are legitimately worried, whether with cause or not, about the risk of LLMs causing psychosis in the vulnerable.
So if we somehow created the ability to perfectly target the output of a model to make it such that we can produce tailored outputs that would a) not corrupt the output quality much (because that’ll kill the golden goose), and b) guide people towards the products and services they might want to advertise, that would constitute a breakthrough in LLM steerability!
Instead what’s more likely is that the models will try to remain ones people would love to use for everything, both helpful and likeable. And unlike serving tokens at cost, this is one where economies of scale can really help cement an advantage and build an enduring moat. The future, whether we want it or not, is going to be like the past, which means there’s no escaping ads.
Being the first name someone recommends for something has enduring consumer value, even if a close substitute exists. Also the reason most LLM discourse revolves around 4o, the default model, even though the much more capable o3 model exists right in the drop down.
Also, Claude going enterprise and ChatGPT going consumer wasn’t something I’d have predicted a year and half ago.
2025-07-16 22:30:32
I.
I’m not a population expert, but there’s a ticking time bomb. Almost everywhere in the world, pretty much without exception, has lower birth rates than they used to. In fact, most of the world is below replacement (TFR or 2.2 or 2.1, depending on where you live). This is true in the US. In Europe. In Australia. Singapore. Japan. Korea. It’s reducing even in India, South East Asia, Latin America. It’s quite possible that despite the heroic efforts from Africa, we might be at replacement TFR insofar as the world is concerned right now.
And this is likely to continue. Today’s <15 set guarantees rising absolute births through ~2040 even if TFR = 1.7, but the trend is rather clear, just looking at the above numbers. Depending on which numbers you believe people think the global population will peak at like 9-10 Billion in the 2050s, then start dropping.
The reason this is a problem is that people, young working age people, are the lifeblood of the economy. A few repercussions of this population pyramid inversion:
IMF’s medium projection, assuming a Cobb-Douglas world, will cut both the level and growth rate of aggregate GDP - maybe 1% hit to the global GDP growth annually
In OECD the worker:retiree ratio doubles by 2050- this will necessitate a 5% fiscal tightening or debt
With fewer workers and more retirees we will see savings decumulate, because retirees spend more and save less1, and this will hit interest rates
And per Jones idea-production thesis, fewer young workes and researchers mean slower idea generation. OECD estimate is around 0.3% off annual TFP growth.
This is obviously scary for multiple reasons.
Lower economic growth and asset reallocation of that nature brings with it a rather uncomfortable shift in hwo people live
Per capita GDP might be less affected in the aggregate, since capital deepening might offset
And if this continues for a long while, there’s the doomer scenario of “voluntary extinction”
(For example, it makes sense that as population declines we will hit a breaking point for the economy. If demand reduces, which is literally what will happen if there’s less people, then that will affect the price. If the labour growth is negative, then the overall output growth will also be negative. And these fewer working age adults will need to take care of us old fogeys at a much larger proportion when we are older.
OECD will see their pension cashlow turn negative by 20302. Gobal labour force will peak maybe a decade after that? Long term healthcare bill for the senior citizens will explore another decade after that.
Some worry that this trend is even more apocalyptic. That soon, through the inexorable rules of mathematics, a below replacement fertility rate will result in lesser and lesser people until we’re effectively depopulated.
It’s bad enough that people, smart successful people, are actually contemplating ideas like “let’s not send people to college” in a handmaid’s tale-esque chain of thought. Just like the 1980-2020 saw a demographic dividend, the 2020-2050 will see a demographic drag.
II.
There are lots of reasons people bandy about. Childcare is more and more expensive. Hell, life is more and more expensive. Healthcare is expensive. Housing is expensive. Education is expensive. Opportunity cost of taking your kids strawberry picking on a Sunday is expensive. Etc.
All of which is also true.
The reasons why TFR is trending lower seem stubborn. No matter what we seem to do it doesn’t seem to reverse. But the economist in me looks at this unbounded curve and asks, “where’s the equilibrium”. Or rather, what are the conditions under which we will likely see the TFR tick back up, to 2.1 or 2.2, and get us to a stable population.
From a review that was published on the fertility question (bold mine):
Our read of the evidence leads us to conclude that the decline in fertility across the industrialized world – including both the rise in childlessness and the reduction in completed fertility – is less a reflection of specific economic costs or policies, but rather, a widespread re-prioritization of the role of parenthood in people’s adult lives. It likely reflects a complex combination of factors leading to “shifting priorities” about how people choose to spend their time, money, and energy. Such factors potentially include evolving opportunities and constraints, changing norms and expectations about work, parenting, and gender roles, and the hard-to-quantify influences of social and cultural factors.
So, at a glance, we’ll need four conditions as I see it:
Cost of having kids has to collapse
Work and family stop being competitive
Cultural status has to shift
Women face less risk from having kids
Now, having kids basically is equivalent to spending like $20k a year or something like that for their childhood, if you’re trying for private schools or nannies and vacations and whatnot. U.S. USDA estimate is $310-340 k lifetime for middle class, 0-17. Yes, an undeniably privileged view but that’s the reality for why many are not having kids in the first place. When median cost for raising a couple kids is half a million or more, that shows up!
The first question that gets asked is, can government subsidies help? We can sort of see from the data. Korea, Hungary, France and Singapore already burn 3-6 % of GDP on baby bonuses, tax breaks and housing perks. They buy at most +0.1–0.2 births, sometimes after an initial bump. That’s not a big boost.
Hungary spends ~5 % of its GDP on incentives yet slipped back to a 1.38 TFR once the novelty wore off because status never shifted and the underlying costs stayed high. I’m going to just assume it will at a global or at least a largely regional scale however, because the alternative feels too much like the earth turning into those clubs I never went to when I was in my 20s.
Italy had a universal child allowance in 2022 and had no real impact of lowered TFR.
What about other costs? Housing has to get cheaper, so you can afford to get the 4 bedroom house to raise your little ones. As demand reduces, so should prices. Instructively, Japan hit the “housing turns negative” wall in 1991, house prices dropped 55% in the following 15 years. China, arguably, entered the same zone in 2022. Also, at some point we will surely make it legal to build more things, if only because the richer older building magnates died out and the NIMBY movement gets starved of oxygen. This should help reduce the burden of bringing another child into the world3.
And despite Japan, a $10 k fall in prices lifts fertility for renters by ~2.4%.
As labour even gets more scarce, will this also get looser? I’d imagine so. Full wage parental leave or low work-week hours for parents seem like they will make a difference at the margin. Success stories remain microscopic today. A few French civil-service tracks, some Nordic municipalities. But if we can scale that globally and TFR moves maybe +0.3? Seems plausible.
Third, culture. This is my blind spot. I can’t quite conceive of people who seem to not think of having children as a “good thing”. I’m assured they exist. But despite this if the pronatalist movement can push anything at the margins how can it not come back! Surely the “child-free to save the planet” idiots have to lose status4.
France is the success case here, in a “one eyed man is kind in the land of the blind” sense, because in Europe they have the highest TFR seemingly mostly through culture. And at least anecdotally the French don’t seem to think of having kids as a burden, and are far more in favour of free-range parenting than anywhere else I’ve been. And they added roughly +0.3 to TFR compared to the european average. Seems good!
Culture is an incredibly important point, because without it you have to contend with data like this, where Latin American countries fell from above US TFR to below seemingly in less than a decade!
Then there’s the biotech world. Artificial wombs, super cheap IVF, partial ectogenesis, other things that are incredible to think of and difficult to bank on, but plausible.
If the last two exist, that can easily add a +0.5 on the TFR. (Assume some adoption of ectogenesis and some adoption of birth probability along with a general push higher due to culture, 0.5 is feasible. Israel, for instance, did +0.8 pretty much purely through culture.)
To recap, we said 4 factors:
Slash cost of having kids - say +0.3
Make housing (etc) affordable - say +0.2
Cultural pro natalist shift - say +0.2
Biotech - say +0.3
Which means that adding all four can get us back to a 2.1 ish stage. At this point I thought it would be nice to wow you with an equation, so here it is if you’d like to play yourself. It doesn’t matter that much either, but is nice to model things out if you wanted to.
Where C (cost collapse), W (work-family cost détente), S (status flip) and B (biotech). If we use those parameters, then the TFR bottoms out near 1.65 in the early 2040s, and crosses back over in a decade. If you drop the biotech lever to like 0.1, or even delay its launch till 2090, then the year we hit replacement TFR slips to 2060s. (If you use it naively thereafter it also pops back up to 2.6 and stays there but I don’t trust it that much.)
Yeah we’ll need to get enough automation to push the labour productivity up enough to make up for labour shortage. We’ll need real housing construction to drop a lot! And we might need to double again the spending that even the bigger governments are doing to encourage their families to have more kids. All of which seem plausible?
III.
This is all very well to say, how will we fund them? We could break the 4 components down into a few actual policies that I’ve seen floating around. Starting naively:
Kids get a massive allowance - like $1k per child per month.
We also give that to stay at home spouses. We also give the same to like head of family as like a tax credit or something, and double the tax on singles over the age of 25 to compensate.
Make pro natalism cool (i.e., good intentioned govt propaganda, say 2x what we spend on anti-drug PSAs)
Let’s even take away pensions from folks with <2 kids, that’s about 76% of family households and/or about 35% of US adults who are single
Be YIMBY
Doing the maths for the US, that’s basically a cost of around (rounding for ease of math) $1 Trillion for child allowance, $0.5 trillion for spousal and head of family tax allowance, so a total of $1.5 trillion cost.
If you add the new tax you’d get from denying social security to the childless or doubling tax on singles, that’ll get you around $1.2 trillion (roughly).
This means we have to spend around, on average, $300-400 billion annually. Assuming a $150-250k PV net gain from additional child, you’d need to get 2.5-4m extra births a year. For context, US currently is at around 3.6m births a year, so it has to double.
Not to mention, both these numbers will obviously move as people move to a new equilibrium, some people choosing to have kids which increases the spend and decreases the revenues.
You could move the numbers around and somehow make it work on a spreadsheet. You could focus only on marginal births (2nd, 3rd etc). Swap more money for universal pre-k, since that raises payroll and income tax. DC’s universal pre-k led to a 10% jump. Subsidize public IVF (Denmark saw a 14x ROI with this), and go very very deeply YIMBY to lower house prices.
If you did this, we could halve the spend and therefore the PV, while doubling the gains from extra births, meaning the ROI could at least be positive, maybe as much as 2x in the best case scenario.
These are very large, even if not insane numbers, though they sound like it. Social security in the US is around $1.5 trillion a year. Net interest on debt itself is $900 billion. Medicare and defense are also the same. What I found most instructive was to get a sense of proportion, a sense of scale as to what will be required if this were to become an economic necessity. And we can probably do it, which when we’re amidst a sea of people discussing Handmaid’s Tale policies or talking about the destruction of the human race, is good to know!
As the retiree share swells and prime-age savers shrink, the demand for short-duration assets rises just when governments must lengthen debt to cover swollen pension and health bills. Labour markets tighten, pushing wages and headline inflation up; term premia widen because retirees dump equities and long bonds while treasuries sell more of the latter to finance deficits. The net effect is persistent, mild inflation and a steeper yield curve, with risk-asset valuations pressured by slower growth and accelerating dissaving.
More workers didn't translate into more output because the effective labour input and its productivity both deteriorated. OECD annual hours worked are down a tenth since 1980, capital deepening flatlined after 2008, and total-factor productivity growth has halved relative to the 1990s.
So it’s about the fact that labour to raise kids is scarce, or expensive. Which should mean we see many dual income households become single income households when the single income is large enough? I don’t know if this is a widespread trend, but there at least anecdotally seems to be some notion of “enough” and beyond that you can optimise other variables. It’s not like we even need to do that much housework anymore!
Someone once asked me whether I always knew I wanted kids. To me the question didn’t make sense, it wasn’t a question I had ever considered. It wasn’t a spreadsheet question, to tally up the pros and cons of having kids - do I value the fifteen utilons I get from being able to hop off to Kenya when I wanted to against the ten I get from hugging my two year old when he asks me for one? Are these even commensurable?
People make the mistake of thinking of having kids as a utilitarian calculus. It’s not. It’s a stage of life. It is unfiltered joy, ask a parent they’ll tell you. Its not Stockholm syndrome, I remember the life before. It was fine. But while it had plenty of diversions and even more freedom, I used it so little. Your instagram posts about going to Maldives will not give you succour in a year or ten, but kids will. Sometimes you can’t know what you’re missing until you try it.
The day I had my first son I told my wife that my world had expanded. That expansion is not something I can plug into a Benthamite equation. Maybe a being smarter than me will be able to, but until then, if nothing else believe in the fact that we have evolved to have kids, to love them, be loved by them, and it is a joy at which one should leap joyously, not with trepidation at the fact that you do not have a perfect model of what life would be like afterwards.