2025-08-14 08:00:00
One of the hardest mental models to break is how disposable AI generated content is.
When asking me to generate one blog post, why not just ask it to generate three, pick the best, use that as a prompt to generate three more, and repeat until you have a polished piece of content?
This is the core idea behind EvoBlog, an evolutionary AI content generation system that leverages multiple large language models (LLMs) to produce high-quality blog posts in a fraction of the time it would take using traditional methods.
The post below was generated using EvoBlog in which the system explains itself.
– Imagine a world where generating a polished, insightful blog post takes less time than brewing a cup of coffee. This isn’t science fiction. We’re building that future today with EvoBlog.
Our approach leverages an evolutionary, multi-model system for blog post generation, inspired by frameworks like EvoGit, which demonstrates how AI agents can collaborate autonomously through version control to evolve code. EvoBlog applies similar principles to content creation, treating blog post development as an evolutionary process with multiple AI agents competing to produce the best content.
The process begins by prompting multiple large language models (LLMs) in parallel. We currently use Claude Sonnet 4, GPT-4.1, and Gemini 2.5 Pro - the latest generation of frontier models. Each model receives the same core prompt but generates distinct variations of the blog post. This parallel approach offers several key benefits.
First, it drastically reduces generation time. Instead of waiting for a single model to iterate, we receive multiple drafts simultaneously. We’ve observed sub-3-minute generation times in our tests, compared to traditional sequential approaches that can take 15-20 minutes.
Second, parallel generation fosters diversity. Each LLM has its own strengths and biases. Claude Sonnet 4 excels at structured reasoning and technical analysis. GPT-4.1 brings exceptional coding capabilities and instruction following. Gemini 2.5 Pro offers advanced thinking and long-context understanding. This inherent variety leads to a broader range of perspectives and writing styles in the initial drafts.
Next comes the evaluation phase. We employ a unique approach here, using guidelines similar to those used by AP English teachers. This ensures the quality of the writing is held to a high standard, focusing on clarity, grammar, and argumentation. Our evaluation system scores posts on four dimensions: grammatical correctness (25%), argument strength (35%), style matching (25%), and cliché absence (15%).
The system automatically flags posts scoring B+ or better (87%+) as “ready to ship,” mimicking real editorial standards. This evaluation process draws inspiration from how human editors assess content quality, but operates at machine speed across all generated variations.
The highest-scoring draft then enters a refinement cycle. The chosen LLM further iterates on its output, incorporating feedback and addressing any weaknesses identified during evaluation. This iterative process is reminiscent of how startups themselves operate - rapid prototyping, feedback loops, and constant improvement are all key to success in both blog post generation and building a company.
A critical innovation is our data verification layer. Unlike traditional AI content generators that often hallucinate statistics, EvoBlog includes explicit instructions against fabricating data points. When models need supporting data, they indicate “[NEEDS DATA: description]” markers that trigger fact-checking workflows. This addresses one of the biggest reliability issues in AI-generated content.
This multi-model approach introduces interesting cost trade-offs. While leveraging multiple LLMs increases upfront costs (typically $0.10-0.15 per complete generation), the time savings and quality improvements lead to substantial long-term efficiency gains. Consider the opportunity cost of a founder spending hours writing a single blog post versus focusing on product development or fundraising.
The architecture draws from evolutionary computation principles, where multiple “mutations” (model variations) compete in a fitness landscape (evaluation scores), with successful adaptations (high-scoring posts) surviving to the next generation (refinement cycle). This mirrors natural selection but operates in content space rather than biological systems.
Our evolutionary, multi-model approach takes this concept further, optimizing for both speed and quality while maintaining reliability through systematic verification.
Looking forward, this evolutionary framework could extend beyond blog posts to other content types - marketing copy, technical documentation, research synthesis, or even code generation as demonstrated by EvoGit’s autonomous programming agents. The core principles of parallel generation, systematic evaluation, and iterative refinement apply broadly to any creative or analytical task.
2025-08-13 08:00:00
GPT-5 achieves 94.6% accuracy on AIME 2025, suggesting near-human mathematical reasoning.
Yet ask it to query your database, and success rates plummet to the teens.
The Spider 2.0 benchmarks reveal a yawning gap in AI capabilities. Spider 2.0 is a comprehensive text-to-SQL benchmark that tests AI models’ ability to generate accurate SQL queries from natural language questions across real-world databases.
While large language models have conquered knowledge work in mathematics, coding, and reasoning, text-to-SQL remains stubbornly difficult.
The three Spider 2.0 benchmarks test real-world database querying across different environments. Spider 2.0-Snow uses Snowflake databases with 547 test examples, peaking at 59.05% accuracy.
Spider 2.0-Lite spans BigQuery, Snowflake, and SQLite with another 547 examples, reaching only 37.84%. Spider 2.0-DBT tests code generation against DuckDB with 68 examples, topping out at 39.71%.
This performance gap isn’t for lack of trying. Since November 2024, 56 submissions from 12 model families have competed on these benchmarks.
Claude, OpenAI, DeepSeek, and others have all pushed their models against these tests. Progress has been steady, from roughly 2% to about 60%, in the last nine months.
The puzzle deepens when you consider SQL’s constraints. SQL has a limited vocabulary compared to English, which has 600,000 words, or programming languages that have much broader syntaxes and libraries to know. Plus there’s plenty of SQL out there to train on.
If anything, this should be easier than the open-ended reasoning tasks where models now excel.
Yet even perfect SQL generation wouldn’t solve the real business challenge. Every company defines “revenue” differently.
Marketing measures customer acquisition cost by campaign spend, sales calculates it using account executive costs, and finance includes fully-loaded employee expenses. These semantic differences create confusion that technical accuracy can’t resolve.
The Spider 2.0 results point to a fundamental truth about data work. Technical proficiency in SQL syntax is just the entry point.
The real challenge lies in business context. Understanding what the data means, how different teams define metrics, and when edge cases matter. As I wrote about in Semantic Cultivators, the bridge between raw data and business meaning requires human judgment that current AI can’t replicate.
2025-08-12 08:00:00
Perplexity AI just made a $34.5b unsolicited offer for Google’s Chrome browser, attempting to capitalize on the pending antitrust ruling that could force Google to divest its browser business.
Comparing Chrome’s economics to Google’s existing Safari deal reveals why $34.5b undervalues the browser.
Google pays Apple $18-20b annually to remain Safari’s default search engine¹, serving approximately 850m users². This translates to $21 per user per year.
The Perplexity offer values Chrome at $32b, which is $9 per user per year for its 3.5b users³.
If Chrome users commanded the same terms as the Google/Apple Safari deal, the browser’s annual revenue potential would exceed $73b.
Browser | Users (m) | Annual Revenue ($b) | Revenue per User ($) | Market Cap 5x ($b) | Market Cap 6x ($b) |
---|---|---|---|---|---|
Safari | 850 | 18 | 21 | 90 | 108 |
Chrome (Perplexity Offer) | 3,500 | 32 | 9 | 172 | 207 |
Chrome (Safari Parity) | 3,500 | 73 | 21 | 367 | 441 |
Chrome (Premium Scenario) | 3,500 | 105 | 30 | 525 | 630 |
This data is based on public estimates but is an approximation.
This assumes that Google would pay a new owner of Chrome a similar scaled fee for the default search placement. Given a 5x to 6x market cap-to-revenue multiple, Chrome is worth somewhere between $172b and $630b, a far cry from the $34.5b offer.
Chrome dominates the market with 65% share⁴, compared to Safari’s 18%. A divestment would throw the search ads market into upheaval. The value of keeping advertiser budgets is hard to overstate for Google’s market cap & position in the ads ecosystem.
If forced to sell Chrome, Google would face an existential choice. Pay whatever it takes to remain the default search engine, or watch competitors turn its most valuable distribution channel into a cudgel against it.
How much is that worth? A significant premium to a simple revenue multiple.
¹ Bloomberg: Google’s Payments to Apple Reached $20 Billion in 2022
² ZipDo: Essential Apple Safari Statistics In 2024
2025-08-11 08:00:00
In 1999, the dotcoms were valued on traffic. IPO metrics revolved around eyeballs.
Then Google launched AdWords, an ad model predicated on clicks, & built a $273b business in 2024.
But that might all be about to change : Pew Research’s July 2025 study reveals users click just 8% of search results with AI summaries, versus 15% without - a 47% reduction. Only 1% click through from within AI summaries.
Cloudflare data shows AI platforms crawl content far more than they refer traffic back : Anthropic crawls 32,400 pages for every 1 referral, while traditional search engines scan content just a couple times per visitor sent.
The expense of serving content to the AI crawlers may not be huge if it’s mostly text.
The bigger point is AI systems disintermediate the user & publisher relationship. Users prefer aggregated AI answers over clicking through websites to find their answers.
It’s logical that most websites should expect less traffic. How will your website & your business handle it?
Sources:
2025-08-08 08:00:00
GPT-5 launched yesterday. 94.6% on AIME 2025. 74.9% on SWE-bench.
As we approach the upper bounds of these benchmarks, they die.
What makes GPT-5 and the next generation of models revolutionary isn’t their knowledge. It’s knowing how to act. For GPT-5 this happens at two levels. First, deciding which model to use. But second, and more importantly, through tool calling.
We’ve been living in an era where LLMs mastered knowledge retrieval & reassembly. Consumer search & coding, the initial killer applications, are fundamentally knowledge retrieval challenges. Both organize existing information in new ways.
We have climbed those hills and as a result competition is more intense than ever. Anthropic, OpenAI, and Google’s models are converging on similar capabilities. Chinese models and open source alternatives are continuing to push ever closer to state-of-the-art. Everyone can retrieve information. Everyone can generate text.
The new axis of competition? Tool-calling.
Tool-calling transforms LLMs from advisors to actors. It compensates for two critical model weaknesses that pure language models can’t overcome.
First, workflow orchestration. Models excel at single-shot responses but struggle with multi-step, stateful processes. Tools enable them to manage long workflows, tracking progress, handling errors, maintaining context across dozens of operations.
Second, system integration. LLMs live in a text-only world. Tools let them interface predictably with external systems like databases, APIs, and enterprise software, turning natural language into executable actions.
In the last month I’ve built 58 different AI tools.
Email processors. CRM integrators. Notion updaters. Research assistants. Each tool extends the model’s capabilities into a new domain.
The most important capability for AI is selecting the right tool quickly and correctly. Every misrouted step kills the entire workflow.
When I say “read this email from Y Combinator & find all the startups that are not in the CRM,” modern LLMs execute a complex sequence.
One command in English replaces an entire workflow. And this is just a simple one.
Even better, the model, properly set up with the right tools, can verify its own work that tasks were completed on time. This self-verification loop creates reliability in workflows that is hard to achieve otherwise.
Multiply this across hundreds of employees. Thousands of workflows. The productivity gains compound exponentially.
The winners in the future AI world will be the ones who are most sophisticated at orchestrating tools and routing the right queries. Every time. Once those workflows are predictable, that’s when we will all become agent managers.
2025-08-04 08:00:00
2025 is the year of agents, & the key capability of agents is calling tools.
When using Claude Code, I can tell the AI to sift through a newsletter, find all the links to startups, verify they exist in our CRM, with a single command. This might involve two or three different tools being called.
But here’s the problem: using a large foundation model for this is expensive, often rate-limited, & overpowered for a selection task.
What is the best way to build an agentic system with tool calling?
The answer lies in small action models. NVIDIA released a compelling paper arguing that “Small language models (SLMs) are sufficiently powerful, inherently more suitable, & necessarily more economical for many invocations in agentic systems.”
I’ve been testing different local models to validate a cost reduction exercise. I started with a Qwen3:30b parameter model, which works but can be quite slow because it’s such a big model, even though only 3 billion of those 30 billion parameters are active at any one time.
The NVIDIA paper recommends the Salesforce xLAM model – a different architecture called a large action model specifically designed for tool selection.
So, I ran a test of my own, each model calling a tool to list my Asana tasks.
Model | Success Rate | Avg Response Time | Avg Tool Time | Avg Total Time |
---|---|---|---|---|
xLAM | 100% (25/25) | 1.48s | 1.14s | 2.61s ± 0.47s |
Qwen | 92% (23/25) | 8.75s | 1.07s | 9.82s ± 1.53s |
The results were striking: xLAM completed tasks in 2.61 seconds with 100% success, while Qwen took 9.82 seconds with 92% success – nearly four times as long.
This experiment shows the speed gain, but there’s a trade-off: how much intelligence should live in the model versus in the tools themselves.
With larger models like Qwen, tools can be simpler because the model has better error tolerance & can work around poorly designed interfaces. The model compensates for tool limitations through brute-force reasoning.
With smaller models, the model has less capacity to recover from mistakes, so the tools must be more robust & the selection logic more precise. This might seem like a limitation, but it’s actually a feature.
This constraint eliminates the compounding error rate of LLM chained tools. When large models make sequential tool calls, errors accumulate exponentially.
Small action models force better system design, keeping the best of LLMs and combining it with specialized models.
This architecture is more efficient, faster, & more predictable.