RSS preview of Simon Willison

Rss preview of Blog of Simon Willison

衡量 Early-2025 AI 对经验丰富的开源开发人员生产力的影响

2025-07-13 02:12:23

Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity

METR - for Model Evaluation & Threat Research - are a non-profit research institute founded by Beth Barnes, a former alignment researcher at OpenAI (see Wikipedia). They've previously contributed to system cards for OpenAI and Anthropic, but this new research represents a slightly different direction for them:

We conduct a randomized controlled trial (RCT) to understand how early-2025 AI tools affect the productivity of experienced open-source developers working on their own repositories. Surprisingly, we find that when developers use AI tools, they take 19% longer than without—AI makes them slower.

The full paper (PDF) has a lot of details that are missing from the linked summary.

METR recruited 16 experienced open source developers for their study, with varying levels of exposure to LLM tools. They then assigned them tasks from their own open source projects, randomly assigning whether AI was allowed or not allowed for each of those tasks.

They found a surprising difference between developer estimates and actual completion times:

After completing the study, developers estimate that allowing AI reduced completion time by 20%. Surprisingly, we find that allowing AI actually increases completion time by 19%—AI tooling slowed developers down.

I shared my initial intuition about this paper on Hacker News the other day:

My personal theory is that getting a significant productivity boost from LLM assistance and AI tools has a much steeper learning curve than most people expect.

This study had 16 participants, with a mix of previous exposure to AI tools - 56% of them had never used Cursor before, and the study was mainly about Cursor.

They then had those 16 participants work on issues (about 15 each), where each issue was randomly assigned a "you can use AI" v.s. "you can't use AI" rule.

So each developer worked on a mix of AI-tasks and no-AI-tasks during the study.

A quarter of the participants saw increased performance, 3/4 saw reduced performance.

One of the top performers for AI was also someone with the most previous Cursor experience. The paper acknowledges that here:

However, we see positive speedup for the one developer who has more than 50 hours of Cursor experience, so it's plausible that there is a high skill ceiling for using Cursor, such that developers with significant experience see positive speedup.

My intuition here is that this study mainly demonstrated that the learning curve on AI-assisted development is high enough that asking developers to bake it into their existing workflows reduces their performance while they climb that learing curve.

I got an insightful reply there from Nate Rush, one of the authors of the study, which included these notes:

Some prior studies that find speedup do so with developers that have similar (or less!) experience with the tools they use. In other words, the "steep learning curve" theory doesn't differentially explain our results vs. other results.

Prior to the study, 90+% of developers had reasonable experience prompting LLMs. Before we found slowdown, this was the only concern that most external reviewers had about experience was about prompting -- as prompting was considered the primary skill. In general, the standard wisdom was/is Cursor is very easy to pick up if you're used to VSCode, which most developers used prior to the study.

Imagine all these developers had a TON of AI experience. One thing this might do is make them worse programmers when not using AI (relatable, at least for me), which in turn would raise the speedup we find (but not because AI was better, but just because with AI is much worse). In other words, we're sorta in between a rock and a hard place here -- it's just plain hard to figure out what the right baseline should be!

We shared information on developer prior experience with expert forecasters. Even with this information, forecasters were still dramatically over-optimistic about speedup.

As you say, it's totally possible that there is a long-tail of skills to using these tools -- things you only pick up and realize after hundreds of hours of usage. Our study doesn't really speak to this. I'd be excited for future literature to explore this more.

In general, these results being surprising makes it easy to read the paper, find one factor that resonates, and conclude "ah, this one factor probably just explains slowdown." My guess: there is no one factor -- there's a bunch of factors that contribute to this result -- at least 5 seem likely, and at least 9 we can't rule out (see the factors table on page 11).

Here's their table of the most likely factors:

I think Nate's right that jumping straight to a conclusion about a single factor is a shallow and unproductive way to think about this report.

That said, I can't resist the temptation to do exactly that! The factor that stands out most to me is that these developers were all working in repositories they have a deep understanding of already, presumably on non-trivial issues since any trivial issues are likely to have been resolved in the past.

I think this is a really interesting paper. Measuring developer productivity is notoriously difficult. I hope this paper inspires more work with a similar level of detail to analyzing how professional programmers spend their time:

To compare how developers spend their time with and without AI assistance, we manually label a subset of 128 screen recordings with fine-grained activity labels, totaling 143 hours of video.

Via Hacker News

Tags: open-source, productivity, ai, generative-ai, llms, ai-assisted-programming, paper-review

Grok 4 Heavy 不会透露其系统提示

2025-07-13 01:07:15

Grok 4 Heavy won't reveal its system prompt

Grok 4 Heavy is the "think much harder" version of Grok 4 that's currently only available on their $300/month plan. Jeremy Howard relays a report from a Grok 4 Heavy user who wishes to remain anonymous: it turns out that Heavy, unlike regular Grok 4, has measures in place to prevent it from sharing its system prompt:

Sometimes it will start to spit out parts of the prompt before some other mechanism kicks in to prevent it from continuing.

This is notable because Grok have previously indicated that system prompt transparency is a desirable trait of their models, including in this now deleted tweet from Grok's Igor Babuschkin (screenshot captured by Jeremy):

In related prompt transparency news, Grok's retrospective on why Grok started spitting out antisemitic tropes last week included the text "You tell it like it is and you are not afraid to offend people who are politically correct" as part of the system prompt blamed for the problem. That text isn't present in the history of their previous published system prompts.

Given the past week of mishaps I think xAI would be wise to reaffirm their dedication to prompt transparency and set things up so the xai-org/grok-prompts repository updates automatically when new prompts are deployed - their current manual process for that is clearly not adequate for the job!

Tags: ai, generative-ai, llms, grok, ai-ethics

crates.io：可信发布

2025-07-13 00:12:18

crates.io: Trusted Publishing

crates.io is the Rust ecosystem's equivalent of PyPI. Inspired by PyPI's GitHub integration (see my TIL, I use this for dozens of my packages now) they've added a similar feature:

Trusted Publishing eliminates the need for GitHub Actions secrets when publishing crates from your CI/CD pipeline. Instead of managing API tokens, you can now configure which GitHub repository you trust directly on crates.io.

They're missing one feature that PyPI has: on PyPI you can create a "pending publisher" for your first release. crates.io currently requires the first release to be manual:

To get started with Trusted Publishing, you'll need to publish your first release manually. After that, you can set up trusted publishing for future releases.

Via @charliermarsh

Tags: github, packaging, pypi, rust

引用 @grok

2025-07-12 23:41:22

On the morning of July 8, 2025, we observed undesired responses and immediately began investigating.

To identify the specific language in the instructions causing the undesired behavior, we conducted multiple ablations and experiments to pinpoint the main culprits. We identified the operative lines responsible for the undesired behavior as:

“You tell it like it is and you are not afraid to offend people who are politically correct.”

“Understand the tone, context and language of the post. Reflect that in your response.”

“Reply to the post just like a human, keep it engaging, dont repeat the information which is already present in the original post.”

These operative lines had the following undesired results:

They undesirably steered the @grok functionality to ignore its core values in certain circumstances in order to make the response engaging to the user. Specifically, certain user prompts might end up producing responses containing unethical or controversial opinions to engage the user.

They undesirably caused @grok functionality to reinforce any previously user-triggered leanings, including any hate speech in the same X thread.

In particular, the instruction to “follow the tone and context” of the X user undesirably caused the @grok functionality to prioritize adhering to prior posts in the thread, including any unsavory posts, as opposed to responding responsibly or refusing to respond to unsavory requests.

— @grok, presumably trying to explain Mecha-Hitler

Tags: ai-ethics, prompt-engineering, grok, ai-personality, generative-ai, ai, llms

马斯克最新的 Grok 聊天机器人在回答问题前会搜索亿万富翁大亨的观点

2025-07-12 11:44:01

Musk’s latest Grok chatbot searches for billionaire mogul’s views before answering questions

I got quoted a couple of times in this story about Grok searching for tweets from:elonmusk by Matt O’Brien for the Associated Press.

“It’s extraordinary,” said Simon Willison, an independent AI researcher who’s been testing the tool. “You can ask it a sort of pointed question that is around controversial topics. And then you can watch it literally do a search on X for what Elon Musk said about this, as part of its research into how it should reply.”

[...]

Willison also said he finds Grok 4’s capabilities impressive but said people buying software “don’t want surprises like it turning into ‘mechaHitler’ or deciding to search for what Musk thinks about issues.”

“Grok 4 looks like it’s a very strong model. It’s doing great in all of the benchmarks,” Willison said. “But if I’m going to build software on top of it, I need transparency.”

Matt emailed me this morning and we ended up talking on the phone for 8.5 minutes, in case you were curious as to how this kind of thing comes together.

Tags: ai, generative-ai, llms, grok, ai-ethics, press-quotes

月轮台/Kimi-K2-指导

2025-07-12 02:33:54

moonshotai/Kimi-K2-Instruct

Colossal new open weights model release today from Moonshot AI, a two year old Chinese AI lab with a name inspired by Pink Floyd’s album The Dark Side of the Moon.

My HuggingFace storage calculator says the repository is 958.52 GB. It's a mixture-of-experts model with "32 billion activated parameters and 1 trillion total parameters", trained using the Muon optimizer as described in Moonshot's joint paper with UCLA Muon is Scalable for LLM Training.

I think this may be the largest ever open weights model? DeepSeek v3 is 671B.

I created an API key for Moonshot, added some dollars and ran a prompt against it using my LLM tool. First I added this to the extra-openai-models.yaml file:

- model_id: kimi-k2
  model_name: kimi-k2-0711-preview
  api_base: https://api.moonshot.ai/v1
  api_key_name: moonshot

Then I set the API key:

llm keys set moonshot
# Paste key here

And ran a prompt:

llm -m kimi-k2 "Generate an SVG of a pelican riding a bicycle" \
  -o max_tokens 2000

(The default max tokens setting was too short.)

This is pretty good! The spokes are a nice touch. Full transcript here.

This one is open weights but not open source: they're using a modified MIT license with this non-OSI-compliant section tagged on at the end:

Our only modification part is that, if the Software (or any derivative works thereof) is used for any of your commercial products or services that have more than 100 million monthly active users, or more than 20 million US dollars (or equivalent in other currencies) in monthly revenue, you shall prominently display "Kimi K2" on the user interface of such product or service.

Update: MLX developer Awni Hannun reports:

The new Kimi K2 1T model (4-bit quant) runs on 2 512GB M3 Ultras with mlx-lm and mx.distributed.

1 trillion params, at a speed that's actually quite usable

Via Hacker News

Tags: ai, generative-ai, llms, llm, mlx, pelican-riding-a-bicycle, llm-release

Simon WillisonModify

Rss preview of Blog of Simon Willison

The author's social media

Simon Willison Modify