MoreRSS

site iconAlex WlchanModify

I‘m a software developer, writer, and hand crafter from the UK. I’m queer and trans.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of Alex Wlchan

Can you take an ox to Oxford?

2025-11-21 08:02:28

At the end of last month, Oxford introduced a £5 congestion charge for passenger cars driving through the city centre. Some vehicles are exempt, like taxis, emergency vehicles, and delivery vans – but what about animals?

i need one of the city councillors following me to advise whether an oxen-driven cart would be chargeable under the congestion charge

madeline odent @oldenoughtosay.com

i mean they are technically zero emissions i guess so you could probably make an argument for driving one down the cornmarket

I’m not a city councillor or a lawyer, but I tried to work out the answer anyway.

The first thing I found was a “Charging order” on the Oxfordshire County Council website, which looks like the legal definition of the congestion charge. Paragraph 3 says “A relevant vehicle is a Class M1 vehicle that is not [non-chargeable or permitted]” – where “non-chargeable” and “permitted” describes the various exemptions.

The order defines Class M1 vehicles as “those falling within class M1(a) and class M1(b) as specified in Schedule 1 of the Vehicle Classes Regulations, which refers to another bit of UK legislation.

That legislation is the Road User Charging and Workplace Parking Levy (Classes of Motor Vehicles) (England) Regulations 2001 (really rolls off the tongue), the full text of which is available online. Here’s the section we’re interested in:

Category M: Motor vehicles with at least four wheels used for the carriage of passengers

In this Part, references to a “motor vehicle” are to a motor vehicle with or without a semi-trailer which—

(a) has at least four wheels;
(b) has an unladen mass exceeding 400kg or an engine with net power exceeding 15kW;
(c) is used for the carriage of passengers; and
(d) is not a motor caravan within Class L(a) or Class L(b) specified in Part III above.

Class M1(a)
A motor vehicle which comprises no more than eight seats in addition to the driver’s seat.

Class M1(b)
A motor vehicle as defined in Class M1(a) which is drawing a trailer.

At first glance, this definition could almost include an ox-drawn carriage: four wheels, hefty enough to exceed 400kg, carrying passengers, and clearly not a motor caravan.

But I don’t think an ox counts as a “motor vehicle”, which is defined by the Road Traffic Act 1988 as “a mechanically propelled vehicle intended or adapted for use on roads”. If your ox is only mechanically propelled in the sense of kicking and walking, it’s off the hook.

Surprisingly, UK legislation doesn’t define “mechanically propelled”. Lawyers usually define everything, even words that seem obvious. Has anybody ever tried arguing that “mechanically propelled” is ambiguous?

I did find a few mentions of one case, Chief Constable of Avon & Somerset v Fleming [1987], where somebody argued that modifying their motorbike for off-roading meant it was no longer a motor vehicle. The judge was unconvinced. (What is it with Flemings and wacky vehicles?)

Returning to the Oxford congestion charge, Category M is interesting because it excludes several other vehicles you might think of as passenger cars.

As Alisdair notes, a three-wheeled Reliant Robin doesn’t count, nor does a small, underpowered Peel P50. Based on a light reading, I believe the Robin is a Class C(a) “motor tricycle”, while a P50 is a Class D(a) “Light quadricycle”. I look forward to an owners’ club holding their annual meeting in Oxford city centre.

Practically speaking, Oxford’s congestion charge is almost certainly enforced by cameras that scan your number plate. An ox-drawn cart doesn’t have a number plate, so it won’t be charged. Other vehicles like a Renault Twizy or Reliant Robin do have number plates, so they’ll be charged even though they’re technically exempt.

To wrap up: an oxen-driven cart isn’t mechanically propelled, it’s not a motor vehicle, and the Oxford congestion charge doesn’t apply. Your ox can ride in and out of the city as much as you like.

But like all the best ideas, somebody in Cambridge thought of it first.


As a closing thought, I think it’s good that so much of the UK’s legislation is publicly available. I could find all of these documents from my phone, without going to a courtroom or a library or a government building. I don’t think about it often, but this kind of openness isn’t a given. Being able to check the law directly, even for extremely silly questions, helps keep the whole system honest.

[If the formatting of this post looks odd in your feed reader, visit the original article]

Cleaning up messy dates in JSON

2025-11-17 16:52:48

I’ve been cleaning up some messy data, and it includes timestamps written by a variety of humans and machines, which don’t use a consistent format.

Here are a few examples:

2025-10-14T05:34:07+0000
2015-03-01 23:34:39 +00:00
2013-9-21 13:43:00Z
2024-08-30

All of these timestamps are machine-readable, but it would be easier for the downstream code if there weren’t as many different formats. For example, the downstream code uses the JavaScript Date() constructor, which rejects some of the timestamps as invalid.

I wrote a Python script to help me find, validate, and normalise all my timestamps, so the rest of my code can deal with a more consistent set.

Table of contents

Finding all the data strings

All the messy data is in JSON, and the structure is quite inconsistent – a lot of heavily nested objects, differently-named fields, varying models and schemas. This project is about tidying it up.

One saving gracing is that the timestamps are named fairly consistently – they’re all stored inside JSON objects, with keys that start with date_, and values which are strings. Here’s an example:

{
  "doc1": {"id": "1", "date_created": "2025-10-14T05:34:07+0000"},
  "shapes": [
  	{"color": "blue", "date_saved": "2015-03-01 23:34:39 +00:00"},
  	{"color": "yellow", "date_saved": "2013-9-21 13:43:00Z", "is_square": true},
  	{"color": "green", "date_saved": null}
  ],
  "date_verified": "2024-08-30"
}

The first thing I want to do is find all the key-value pairs which combine a date_ and a string.

I wrote a Python function to recursively walk the JSON and pull out matching pairs. I’m sure there are libraries for this, but JSON is simple enough that I can write it by hand. It only has a few types, and even fewer that matter here:

  • If it’s a JSON object: inspect its keys, then recurse into each value
  • If it’s a JSON array: recurse into each element
  • If it’s a string, number, bool, or null: ignore it

Here’s my code:

from collections.abc import Iterator
from typing import Any


def find_all_dates(json_value: Any) -> Iterator[tuple[dict[str, Any], str, str]]:
    """
    Find all the timestamps in a heavily nested JSON object.

    This function looks for any JSON objects with a key-value pair
    where the key starts with `date_` and the value is a string, and
    emits a 3-tuple:

    *   the JSON object
    *   the key
    *   the value

    """
    # Case 1: JSON objects
    if isinstance(json_value, dict):
        for key, value in json_value.items():
            if (
                isinstance(key, str)
                and key.startswith("date_")
                and isinstance(value, str)
            ):
                yield json_value, key, value
            else:
                yield from find_all_dates(value)

    # Case 2: JSON arrays
    elif isinstance(json_value, list):
        for value in json_value:
            yield from find_all_dates(value)

    # Case 3: other JSON types
    elif isinstance(json_value, (str, int, bool)) or json_value is None:
        return

    # Case 4: handle unexpected types
    else:
        raise TypeError(f"Unexpected type: {type(json_value)}")

There are branches for all the builtin JSON types, then a catch-all branch for anything else.

I added a catch-all TypeError branch to catch list- or dict-like inputs that aren’t actually JSON types – things like dict.values() or custom container classes. Without this check, they’d fall through to the “ignore” case and quietly drop nested data. An explicit error makes the failure obvious, and the fix is easy: wrap the input in list() or dict().

For each timestamp it finds, the function emits a tuple with the nested object, the key and the value. For the example above, here’s the first tuple it returns:

({"id": "1", "date_created": "2025-10-14T05:34:07+0000"},
 "date_created",
 "2025-10-14T05:34:07+0000")

This return type allows me to both read and modify the JSON with the same function:

# Reading the timestamps
for _, _, date_string in find_all_dates(json_value):
    print(date_string)

# Modifying the timestamps
for json_obj, key, date_string in find_all_dates(json_value):
    json_obj[key] = run_fixup(date_string)

The latter works because json_obj points to the actual dictionary from the nested JSON, not a copy, so when we assign json_obj[key] = …, we modify the original JSON structure in-place.

Now we can find all the timestamps, we need to check if they use a consistent datetime format.

Checking if a date string matches a given format

I normally parse timestamps in Python with the datetime.strptime function. This is quite a strict function, because you have to pass it a format string that describes exactly how you expect the timestamp to be formatted.

If you want a more flexible approach, you can use the python-dateutil module, which has a generic parser that guesses how to read the timestamp, rather than asking you to specify.

I prefer strict parsing, because I (usually!) know exactly how timestamps will be formatted, and inconsistent formats can hide bugs. There’s no room for ambiguity, and no risk of a timestamp being guessed incorrectly.

The strptime function takes two arguments: the string you want to parse, and the format string. Here’s an example:

>>> from datetime import datetime
>>> datetime.strptime("2001-02-03T04:05:06+00:00", "%Y-%m-%dT%H:%M:%S%z")
datetime.datetime(2001, 2, 3, 4, 5, 6, tzinfo=datetime.timezone.utc)

If you pass a timestamp that doesn’t match the format string, it throws a ValueError:

>>> datetime.strptime("2001-02-03", "%Y-%m-%dT%H:%M:%S%z")
ValueError: time data '2001-02-03' does not match format '%Y-%m-%dT%H:%M:%S%z'

It also checks that the whole string is parsed, and throws a ValueError if it’s an incomplete match:

>>> datetime.strptime("2001-02-03T04:05:06+00:00", "%Y-%m-%d")
ValueError: unconverted data remains: T04:05:06+00:00

This allows us to write a function that checks if a timestamp matches a given format:

from datetime import datetime


def date_matches_format(date_string: str, format: str) -> bool:
    """
    Returns True if `date_string` can be parsed as a datetime
    using `format`, False otherwise.
    """
    try:
        datetime.strptime(date_string, format)
        return True
    except ValueError:
        return False

The format can be any format code supported by strptime(). The Python docs have a list of all the accepted format codes. That list includes a couple of non-standard format codes which only have partial support, which is part of why I want to clean up these date strings.

If we want to allow multiple formats, we can wrap this function using any():

def date_matches_any_format(date_string: str, formats: tuple[str]) -> bool:
    """
    Returns True if `date_string` can be parsed as a datetime
    with any of the `formats`, False otherwise.
    """
    return any(
        date_matches_format(date_string, fmt)
        for fmt in formats
    )

Here’s how we can use this function to find any timestamps that don’t match our allowed formats:

allowed_formats = (
    # 2001-02-03T04:05:06+07:00
    "%Y-%m-%dT%H:%M:%S%z",
    #
    # 2001-02-03
    "%Y-%m-%d",
)

for _, _, date_string in find_all_dates(json_value):
    if not date_matches_any_format(date_string, allowed_formats):
        print(date_string)

Testing that all of my timestamps use consistent formats

With these two functions in hand, I wrote a test I can run with pytest to tell me about any timestamps that don’t match my allowed formats:

def test_all_timestamps_are_consistent():
    """
    All the timestamps in my JSON use a consistent format.

    See https://alexwlchan.net/2025/messy-dates-in-json/
    """
    allowed_formats = (
        # 2001-02-03T04:05:06+07:00
        "%Y-%m-%dT%H:%M:%S%z",
        #
        # 2001-02-03
        "%Y-%m-%d",
    )

    bad_date_strings = {
        date_string
        for _, _, date_string in find_all_dates(JSON_DATA)
        if not date_matches_any_format(date_string, allowed_formats)
    }

    assert bad_date_strings == set()

If you only allow a single format, you could simplify this slightly by using date_matches_format.

If the test passes, all your timestamps match the allowed formats. If the test fails, pytest prints the ones that don’t. Running it on our original example shows two disallowed timestamps:

AssertionError: assert {'2013-9-21 1...34:39 +00:00'} == set()

  Extra items in the left set:
  '2015-03-01 23:34:39 +00:00'
  '2013-9-21 13:43:00Z'

As with the test I wrote in my last post, I like to report on all the failing values, not just the first one. This allows me to see the scale of the problem, and see patterns in the failing output – if I see a bad timestamp, is it a one-off issue I should fix by hand, or does it affect thousands of values that need an automatic cleanup?

When I first ran this test, it failed with thousands of errors. I cleaned up the data and re-ran the test until it passed, and now I can keep running it to ensure no unexpected values sneak back in.

Changing the format of date strings in bulk

This test revealed thousands of errors, and I didn’t want to fix them all by hand. That would be slow, tedious, and prone to manual errors. But among the timestamps I wanted to change, there were pockets of consistency – each tool that contributed to this data would write timestamps in a single format, and I could convert all the timestamps from that tool in one go.

I wrote one-off fixer scripts to perform these conversions. Each script would read the JSON file, look for date strings in a given format, convert them to my preferred format, then write the result back to my JSON file: Here’s one example:

import json


# e.g. 2001-02-03 04:05:06 +07:00
old_format = "%Y-%m-%d %H:%M:%S %z"

# e.g. 2001-02-03T04:05:06+07:00 (datetime.isoformat())
new_format = "%Y-%m-%dT%H:%M:%S%z"


with open("my_data.json") as in_file:
    json_data = json.load(in_file)

for json_obj, key, date_string in find_all_dates(json_data):
    if date_matches_format(date_string, old_format):
        d = datetime.strptime(date_string, old_format)
        json_obj[key] = d.strftime(new_format)

with open("my_data.json", "w") as out_file:
    out_file.write(json.dumps(json_data, indent=2))

I keep my JSON data files in Git, and I committed every time I ran a successful script. That made it easy to see the changes from each fix-up, and to revert them if I made a mistake.

The nice thing about this approach is that each script is quite small and simple, because it’s only trying to fix one thing at a time. But each script adds up, and salami slicing eventually ends up with dramatically cleaner data. I didn’t fix everything this way (some typos were quicker to fix by hand), but scripting fixed the majority of issues.

Putting it all together

Here’s what we’ve done in this post:

  • Written a recursive function to find all the timestamps in a JSON value
  • Chosen the timestamp formats we allow, and added helpers to check them
  • Added a test to find and prevent unexpected formats
  • Written one-off migration scripts to clean up old timestamps

Here’s the final test which ties this all together. I’ve saved it as test_date_formats.py:

from collections.abc import Iterator
from datetime import datetime
from typing import Any


def find_all_dates(json_value: Any) -> Iterator[tuple[dict[str, Any], str, str]]:
    """
    Find all the timestamps in a heavily nested JSON object.

    This function looks for any JSON objects with a key-value pair
    where the key starts with `date_` and the value is a string, and
    emits a 3-tuple:

    *   the JSON object
    *   the key
    *   the value

    """
    # Case 1: JSON objects
    if isinstance(json_value, dict):
        for key, value in json_value.items():
            if (
                isinstance(key, str)
                and key.startswith("date_")
                and isinstance(value, str)
            ):
                yield json_value, key, value
            else:
                yield from find_all_dates(value)

    # Case 2: JSON arrays
    elif isinstance(json_value, list):
        for value in json_value:
            yield from find_all_dates(value)

    # Case 3: other JSON types
    elif isinstance(json_value, (str, int, bool)) or json_value is None:
        return []

    # Case 4: handle unexpected types
    else:
        raise TypeError(f"Unexpected type: {type(json_value)}")


def date_matches_format(date_string: str, format: str) -> bool:
    """
    Returns True if `date_string` can be parsed as a datetime
    using `format`, False otherwise.
    """
    try:
        datetime.strptime(date_string, format)
        return True
    except ValueError:
        return False


def date_matches_any_format(date_string: str, formats: tuple[str]) -> bool:
    """
    Returns True if `date_string` can be parsed as a datetime
    with any of the `formats`, False otherwise.
    """
    return any(
        date_matches_format(date_string, fmt)
        for fmt in formats
    )


def test_all_timestamps_are_consistent():
    """
    All the timestamps in my JSON use a consistent format.

    See https://alexwlchan.net/2025/messy-dates-in-json/
    """
    # TODO: Write code for reading JSON_DATA

    allowed_formats = (
        # 2001-02-03T04:05:06+07:00
        "%Y-%m-%dT%H:%M:%S%z",
        #
        # 2001-02-03
        "%Y-%m-%d",
    )

    bad_date_strings = {
        date_string
        for _, _, date_string in find_all_dates(JSON_DATA)
        if not date_matches_any_format(date_string, allowed_formats)
    }

    assert bad_date_strings == set()

If you want to use this test, you’ll need to modify it to read your JSON data, and to specify the list of timestamp formats you accept.

Although I’ve since cleaned up the data so it has a more consistent structure, this test has remained as-is. I could write a more specific test that knows about my JSON schema and looks for timestamps in specific fields, but I quite like having a generic test that doesn’t need to change when I change my data model. The flexibility also means it’s easy to copy across projects, and I’ve reused this test almost unmodified in several tests already.

[If the formatting of this post looks odd in your feed reader, visit the original article]

Detecting AV1-encoded videos with Python

2025-11-09 07:13:56

In my previous post, I wrote about how I’ve saved some AV1-encoded videos that I can’t play on my iPhone. Eventually, I’ll upgrade to a new iPhone which supports AV1, but in the meantime, I want to convert all of those videos to an older codec. The problem is finding all the affected videos – I don’t want to wait until I want to watch a video before discovering it won’t play.

I already use pytest to run some checks on my media library: are all the files in the right place, is the metadata in the correct format, do I have any misspelt tags, and so on. I wanted to write a new test that would check for AV1-encoded videos, so I could find and convert them in bulk.

In this post, I’ll show you two ways to check if a video is encoded using AV1, and a test I wrote to find any such videos inside a given folder.

Table of contents

Getting the video codec with ffprobe

In my last post, I wrote an ffprobe command that prints some information about a video, including the codec. (ffprobe is a companion tool to the popular video converter FFmpeg.)

$ ffprobe -v error -select_streams v:0 \
    -show_entries stream=codec_name,profile,level,bits_per_raw_sample \
    -of default=noprint_wrappers=1 "input.mp4"
codec_name=av1
profile=Main
level=8
bits_per_raw_sample=N/A

I can tweak this command to print just the codec name:

$ ffprobe -v error -select_streams v:0 \
    -show_entries stream=codec_name \
    -of csv=print_section=0 "input.mp4"
av1

To run this command from Python, I call the check_output function from the subprocess module. This checks the command completes successfully, then returns the output as a string. I can check if the output is the string av1:

import subprocess


def is_av1_video(path: str) -> bool:
    """
    Returns True if a video is encoded with AV1, False otherwise.
    """
    output = subprocess.check_output([
        "ffprobe",
        #
        # Set the logging level
        "-loglevel", "error",
        #
        # Select the first video stream
        "-select_streams", "v:0",
        #
        # Print the codec_name (e.g. av1)
        "-show_entries", "stream=codec_name",
        #
        # Print just the value
        "-output_format", "csv=print_section=0",
        #
        # Name of the video to check
        path
    ], text=True)

    return output.strip() == "av1"

Most of this function is defining the ffprobe command, which takes quite a few flags. Whenever I embed a shell command in another program, I always replace any flags/arguments with the long versions, and explain their purpose in a comment – for example, I’ve replaced -of with -output_format. Short flags are convenient when I’m typing something by hand, but long flags are more readable when I return to this code later.

This function works, but the ffprobe command is quite long, and it requires spawning a new process for each video I want to check. Is there a faster way?

Getting the video codec with MediaInfo

While working at the Flickr Foundation, I discovered MediaInfo, another tool for analysing video files. It’s used in Data Lifeboat to get the dimensions and duration of videos.

You can run MediaInfo as a command-line program to get the video codec:

$ mediainfo --Inform="Video;%Format%" "input.mp4"
AV1

This is a simpler command than ffprobe, but I’d still be spawning a new process if I called this from subprocess.

Fortunately, MediaInfo is also available as a library, and it has a Python wrapper. You can install the wrapper with pip install pymediainfo, then we can use the functionality of MediaInfo inside our Python process:

>>> from pymediainfo import MediaInfo
>>> media_info = MediaInfo.parse("input.mp4")
>>> media_info.video_tracks[0].codec_id
'av01'

This code could throw an IndexError if there’s no video track – if it’s a .mp4 file which only has audio data – but that’s pretty unusual, and not something I’ve found in any of my videos.

I can write a new wrapper function:

from pymediainfo import MediaInfo


def is_av1_video(path: str) -> bool:
    """
    Returns True if a video is encoded with AV1, False otherwise.
    """
    media_info = MediaInfo.parse(path)

    return media_info.video_tracks[0].codec_id == "av01"

This is shorter than the ffprobe code, and faster too – testing locally, this is about 3.5× faster than spawning an ffprobe process per file.

Writing a test to find videos with the AV1 codec

Now we have a function that tells us if a given video uses AV1, we want a test that checks if there are any matching files. This is what I wrote:

import glob


def test_no_videos_are_av1():
    """
    No videos are encoded in AV1 (which doesn't play on my iPhone).

    This test can be removed when I upgrade all my devices to ones with
    hardware AV1 decoding support.

    See https://alexwlchan.net/2025/av1-on-my-iphone/
    """
    av1_videos = {
        p
        for p in glob.glob("**/*.mp4", recursive=True)
        if is_av1_video(p)
    }

    assert av1_videos == set()

It uses the glob module to find .mp4 video files anywhere in the current folder, and then filters for files which use the AV1 codec. The recursive=True argument is important, because it tells glob to search below the current directory.

I’m only looking for .mp4 files because that’s the only format I use for videos, but you might want to search for .mkv or .webm too. If I was doing that, I might drop glob and use my snippet for walking a file tree instead.

The test builds a set of all the AV1 videos, then checks that it’s empty. This means that if the test fails, I can see all the affected videos at once. If the test failed on the first AV1 video, I’d only know about one video at a time, which would slow me down.

Putting it all together

You can use ffprobe or MediaInfo – I prefer MediaInfo because it’s faster and I already have it installed, but both approaches are fine.

Here’s my final test, which uses MediaInfo to check if a video uses AV1, and scans a folder using glob. I’ve saved it as test_no_av1_videos.py:

import glob

from pymediainfo import MediaInfo


def is_av1_video(path: str) -> bool:
    """
    Returns True if a video is encoded with AV1, False otherwise.
    """
    media_info = MediaInfo.parse(path)

    return media_info.video_tracks[0].codec_id == "av01"


def test_no_videos_are_av1():
    """
    No videos are encoded in AV1 (which doesn't play on my iPhone).

    This test can be removed when I upgrade all my devices to ones with
    hardware AV1 decoding support.

    See https://alexwlchan.net/2025/av1-on-my-iphone/
    """
    av1_videos = {
        p
        for p in glob.glob("**/*.mp4", recursive=True)
        if is_av1_video(p)
    }

    assert av1_videos == set()

In one folder with 350 videos, this takes about 8 seconds to run. I could make that faster by reading the video files in parallel, or caching the results, but it’s fast enough for now.

When I buy a new device with hardware AV1 decoding, I’ll delete this test. Until then, it’s a quick and easy way to find and re-encode any videos that won’t play on my iPhone.

[If the formatting of this post looks odd in your feed reader, visit the original article]

Why can’t my iPhone play that video?

2025-10-28 17:59:45

I download a lot of videos, but recently I discovered that some of those videos won’t play on my iPhone. If I try to open the videos or embed them in a webpage, I get a broken video player:

A black box with a play triangle in the middle, and a line striking it out.

These same videos play fine on my Mac – it’s just my iPhone that has issues. The answer involves the AV1 video codec, Apple’s chips, and several new web APIs I learnt along the way.

Table of contents

My iPhone is too old for the AV1 video codec

Doing some research online gave me the answer quickly: the broken videos use the AV1 codec, which isn’t supported on my iPhone. AV1 is a modern video codec that’s designed to be very efficient and royalty free, but it’s only been recently supported on Apple devices.

I have an iPhone 13 mini with an A15 processor. My iPhone doesn’t have hardware decoding support for AV1 videos – that only came with the iPhone 15 Pro and the A17 Pro. This support was included in all subsequent chips, including the M4 Pro in my Mac Mini.

It’s theoretically possible for Apple to decode AV1 in software, but they haven’t. According to Roger Pantos, who works on Apple’s media streaming team, there are no plans to provide software decoding for AV1 video. This means that if your chip doesn’t have this support, you’re out of luck.

I wanted to see if I could have worked this out myself. I couldn’t find any references or documentation for Apple’s video codec support – so failing that, is there some query or check I could run on my devices?

Checking compatibility with web APIs

I’ve found a couple of APIs I can use that tell me my browser can play a particular video. These APIs are used by video streaming sites to make sure they send you the correct video. For example, YouTube can work out that my iPhone doesn’t support AV1 video, so they’ll stream me a video that uses a different codec.

Getting the MIME type and bitrate for the video

The APIs I found require a MIME type and bitrate for the video.

A MIME type can be something simple like video/mp4 or image/jpeg, but it can also include information about the video codec. The codec string for AV1 is quite complicated, and includes many parts. If you want more detail, read the AV1 codecs docs on MDN or the Codecs Parameter String section of the AV1 spec.

We can get the key information about the unplayable video using ffprobe:

ffprobe -v error -select_streams v:0 \
    -show_entries stream=codec_name,profile,level,bits_per_raw_sample \
    -of default=noprint_wrappers=1 "input.mp4"
# codec_name=av1
# profile=Main
# level=8
# bits_per_raw_sample=N/A

The AV1 codec template is av01.P.LLT.DD, which we construct as follows:

  • P = profile number, and “Main” means 0
  • LL = a two-digit level number, so 08
  • Tier = the tier indicator, which can be Main or High. I think the “High” tier is for professional workflows, so let’s assume my video is Main or M.
  • DD = the two-digit bit depth, so 08`.

This gives us the MIME type for the unplayable video:

video/mp4; codecs=av01.0.08M.08

I also got the MIME type for an H.264 video which does play on my iPhone:

video/mp4; codecs=avc1.640028

By swapping out the argument to -show_entries, we can also use ffprobe to get the resolution, frame rate, and bit rate:

ffprobe -v error -select_streams v:0 \
  -show_entries stream=width,height,bit_rate,r_frame_rate \
  -of default=noprint_wrappers=1:nokey=0 "input.mp4"
# width=1920
# height=1080
# r_frame_rate=24000/1001
# bit_rate=1088190

Now we have this information, let’s pass it to some browser APIs.

HTMLMediaElement: canPlayType()

The video.canPlayType() method on HTMLMediaElement takes a MIME type, and tells you whether a browser is likely able to play that media. Note the word “likely”: possible responses are “probably”, “maybe” and “no”.

Here’s an example using the MIME type of my AV1 video:

const video = document.createElement("video");
video.canPlayType("video/mp4; codecs=av01.0.08M.08");

Let’s run this with few different values, and compare the results:

MIME type iPhone Mac
AV1 video
video/mp4; codecs=av01.0.08M.08
"" (= “no”) "probably"
H.264 video
video/mp4; codecs=avc1.640028
"probably" "probably"
Generic MP4
video/mp4
"maybe" "maybe"
Made-up format
video/mp4000
"" (= “no”) "" (= “no”)

This confirms the issue: my iPhone can’t play AV1 videos, while my Mac can.

The generic MP4 is a clue about why this API returns a “likely” result, not something more certain. The MIME type doesn’t contain enough information about whether a video will be playable.

MediaCapabilities: decodingInfo()

For a more nuanced answer, we can use the decodingInfo() method in the MediaCapabilities API. You pass detailed information about the video, including the MIME type and resolution, and it tells you whether the video can be played – and more than that, whether the video can be played in a smooth and power-efficient way.

Here’s an example of how you use it:

await navigator.mediaCapabilities.decodingInfo({
  type: "file",
  video: {
    contentType: "video/mp4; codecs=av01.0.08M.08",
    width:     1920,
    height:    1080,
    bitrate:   1088190,
    framerate: 24
  }
});
// {powerEfficient: false,
//  smooth: false,
//  supported: false,
//  supportedConfiguration: Object}

Let’s try this with two videos:

AV1 video
Video config
{
  contentType: "video/mp4; codecs=av01.0.08M.08",
  width:     1920,
  height:    1080,
  bitrate:   1088190,
  framerate: 24
}
iPhone not supported
Mac supported, smooth, power efficient
H.264 video
Video config
{
  contentType: "video/mp4; codecs=avc1.640028",
  width:     1440,
  height:    1080,
  bitrate:   1660976,
  framerate: 24
}
iPhone supported, smooth, power efficient
Mac supported, smooth, power efficient

This re-confirms our theory that my iPhone’s lack of AV1 support is the issue.

It’s worth noting that this is still a heuristic, not a guarantee. I plugged some really large numbers into this API, and my iPhone claims it could play a trillion-pixel H.264 encoded video in a smooth and power efficient way. I know Apple’s hardware is good, but it’s not that good.

What am I going to do about this?

This is only an issue because I have a single video file, it’s encoded with AV1, and I have a slightly older iPhone. Commercial streaming services like YouTube, Vimeo and TikTok don’t have this problem because they store videos with multiple codecs, and use browser APIs to determine the right version to send you.

Apple would like me to buy a new iPhone, but that’s overkill for this problem. That will happen eventually, but not today.

In the meantime, I’m going to convert any AV1 encoded videos to a codec that my iPhone can play, and change my downloader script to do the same to any future downloads. Before I understood the problem, I was playing whack-a-mole with broken videos. Now I know that AV1 encoding is the issue, I can find and fix all of these videos in one go.

[If the formatting of this post looks odd in your feed reader, visit the original article]

Doing my own syntax highlighting (finally)

2025-10-22 20:55:42

I had syntax highlighting in the very first version of this blog, and I never really thought about that decision. I was writing a programming blog, and I was including snippets of code, so obviously I should have syntax highlighting.

Over the next thirteen years, I tweaked and refined the rest of the design, but I never touched the syntax highlighting. I’ve been applying a rainbow wash of colours that somebody else chose, because I didn’t have any better ideas.

This week I read Nikita Prokopov’s article Everyone is getting syntax highlighting wrong, which advocates for a more restrained approach. Rather than giving everything a colour, he suggests colouring just a few key elements, like strings, comments, and variable definitions. I don’t know if that would work for everybody, but I like the idea, and it gave me the push to try something new.

It’s time to give code snippets the same care I’ve given the rest of this site.

Table of contents

What have I changed?

I’ve stripped back the syntax highlighting to a few key rules:

  • Comments in red
  • Strings in green
  • Numbers, booleans, and other constants in magenta
  • Variable definitions in blue
  • Punctuation in grey

Everything else is the default black/white.

This is similar to Nikita’s colour scheme “Alabaster”, but I chose my own colours to match my site palette. I’m also making my own choices about how to interpret these rules, because real code doesn’t always fall into neat buckets.

Here’s a snippet of Rust code with the old syntax highlighting:

Screenshot of some Rust code. Most of the code is coloured either blue, red, or green, with some bold and italicised text. The one comment is in a dull bluish-grey.

Here’s the same code in my new design:

Screenshot of the same Rust code. Now the code is mostly black, with blue highlights for the import/function/variable name, strings in green, and the comment is bright red.

Naturally, these code blocks work in both light and dark mode.

The new design is cleaner, it fits in better with the rest of the site, and I really like it. Some of that will be novelty and the IKEA effect, but I see other benefits to this simplified palette.

How does it work?

Converting code to HTML

I use Rouge as my syntax highlighter. I give it a chunk of code and specify the language, and it parses the code into a sequence of tokens – like operators, variables, or constants. Rouge returns a blob of HTML, with each token wrapped in a <span> that describes the token type.

For example, if I ask Rouge to highlight the Python snippet:

print("hello world")

it produces this HTML:

<span class="nf">print</span><span class="p">(</span><span class="sh">"</span><span class="s">hello world</span><span class="sh">"</span><span class="p">)</span>

The first token is print, which is a function name (Name.Function, or class="nf"). The ( and ) are punctuation ("p") and the string is split into quotation marks (sh for String.Heredoc) and the text (s for String). You can see all the tokens in this short example, and all the possible token types are listed in the Pygments docs. (Pygments is another syntax highlighting library, but Rouge uses the same classification system.)

Each token has a different class in the HTML, so I can style tokens with CSS. For example, if I want all function names to be blue, I can target the "nf" class:

.nf { color: blue; }

I wrap the entire block in a <pre> tag with a language class, like <pre class="language-go">, so I can also apply per-language styles if I want.

Separating variable/function definitions and usage

I want to highlight variables when they’re defined, not every time they’re used. This gives you an overview of the structure, without drowning the code in blue.

This is tricky with Rouge, because it has no semantic understanding of the code – it only knows what each token is, not how it’s being used. In the example above, it knows that print is the name of a function, but it doesn’t know if the function is being called or being defined.

I could use something smarter, like the language servers used by modern IDEs, but that’s a lot of extra complexity. It might not even work – many of my code snippets are fragments, not complete programs, and wouldn’t parse cleanly.

Instead, I’m manually annotating my code snippets to mark definitions. I wrote a Jekyll plugin that reads those annotations, and modifies the HTML from Rouge to add the necessary highlights. It’s extra work, but I already spend a lot of time trying to pick the right snippet of code to make my point. These annotations are quick and easy, and it’s worth it for a clearer result.

Older posts don’t have these annotations, so they won’t get the full benefit of the new colour scheme, but I’m gradually updating them.

What do I like about this new design?

It’s pushed me to think more about syntax highlighting

Now that I’m not using somebody else’s rules, I’m paying more attention to how my code looks. I’m thinking more carefully about how my rules should apply. I’m noticing when colours feel confusing or unclear, and finding small ways to tweak them to improve clarity.

For example, “variable definitions in blue” sounds pretty clear cut, but does that include imports? Function parameters? What about HTML or CSS, where variables aren’t really a thing? What parts of the code do I think are important and worth highlighting?

I could have asked these questions at any time, but changing my syntax highlighting gave me the push to actually do it.

It makes comments more prominent

In my first programming job, I worked in a codebase with extensive comemnts. They were a good starting point in unfamiliar code, with lots of context, explanation, and narrative. The company’s default IDE showed comments in bright blue, and looking back, I’m sure that colour choice encouraged the culture of detailed documentation.

I realise now how unusual that was, but at the time it was my only experience of professional software development. I carried that habit of writing coomments into subsequent jobs, but I’d forgotten the colour scheme. Now, I’m finally reviving that good idea.

Comments are bright red in my new theme – not the subdued grey used by so many other themes. The pop of colour makes comments easier to spot and more inviting to read. I’ve also ported this style to my IDE, and now when I write comments, I don’t feel like my words are disappearing.

It might be easier to read

I was inspired to make this change by reading Nikita Prokopov’s article, which argues for a minimal colour scheme – but not everyone agrees.

Syntax highlighting is mostly a matter of taste. Some programmers like a clean, light theme, others prefer high-contrast dark themes with bold colours. There’s lots of research into how we read code, but as far as I know, there’s no strong evidence or consensus in favour of any particular approach.

Whatever your taste, I think code is easier to read in a colour scheme you’re already familiar with. You know what the colours mean, so your brain doesn’t have to learn anything. A new scheme might grow on you over time, but at first, it’s more likely to be distracting than helpful.

That’s a problem for a blog like this. Most readers find a single post through search, read something useful, and never return. They’re not reading enough code here to learn my colour scheme, and unfamiliar colours are just noise.

With that in mind, I think a minimal palette works better. My posts only contain short snippets of code – enough to make a point, but not full files or complex programs. When you’re only reading a small amount of code, it’s more useful to highlight key elements than wash everything in colour.

Better late than never

I’ve wanted to improve my code snippets for a long time, but it always felt overwhelming. I’m used to colour themes which use a large palette, and I don’t have the design skills to choose that many colours. It wasn’t until I thought about using a smaller palette that this felt doable

I picked my new colours just a few days ago, and already the old design feels stale and tired. I’d planned to spend more time tinkering before I make it live, but it’s such an improvement I want to use it immediately.

I love having a little corner of the web which is my own personal sandbox. Thirteen years in, and I’m still finding new ways to make myself smile.

[If the formatting of this post looks odd in your feed reader, visit the original article]

Creating a personal wrapper around yt-dlp

2025-10-07 14:19:18

I download a lot of videos from YouTube, and yt-dlp is my tool of choice. Sometimes I download videos as a one-off, but more often I’m downloading videos in a project – my bookmarks, my collection of TV clips, or my social media scrapbook.

I’ve noticed myself writing similar logic in each project – finding the downloaded files, converting them to MP4, getting the channel information, and so on. When you write the same thing multiple times, it’s a sign you should extract it into a shared tool – so that’s what I’ve done.

yt-dlp_alexwlchan is a script that calls yt-dlp with my preferred options, in particular:

  • Download the highest-quality video, thumbnail, and subtitles
  • Save the video as MP4 and the thumbnail as a JPEG
  • Get some information about the video (like title and description) and the channel (like the name and avatar)

All this is presented in a CLI command which prints a JSON object that other projects can parse. Here’s an example:

$ yt-dlp_alexwlchan.py "https://www.youtube.com/watch?v=TUQaGhPdlxs"
{
  "id": "TUQaGhPdlxs",
  "url": "https://www.youtube.com/watch?v=TUQaGhPdlxs",
  "title": "\"new york city, manhattan, people\" - Free Public Domain Video",
  "description": "All videos uploaded to this channel are in the Public Domain: Free for use by anyone for any purpose without restriction. #PublicDomain",
  "date_uploaded": "2022-03-25T01:10:38Z",
  "video_path": "\uff02new york city, manhattan, people\uff02 - Free Public Domain Video [TUQaGhPdlxs].mp4",
  "thumbnail_path": "\uff02new york city, manhattan, people\uff02 - Free Public Domain Video [TUQaGhPdlxs].jpg",
  "subtitle_path": null,
  "channel": {
    "id": "UCDeqps8f3hoHm6DHJoseDlg",
    "name": "Public Domain Archive",
    "url": "https://www.youtube.com/channel/UCDeqps8f3hoHm6DHJoseDlg",
    "avatar_url": "https://yt3.googleusercontent.com/ytc/AIdro_kbeCfc5KrnLmdASZQ9u649IxrxEUXsUaxdSUR_jA_4SZQ=s0"
  },
  "site": "youtube"
}

Rather than using the yt-dlp CLI, I’m using the Python interface. I can import the YouTubeDL class and pass it some options, then pull out the important fields from the response. The library is very flexible, and the options are well-documented.

This is similar to my create_thumbnail tool. I only have to define my preferred behaviour once, then other code can call it as an external script.

I have ideas for changes I might make in the future, like tidying up filenames or supporting more sites, but I’m pretty happy with this first pass. All the code is in my yt-dlp_alexwlchan GitHub repo.

This script is based on my preferences, so you probably don’t want to use it directly – but if you use yt-dlp a lot, it could be a helpful starting point for writing your own script.

Even if you don’t use yt-dlp, the idea still applies: when you find yourself copy-pasting configuration and options, turn it into a standalone tool. It keeps your projects cleaner and more consistent, and your future self will thnak you for it.

[If the formatting of this post looks odd in your feed reader, visit the original article]