MoreRSS

site iconAlex WlchanModify

I‘m a software developer, writer, and hand crafter from the UK. I’m queer and trans.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of Alex Wlchan

When square pixels aren’t square

2025-12-05 15:54:32

When I embed videos in web pages, I specify an aspect ratio. For example, if my video is 1920 × 1080 pixels, I’d write:

<video style="aspect-ratio: 1920 / 1080">

If I also set a width or a height, the browser now knows exactly how much space this video will take up on the page – even if it hasn’t loaded the video file yet. When it initially renders the page, it can leave the right gap, so it doesn’t need to rearrange when the video eventually loads. (The technical term is “reducing cumulative layout shift”.)

That’s the idea, anyway.

I noticed that some of my videos weren’t fitting in their allocated boxes. When the video file loaded, it could be too small and get letterboxed, or be too big and force the page to rearrange to fit. Clearly there was a bug in my code for computing aspect ratios, but what?

Three aspect ratios, one video

I opened one of the problematic videos in QuickTime Player, and the resolution listed in the Movie Inspector was rather curious: Resolution: 1920 × 1080 (1350 × 1080).

The first resolution is what my code was reporting, but the second resolution is what I actually saw when I played the video. Why are there two?

The storage aspect ratio (SAR) of a video is the pixel resolution of a raw frame. If you extract a single frame as a still image, that’s the size of the image you’d get. This is the first resolution shown by QuickTime Player, and it’s what I was reading in my code.

I was missing a key value – the pixel aspect ratio (PAR). This describes the shape of each pixel, in particular the width-to-height ratio. It tells a video player how to stretch or squash the stored pixels when it displays them. This can sometimes cause square pixels in the stored image to appear as rectangles.

A 3×3 grid of pixels, where each pixel is a rectangle that's taller than it is wide.A 3×3 grid of pixels, where each pixel is a square.A 3×3 grid of pixels, where each pixel is a rectangle that's wider than it is tall.
PAR < 1
portrait pixels
PAR = 1
square pixels
PAR > 1
landscape pixels

This reminds me of EXIF orientation for still images – a transformation that the viewer applies to the stored data. If you don’t apply this transformation properly, your media will look wrong when you view it. I wasn’t accounting for the pixel aspect ratio in my code.

According to Google, the primary use case for non-square pixels is standard-definition televisions which predate digital video. However, I’ve encountered several videos with an unusual PAR that were made long into the era of digital video, when that seems unlikely to be a consideration. It’s especially common in vertical videos like YouTube Shorts, where the stored resolution is a square 1080 × 1080, and the aspect ratio makes it a portrait.

I wonder if it’s being introduced by a processing step somewhere? I don’t understand why, but I don’t have to – I’m only displaying videos, not producing them.

The display aspect ratio (DAR) is the size of the video as viewed – what happens when you apply the pixel aspect ratio to the stored frames. This is the second resolution shown by QuickTime Player, and it’s the aspect ratio I should be using to preallocate space in my video player.

These three values are linked by a simple formula:

DAR = SAR × PAR

The size of the viewed video is the stored resolution times the shape of each pixel.

The stored frame may not be what you see

One video with a non-unit pixel aspect ratio is my download of Mars EDL 2020 Remastered. This video by Simeon Schmauß tries to match what the human eye would have seen during the landing of NASA’s Perseverance rover in 2021.

We can get the width, height, and sample aspect ratio (which is another name for pixel aspect ratio) using ffprobe:

$ ffprobe -v error \
      -select_streams v:0 \
      -show_entries stream=width,height,sample_aspect_ratio \
      "Mars 2020 EDL Remastered [HHhyznZ2u4E].mp4"
[STREAM]
width=1920
height=1080
sample_aspect_ratio=45:64
[/STREAM]

Here 1920 is the stored width, and 45:64 is the pixel aspect ratio. We can multiply them together to get the display width: 1920 × 45 / 64 = 1350. This matches what I saw in QuickTime Player.

Let’s extract a single frame using ffmpeg, to get the stored pixels. This command saves the 5000th frame as a PNG image:

$ ffmpeg -i "Mars 2020 EDL Remastered [HHhyznZ2u4E].mp4" \
    -filter:v "select=eq(n\,5000)" \
    -frames:v 1 \
    frame.png

The image is 1920 × 1080 pixels, and it looks wrong: the circular parachute is visibly stretched.

Photo looking up towards a parachute against a dark brown sky. The parachute is made of white-and-orange segments, and is stretched horizontally. The circle is wider than it is tall.

Suppose we take that same image, but now apply the pixel aspect ratio. This is what the image is meant to look like, and it’s not a small difference – now the parachute actually looks like a circle.

The same photo as before, but now the parachute is a circle.

Seeing both versions side-by-side makes the problem obvious: the stored frame isn’t how the video is displayed. The video player in my browser will play it correctly using the pixel aspect ratio, but my layout code wasn’t doing that. I was telling the browser the wrong aspect ratio, and the browser had to update the page when it loaded the video file.

Getting the correct display dimensions in Python

This is my old function for getting the dimensions of a video file, which uses a Python wrapper around MediaInfo to extract the width and height fields. I now realise that this only gives me the storage aspect ratio, and may be misleading for some videos.

from pathlib import Path

from pymediainfo import MediaInfo


def get_storage_aspect_ratio(video_path: Path) -> tuple[int, int]:
    """
    Returns the storage aspect ratio of a video, as a width/height ratio.
    """
    media_info = MediaInfo.parse(video_path)
    
    try:
        video_track = next(
            tr
            for tr in media_info.tracks
            if tr.track_type == "Video"
        )
    except StopIteration:
        raise ValueError(f"No video track found in {video_path}")
    
    return video_track.width, video_track.height

I can’t find an easy way to extract the pixel aspect ratio using pymediainfo. It does expose a Track.aspect_ratio property, but that’s a string which has a rounded value – for example, 45:64 becomes 0.703. That’s close, but the rounding introduces a small inaccuracy. Since I can get the complete value from ffprobe, that’s what I’m doing in my revised function.

The new function is longer, but it’s more accurate:

from fractions import Fraction
import json
from pathlib import Path
import subprocess


def get_display_aspect_ratio(video_path: Path) -> tuple[int, int]:
    """
    Returns the display aspect ratio of a video, as a width/height fraction.
    """
    cmd = [
        "ffprobe",
        #
        # verbosity level = error
        "-v", "error",
        #
        # only get information about the first video stream
        "-select_streams", "v:0",
        #
        # only gather the entries I'm interested in
        "-show_entries", "stream=width,height,sample_aspect_ratio",
        #
        # print output in JSON, which is easier to parse
        "-print_format", "json",
        #
        # input file
        str(video_path)
    ]
    
    output = subprocess.check_output(cmd)
    ffprobe_resp = json.loads(output)
    
    # The output will be structured something like:
    #
    #   {
    #       "streams": [
    #           {
    #               "width": 1920,
    #               "height": 1080,
    #               "sample_aspect_ratio": "45:64"
    #           }
    #       ],
    #       …
    #   }
    #
    # If the video doesn't specify a pixel aspect ratio, then it won't
    # have a `sample_aspect_ratio` key.
    video_stream = ffprobe_resp["streams"][0]
    
    try:
        pixel_aspect_ratio = Fraction(
            video_stream["sample_aspect_ratio"].replace(":", "/")
        )
    except KeyError:
        pixel_aspect_ratio = 1
    
    width = round(video_stream["width"] * pixel_aspect_ratio)
    height = video_stream["height"]
    
    return width, height

This is calling the ffprobe command I showed above, plus -print_format json to print the data in JSON, which is easier for Python to parse.

I have to account for the case where a video doesn’t set a sample aspect ratio – in that case, the displayed video just uses square pixels.

Since the aspect ratio is expressed as a ratio of two integers, this felt like a good chance to try the fractions module. That avoids converting the ratio to a floating-point number, which potentially introduces inaccuracies. It doesn’t make a big difference, but in my video collection treating the aspect ratio as a float produces results that are 1 or 2 pixels different from QuickTime Player.

When I multiply the stored width and aspect ratio, I’m using the round() function to round the final width to the nearest integer. That’s more accurate than int(), which always rounds down.

Conclusion: use display aspect ratio

When you want to know how much space a video will take up on a web page, look at the display aspect ratio, not the stored pixel dimensions. Pixels can be squashed or stretched before display, and the stored width/height won’t tell you that.

Videos with non-square pixels are pretty rare, which is why I ignored this for so long. I’m glad I finally understand what’s going on.

After switching to ffprobe and using the display aspect ratio, my pre-allocated video boxes now match what the browser eventually renders – no more letterboxing, no more layout jumps.

[If the formatting of this post looks odd in your feed reader, visit the original article]

Can you take an ox to Oxford?

2025-11-21 08:02:28

At the end of last month, Oxford introduced a £5 congestion charge for passenger cars driving through the city centre. Some vehicles are exempt, like taxis, emergency vehicles, and delivery vans – but what about animals?

i need one of the city councillors following me to advise whether an oxen-driven cart would be chargeable under the congestion charge

madeline odent @oldenoughtosay.com

i mean they are technically zero emissions i guess so you could probably make an argument for driving one down the cornmarket

I’m not a city councillor or a lawyer, but I tried to work out the answer anyway.

The first thing I found was a “Charging order” on the Oxfordshire County Council website, which looks like the legal definition of the congestion charge. Paragraph 3 says “A relevant vehicle is a Class M1 vehicle that is not [non-chargeable or permitted]” – where “non-chargeable” and “permitted” describes the various exemptions.

The order defines Class M1 vehicles as “those falling within class M1(a) and class M1(b) as specified in Schedule 1 of the Vehicle Classes Regulations, which refers to another bit of UK legislation.

That legislation is the Road User Charging and Workplace Parking Levy (Classes of Motor Vehicles) (England) Regulations 2001 (really rolls off the tongue), the full text of which is available online. Here’s the section we’re interested in:

Category M: Motor vehicles with at least four wheels used for the carriage of passengers

In this Part, references to a “motor vehicle” are to a motor vehicle with or without a semi-trailer which—

(a) has at least four wheels;
(b) has an unladen mass exceeding 400kg or an engine with net power exceeding 15kW;
(c) is used for the carriage of passengers; and
(d) is not a motor caravan within Class L(a) or Class L(b) specified in Part III above.

Class M1(a)
A motor vehicle which comprises no more than eight seats in addition to the driver’s seat.

Class M1(b)
A motor vehicle as defined in Class M1(a) which is drawing a trailer.

At first glance, this definition could almost include an ox-drawn carriage: four wheels, hefty enough to exceed 400kg, carrying passengers, and clearly not a motor caravan.

But I don’t think an ox counts as a “motor vehicle”, which is defined by the Road Traffic Act 1988 as “a mechanically propelled vehicle intended or adapted for use on roads”. If your ox is only mechanically propelled in the sense of kicking and walking, it’s off the hook.

Surprisingly, UK legislation doesn’t define “mechanically propelled”. Lawyers usually define everything, even words that seem obvious. Has anybody ever tried arguing that “mechanically propelled” is ambiguous?

I did find a few mentions of one case, Chief Constable of Avon & Somerset v Fleming [1987], where somebody argued that modifying their motorbike for off-roading meant it was no longer a motor vehicle. The judge was unconvinced. (What is it with Flemings and wacky vehicles?)

Returning to the Oxford congestion charge, Category M is interesting because it excludes several other vehicles you might think of as passenger cars.

As Alisdair notes, a three-wheeled Reliant Robin doesn’t count, nor does a small, underpowered Peel P50. Based on a light reading, I believe the Robin is a Class C(a) “motor tricycle”, while a P50 is a Class D(a) “Light quadricycle”. I look forward to an owners’ club holding their annual meeting in Oxford city centre.

Practically speaking, Oxford’s congestion charge is almost certainly enforced by cameras that scan your number plate. An ox-drawn cart doesn’t have a number plate, so it won’t be charged. Other vehicles like a Renault Twizy or Reliant Robin do have number plates, so they’ll be charged even though they’re technically exempt.

To wrap up: an oxen-driven cart isn’t mechanically propelled, it’s not a motor vehicle, and the Oxford congestion charge doesn’t apply. Your ox can ride in and out of the city as much as you like.

But like all the best ideas, somebody in Cambridge thought of it first.


As a closing thought, I think it’s good that so much of the UK’s legislation is publicly available. I could find all of these documents from my phone, without going to a courtroom or a library or a government building. I don’t think about it often, but this kind of openness isn’t a given. Being able to check the law directly, even for extremely silly questions, helps keep the whole system honest.

[If the formatting of this post looks odd in your feed reader, visit the original article]

Cleaning up messy dates in JSON

2025-11-17 16:52:48

I’ve been cleaning up some messy data, and it includes timestamps written by a variety of humans and machines, which don’t use a consistent format.

Here are a few examples:

2025-10-14T05:34:07+0000
2015-03-01 23:34:39 +00:00
2013-9-21 13:43:00Z
2024-08-30

All of these timestamps are machine-readable, but it would be easier for the downstream code if there weren’t as many different formats. For example, the downstream code uses the JavaScript Date() constructor, which rejects some of the timestamps as invalid.

I wrote a Python script to help me find, validate, and normalise all my timestamps, so the rest of my code can deal with a more consistent set.

Table of contents

Finding all the data strings

All the messy data is in JSON, and the structure is quite inconsistent – a lot of heavily nested objects, differently-named fields, varying models and schemas. This project is about tidying it up.

One saving gracing is that the timestamps are named fairly consistently – they’re all stored inside JSON objects, with keys that start with date_, and values which are strings. Here’s an example:

{
  "doc1": {"id": "1", "date_created": "2025-10-14T05:34:07+0000"},
  "shapes": [
  	{"color": "blue", "date_saved": "2015-03-01 23:34:39 +00:00"},
  	{"color": "yellow", "date_saved": "2013-9-21 13:43:00Z", "is_square": true},
  	{"color": "green", "date_saved": null}
  ],
  "date_verified": "2024-08-30"
}

The first thing I want to do is find all the key-value pairs which combine a date_ and a string.

I wrote a Python function to recursively walk the JSON and pull out matching pairs. I’m sure there are libraries for this, but JSON is simple enough that I can write it by hand. It only has a few types, and even fewer that matter here:

  • If it’s a JSON object: inspect its keys, then recurse into each value
  • If it’s a JSON array: recurse into each element
  • If it’s a string, number, bool, or null: ignore it

Here’s my code:

from collections.abc import Iterator
from typing import Any


def find_all_dates(json_value: Any) -> Iterator[tuple[dict[str, Any], str, str]]:
    """
    Find all the timestamps in a heavily nested JSON object.

    This function looks for any JSON objects with a key-value pair
    where the key starts with `date_` and the value is a string, and
    emits a 3-tuple:

    *   the JSON object
    *   the key
    *   the value

    """
    # Case 1: JSON objects
    if isinstance(json_value, dict):
        for key, value in json_value.items():
            if (
                isinstance(key, str)
                and key.startswith("date_")
                and isinstance(value, str)
            ):
                yield json_value, key, value
            else:
                yield from find_all_dates(value)

    # Case 2: JSON arrays
    elif isinstance(json_value, list):
        for value in json_value:
            yield from find_all_dates(value)

    # Case 3: other JSON types
    elif isinstance(json_value, (str, int, bool)) or json_value is None:
        return

    # Case 4: handle unexpected types
    else:
        raise TypeError(f"Unexpected type: {type(json_value)}")

There are branches for all the builtin JSON types, then a catch-all branch for anything else.

I added a catch-all TypeError branch to catch list- or dict-like inputs that aren’t actually JSON types – things like dict.values() or custom container classes. Without this check, they’d fall through to the “ignore” case and quietly drop nested data. An explicit error makes the failure obvious, and the fix is easy: wrap the input in list() or dict().

For each timestamp it finds, the function emits a tuple with the nested object, the key and the value. For the example above, here’s the first tuple it returns:

({"id": "1", "date_created": "2025-10-14T05:34:07+0000"},
 "date_created",
 "2025-10-14T05:34:07+0000")

This return type allows me to both read and modify the JSON with the same function:

# Reading the timestamps
for _, _, date_string in find_all_dates(json_value):
    print(date_string)

# Modifying the timestamps
for json_obj, key, date_string in find_all_dates(json_value):
    json_obj[key] = run_fixup(date_string)

The latter works because json_obj points to the actual dictionary from the nested JSON, not a copy, so when we assign json_obj[key] = …, we modify the original JSON structure in-place.

Now we can find all the timestamps, we need to check if they use a consistent datetime format.

Checking if a date string matches a given format

I normally parse timestamps in Python with the datetime.strptime function. This is quite a strict function, because you have to pass it a format string that describes exactly how you expect the timestamp to be formatted.

If you want a more flexible approach, you can use the python-dateutil module, which has a generic parser that guesses how to read the timestamp, rather than asking you to specify.

I prefer strict parsing, because I (usually!) know exactly how timestamps will be formatted, and inconsistent formats can hide bugs. There’s no room for ambiguity, and no risk of a timestamp being guessed incorrectly.

The strptime function takes two arguments: the string you want to parse, and the format string. Here’s an example:

>>> from datetime import datetime
>>> datetime.strptime("2001-02-03T04:05:06+00:00", "%Y-%m-%dT%H:%M:%S%z")
datetime.datetime(2001, 2, 3, 4, 5, 6, tzinfo=datetime.timezone.utc)

If you pass a timestamp that doesn’t match the format string, it throws a ValueError:

>>> datetime.strptime("2001-02-03", "%Y-%m-%dT%H:%M:%S%z")
ValueError: time data '2001-02-03' does not match format '%Y-%m-%dT%H:%M:%S%z'

It also checks that the whole string is parsed, and throws a ValueError if it’s an incomplete match:

>>> datetime.strptime("2001-02-03T04:05:06+00:00", "%Y-%m-%d")
ValueError: unconverted data remains: T04:05:06+00:00

This allows us to write a function that checks if a timestamp matches a given format:

from datetime import datetime


def date_matches_format(date_string: str, format: str) -> bool:
    """
    Returns True if `date_string` can be parsed as a datetime
    using `format`, False otherwise.
    """
    try:
        datetime.strptime(date_string, format)
        return True
    except ValueError:
        return False

The format can be any format code supported by strptime(). The Python docs have a list of all the accepted format codes. That list includes a couple of non-standard format codes which only have partial support, which is part of why I want to clean up these date strings.

If we want to allow multiple formats, we can wrap this function using any():

def date_matches_any_format(date_string: str, formats: tuple[str]) -> bool:
    """
    Returns True if `date_string` can be parsed as a datetime
    with any of the `formats`, False otherwise.
    """
    return any(
        date_matches_format(date_string, fmt)
        for fmt in formats
    )

Here’s how we can use this function to find any timestamps that don’t match our allowed formats:

allowed_formats = (
    # 2001-02-03T04:05:06+07:00
    "%Y-%m-%dT%H:%M:%S%z",
    #
    # 2001-02-03
    "%Y-%m-%d",
)

for _, _, date_string in find_all_dates(json_value):
    if not date_matches_any_format(date_string, allowed_formats):
        print(date_string)

Testing that all of my timestamps use consistent formats

With these two functions in hand, I wrote a test I can run with pytest to tell me about any timestamps that don’t match my allowed formats:

def test_all_timestamps_are_consistent():
    """
    All the timestamps in my JSON use a consistent format.

    See https://alexwlchan.net/2025/messy-dates-in-json/
    """
    allowed_formats = (
        # 2001-02-03T04:05:06+07:00
        "%Y-%m-%dT%H:%M:%S%z",
        #
        # 2001-02-03
        "%Y-%m-%d",
    )

    bad_date_strings = {
        date_string
        for _, _, date_string in find_all_dates(JSON_DATA)
        if not date_matches_any_format(date_string, allowed_formats)
    }

    assert bad_date_strings == set()

If you only allow a single format, you could simplify this slightly by using date_matches_format.

If the test passes, all your timestamps match the allowed formats. If the test fails, pytest prints the ones that don’t. Running it on our original example shows two disallowed timestamps:

AssertionError: assert {'2013-9-21 1...34:39 +00:00'} == set()

  Extra items in the left set:
  '2015-03-01 23:34:39 +00:00'
  '2013-9-21 13:43:00Z'

As with the test I wrote in my last post, I like to report on all the failing values, not just the first one. This allows me to see the scale of the problem, and see patterns in the failing output – if I see a bad timestamp, is it a one-off issue I should fix by hand, or does it affect thousands of values that need an automatic cleanup?

When I first ran this test, it failed with thousands of errors. I cleaned up the data and re-ran the test until it passed, and now I can keep running it to ensure no unexpected values sneak back in.

Changing the format of date strings in bulk

This test revealed thousands of errors, and I didn’t want to fix them all by hand. That would be slow, tedious, and prone to manual errors. But among the timestamps I wanted to change, there were pockets of consistency – each tool that contributed to this data would write timestamps in a single format, and I could convert all the timestamps from that tool in one go.

I wrote one-off fixer scripts to perform these conversions. Each script would read the JSON file, look for date strings in a given format, convert them to my preferred format, then write the result back to my JSON file: Here’s one example:

import json


# e.g. 2001-02-03 04:05:06 +07:00
old_format = "%Y-%m-%d %H:%M:%S %z"

# e.g. 2001-02-03T04:05:06+07:00 (datetime.isoformat())
new_format = "%Y-%m-%dT%H:%M:%S%z"


with open("my_data.json") as in_file:
    json_data = json.load(in_file)

for json_obj, key, date_string in find_all_dates(json_data):
    if date_matches_format(date_string, old_format):
        d = datetime.strptime(date_string, old_format)
        json_obj[key] = d.strftime(new_format)

with open("my_data.json", "w") as out_file:
    out_file.write(json.dumps(json_data, indent=2))

I keep my JSON data files in Git, and I committed every time I ran a successful script. That made it easy to see the changes from each fix-up, and to revert them if I made a mistake.

The nice thing about this approach is that each script is quite small and simple, because it’s only trying to fix one thing at a time. But each script adds up, and salami slicing eventually ends up with dramatically cleaner data. I didn’t fix everything this way (some typos were quicker to fix by hand), but scripting fixed the majority of issues.

Putting it all together

Here’s what we’ve done in this post:

  • Written a recursive function to find all the timestamps in a JSON value
  • Chosen the timestamp formats we allow, and added helpers to check them
  • Added a test to find and prevent unexpected formats
  • Written one-off migration scripts to clean up old timestamps

Here’s the final test which ties this all together. I’ve saved it as test_date_formats.py:

from collections.abc import Iterator
from datetime import datetime
from typing import Any


def find_all_dates(json_value: Any) -> Iterator[tuple[dict[str, Any], str, str]]:
    """
    Find all the timestamps in a heavily nested JSON object.

    This function looks for any JSON objects with a key-value pair
    where the key starts with `date_` and the value is a string, and
    emits a 3-tuple:

    *   the JSON object
    *   the key
    *   the value

    """
    # Case 1: JSON objects
    if isinstance(json_value, dict):
        for key, value in json_value.items():
            if (
                isinstance(key, str)
                and key.startswith("date_")
                and isinstance(value, str)
            ):
                yield json_value, key, value
            else:
                yield from find_all_dates(value)

    # Case 2: JSON arrays
    elif isinstance(json_value, list):
        for value in json_value:
            yield from find_all_dates(value)

    # Case 3: other JSON types
    elif isinstance(json_value, (str, int, bool)) or json_value is None:
        return []

    # Case 4: handle unexpected types
    else:
        raise TypeError(f"Unexpected type: {type(json_value)}")


def date_matches_format(date_string: str, format: str) -> bool:
    """
    Returns True if `date_string` can be parsed as a datetime
    using `format`, False otherwise.
    """
    try:
        datetime.strptime(date_string, format)
        return True
    except ValueError:
        return False


def date_matches_any_format(date_string: str, formats: tuple[str]) -> bool:
    """
    Returns True if `date_string` can be parsed as a datetime
    with any of the `formats`, False otherwise.
    """
    return any(
        date_matches_format(date_string, fmt)
        for fmt in formats
    )


def test_all_timestamps_are_consistent():
    """
    All the timestamps in my JSON use a consistent format.

    See https://alexwlchan.net/2025/messy-dates-in-json/
    """
    # TODO: Write code for reading JSON_DATA

    allowed_formats = (
        # 2001-02-03T04:05:06+07:00
        "%Y-%m-%dT%H:%M:%S%z",
        #
        # 2001-02-03
        "%Y-%m-%d",
    )

    bad_date_strings = {
        date_string
        for _, _, date_string in find_all_dates(JSON_DATA)
        if not date_matches_any_format(date_string, allowed_formats)
    }

    assert bad_date_strings == set()

If you want to use this test, you’ll need to modify it to read your JSON data, and to specify the list of timestamp formats you accept.

Although I’ve since cleaned up the data so it has a more consistent structure, this test has remained as-is. I could write a more specific test that knows about my JSON schema and looks for timestamps in specific fields, but I quite like having a generic test that doesn’t need to change when I change my data model. The flexibility also means it’s easy to copy across projects, and I’ve reused this test almost unmodified in several tests already.

[If the formatting of this post looks odd in your feed reader, visit the original article]

Detecting AV1-encoded videos with Python

2025-11-09 07:13:56

In my previous post, I wrote about how I’ve saved some AV1-encoded videos that I can’t play on my iPhone. Eventually, I’ll upgrade to a new iPhone which supports AV1, but in the meantime, I want to convert all of those videos to an older codec. The problem is finding all the affected videos – I don’t want to wait until I want to watch a video before discovering it won’t play.

I already use pytest to run some checks on my media library: are all the files in the right place, is the metadata in the correct format, do I have any misspelt tags, and so on. I wanted to write a new test that would check for AV1-encoded videos, so I could find and convert them in bulk.

In this post, I’ll show you two ways to check if a video is encoded using AV1, and a test I wrote to find any such videos inside a given folder.

Table of contents

Getting the video codec with ffprobe

In my last post, I wrote an ffprobe command that prints some information about a video, including the codec. (ffprobe is a companion tool to the popular video converter FFmpeg.)

$ ffprobe -v error -select_streams v:0 \
    -show_entries stream=codec_name,profile,level,bits_per_raw_sample \
    -of default=noprint_wrappers=1 "input.mp4"
codec_name=av1
profile=Main
level=8
bits_per_raw_sample=N/A

I can tweak this command to print just the codec name:

$ ffprobe -v error -select_streams v:0 \
    -show_entries stream=codec_name \
    -of csv=print_section=0 "input.mp4"
av1

To run this command from Python, I call the check_output function from the subprocess module. This checks the command completes successfully, then returns the output as a string. I can check if the output is the string av1:

import subprocess


def is_av1_video(path: str) -> bool:
    """
    Returns True if a video is encoded with AV1, False otherwise.
    """
    output = subprocess.check_output([
        "ffprobe",
        #
        # Set the logging level
        "-loglevel", "error",
        #
        # Select the first video stream
        "-select_streams", "v:0",
        #
        # Print the codec_name (e.g. av1)
        "-show_entries", "stream=codec_name",
        #
        # Print just the value
        "-output_format", "csv=print_section=0",
        #
        # Name of the video to check
        path
    ], text=True)

    return output.strip() == "av1"

Most of this function is defining the ffprobe command, which takes quite a few flags. Whenever I embed a shell command in another program, I always replace any flags/arguments with the long versions, and explain their purpose in a comment – for example, I’ve replaced -of with -output_format. Short flags are convenient when I’m typing something by hand, but long flags are more readable when I return to this code later.

This function works, but the ffprobe command is quite long, and it requires spawning a new process for each video I want to check. Is there a faster way?

Getting the video codec with MediaInfo

While working at the Flickr Foundation, I discovered MediaInfo, another tool for analysing video files. It’s used in Data Lifeboat to get the dimensions and duration of videos.

You can run MediaInfo as a command-line program to get the video codec:

$ mediainfo --Inform="Video;%Format%" "input.mp4"
AV1

This is a simpler command than ffprobe, but I’d still be spawning a new process if I called this from subprocess.

Fortunately, MediaInfo is also available as a library, and it has a Python wrapper. You can install the wrapper with pip install pymediainfo, then we can use the functionality of MediaInfo inside our Python process:

>>> from pymediainfo import MediaInfo
>>> media_info = MediaInfo.parse("input.mp4")
>>> media_info.video_tracks[0].codec_id
'av01'

This code could throw an IndexError if there’s no video track – if it’s a .mp4 file which only has audio data – but that’s pretty unusual, and not something I’ve found in any of my videos.

I can write a new wrapper function:

from pymediainfo import MediaInfo


def is_av1_video(path: str) -> bool:
    """
    Returns True if a video is encoded with AV1, False otherwise.
    """
    media_info = MediaInfo.parse(path)

    return media_info.video_tracks[0].codec_id == "av01"

This is shorter than the ffprobe code, and faster too – testing locally, this is about 3.5× faster than spawning an ffprobe process per file.

Writing a test to find videos with the AV1 codec

Now we have a function that tells us if a given video uses AV1, we want a test that checks if there are any matching files. This is what I wrote:

import glob


def test_no_videos_are_av1():
    """
    No videos are encoded in AV1 (which doesn't play on my iPhone).

    This test can be removed when I upgrade all my devices to ones with
    hardware AV1 decoding support.

    See https://alexwlchan.net/2025/av1-on-my-iphone/
    """
    av1_videos = {
        p
        for p in glob.glob("**/*.mp4", recursive=True)
        if is_av1_video(p)
    }

    assert av1_videos == set()

It uses the glob module to find .mp4 video files anywhere in the current folder, and then filters for files which use the AV1 codec. The recursive=True argument is important, because it tells glob to search below the current directory.

I’m only looking for .mp4 files because that’s the only format I use for videos, but you might want to search for .mkv or .webm too. If I was doing that, I might drop glob and use my snippet for walking a file tree instead.

The test builds a set of all the AV1 videos, then checks that it’s empty. This means that if the test fails, I can see all the affected videos at once. If the test failed on the first AV1 video, I’d only know about one video at a time, which would slow me down.

Putting it all together

You can use ffprobe or MediaInfo – I prefer MediaInfo because it’s faster and I already have it installed, but both approaches are fine.

Here’s my final test, which uses MediaInfo to check if a video uses AV1, and scans a folder using glob. I’ve saved it as test_no_av1_videos.py:

import glob

from pymediainfo import MediaInfo


def is_av1_video(path: str) -> bool:
    """
    Returns True if a video is encoded with AV1, False otherwise.
    """
    media_info = MediaInfo.parse(path)

    return media_info.video_tracks[0].codec_id == "av01"


def test_no_videos_are_av1():
    """
    No videos are encoded in AV1 (which doesn't play on my iPhone).

    This test can be removed when I upgrade all my devices to ones with
    hardware AV1 decoding support.

    See https://alexwlchan.net/2025/av1-on-my-iphone/
    """
    av1_videos = {
        p
        for p in glob.glob("**/*.mp4", recursive=True)
        if is_av1_video(p)
    }

    assert av1_videos == set()

In one folder with 350 videos, this takes about 8 seconds to run. I could make that faster by reading the video files in parallel, or caching the results, but it’s fast enough for now.

When I buy a new device with hardware AV1 decoding, I’ll delete this test. Until then, it’s a quick and easy way to find and re-encode any videos that won’t play on my iPhone.

[If the formatting of this post looks odd in your feed reader, visit the original article]

Why can’t my iPhone play that video?

2025-10-28 17:59:45

I download a lot of videos, but recently I discovered that some of those videos won’t play on my iPhone. If I try to open the videos or embed them in a webpage, I get a broken video player:

A black box with a play triangle in the middle, and a line striking it out.

These same videos play fine on my Mac – it’s just my iPhone that has issues. The answer involves the AV1 video codec, Apple’s chips, and several new web APIs I learnt along the way.

Table of contents

My iPhone is too old for the AV1 video codec

Doing some research online gave me the answer quickly: the broken videos use the AV1 codec, which isn’t supported on my iPhone. AV1 is a modern video codec that’s designed to be very efficient and royalty free, but it’s only been recently supported on Apple devices.

I have an iPhone 13 mini with an A15 processor. My iPhone doesn’t have hardware decoding support for AV1 videos – that only came with the iPhone 15 Pro and the A17 Pro. This support was included in all subsequent chips, including the M4 Pro in my Mac Mini.

It’s theoretically possible for Apple to decode AV1 in software, but they haven’t. According to Roger Pantos, who works on Apple’s media streaming team, there are no plans to provide software decoding for AV1 video. This means that if your chip doesn’t have this support, you’re out of luck.

I wanted to see if I could have worked this out myself. I couldn’t find any references or documentation for Apple’s video codec support – so failing that, is there some query or check I could run on my devices?

Checking compatibility with web APIs

I’ve found a couple of APIs I can use that tell me my browser can play a particular video. These APIs are used by video streaming sites to make sure they send you the correct video. For example, YouTube can work out that my iPhone doesn’t support AV1 video, so they’ll stream me a video that uses a different codec.

Getting the MIME type and bitrate for the video

The APIs I found require a MIME type and bitrate for the video.

A MIME type can be something simple like video/mp4 or image/jpeg, but it can also include information about the video codec. The codec string for AV1 is quite complicated, and includes many parts. If you want more detail, read the AV1 codecs docs on MDN or the Codecs Parameter String section of the AV1 spec.

We can get the key information about the unplayable video using ffprobe:

ffprobe -v error -select_streams v:0 \
    -show_entries stream=codec_name,profile,level,bits_per_raw_sample \
    -of default=noprint_wrappers=1 "input.mp4"
# codec_name=av1
# profile=Main
# level=8
# bits_per_raw_sample=N/A

The AV1 codec template is av01.P.LLT.DD, which we construct as follows:

  • P = profile number, and “Main” means 0
  • LL = a two-digit level number, so 08
  • Tier = the tier indicator, which can be Main or High. I think the “High” tier is for professional workflows, so let’s assume my video is Main or M.
  • DD = the two-digit bit depth, so 08`.

This gives us the MIME type for the unplayable video:

video/mp4; codecs=av01.0.08M.08

I also got the MIME type for an H.264 video which does play on my iPhone:

video/mp4; codecs=avc1.640028

By swapping out the argument to -show_entries, we can also use ffprobe to get the resolution, frame rate, and bit rate:

ffprobe -v error -select_streams v:0 \
  -show_entries stream=width,height,bit_rate,r_frame_rate \
  -of default=noprint_wrappers=1:nokey=0 "input.mp4"
# width=1920
# height=1080
# r_frame_rate=24000/1001
# bit_rate=1088190

Now we have this information, let’s pass it to some browser APIs.

HTMLMediaElement: canPlayType()

The video.canPlayType() method on HTMLMediaElement takes a MIME type, and tells you whether a browser is likely able to play that media. Note the word “likely”: possible responses are “probably”, “maybe” and “no”.

Here’s an example using the MIME type of my AV1 video:

const video = document.createElement("video");
video.canPlayType("video/mp4; codecs=av01.0.08M.08");

Let’s run this with few different values, and compare the results:

MIME type iPhone Mac
AV1 video
video/mp4; codecs=av01.0.08M.08
"" (= “no”) "probably"
H.264 video
video/mp4; codecs=avc1.640028
"probably" "probably"
Generic MP4
video/mp4
"maybe" "maybe"
Made-up format
video/mp4000
"" (= “no”) "" (= “no”)

This confirms the issue: my iPhone can’t play AV1 videos, while my Mac can.

The generic MP4 is a clue about why this API returns a “likely” result, not something more certain. The MIME type doesn’t contain enough information about whether a video will be playable.

MediaCapabilities: decodingInfo()

For a more nuanced answer, we can use the decodingInfo() method in the MediaCapabilities API. You pass detailed information about the video, including the MIME type and resolution, and it tells you whether the video can be played – and more than that, whether the video can be played in a smooth and power-efficient way.

Here’s an example of how you use it:

await navigator.mediaCapabilities.decodingInfo({
  type: "file",
  video: {
    contentType: "video/mp4; codecs=av01.0.08M.08",
    width:     1920,
    height:    1080,
    bitrate:   1088190,
    framerate: 24
  }
});
// {powerEfficient: false,
//  smooth: false,
//  supported: false,
//  supportedConfiguration: Object}

Let’s try this with two videos:

AV1 video
Video config
{
  contentType: "video/mp4; codecs=av01.0.08M.08",
  width:     1920,
  height:    1080,
  bitrate:   1088190,
  framerate: 24
}
iPhone not supported
Mac supported, smooth, power efficient
H.264 video
Video config
{
  contentType: "video/mp4; codecs=avc1.640028",
  width:     1440,
  height:    1080,
  bitrate:   1660976,
  framerate: 24
}
iPhone supported, smooth, power efficient
Mac supported, smooth, power efficient

This re-confirms our theory that my iPhone’s lack of AV1 support is the issue.

It’s worth noting that this is still a heuristic, not a guarantee. I plugged some really large numbers into this API, and my iPhone claims it could play a trillion-pixel H.264 encoded video in a smooth and power efficient way. I know Apple’s hardware is good, but it’s not that good.

What am I going to do about this?

This is only an issue because I have a single video file, it’s encoded with AV1, and I have a slightly older iPhone. Commercial streaming services like YouTube, Vimeo and TikTok don’t have this problem because they store videos with multiple codecs, and use browser APIs to determine the right version to send you.

Apple would like me to buy a new iPhone, but that’s overkill for this problem. That will happen eventually, but not today.

In the meantime, I’m going to convert any AV1 encoded videos to a codec that my iPhone can play, and change my downloader script to do the same to any future downloads. Before I understood the problem, I was playing whack-a-mole with broken videos. Now I know that AV1 encoding is the issue, I can find and fix all of these videos in one go.

[If the formatting of this post looks odd in your feed reader, visit the original article]

Doing my own syntax highlighting (finally)

2025-10-22 20:55:42

I had syntax highlighting in the very first version of this blog, and I never really thought about that decision. I was writing a programming blog, and I was including snippets of code, so obviously I should have syntax highlighting.

Over the next thirteen years, I tweaked and refined the rest of the design, but I never touched the syntax highlighting. I’ve been applying a rainbow wash of colours that somebody else chose, because I didn’t have any better ideas.

This week I read Nikita Prokopov’s article Everyone is getting syntax highlighting wrong, which advocates for a more restrained approach. Rather than giving everything a colour, he suggests colouring just a few key elements, like strings, comments, and variable definitions. I don’t know if that would work for everybody, but I like the idea, and it gave me the push to try something new.

It’s time to give code snippets the same care I’ve given the rest of this site.

Table of contents

What have I changed?

I’ve stripped back the syntax highlighting to a few key rules:

  • Comments in red
  • Strings in green
  • Numbers, booleans, and other constants in magenta
  • Variable definitions in blue
  • Punctuation in grey

Everything else is the default black/white.

This is similar to Nikita’s colour scheme “Alabaster”, but I chose my own colours to match my site palette. I’m also making my own choices about how to interpret these rules, because real code doesn’t always fall into neat buckets.

Here’s a snippet of Rust code with the old syntax highlighting:

Screenshot of some Rust code. Most of the code is coloured either blue, red, or green, with some bold and italicised text. The one comment is in a dull bluish-grey.

Here’s the same code in my new design:

Screenshot of the same Rust code. Now the code is mostly black, with blue highlights for the import/function/variable name, strings in green, and the comment is bright red.

Naturally, these code blocks work in both light and dark mode.

The new design is cleaner, it fits in better with the rest of the site, and I really like it. Some of that will be novelty and the IKEA effect, but I see other benefits to this simplified palette.

How does it work?

Converting code to HTML

I use Rouge as my syntax highlighter. I give it a chunk of code and specify the language, and it parses the code into a sequence of tokens – like operators, variables, or constants. Rouge returns a blob of HTML, with each token wrapped in a <span> that describes the token type.

For example, if I ask Rouge to highlight the Python snippet:

print("hello world")

it produces this HTML:

<span class="nf">print</span><span class="p">(</span><span class="sh">"</span><span class="s">hello world</span><span class="sh">"</span><span class="p">)</span>

The first token is print, which is a function name (Name.Function, or class="nf"). The ( and ) are punctuation ("p") and the string is split into quotation marks (sh for String.Heredoc) and the text (s for String). You can see all the tokens in this short example, and all the possible token types are listed in the Pygments docs. (Pygments is another syntax highlighting library, but Rouge uses the same classification system.)

Each token has a different class in the HTML, so I can style tokens with CSS. For example, if I want all function names to be blue, I can target the "nf" class:

.nf { color: blue; }

I wrap the entire block in a <pre> tag with a language class, like <pre class="language-go">, so I can also apply per-language styles if I want.

Separating variable/function definitions and usage

I want to highlight variables when they’re defined, not every time they’re used. This gives you an overview of the structure, without drowning the code in blue.

This is tricky with Rouge, because it has no semantic understanding of the code – it only knows what each token is, not how it’s being used. In the example above, it knows that print is the name of a function, but it doesn’t know if the function is being called or being defined.

I could use something smarter, like the language servers used by modern IDEs, but that’s a lot of extra complexity. It might not even work – many of my code snippets are fragments, not complete programs, and wouldn’t parse cleanly.

Instead, I’m manually annotating my code snippets to mark definitions. I wrote a Jekyll plugin that reads those annotations, and modifies the HTML from Rouge to add the necessary highlights. It’s extra work, but I already spend a lot of time trying to pick the right snippet of code to make my point. These annotations are quick and easy, and it’s worth it for a clearer result.

Older posts don’t have these annotations, so they won’t get the full benefit of the new colour scheme, but I’m gradually updating them.

What do I like about this new design?

It’s pushed me to think more about syntax highlighting

Now that I’m not using somebody else’s rules, I’m paying more attention to how my code looks. I’m thinking more carefully about how my rules should apply. I’m noticing when colours feel confusing or unclear, and finding small ways to tweak them to improve clarity.

For example, “variable definitions in blue” sounds pretty clear cut, but does that include imports? Function parameters? What about HTML or CSS, where variables aren’t really a thing? What parts of the code do I think are important and worth highlighting?

I could have asked these questions at any time, but changing my syntax highlighting gave me the push to actually do it.

It makes comments more prominent

In my first programming job, I worked in a codebase with extensive comemnts. They were a good starting point in unfamiliar code, with lots of context, explanation, and narrative. The company’s default IDE showed comments in bright blue, and looking back, I’m sure that colour choice encouraged the culture of detailed documentation.

I realise now how unusual that was, but at the time it was my only experience of professional software development. I carried that habit of writing coomments into subsequent jobs, but I’d forgotten the colour scheme. Now, I’m finally reviving that good idea.

Comments are bright red in my new theme – not the subdued grey used by so many other themes. The pop of colour makes comments easier to spot and more inviting to read. I’ve also ported this style to my IDE, and now when I write comments, I don’t feel like my words are disappearing.

It might be easier to read

I was inspired to make this change by reading Nikita Prokopov’s article, which argues for a minimal colour scheme – but not everyone agrees.

Syntax highlighting is mostly a matter of taste. Some programmers like a clean, light theme, others prefer high-contrast dark themes with bold colours. There’s lots of research into how we read code, but as far as I know, there’s no strong evidence or consensus in favour of any particular approach.

Whatever your taste, I think code is easier to read in a colour scheme you’re already familiar with. You know what the colours mean, so your brain doesn’t have to learn anything. A new scheme might grow on you over time, but at first, it’s more likely to be distracting than helpful.

That’s a problem for a blog like this. Most readers find a single post through search, read something useful, and never return. They’re not reading enough code here to learn my colour scheme, and unfamiliar colours are just noise.

With that in mind, I think a minimal palette works better. My posts only contain short snippets of code – enough to make a point, but not full files or complex programs. When you’re only reading a small amount of code, it’s more useful to highlight key elements than wash everything in colour.

Better late than never

I’ve wanted to improve my code snippets for a long time, but it always felt overwhelming. I’m used to colour themes which use a large palette, and I don’t have the design skills to choose that many colours. It wasn’t until I thought about using a smaller palette that this felt doable

I picked my new colours just a few days ago, and already the old design feels stale and tired. I’d planned to spend more time tinkering before I make it live, but it’s such an improvement I want to use it immediately.

I love having a little corner of the web which is my own personal sandbox. Thirteen years in, and I’m still finding new ways to make myself smile.

[If the formatting of this post looks odd in your feed reader, visit the original article]