2025-12-20 06:57:16
I was creating a new S3 bucket today, and I had an idea – what if I add a README?
Browsing a list of S3 buckets is often an exercise in code archeology. Although people try to pick meaningful names, it’s easy for context to be forgotten and the purpose lost to time. Looking inside the bucket may not be helpful either, if all you see is binary objects in an unknown format named using UUIDs. A sentence or two of prose could really help a future reader.
We manage our infrastructure with Terraform and the Terraform AWS provider can upload objects to S3, so I only need to add a single resource:
resource "aws_s3_bucket" "example" {
bucket = "alexwlchan-readme-example"
}
resource "aws_s3_object" "readme" {
bucket = aws_s3_bucket.example.id
key = "README.txt"
content = <<EOF
This bucket stores log files for the Widget Wrangler Service.
These log files are anonymised and expire after 30 days.
Docs: http://internal-wiki.example.com/widget-logs
Contact: [email protected]
EOF
content_type = "text/plain"
}
Now when the bucket is created, it comes with its own explanation. When you open the bucket in the S3 console, the README appears as a regular object in the list of files.
This is an example, but a real README needn’t be much longer:
This doesn’t replace longer documentation elsewhere, but it can be a useful pointer in the right direction.
It’s a quick and easy way to help the future sysadmin who’s trying to understand an account full of half-forgotten S3 buckets, and my only regret is that I didn’t think to use aws_s3_object this way sooner.
[If the formatting of this post looks odd in your feed reader, visit the original article]
2025-12-17 17:48:51
A while ago I was looking for a palm tree emoji, and the macOS Character Viewer suggested a variety of other characters I didn’t recognise:

Some of the curves look a bit like Hebrew, but it’s definitely not that alphabet. I clicked on the first character (𐡱) and learnt that it’s Palmyrene Letter Pe, which is from the Palmyrene alphabet. I’d never heard of Palmyrene, so I knew I was about to learn something.
These letters are part of the Palmyrene Unicode block, a set of 32 code points for the Palmyrene alphabet and digits. One of the cool things about Unicode is that the proposals for new characters are publicly available on the Unicode Consortium website, and they’re usually pretty readable.
Proposals have to provide some background on the characters they’re proposing. Here’s the introduction from the original proposal in 2010:
The Palmyrene alphabet was used from the first century BCE, in a small independent state established near the Red Sea, north of the Syrian desert between Damascus and the Euphrates. The alphabet was derived as a national script by modification of the customary forms that cursive Aramaic which themselves developed during the first Persian Empire.
Palmyrene is known from documents distributed over a period from the year 9 BCE until 273 CE, the date of the sack of Palmyra by Aurelian. […] No documents on perishable materials have survived; there are a few painted inscriptions, but many inscriptions on stone.
Here’s an example of a funerary stone inscribed with Palmyrene script, whose shapes match the Unicode characters I didn’t recognise:

The proposal was written by Michael Everson, a prolific contributor who’s submitted hundreds of proposals to add characters to Unicode. His Wikipedia article lists over seventy scripts. He was profiled by the New York Times in 2003 – seven years before proposing Palmyrene – which described his work and his “crucial role in developing Unicode”.
He takes a very long view of his work. Normally I’m sceptical of claims about the longevity of digital work, but Unicode is a rare area where I think it might just last:
“There’s satisfaction in knowing that the work of analyzing and encoding these languages, once done, will never need to be done again,” [Everson] said. “This will be used for the next thousand years.”
And I liked this part at the end:
He likes to tell about how he met the president of the Tibetan Calligraphy Society at a Unicode meeting in Copenhagen. Mr. Everson had helped the organization ensure that Tibetan was included in the standard. The president showed Mr. Everson how to write his name in Tibetan with a highlighter pen.
“He thanked me,” Mr. Everson said with reverence. “I couldn’t believe that, because his organization has been in existence for over a thousand years.”
I spent eight years working in cultural heritage and thinking about the longevity of digital collections, but I never gave much thought to the history or encoding of writing. This is cool and important work, and I should learn more about it.
Palmyrene has 22 letters in its alphabet, which expands to 32 Unicode codepoints when you include alternative letters, numbers, and a pair of symbols.
The only letter I recognise is aleph (𐡠), which looks similar to the Hebrew letter aleph ℵ. I know the latter because it’s used by mathematicians to describe the size of infinite sets. It turns out aleph or (alef) is the name of letters in a variety of languages, not all of which look the same – including Phoenician (𐤀), Syriac (ܐ), and Nabatean (𐢁/𐢀).
The other letters have names which are new to me, like heth (𐡧), samekh (𐡯), and gimel (𐡢).
One especially interesting letter is nun, which appears differently depending on whether it’s in the middle of the word (𐡮) or the end (𐡭). This reminds me of the ancient Greek letter sigma, which is either σ or ς. I can’t help but see a passing resemblance between final nun and final sigma, but surely it’s a coincidence – the rest of the alphabets are so different.
The Palmyrene numbers look similar to the Arabic numerals we use today, but not necessarily the same meaning. One, two, three and four are regular tally marks (𐡹, 𐡺, 𐡻, 𐡼). The more unusual characters are five (𐡽), ten (𐡾), and twenty (𐡿) – but again, it’s surely a coincidence that the latter resembles the modern digit 3.
Alongside the letters and numbers, there are two decorative symbols for left/right fleurons (𐡷/𐡸).
Palmyrene is written horizontally from right-to-left, which introduced some new challenges while writing this blog post.
The first issue was in my text editor, which is fairly old and doesn’t have good right-to-left support.
I can include Palmyrene characters directly in my text, but it messes up the ordering and text selection.
I can navigate the text with the arrow keys, but it behaves in weird ways.
To get round this, I used HTML entities in all my source code (for example, 𐡠).
The second issue was in the rendered HTML page, where the Unicode characters affect the ordering on the page. In particular, I wanted to show the characters for 1, 2, 3, 4, in that order, so I wrote the four entities – but the browser uses a bidirectional algorithm and renders the sequence of characters as right-to-left. That’s the opposite of what I wanted:
| HTML: |
𐡹, 𐡺, 𐡻, 𐡼
|
|---|---|
| Output: | 𐡹, 𐡺, 𐡻, 𐡼 |
The fix was to wrap each character in the bidirectional isolate <bdi> element.
This tells the browser to isolate the direction of the text within that element, so the direction of each character doesn’t affect the overall sequence.
This gave me what I wanted:
| HTML: |
<bdi>𐡹</bdi>, <bdi>𐡺</bdi>, <bdi>𐡻</bdi>, <bdi>𐡼</bdi>
|
|---|---|
| Output: | 𐡹, 𐡺, 𐡻, 𐡼 |
This is the first time the <bdi> element has appeared on this blog, and I think it’s the first time I’ve used it anywhere.
I took the original screenshot in September. It took me three months to dig into the detail, and I’m glad I did. This is a corner of history and writing that I’d never heard of, and even now I’ve only scratched the surface.
The Palmyrene alphabet is an example of what I call a “fractally interesting” topic. However deep you dig, however much you learn, there’s always more to uncover.
[If the formatting of this post looks odd in your feed reader, visit the original article]
2025-12-12 18:30:02
I’ve been building a scrapbook of social media, a place where I can save posts and conversations that I want to remember. It has a nice web-based interface for browsing, and a carefully-designed data model that should scale as I add more platforms and more years of my life. As I see new things I want to remember, it’s easy to save them in my scrapbook.
But what about everything I’d saved before?
Across various disks, I’d accumulated over 150,000 posts from Twitter, Tumblr, and other platforms. These sites were important to me and I didn’t want to lose those memories, so I kept trying to back them up – but those snapshots had more enthusiasm than organisation. They were chaotic, devoid of context, and difficult to search – but the data was there.
After so many failed attempts, my scrapbook finally feels sustainable. It has a robust data model and a simple tech stack that I hope will last a long time. I wanted to bring in my older backups, but importing everything wholesale would just reproduce the problems of the past. I’d be polluting a clean space with a decade of disordered history. It was finally time to do some curation.
I went through each post, one-by-one, and asked: Is this worth keeping? Do I want this in the story of my life? Is this best left in the past?
That’s why I started looking back over fifteen years of a life lived online, which became an unexpectedly emotional meeting with my younger self.
One thing I’d forgotten is how much I learnt from being online, especially in fannish spaces. My timeline taught me about feminism and consent; about disability and the barriers of the built world; about racism in a way that went far deeper than anything I’d encountered before. I could learn about issues directly from the people who faced them, not filtered through a journalist’s lens. Today I take that social awareness for granted, but the Internet is where it started.
Social media was a crash course in humanity – broader, richer, and more diverse than anything I got from formal education.
Once I learned to shut up and just listen, Twitter let me follow conversations between people whose lives were nothing like mine. I got the answers to so many questions I’d never even known to ask, and I miss that. I stopped using Twitter after it was bought by Elon Musk, and I have yet to find another platform that replicates that passive, ambient learning.
More than anything else I saw online, queer culture has shaped my life. I’m queer, my partner is queer, and so are most of my friends. There are so many people I’d never have met if social media hadn’t introduced me to this world.
When I was realising I was queer, it all felt very difficult and angsty. Looking back, I can see myself following a classic path – talking to queer people, being a loud and enthusiastic ally, then starting to realise there might be a reason I cared so much. I went through it once when I realised I wasn’t straight, and again a few years later when I realised I wasn’t cis.
My younger self was oblivious, but it’s all so obvious in hindsight. I cringe at some of those older posts, but they helped me become who I am today, and I want to keep them.
There are other posts I look back on with less fondness. I’m embarrassed by how annoying I was when I was younger. I spent too much time on self-indulgent moralising and pointless arguments, often with people I probably agreed with on almost everything else. I wanted to be right more than I wanted to listen, and that got in the way of useful conversations.
Those arguments were worthless then and they’re worthless now. Deleting them was a relief.
Among my less admirable behaviour was the performative outrage toward the “main character” of the day – the unlucky person whose viral tweet had summoned thousands of replies explaining why they were a terrible person. Looking back, it was a symptom of misplaced familiarity. I was reading a stranger’s posts as if I knew them, projecting motives from scraps of context, and joining dogpiles to fit in with the crowd.
Despite ruffling a lot of feathers, I was only the main character once, and in a small corner of the tech community. It was still an unpleasant weekend, and I got off lightly compared to some of my friends – but I’ve never forgotten how quickly online attention can turn to anger and hostility.
Learning about parasocial relationships helped me behave better. I realised how often my reactions were shaped by a false sense of intimacy, and how easy it was to be cruel when I forgot there was a person behind the avatar. I shifted my attention towards friends rather than strangers, and when I did talk to people I didn’t know, I tried to be constructive instead of showing off.
When I joined Twitter, I admired by people who were smart. Today, I look up to people who are kind. I’ve come to value generosity and empathy far more than cleverness and nitpicking.
Looking through old conversations, I see the ghosts of friendships and relationships I’ve since lost. Some of those could be recovered if either of us reached out; others are gone for good. A few people have even passed away. I don’t know where most of those friends ended up, but I hope life has been kind to them.
I’ve passed through so many spaces: the PyCon UK community; fandoms like the Marmfish and the Creampuffs; the trans elders who supported me during my transition; the small, loyal group of blog readers who always left thoughtful comments. Some I lost touch with while I was still on Twitter; others I left behind when I left Twitter altogether.
As my interests changed and I moved from one space to another, I often did a poor job of keeping up the friendships I already had. I’d pour my energy into chasing new connections in the spaces I’d just discovered, neglecting the people who had been there all along. That neglect is stark when I look at it over a decade-long span. It was sobering to realise how many more friends I might have today if I hadn’t taken so many past connections for granted.
I’ve tried to keep lingering traces of those friendships by saving my mentions as well as my own tweets. Here’s one I found that made me cry: “One of the things I miss the most from my pre-pandemic Twitter timeline is seeing @alexwlchan traveling on trains and taking train selfies”. I miss that culture too – selfies were such a source of joy and affirmation, especially in queer and trans spaces. I miss seeing pretty pictures of my friends, and sharing mine in return.
When Elon Musk bought Twitter, a lot of my remaining connections there were broken. Some friends went to other platforms; others left social media entirely. I was one of them! I still write here, but it’s a more professional, broadcast space – it’s not a back-and-forth conversation.
I miss the friendships I had, and the ones that might have been.
I started with 150,000 fragments, which I reduced to 4,000 conversations. A lot of it I was glad to forget, but there are gems I want to remember. I’m glad I’ve done this, and it reinforces my belief that social media is an important part of my life that I should preserve properly.
Curating these memories has made them feel smaller and more manageable. The mess of JSON files scattered across disks has been replaced by a meaningful, well-organised collection I can look back on with a smile.
My use of Tumblr fell away gradually, and I stopped tweeting when Elon Musk bought Twitter. I didn’t jump to another platform immediately, because I wanted to pause and reflect on what I wanted from social media. Currently, my social media usage is limited to linking to blog posts.
Looking back over my old posts has helped with those reflections. I’d like to think I’ve grown up a bit in the interim, and that I’d use it better if I made it a bigger part of my life again. A lot of good things started as conversations on social media, and I often wonder if I’m missing out. But I don’t miss the time sunk in pointless arguments, the performative anger, or the abuse from strangers.
I still don’t know what my future with social media will look like, but this project has me wondering. Until I decide, my scrapbook lets me see the best of what’s already been – the friendships, the joy, and the moments that mattered.
[If the formatting of this post looks odd in your feed reader, visit the original article]
2025-12-10 19:43:25
In my previous post, I described my social media scrapbook – a tiny, private archive where I save conversations that I care about.
The implementation is mine, but the ideas aren’t: cultural heritage institutions have been thinking about how to preserve social media for years. There’s decades of theory and practice behind digital preservation, but social media presents some unique challenges.
Institutional archiving has different constraints to individual collections – institutions serve a much wider audience, so their decisions need consistency and boundaries. My own scrapbook is tiny and personal, and comparing it alongside institutional efforts really highlights the differences and difficulties. It’s why I usually call it a “scrapbook”, not an “archive”: it’s informal and a bit chaotic, and that’s fine because it’s only for me.
In this post, I’ll explain what I see as the key issues facing institutional social media archiving: what can be saved, what resists preservation, and how context is so hard to keep.
Table of contents
Social media exists at a scale that’s hard to comprehend: billions of posts, with millions more being added each day.
This makes it difficult for anyone to choose what to preserve, because any one person can only know a tiny fragment of the whole. Making a choice inevitably introduces selection bias, and I’ve spoken to many people who’d like to avoid that bias by “collecting everything” – but that’s far beyond the capacity of any institution.
Since they can’t collect everything, institutions create rules – collection policies that define what’s in-scope. These rules are meant to ensure consistency, fairness, and reduce individual bias, but they force archivists to draw boundaries in a medium that inherently resists them.
Social media isn’t a sequence of isolated posts; it’s a dense, interconnected graph. A single post only makes sense in context – the replies, the people, the topic du jour. How much of this context do you gather? How many hops out do you follow? Do you save the whole thread, every reply, every linked account? How do you prevent scope creep from sucking in everything?
My personal scrapbook is subjective and inconsistent, because the only audience is me. My “collection policy” is pure vibes – I save threads I think are interesting; I keep posts that I find moving; I prune replies that are embarrassing or unhelpful. If I’m inconsistent or I delete the wrong thing, nobody else is affected.
Institutions can’t be that casual. They need durable, defensible rules about where their collection starts and ends. On social media, where every post is context to a larger tangle of conversation, drawing that boundary is a major challenge.
Social media archiving efforts often concentrate on publicly available, long-lasting content, which excludes other types of material – even though they make up an ever-growing proportion of social media. Two major categories stand out:
Collecting this material is difficult. Technically, it’s behind authentication walls or interfaces that most web archiving tools can’t reach. And even if you save it, can you share it? Ethically, archivists must be careful not to violate social norms or user expectations.
It isn’t impossible, and I’ve seen a handful of projects capture private and ephemeral media – for example, researchers analysing Instagram Stories and their use in political campaigns. These efforts rely on a patchwork of methods: accessing content through user logins, browser plugins, even taking screenshots. They tend to be small, targeted, and short-lived.
My scrapbook has a small amount of private content, mostly conversations between me and locked accounts on Twitter. I’m comfortable with that because I was part of those conversations, and it’s a private archive. I’m not sharing it with anybody else, so I don’t think my friends would begrudge me keeping a copy. I haven’t saved any ephemeral content.
Private and ephemeral posts have a different dynamic from public timelines. People can be more personal, vulnerable, and candid when they know their posts can’t be seen by anyone, forever. Maybe those moments won’t appear in social media archives – but if so, we should acknowledge that limitation, and what stories it leaves out.
Social media is more than just posts, words, and images – it’s the experience. The interface, interaction design, and the algorithms that shape our feeds are rarely captured in archives.
For example, consider TikTok and the rise of vertical-swipe video. Because the next video is just a swipe away, creators structure their content to hook you immediately, and keep your attention throughout – a shift from the slower pace of older videos. If you only save the video file and not the swiping experience, it’s harder to understand why the creator made those choices.
Even more elusive is the “algorithm”, the black box that decide what posts appear in our timeline. These algorithms shape culture itself – amplifying some voices, suppressing others, deciding which ideas can spread – but their inner workings are deliberately opaque and impossible to archive. Their behaviour is a closely-guarded commercial secret.
A purely technical approach to preserving the experience is doomed to fail – but that doesn’t mean all is lost. We can document how these experiences shaped the flow of content: screenshots, screen recordings, detailed descriptions. Oral histories can give future audiences a sense of what it was like to exist in these digital ecosystems.
One of my favourite parts of any archive is the everyday. Often, something isn’t written down because it seems “obvious” at the time – but decades later, that knowledge has vanished. Social media is evolving quickly, and now is the time to capture these experiences. Future generations, looking back once the landscape settles, will want to understand the path that led there.
In the early 2000s, many platforms were far more supportive of digital preservation. Public APIs were common, scraping was largely tolerated, and some companies even collaborated with heritage institutions.
Twitter is the poster child of this sort of corporate endorsement. Their public API allowed a flourishing ecosystem of third-party clients and research projects; researchers could easily assemble datasets; the Library of Congress even attempted to preserve every public tweet between 2006 and 2017. The project stalled and remains largely inaccessible today – but it would never even get started in 2025.
Today, most platforms resist being preserved, archived, or downloaded en masse. APIs are restricted or paywalled, rate limits are strict, and scraping is aggressively blocked. The rise of generative AI has accelerated this trend, as companies realise their data is valuable for model training. Why give it away for free when you can ask for money?
Reddit is the most recent example. They blocked the Internet Archive after some AI companies used it to access posts for free – posts for which Google pays Reddit millions to access.
Attempts to preserve content programmatically are increasingly limited, which makes it difficult to archive at scale. In my scrapbook, I replace APIs with entering data by hand, but that’s only practical if you’re saving a small amount of data.
A lot of web archiving has historically ignored consent. If something is on the public web, many archives consider it eligible for capture – but preserving a post means it’s preserved forever. Embarrassing thoughts or personal pictures can’t be deleted once they’ve been archived.
Not everyone would agree to their posts being permanently preserved, even if they use services like the Wayback Machine. We see this in the popularity of private accounts, closed forums, and ephemeral posts – people want control over how and when their posts are seen. Generative AI and the use of social media for model training has made people even more sensitive about their data.
The general public often ignores copyright and privacy – how many people use images they found online with no regard for the creator? – but institutions hold themselves to a higher standard.
A strict ethical stance would require explicit consent from every creator. Institutions often use donor agreements, where you allow them to keep your material and sign away the right to remove it afterwards – but that solution is hard to scale to social media, where a single conversation may involve dozens of people.
It would also mean losing huge amounts of historically valuable material. It would exclude orphaned accounts, abandoned platforms, and users who have died or lost their password. And web archives preserve content from companies, politicians, and public figures, helping keep them accountable – but these figures would rarely consent to archiving they don’t control.
One interesting approach is Bluesky’s proposal User Intents for Data Reuse, letting users declare how they want their posts to be reused, such as for AI training or archiving. Technology alone is not the solution – you also need enforcement – but this feels like a step in the right direction.
I like the idea of a balanced approach – collecting material from public figures is fair game; anything from private citizens needs explicit consent. Of course, that’s easier said than done, and it’s tricky to codify that as a well-defined rule – but to me, “anything publicly available” feels increasingly insufficient as an ethical guideline.
In my personal scrapbook, I don’t have a formal consent process – something I feel comfortable with because my archive is small, private, and only my own reference. My guiding rule is “don’t be a creep”. I don’t save anything I think the original author would be uncomfortable knowing I kept.
Consent is a preference, but legislation is a hard boundary. Digital collections are affected by a patchwork of laws – copyright, privacy, data protection rules like the right to erasure, and even content-related restrictions. Institutions must ensure their collections comply with all relevant laws, even if those obligations conflict with the goals of long-term preservation.
Social media archiving is especially tricky. Automated, bulk collection can easily capture illegal or sensitive content, and mistakes may go unnoticed.
That’s why I prefer a targeted, human-reviewed approach. It slows you down, but reading all the material allows archivists to catch potential issues before content becomes a liability.
An archive is useless if you can’t look at what you’ve saved. This is often a problem in social media archives: we can save posts at incredible speeds, but we can’t search them in any meaningful way.
Web archiving often saves page-by-page, one page per post, like the Wayback Machine. This scales beautifully for capture, but terribly for discovery. You can retrieve a post if you know its URL, but you can’t find everything about a single topic or written by a given author.
Traditional archives solve this with cataloguing: humans write descriptions, and researchers use those to find what they need. But that model can buckle if you try to save social media at scale: machines can save thousands of posts in the time it takes a human to describe just one.
In my personal scrapbook, I add keyword tags to every conversation. They’re fast, informal, and effective. If I want something specific, I can filter by tag and find it instantly. Since I’m the only person who uses these tags, I can define them in a way I like and change them when I decide. If I was in an institutional context, I’d use a controlled vocabulary like LCSH or MeSH.
These light-touch keywords feel like a realistic middle ground: human-scale data that’s quick to apply, but rich enough to cut through the fog.
Identity on social media is a hard problem. Many accounts are anonymous or pseudonymous, and most people have accounts scattered across multiple platforms. This makes it tricky to track somebody’s presence on social media, because there’s rarely a mapping between a person’s real-world identity and the accounts they use online. Often, this anonymity is intentional.
This ambiguity creates a big headache for support teams at social media companies. When somebody asks for help regaining access to an account because they’ve lost the password or been hacked, how can the platform be sure they’re the real owner? That question is even harder to answer if you’re outside the company.
Institutions and researchers care about identity because it provides context and authority: who wrote this, and how much can we trust their words? Social media makes this hard, because many usernames don’t tell you anything about the person behind them. Although institutions have tools to connect people across records, you need to know who the person is first!
My personal scrapbook sidesteps this complexity. Nearly all of the conversations it contains are with friends I know well, so I can easily connect their identities across different services.
Social media relies on shared knowledge: current events, in-jokes, and memes. Without this context, the meaning of a post can fade – or an entirely new meaning can take its place.
This isn’t a new problem – all human communication requires context – but social media takes it to eleven. The pace and brevity are a fertile breeding ground for memes whose origins disappear almost immediately. Log off for a day, and you’ll return to posts that make no sense at all. You missed the moment that sparked the meme. Imagine how much harder it is to understand if you arrive years – or decades – later.
You can try to fill in the gap with catalogue descriptions, but that’s only possible if somebody understands the references well enough to describe them. With social media’s scale and speed, it’s impossible for anybody to know all the jokes, memes, and ideas that might affect a post.
In my personal scrapbook, I rely on my memory to provide that context. I don’t write longer descriptions, and I don’t know how much I’ll remember. Some posts that made sense in 2020 may be baffling in 2030, others will still be crystal clear. Only one way to find out!
Perhaps we can’t preserve social media perfectly, but that doesn’t mean we shouldn’t try. Every archive ever assembled is incomplete, but they still have immense value. Capturing public posts, threads, or conversations – even if we lose some of the context or ephemeral content – helps preserve a record of cultural history that could otherwise be lost.
Social media archiving may be a new endeavour for large institutions, but it’s not a new idea. There are small, ad-hoc projects happening everywhere, and there’s lots of prior art to learn from. Just today I came across Posty, a tool for creating archive a Mastodon account as a static site.
I’m always excited when I see people building tools to save tiny corners of the web – posts from a single account, fanworks from a tight-knit community, or shared advice from a community wiki. Whenever a platform disappears or looks shaky, there’s a renewed interest to minimise the loss.
Social media archiving will never be perfect, but it’s possible, and I’m excited to see how institutions rise to the challenge.
[If the formatting of this post looks odd in your feed reader, visit the original article]
2025-12-08 17:46:34
I grew up alongside social media, as it was changing from nerd curiosity to mainstream culture. I joined Twitter and Tumblr in the early 2010s, and I stayed there for over a decade. Those spaces shaped my adult life: I met friends and partners, found a career in cultural heritage, and discovered my queer identity.
That impact will last a long time. The posts themselves? Not so much.
Social media is fragile, and it can disappear quickly. Sites get sold, shut down or blocked. People close their accounts or flee the Internet. Posts get deleted, censored or lost by platforms that don’t care about permanence. We live in an era of abundant technology and storage, but the everyday record of our lives is disappearing before our eyes.
I want to remember social media, and not just as a vague memory. I want to remember exactly what I read, what I saw, what I wrote. If I was born 50 years ago, I’m the sort of person who’d keep a scrapbook full of letters and postcards – physical traces of the people who mattered to me. Today, those traces are digital.
I don’t trust the Internet to remember for me, so I’ve built my own scrapbook of social media. It’s a place where I can save the posts that shaped me, delighted me, or just stuck in my mind.

It’s a static site where I can save conversations from different services, enjoy them in my web browser, and search them using my own tags. It’s less than two years old, but it already feels more permanent than many social media sites. This post is the first in a three-part series about preserving social media, based on both my professional and personal experience.
Table of contents
Before I ever heard the phrase “digital preservation”, I knew I wanted to keep my social media. I wrote scripts to capture my conversations and stash them away on storage I controlled.
Those scripts worked, technically, but the end result was a mess. I focusing on saving data, and organisation and presentation were an afterthought. I was left with disordered folders full of JSON and XML files – archives I couldn’t actually use, let along search or revisit with any joy.
I’ve tried to solve this problem more times than I can count. I have screenshots of at least a dozen different attempts, and there are probably just as many I’ve forgotten.
For the first time, though, I think I have a sustainable solution. I can store conversations, find them later, and the tech stack is simple enough to keep going for a long time. Saying something will last always has a whiff of hubris, especially if software is involved, but I have a good feeling.
Looking back, I realise my previous attempts failed because I focused too much on my tools. I kept thinking that if I just picked the right language, or found a better framework, or wrote cleaner code, I’d finally land on a permanent solution. The tools do matter – and a static site will easily outlive my hacky Python web apps – but other things are more important.
What I really needed was a good data model. Every earlier version started with a small schema that could hold simple conversations, which worked until I tried to save something more complex. Whenever that happened, I’d make a quick fix, thinking about the specific issue rather than the data model as a whole. Too many one-off changes and everything would become a tangled mess, which is usually when I’d start the next rewrite.
This time, I thought carefully about the shape of the data. What’s worth storing, and what’s the best way to store it? How do I clean, validate, and refine my data? How do I design a data schema that can evolve in a more coherent way? More than any language or framework choice, I think this is what will finally give this project some sticking power.
I store metadata in a machine-readable JSON/JavaScript file, and present it as a website that I can open in my browser. Static sites give me a lightweight, flexible way to save and view my data, in a format that’s widely supported and likely to remain usable for a long time.
This is a topic I’ve written about at length, including a detailed explanation of my code.
Within my scrapbook, the unit of storage is a conversation – a set of one or more posts that form a single thread. If I save one post in a conversation, I save them all. This is different to many other social media archives, which only save one post at a time.
The surrounding conversation is often essential to understanding a post. Without it, posts can be difficult to understand and interpret later. For example, a tweet where I said “that’s a great idea!” doesn’t make sense unless you know what I was replying to. Storing all the posts in a conversation together means I always have that context.
A big mistake I made in the past was trying to shoehorn every site into the same data model.
The consistency sounds appealing, but different sites are different. A tweet is a short fragment of plain text, sometimes with attached media. Tumblr posts are longer, with HTML and inline styles. On Flickr the photo is the star, with text-based metadata as a secondary concern.
It’s hard to create a single data model that can store a tweet and a Tumblr post and a Flickr picture and the dozen other sites I want to support. Trying to do so always led me to a reductive model that over-simplified the data.
For my scrapbook, I’m avoiding this problem by creating a different data model for each site I want to save. I can define the exact set of fields used by that site, and I can match the site’s terminology.
Here’s one example: a thread from Twitter, where I saved a tweet and one of the replies.
The site, id, and meta are common to the data model across all sites, then there are site-specific fields in the body – in this example, the body is an array of tweets.
{
"site": "twitter",
"id": "1574527222374977559",
"meta": {
"tags": ["trans joy", "gender euphoria"],
"date_saved": "2025-10-31T07:31:01Z",
"url": "https://www.twitter.com/alexwlchan/status/1574527222374977559"
},
"body": [
{
"id": "1574527222374977559",
"author": "alexwlchan",
"text": "prepping for bed, I glanced in a mirror\n\nand i was struck by an overwhelming sense of feeling beautiful\n\njust from the angle of my face and the way my hair fell around over it\n\ni hope i never stop appreciating the sense of body confidence and comfort i got from Transition 🥰",
"date_posted": "2022-09-26T22:31:57Z"
},
{
"id": "1574527342470483970",
"author": "oldenoughtosay",
"text": "@alexwlchan you ARE beautiful!!",
"date_posted": "2022-09-26T22:32:26Z",
"entities": {
"hashtags": [],
"media": [],
"urls": [],
"user_mentions": ["alexwlchan"]
},
"in_reply_to": {
"id": "1574527222374977559",
"user": "alexwlchan"
}
}
}
]
}
If this was a conversation from a different site, say Tumblr or Instagram, you’d see something different in the body.
I store all the data as JSON, and I keep the data model small enough that I can fill it in by hand.
I’ve been trying to preserve my social media for over a decade, so I have a good idea of what fields I look back on and what I don’t. For example, many social media websites have metrics – how many times a post was viewed, starred, or retweeted – but I don’t keep them. I remember posts because they were fun, thoughtful, or interesting, not because they hit a big number.
Writing my own data model means I know exactly when it changes. In previous tools, I only stored the raw API response I received from each site. That sounds nice – I’m saving as much information as I possibly can! – but APIs change and the model would subtly shift over time. The variation made searching tricky, and in practice I only looked at a small fraction of the saved data.
I try to reuse data structures where appropriate.
Conversations from every site have the same meta scheme; conversations from microblogging services are all the same (Twitter, Mastodon, Bluesky, Threads); I have a common data structure for images and videos.
Each data model is accompanied by a rendering function, which reads the data and returns a snippet of HTML that appears in one of the “cards” in my web browser. I have a long switch statement that just picks the right rendering function, something like:
function renderConversation(props) {
switch(props.site) {
case 'flickr':
return renderFlickrPicture(props);
case 'twitter':
return renderTwitterThread(props);
case 'youtube':
return renderYouTubeVideo(props);
…
}
}
This approach makes it easy for me to add support for new sites, without breaking anything I’ve already saved. It’s already scaled to twelve different sites (Twitter, Tumblr, Bluesky, Mastodon, Threads, Instagram, YouTube, Vimeo, TikTok, Flickr, Deviantart, Dribbble), and I’m going to add WhatsApp and email in future – which look and feel very different to public social media.
I also have a “generic media” data model, which is a catch-all for images and videos I’ve saved from elsewhere on the web. This lets me save something as a one-off from a blog or a forum without writing a whole new data model or rendering function.
I tag everything with keywords as I save it. If I’m looking for a conversation later, I think of what tags I would have used, and I can filter for them in the web app. These tags mean I can find old conversations, and allows me to add my own interpretation to the posts I’m saving.
This is more reliable than full text search, because I can search a consistent set of terms. Social media posts don’t always mention their topic in a consistent, easy-to-find phrase – either because it just didn’t fit into the wording, or because they’re deliberately keeping it as subtext. For example, not all cat pictures include the word “cat”, but I tag them all with “cats” so I can find them later.
I use fuzzy string matching to find and fix mistyped tags.
Here’s a quick sketch of how my data and files are laid out on disk:
scrapbook/
├─ avatars/
├─ media/
│ ├─ a/
│ └─ b/
│ └─ bananas.jpg
├─ posts.js
└─ users.js
This metadata forms a little graph:
All of my post data is in posts.js, which contains objects like the Twitter example above.
Posts can refer to media files, which I store in the media/ directory and group by the first letter of their filename – this keeps the number of files in each subdirectory manageable.
Posts point to their author in users.js.
My user model is small – the path of an avatar image in avatars/, and maybe a display name if the site supports it.
Currently, users are split by site, and I can’t correlate users across sites.
For example, I have no way to record that @alexwlchan on Twitter and @[email protected] on Mastodon are the same person.
That’s something I’d like to do in future.
I have a test suite written in Python and pytest that checks the consistency and correctness of my metadata. This includes things like:
I’m doing a lot of manual editing of metadata, and these tests give me a safety net against mistakes. They’re pretty fast, so I run them every time I make a change.
Pretty much every social media website has a way to export your data, but some exports are better than others. Some sites clearly offer it reluctantly – a zip archive full of JSON files, with minimal documentation or explanation. Enough to comply with data export laws, but nothing more.
Twitter’s archive was much better.
When you downloaded your archive, the first thing you’d see was an HTML file called Your archive.html.
Opening this would launch a static website where you could browse your data, including full-text search for your tweets:


This approach was a big inspiration for me, and put me on the path of using static websites for tiny archives. It’s a remarkably robust piece of engineering, and these archives will last long after Twitter or X have disappeared from the web.
The Twitter archive isn’t exactly what I want, because it only has my tweets. My favourite moments on Twitter were back-and-forth conversations, and my personal archive only contains my side of the conversation. In my custom scrapbook, I can capture both people’s contributions.
Data Lifeboat is a project by the Flickr Foundation to create archival slivers of Flickr. I worked at the Foundation for nearly two years, and I built the first prototypes of Data Lifeboat. I joined because of my interest in archiving social media, and the ideas flowed in both directions: personal experiments informed my work, and vice versa.
Data Lifeboat and my scrapbook differ in some details, but the underlying principles are the same.
One of my favourite parts of that work was pushing static websites for tiny archives further than I ever have before. Each Data Lifeboat package includes a viewer app for browsing the contents, which is a static website built in vanilla JavaScript – very similar to the Twitter archive. It’s the most complex static site I’ve ever built, so much so that I had to write a test suite using Playwright.
That experience made me more ambitious about what I can do with static, self-contained sites.
Earlier this year I wrote about my bookmarks collection, which I also store in a static site. My bookmarks are mostly long-form prose and video – reference material with private notes. The scrapbook is typically short-form content, often with visual media, often with conversations I was a part of. Both give me searchable, durable copies of things I don’t want to lose.
I built my own bookmarks site because I didn’t trust a bookmarking service to last; I built my social media scrapbook because I don’t trust social media platforms to stick around. They’re two different manifestations of the same idea.
Tapestry is an iPhone app that combines posts from multiple platforms into a single unified timeline – social media, RSS feeds, blogs. The app pulls in content using site-specific “connectors”, written with basic web technologies like JavaScript and JSON.

Although I don’t use Tapestry myself, I was struck by the design, especially the connectors. The idea that each site gets its own bit of logic is what inspired me to consider different data models for each site – and of course, I love the use of vanilla web tech.
When I embed social media posts on this site, I don’t use the native embeds offered by platforms, which pull in megabytes of of JavaScript and tracking. Instead, I use lightweight HTML snippets styled with my own CSS, an idea I first saw on Dr Drang’s site over thirteen years ago.
The visual appearance of these snippets isn’t a perfect match for the original site, but they’re close enough to be usable. The CSS and HTML templates were a good starting point for my scrapbook.
I’ve spent a lot of time and effort on this project, and I had fun doing it, but you can build something similar with a fraction of the effort. There are lots of simpler ways to save an offline backup of an online page – a screenshot, a text file, a printout.
If there’s something online you care about and wouldn’t want to lose, save your own copy. The history of the Internet tells us that it will almost certainly disappear at some point.
The Internet forgets, but it doesn’t have to take your memories with it.
[If the formatting of this post looks odd in your feed reader, visit the original article]
2025-12-05 15:54:32
When I embed videos in web pages, I specify an aspect ratio. For example, if my video is 1920 × 1080 pixels, I’d write:
<video style="aspect-ratio: 1920 / 1080">
If I also set a width or a height, the browser now knows exactly how much space this video will take up on the page – even if it hasn’t loaded the video file yet. When it initially renders the page, it can leave the right gap, so it doesn’t need to rearrange when the video eventually loads. (The technical term is “reducing cumulative layout shift”.)
That’s the idea, anyway.
I noticed that some of my videos weren’t fitting in their allocated boxes. When the video file loaded, it could be too small and get letterboxed, or be too big and force the page to rearrange to fit. Clearly there was a bug in my code for computing aspect ratios, but what?
I opened one of the problematic videos in QuickTime Player, and the resolution listed in the Movie Inspector was rather curious: Resolution: 1920 × 1080 (1350 × 1080).
The first resolution is what my code was reporting, but the second resolution is what I actually saw when I played the video. Why are there two?
The storage aspect ratio (SAR) of a video is the pixel resolution of a raw frame. If you extract a single frame as a still image, that’s the size of the image you’d get. This is the first resolution shown by QuickTime Player, and it’s what I was reading in my code.
I was missing a key value – the pixel aspect ratio (PAR). This describes the shape of each pixel, in particular the width-to-height ratio. It tells a video player how to stretch or squash the stored pixels when it displays them. This can sometimes cause square pixels in the stored image to appear as rectangles.
This reminds me of EXIF orientation for still images – a transformation that the viewer applies to the stored data. If you don’t apply this transformation properly, your media will look wrong when you view it. I wasn’t accounting for the pixel aspect ratio in my code.
According to Google, the primary use case for non-square pixels is standard-definition televisions which predate digital video. However, I’ve encountered several videos with an unusual PAR that were made long into the era of digital video, when that seems unlikely to be a consideration. It’s especially common in vertical videos like YouTube Shorts, where the stored resolution is a square 1080 × 1080, and the aspect ratio makes it a portrait.
I wonder if it’s being introduced by a processing step somewhere? I don’t understand why, but I don’t have to – I’m only displaying videos, not producing them.
The display aspect ratio (DAR) is the size of the video as viewed – what happens when you apply the pixel aspect ratio to the stored frames. This is the second resolution shown by QuickTime Player, and it’s the aspect ratio I should be using to preallocate space in my video player.
These three values are linked by a simple formula:
DAR = SAR × PAR
The size of the viewed video is the stored resolution times the shape of each pixel.
One video with a non-unit pixel aspect ratio is my download of Mars EDL 2020 Remastered. This video by Simeon Schmauß tries to match what the human eye would have seen during the landing of NASA’s Perseverance rover in 2021.
We can get the width, height, and sample aspect ratio (which is another name for pixel aspect ratio) using ffprobe:
$ ffprobe -v error \
-select_streams v:0 \
-show_entries stream=width,height,sample_aspect_ratio \
"Mars 2020 EDL Remastered [HHhyznZ2u4E].mp4"
[STREAM]
width=1920
height=1080
sample_aspect_ratio=45:64
[/STREAM]
Here 1920 is the stored width, and 45:64 is the pixel aspect ratio.
We can multiply them together to get the display width: 1920 × 45 / 64 = 1350.
This matches what I saw in QuickTime Player.
Let’s extract a single frame using ffmpeg, to get the stored pixels. This command saves the 5000th frame as a PNG image:
$ ffmpeg -i "Mars 2020 EDL Remastered [HHhyznZ2u4E].mp4" \
-filter:v "select=eq(n\,5000)" \
-frames:v 1 \
frame.png
The image is 1920 × 1080 pixels, and it looks wrong: the circular parachute is visibly stretched.

Suppose we take that same image, but now apply the pixel aspect ratio. This is what the image is meant to look like, and it’s not a small difference – now the parachute actually looks like a circle.

Seeing both versions side-by-side makes the problem obvious: the stored frame isn’t how the video is displayed. The video player in my browser will play it correctly using the pixel aspect ratio, but my layout code wasn’t doing that. I was telling the browser the wrong aspect ratio, and the browser had to update the page when it loaded the video file.
This is my old function for getting the dimensions of a video file, which uses a Python wrapper around MediaInfo to extract the width and height fields. I now realise that this only gives me the storage aspect ratio, and may be misleading for some videos.
from pathlib import Path
from pymediainfo import MediaInfo
def get_storage_aspect_ratio(video_path: Path) -> tuple[int, int]:
"""
Returns the storage aspect ratio of a video, as a width/height ratio.
"""
media_info = MediaInfo.parse(video_path)
try:
video_track = next(
tr
for tr in media_info.tracks
if tr.track_type == "Video"
)
except StopIteration:
raise ValueError(f"No video track found in {video_path}")
return video_track.width, video_track.height
I can’t find an easy way to extract the pixel aspect ratio using pymediainfo.
It does expose a Track.aspect_ratio property, but that’s a string which has a rounded value – for example, 45:64 becomes 0.703.
That’s close, but the rounding introduces a small inaccuracy.
Since I can get the complete value from ffprobe, that’s what I’m doing in my revised function.
The new function is longer, but it’s more accurate:
from fractions import Fraction
import json
from pathlib import Path
import subprocess
def get_display_aspect_ratio(video_path: Path) -> tuple[int, int]:
"""
Returns the display aspect ratio of a video, as a width/height fraction.
"""
cmd = [
"ffprobe",
#
# verbosity level = error
"-v", "error",
#
# only get information about the first video stream
"-select_streams", "v:0",
#
# only gather the entries I'm interested in
"-show_entries", "stream=width,height,sample_aspect_ratio",
#
# print output in JSON, which is easier to parse
"-print_format", "json",
#
# input file
str(video_path)
]
output = subprocess.check_output(cmd)
ffprobe_resp = json.loads(output)
# The output will be structured something like:
#
# {
# "streams": [
# {
# "width": 1920,
# "height": 1080,
# "sample_aspect_ratio": "45:64"
# }
# ],
# …
# }
#
# If the video doesn't specify a pixel aspect ratio, then it won't
# have a `sample_aspect_ratio` key.
video_stream = ffprobe_resp["streams"][0]
try:
pixel_aspect_ratio = Fraction(
video_stream["sample_aspect_ratio"].replace(":", "/")
)
except KeyError:
pixel_aspect_ratio = 1
width = round(video_stream["width"] * pixel_aspect_ratio)
height = video_stream["height"]
return width, height
This is calling the ffprobe command I showed above, plus -print_format json to print the data in JSON, which is easier for Python to parse.
I have to account for the case where a video doesn’t set a sample aspect ratio – in that case, the displayed video just uses square pixels.
Since the aspect ratio is expressed as a ratio of two integers, this felt like a good chance to try the fractions module.
That avoids converting the ratio to a floating-point number, which potentially introduces inaccuracies.
It doesn’t make a big difference, but in my video collection treating the aspect ratio as a float produces results that are 1 or 2 pixels different from QuickTime Player.
When I multiply the stored width and aspect ratio, I’m using the round() function to round the final width to the nearest integer.
That’s more accurate than int(), which always rounds down.
When you want to know how much space a video will take up on a web page, look at the display aspect ratio, not the stored pixel dimensions. Pixels can be squashed or stretched before display, and the stored width/height won’t tell you that.
Videos with non-square pixels are pretty rare, which is why I ignored this for so long. I’m glad I finally understand what’s going on.
After switching to ffprobe and using the display aspect ratio, my pre-allocated video boxes now match what the browser eventually renders – no more letterboxing, no more layout jumps.
[If the formatting of this post looks odd in your feed reader, visit the original article]