2025-08-14 03:46:03
Spoiler: I have data from the story in the title of this post, it's mostly what I expected it to be, I've just added it to HIBP where I've called it "Data Troll", and I'm going to give everyone a lot more context below. Here goes:
Headlines one-upping each other on the number of passwords exposed in a data breach have become somewhat of a sport in recent years. Each new story wants to present a number that surpasses the previous story, and the clickbait cycle continues. You can see it coming a mile away, and you just know the reality is somewhat less than the headline, but how much less?
And so it was in June when a story with this title hit the headlines: 16 billion passwords exposed in record-breaking data breach. I thought this would be another standard run-of-the-mill sensational headline that would catch a few eyeballs for a couple of days then be forgotten, but no, apparently not. It started with a huge volume of interest in Have I Been Pwned:
That's Google searches for my "little" project, which I found odd, because we hadn't put any data in HIBP! But that initial story gained so much traction and entered the mainstream media to the extent that many publications directed people to HIBP, and inevitably, there was a bunch of searching done to figure out what the service actually was. And the news is still coming out - this story landed on AOL just last week:
You know it's serious because of all the red and exclamation marks... but per the article, "you don't need to panic" 🤷♂️
Enough speculating, let's get into what's actually in here, and for that, I went straight to the source:
Bob is a quality researcher who has been very successful over the years at sniffing out breached data, some of which had previously ended up in HIBP as a result of his good work. So we had a chat about this trove, and the first thing he made clear was that this isn't a single source of exposure, but rather different infostealer data sets that have been publicly exposed this year. The headlines implying this was a massive breach are misleading; stealer logs are produced from individually compromised machines and occasionally bundled up and redistributed. Bob also pointed out that many of the data sets were no longer exposed, and he didn't have a copy of all of them. But he did have a subset of the data he was happy to send over for HIBP, so let's analyse that.
All told, the data Bob sent contained 10 JSON files totalling 775GB across 2.7B rows. An intial cursory check against HIBP showed more than 90% of the email addresses were already in there, and of those that were in previous stealer logs, there was a high correlation of matching website domains. What I mean by this is that if the data Bob sent had someone's email address and password captured when logging into Netflix and Spotify, that person was probably already in HIBP's stealer logs against Netflix and Spotify. In other words, there's a lot of data we've seen before.
So, what do we make of all this, especially since the corpus Bob sent is about 17% of the reported 16B headline? Let me speak generally about how these data sets tend to have hyperbolic headlines, and the numbers of actual impact are way smaller:
The corpus of data I received contained 2.7B rows, of which I was able to extract 325M unique stealer log entries. That's the number of rows I could successfully parse out website, email address and password values from. In my earlier example with the one person's credentials captured for both Netflix and Spotify, that would mean two unique stealer log records. All of this then distilled down to 109M unique email addresses across all the files, and that's the number you'll now see in HIBP. In other words, 2.7B -> 109M is a 96% reduction from headline to people. Could we apply the same maths to the 16B headline? We'll never know for sure, but I betcha the decrease is even greater; I doubt additional corpuses to the tune of that many billion would continue to add new email addresses, and the duplication ratio would increase.
Because it always comes up after loading stealer logs, a quick caveat:
Not all email addresses loaded into this breach will contain corresponding stealer log entries. This is because we have one process to regex out all the addresses (the code is open source), and another process that pulls rows with email addresses against valid websites and passwords.
And because I'll end up copying and pasting this over and over again in responses to queries, another caveat:
Presence in a stealer log is often an indicator of an infected device, but we have no data to indicate when it was infected. There will be a lot of old data in here, just as there's a lot of repackaged data.
Of the passwords in valid stealer log entries, there were 231M unique ones, and we'd seen 96% of them before. Those are now all in Pwned Passwords with updated prevalence counts and are searchable via the website and, of course, via the API. Speaking of which, those passwords are presently being searched a lot:
Every time I look, there's another billion (or two) pic.twitter.com/X7gflzWdCH
— Troy Hunt (@troyhunt) July 30, 2025
Of the 109M email addresses we could parse out of the corpus, 96% of them were already in HIBP (that number coincidentally matches the percentage of existing passwords we track). They weren't all from previous stealer logs, of course, but anecdotally, during my testing, I found a lot of crossover between this one and the ALIEN TXTBASE logs from earlier this year. Regardless, we added 4.4M new addresses from Data Troll that we'd never seen before, so that alone is significant. Not significant enough to justify hyperbolic headlines to the effect of "biggest ever", but still sizeable.
To summarise:
And lastly, there's that "Data Troll" title. When I first saw this story getting so much traction, the image I had in my mind was of a troll sitting on stashes of data. The mass media then picked this up and turned it into deliberately provocative headlines, manipulating the narrative to seek attention. Hopefully, this post tempers all that a little bit and brings some sanity back into the discussion. We need to take data exposures like this seriously, but it certainly didn't deserve the attention it got.
2025-08-13 10:37:14
We were recently travelling to faraway lands, doing meet and greets with gov partners, when one of them posed an interesting idea:
What if people from our part of the world could see a link through to our local resource on data breaches provided by the gov?
Initially, I was sceptical, primarily because no matter where you are in the world, isn't the guidance the same? Strong and unique passwords, turn on MFA, and so on and so forth. But our host explained the suggestion, which in retrospect made a lot of sense:
Showing people a local resource from a trusted government body has a gravitas that we believe would better support data breach victims.
And he was right. Not just about the significance of a government resource, but as we gave it more thought, all the other things that are specific to the local environment. Additional support resources. Avenues to report scams. Language! Like literally, presenting content in a way that normal everyday folks can understand it based on where they are in the world. And we have the mechanics to do this now as we're already geo-targeting content on the breach pages courtesy of HIBP's partner program.
Whilst we're still working through the mechanics with the gov that initially came up with this suggestion, during a recent chat with our friends "across the ditch" at New Zealand's National Cyber Security Centre, I mentioned the idea. They thought it was great, so we just did it 🙂 As of now, if you're a Kiwi and you open up any one of the 899 breach pages (such as this one), you'll see this advice off to the right of the screen:
That links off to a resource on their Own Your Online initiative, which aims to help everyday folks there protect themselves in cyberspace. There's lots of good practical advice on the site along the lines I mentioned earlier, and even a suggestion to go and check out HIBP (which now links you back to the NZ NCSC...)
I'll be reaching out to our other gov partners around the world and seeing what resources they have that we could integrate, hopefully it's just one more little step in the right direction to protect the masses from online nasties.
2025-08-12 11:26:47
I think the most amusing comment I had during this live stream was one to the effect of expecting me to have all my tech things neat and ordered. As I look around me now, there are Shellys with cables hanging off them all over my desk, the keyboard I'm typing on has become very flakey with the Bluetooth connection, a monitor colour tuning tool I've been meaning to run for years is still sitting there, there are seven boxes of Ubiquiti stuff on the floor waiting to be installed, an IoT smoke alarm that needs a hub to work is next to me and I'm looking at the camera that failed me this week and it still has that damn micro USB cable hanging out of it and not properly run through the wall to be nice and invisible. Yet somehow, today I've prioritised IoT'ing my rubbish bins with AI 🤷♂️ More on that next week!
2025-08-06 03:55:34
I'm often asked if cyber criminals are getting better at impersonating legitimate organisations in order to sneak their phishing attacks through. Yes, they absolutely are, but I also argue that the inverse is true too: legitimate organisations frequently communicate in ways that are indistinguishable from a phishing attack! I can name countless examples of banks, delivery services and even government agencies sending communication that I was convinced was a phish, but turned out to be legit. I once had an argument with an agent from our own tax office on precisely that basis. After having shown all the hallmarks of being a scammer, she instead turned out to be making a legitimate inquiry. And if you need more convincing that even I can't tell the difference between a scam and legit comms, look no further than my own recent failure to spot a phish that successfully extracted my Mailchimp credentials, including the 2FA code!
I don't mind recognising that I struggle with scams, and frankly, it creates a lot more empathy for the masses out there who don't spend their days thinking about cybersecurity. These are the sorts of folks who use Have I Been Pwned and often land there a bit frazzled, looking for answers after learning they've been breached in some nasty incident. They need a proactive defence against this style of attack that can protect them when the human controls fail, as they recently failed me. That's why today, I'm very happy to announce a new HIBP partner, Guardio! You'll find them located on each dedicated breach page, and on the home page of your personal dashboard:
We've now turned the above recommendation on for all US-based visitors and highlighted them for all audiences regardless of locale on the partners page. We believe the service they offer makes a meaningful difference to the security posture of our users, and we are happy to include them here to complement the unique services provided by our existing partners. So it's a big welcome to Guardio, and I look forward to sharing more about the work they're doing to protect us all in the future. Check out what Guardio does on their dedicated HIBP page now.
2025-08-03 15:12:24
I've listened to a few industry podcasts discussing the Tea app breach since recording, and the thing that really struck me was the lack of discussion around the privacy implications of the service before the breach. Here was a tool where people were non-consensually uploading photos of others and leaving fairly intimate commentary about them. That MO seems to be, at least in part, related to the motive to take a service that presented massive privacy implications for the subject matters and, to vet their participants' gender, create an even bigger privacy issue by collecting selfies and IDs, which in turn created yet another privacy issue when they were leaked and misused. There were so many red flags about this service before the breach that it's kinda fascinating the focus is now so heavily on the aftermath. A bit more pre-emptive focus on privacy next time, everyone.
2025-07-28 18:37:36
This will be the title of the blog post: "Court Injunctions are the Thoughts and Prayers of Data Breach Response". It's got a nice ring to it, and it resonates so much with the response to other disasters where the term is offered as a platitude that has absolutely no practical benefit at all. You know, like the Qantas injunction to prevent data from their breach being examined by other parties. So, whilst it means journos won't be poring over it (and we won't be loading it into HIBP), criminals will pay no attention to it whatsoever. More to come in the forthcoming blog post.