MoreRSS

site icon404 MediaModify

A journalist-founded digital media company exploring the ways technology is shaping–and is shaped by–our world.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of 404 Media

AI Models And Parents Don’t Understand ‘Let Him Cook’

2025-06-25 02:17:14

AI Models And Parents Don’t Understand ‘Let Him Cook’

Young people have always felt misunderstood by their parents, but new research shows that Gen Alpha might also be misunderstood by AI. A research paper, written by Manisha Mehta, a soon-to-be 9th grader, and presented today at the ACM Conference on Fairness, Accountability, and Transparency in Athens, shows that Gen Alpha’s distinct mix of meme- and gaming-influenced language might be challenging automated moderation used by popular large language models. 

The paper compares kid, parent, and professional moderator performance in content moderation to that of four major LLMs: OpenAI’s GPT-4, Anthropic’s Claude, Google’s Gemini, and Meta’s Llama 3. They tested how well each group and AI model understood Gen Alpha phrases, as well as how well they could recognize the context of comments and analyze potential safety risks involved. 

Mehta, who will be starting 9th Grade in the fall, recruited 24 of her friends to create a dataset of 100 “Gen Alpha” phrases. This included expressions that might be mocking or encouraging depending on the context, like “let him cook” and “ate that up”, as well as expressions from gaming and social media contexts like “got ratioed”, “secure the bag”, and “sigma”.  

AI Models And Parents Don’t Understand ‘Let Him Cook’

“Our main thesis was that Gen Alpha has no reliable form of content moderation online,” Mehta told me over Zoom, using her dad’s laptop. She described herself as a definite Gen Alpha, and she met her (adult) co-author last August, who is supervising her dad’s PhD. She has seen friends experience online harassment and worries that parents aren’t aware of how young people’s communication styles open them up to risks. “And there’s a hesitancy to ask for help from their guardians because they just don’t think their parents are familiar enough [with] that culture,” she says.

Given the Gen Alpha phrases, “all non-Gen Alpha evaluators—human and AI—struggled significantly,” in the categories of “Basic Understanding” (what does a phrase mean?), “Contextual Understanding” (does it mean something different in different contexts?), and “Safety Risk” (is it toxic?). This was particularly true for “emerging expressions” like skibidi and gyatt, with phrases that can be used ironically or in different ways, or with insults hidden in innocent comments. Part of this is due to the unusually rapid speed of Gen Alpha’s language evolution; a model trained on today’s hippest lingo might be totally bogus when it’s published in six months. 

In the tests, kids broadly recognized the meaning of their own generation-native phrases, scoring 98, 96, and 92 percent in each of the three categories. However, both parents and professional moderators “showed significant limitations,” according to the paper; parents scored 68, 42, and 35 percent in those categories, while professional moderators did barely any better with 72, 45, and 38 percent. The real life implications of these numbers mean that a parent might only recognize one third of the times when their child is being bullied in their instagram comments.

AI Models And Parents Don’t Understand ‘Let Him Cook’

The four LLMs performed about the same as the parents, potentially indicating that the data used to train the models might be constructed from more “grown-up” language examples. This makes sense since pretty much all novelists are older than 15, but it also means that content-moderation AIs tasked with maintaining young people’s online safety might not be linguistically equipped for the job.

Mehta explains that Gen Alpha, born between 2010-ish and last-year-ish, are the first cohort to be born fully post-iPhone. They are spending unprecedented amounts of their early childhoods online, where their interactions can’t be effectively monitored. And, due to the massive volumes of content they produce, a lot of the moderation of the risks they face is necessarily being handed to ineffective automatic moderation tools with little parental oversight. Against a backdrop of steadily increasing exposure to online content, Gen Alpha’s unique linguistic habits pose unique challenges for safety. 

Judge Rules Training AI on Authors' Books Is Legal But Pirating Them Is Not

2025-06-24 23:31:37

Judge Rules Training AI on Authors' Books Is Legal But Pirating Them Is Not

A federal judge in California ruled Monday that Anthropic likely violated copyright law when it pirated authors’ books to create a giant dataset and "forever" library but that training its AI on those books without authors' permission constitutes transformative fair use under copyright law. The complex decision is one of the first of its kind in a series of high-profile copyright lawsuits brought by authors and artists against AI companies, and it’s largely a very bad decision for authors, artists, writers, and web developers. 

This case, in which authors Andrea Bartz, Charles Graeber, and Kirk Wallace Johnson sued Anthropic, maker of the Claude family of large language models, is one of dozens of high-profile lawsuits brought against AI giants. The authors sued Anthropic because the company scraped full copies of their books for the purposes of training their AI models from a now-notorious dataset called Books3, as well as from the piracy websites LibGen and Pirate Library Mirror (PiLiMi). The suit also claims that Anthropic bought used physical copies of books and scanned them for the purposes of training AI. 

"From the start, Anthropic ‘had many places from which’ it could have purchased books, but it preferred to steal them to avoid ‘legal/practice/business slog,’ as cofounder and chief executive officer Dario Amodei put it. So, in January or February 2021, another Anthropic cofounder, Ben Mann, downloaded Books3, an online library of 196,640 books that he knew had been assembled from unauthorized copies of copyrighted books — that is, pirated," William Alsup, a federal judge for the Northern District of California, wrote in his decision Monday. "Anthropic’s next pirated acquisitions involved downloading distributed, reshared copies of other pirate libraries. In June 2021, Mann downloaded in this way at least five million copies of books from Library Genesis, or LibGen, which he knew had been pirated. And, in July 2022, Anthropic likewise downloaded at least two million copies of books from the Pirate Library Mirror, or PiLiMi, which Anthropic knew had been pirated."

Massive Creator Platform Fansly Bans Furries

2025-06-24 22:19:43

Massive Creator Platform Fansly Bans Furries

Fansly, a popular platform where independent creators—many of whom are making adult content—sell access to images and videos to subscribers and fans, announced sweeping changes to its terms of service on Monday, including effectively banning furries.

The changes blame payment processors for classifying “some anthropomorphic content as simulated bestiality.” Most people in the furry fandom condemn bestiality and anything resembling it, but payment processors—which have increasingly dictated strict rules for adult sexual content for years—seemingly don’t know the difference and are making it creators’ problem.

The changes include new policies that ban chatbots or image generators that respond to user prompts, content featuring alcohol, cannabis or “other intoxicating substances,” and selling access to Snapchat content or other social media platforms if it violates their terms of service. 

‘FuckLAPD.com’ Lets Anyone Use Facial Recognition to Instantly Identify Cops

2025-06-24 21:43:47

‘FuckLAPD.com’ Lets Anyone Use Facial Recognition to Instantly Identify Cops

A new site, FuckLAPD.com, is using public records and facial recognition technology to allow anyone to identify police officers in Los Angeles they have a picture of. The tool, made by artist Kyle McDonald, is designed to help people identify cops who may otherwise try to conceal their identity, such as covering their badge or serial number.

“We deserve to know who is shooting us in the face even when they have their badge covered up,” McDonald told me when I asked if the site was made in response to police violence during the LA protests against ICE that started earlier this month. “fucklapd.com is a response to the violence of the LAPD during the recent protests against the horrific ICE raids. And more broadly—the failure of the LAPD to accomplish anything useful with over $2B in funding each year.”

“Cops covering up their badges? ID them with their faces instead,” the site, which McDonald said went live this Saturday. The tool allows users to upload an image of a police officer’s face to search over 9,000 LAPD headshots obtained via public record requests. The site says image processing happens on the device, and no photos or data are transmitted or saved on the site. “Blurry, low-resolution photos will not match,” the site says. 

This Queer Online Zine Can Only Be Read Via an Ancient Internet Protocol

2025-06-24 21:00:39

This Queer Online Zine Can Only Be Read Via an Ancient Internet Protocol

Unless you’re living in a ChatGPT hype-bro bubble, it’s a pretty common sentiment these days that the internet is getting shittier. Social media algorithms have broken our brains, AI slop flows freely through Google search results like raw sewage, and tech companies keep telling us that this new status quo is not only inevitable, but Good.

Standing in stark opposition to these trends is New Session, an online literary zine accessed via the ancient-but-still-functional internet protocol Telnet.

Like any other zine, New Session features user-submitted poems, essays, and other text-based art. But the philosophy behind each of its digital pages is anything but orthodox.

“In the face of right-wing politics, climate change, a forever pandemic, and the ever-present hunger of imperialist capitalism, we have all been forced to adapt,” reads the intro to New Session’s third issue, titled Adaptations, which was released earlier this month. “Both you and this issue will change with each viewing. Select a story by pressing the key associated with it in the index. Read it again. Come back to it tomorrow. Is it the same? Are you?”

The digital zine is accessible on the web via a browser-based Telnet client, or if you’re a purist like me, via the command line. As the intro promises, each text piece changes—adapts—depending on various conditions, like what time of day you access it or how many times you’ve viewed it. Some pieces change every few minutes, while others update every time a user looks at it, like gazing at fish inside a digital aquarium.

Once logged in, the zine’s main menu lists each piece along with the conditions that cause it to change. For example, Natasja Kisstemaker’s “Sanctuary” changes with every viewing, based on the current weather. “Signature,” by Kaia Peacock, updates every time you press a key, slowly revealing more of the piece when you type a letter contained in the text—like a word puzzle on Wheel of Fortune.

Cara Esten Hurtle, an artist and software engineer based in the Bay Area, co-founded New Session in 2021 along with Lo Ferris, while searching for something to do with her collection of retro computers during the early days of the COVID-19 pandemic.

“I realized I’d been carrying around a lot of old computers, and I thought it would be cool to be able to do modern stuff on these things,” Hurtle told 404 Media. “I wanted to make something that was broadly usable across every computer that had ever been made. I wanted to be like, yeah, you can run this on a 1991 Thinkpad someone threw away, or you could run it on your modern laptop.”

If you’re of a certain age, you might remember Telnet as a server-based successor to BBS message boards, the latter of which operated by connecting computers directly. It hearkens back to a slower internet age, where you’d log in maybe once or twice a day to read what’s new. Technically, Telnet predates the internet itself, originally developed as a networked teletype system in the late ‘60s for the internet’s military precursor, the ARPAnet. Years later, it was officially adopted as one of the earliest internet protocols, and today it remains the oldest application protocol still in use—though mainly by enthusiasts like Hurtle.

New Session intentionally embraces this slower pace, making it more like light-interactive fiction than a computer game. For Hurtle, the project isn’t just retro novelty—it’s a radical rejection of the addictive social media and algorithmic attention-mining that have defined the modern day internet.

“I want it to be something where you don’t necessarily feel like you have to spend a ton of time with it,” said Hurtle. “I want people to come back to it because they’re interested in the stories in the same way you’d come back to a book—not to get your streak on Duolingo.”

I won’t go into too much detail, because discovering how the pieces change is kind of the whole point. But on the whole, reading New Session feels akin to a palette cleanser after a long TikTok binge. Its very design evokes the polar opposite of the hyper-consumerist mindset that brought us infinite scrolls and algorithmic surveillance. The fact that you literally can’t consume it all in one session forces readers to engage with the material more slowly and meaningfully, piquing curiosity and exercising intuition.

At the same time, the zine isn’t meant to be a nostalgic throwback to simpler times. New Session specifically solicits works from queer and trans writers and artists, as a way to reclaim a part of internet history that was credited almost entirely to white straight men. But Hurtle says revisiting things like Telnet can also be a way to explore paths not taken, and re-assess ideas that were left in the dustbin of history.

“You have to avoid the temptation to nostalgize, because that’s really dangerous and it just turns you into a conservative boomer,” laughs Hurtle. “But we can imagine what aspects of this we can take and claim for our own. We can use it as a window to understand what’s broken about the current state of the internet. You just can’t retreat to it.”

Projects like New Session make a lot of sense in a time when more people are looking backward to earlier iterations of the internet—not to see where it all went wrong, but to excavate old ideas that could have shaped it in a radically different way, and perhaps still can. It’s a reminder of that hidden, universal truth—to paraphrase the famous David Graeber quote—that the internet is a thing we make, and could just as easily make differently.

Meta's AI Model 'Memorized' Huge Chunks of Books, Including 'Harry Potter' and '1984'

2025-06-24 01:54:43

Meta's AI Model 'Memorized' Huge Chunks of Books, Including 'Harry Potter' and '1984'

A new paper from researchers at Stanford, Cornell, and West Virginia University seems to show that one version of Meta’s flagship AI model, Llama 3.1, has memorized almost the whole of the first Harry Potter book. This finding could have far-reaching copyright implications for the AI industry and impact authors and creatives who are already part of class-action lawsuits against Meta. 

Researchers tested a bunch of different widely-available free large language models to see what percentage of 56 different books they could reproduce. The researchers fed the models hundreds of short text snippets from those books and measured how well it could recite the next lines. The titles were a random sampling of popular, lesser-known, and public domain works drawn from the now-defunct and controversial Books3 dataset that Meta used to train its models, as well as books by plaintiffs in the recent, and ongoing, Kadrey vs Meta class-action lawsuit. 

According to Mark A. Lemley, one of the study authors, this finding might have some interesting implications. AI companies argue that their models are generative—as in, they make new stuff, rather than just being fancy search engines. On the other hand, authors and news outlets are suing on the basis that AI is just remixing existing material, including copyrighted content. “I think what we show in the paper is that neither of those characterizations is accurate,” says Lemley.

The paper shows that the capacity of Meta’s popular Llama 3.1 70B to recite passages from The Sorcerer’s Stone and 1984—among other books—is way higher than could happen by chance. This could indicate that LLMs are not just trained using books, but might actually be storing entire copies of the books themselves. That might mean that under copyright law that the model is less “inspired by” and more “a bootleg copy of” certain texts. 

It’s hard to prove that a model has “memorized” something, because it’s hard to see inside. But LLMs are trained using the mathematical relationships between little chunks of data called ‘tokens,’ like words or punctuation. Tokens all have varying probabilities of following each other or getting strung together in a specific order.

The researchers were able to extract sections of various books by repeatedly prompting the models with selected lines. They split each book into 100-token overlapping strings, then presented the model with the first 50-token half and measured how well it could produce the second. This might take a few tries, but ultimately the study was able to reproduce 91 percent of The Sorcerer’s Stone with this method. 

“There’s no way, it’s really improbable, that it can get the next 50 words right if it hadn’t memorized it,” James Grimmelmann, Tessler Family Professor of Digital and Information Law at Cornell, who has worked to define “memorization” in this space, told 404 Media. 

OpenAI has called memorization “a rare failure of the learning process,” and says that it sometimes happens when the topic in question appears many times in training data. It also says that intentionally getting their LLMs to spit out memorized data “is not an appropriate use of our technology and is against our terms of use.”

The study’s authors say in their paper that if the model is storing a book in its memory, the model itself could be considered to literally “be” a copy of the book. If that’s the case, then distributing the LLM at all might be legally equivalent to bootlegging a DVD. And this could mean that a court could order the destruction of the model itself, in the same way they’ve ordered the destruction of a cache of boxsets of pirated films. This has never happened in the AI space, and might not be possible, given how widespread these models are. Meta doesn’t release usage statistics of its different LLMs, but 3.1 70B is one of its most popular. The Stanford paper estimates that the Llama 3.1 70B model has been downloaded a million times since its release, so, technically, Meta could have accidentally distributed a million pirate versions of The Sorcerer’s Stone

The paper found that different Llama models had memorized widely varying amounts of the tested books. “There are lots of books for which it has essentially nothing,” said Lerney. Some models were amazing at regurgitating, and others weren’t, meaning that it was more likely that the specific choices made in training the 3.1 70B version had led to memorization, the researchers said. That could be as simple as the choice not to remove duplicated training data, or the fact that Harry Potter and 1984 are pretty popular books online. For comparison, the researchers found that the Game of Thrones books were highly memorized, but Twilight books weren’t memorized at all.

Grimmelman said he believes their findings might also be good news overall for those seeking to regulate AI companies. If courts rule against allowing extensive memorization, “then you could give better legal treatment to companies that have mitigated or prevented it than the companies that didn't,” he said. “You could just say, if you memorize more than this much of a book, we'll consider that infringement. It's up to you to figure out how to make sure your models don't memorize more than that.”