RSS preview of Bear Blog Trending Posts

Rss preview of Blog of Bear Blog Trending Posts

Learning JP #11 - Last exercise, bye routine!

2025-06-25 09:23:09

The last challenge is here!

I'm not gonna lie, I haven't been paying attention to our good old "30 Day Japanese" routine, but that's actually a good thing!
I think I've mentioned it before, but the main purpose of the routine, taken straight from the site, is:

This routine was made in response to decision paralysis some people may experience when learning Japanese. This is by no means definitive and you are free to change any parts of any day to meet your needs.

I think I'm finally understanding everything, especially how parts of the language fall into place when learning it. I have an overall understanding of how I want to tackle this adventure, and that's why...

I'm dropping the routine!

I know this barely counts, because I've been following it for almost the full 30 days, but I've modified so many things that I can barely say that I'm following it.
I am, however, picking up the last challenge that the routine proposes.
The routine asks us to read 100 pages of the よつばと！ (Yotsubato!) manga, because it's simple, a slice of life, and features Furigana (sort of subtitles for Kanji).
I don't really feel like reading よつばと！ for this challenge, so I'm replacing it with another simple manga with Furigana, good old ドラえもん (Doraemon).

An important part of this challenge is to actually be able to read, understand, and translate the text, which is why the routine recommends using a tool called mokuro, which processes the manga you feed it, and turns the text in the images into regular text that we can select and copy, which works great with Yomitan!

Let me show you:

First we have a regular screenshot of ドラえもん:

We can see mokuro's text (top-left) if we hover over the original text:

And if we hold the shift key, we can use Yomitan:

So? Pretty cool, huh? You can also see some examples of the Furigana, like in the top-right panel, it shows that 成功 is read as せいこう.

What will I do without the routine?

Well, as for now I'll continue with the classics:

Kaishi 1.5k - Anki deck
Recognition RTK 450 - Anki deck
Cure Dolly videos
Immersion

And I'll soon be adding in the manga challenge.

I would like to do a soft restart on the 1st of next month, but I'll explain why and how I'll do that in a different post. Today was all about giving an update on the new challenge and how it'll work!
\

That's all! I hope to see you again soon! ★

◀ ▲

The Bitter Lesson is coming for Tokenization

2025-06-24 21:30:00

a world of LLMs without tokenization is desirable and increasingly possible

Published on 24/06/2025 • ⏱️ 29 min read

In this post, we highlight the desire to replace tokenization with a general method that better leverages compute and data. We'll see tokenization's role, its fragility and we'll build a case for removing it. After understanding the design space, we'll explore the potential impacts of a recent promising candidate (Byte Latent Transformer) and build strong intuitions around new core mechanics.

As it's been pointed out countless times - if the trend of ML research could be summarised, it'd be the adherence to The Bitter Lesson - opt for general-purpose methods that leverage large amounts of compute and data over crafted methods by domain experts. More succinctly articulated by Ilya Sutskever, "the models, they just want to learn". Model ability has continued to be blessed with the talent influx, hardware upgrades, model architectural advances and initial data ubiquity to enable this reality in recent years.

the pervasive tokenization problem

However, one of the documented bottlenecks in the text transformer world that has received less optimisation effort is the very mechanism that shapes its world view - tokenization.

If you're not aware, one of the popular text tokenization methods for transformers, Byte-Pair Encoding (BPE), is a learned procedure that extracts an effectively compressed vocabulary (of desired size) from a dataset by iteratively merging the most frequent pairs of existing tokens.

source

It's worth remembering that this form of tokenization is not a strict requirement of the transformer. In practice, it means that we're able to represent more bytes given a fixed number of entries in the transformer's embedding table. From our earlier definition, effective is doing some heavy lifting. Ideally, the vocabulary of tokens is perfectly constructed for the task at hand such that it obtains the optimal trade off of byte compression to reduce the transformer's FLOPS while maintaining enough of a granular representation to achieve the lowest possible loss. Another ideal attribute is that tokens that do get merged, end up being well-modelled during training.

In the optimal tradeoff, the need for byte compression comes from attention's computational complexity and it's the core reason why transformers have to rely on some form of tokenization (often sub-word). Character-level RNNs used to be the norm (Sutskever, 2011, Graves, 2013, Karpathy's RNN post) but they struggled to learn from characters and were superseded in favour of character-aware models that tokenize via a CNN over characters (which also spilled over to transformer world). In the case of attention however, tokenization was there from the beginning since it is imperative to avoid clogging up the context to enable the transformer to attend to the full sequence and cash in on its long-range dependencies abilities.

Revisiting the optimal tradeoff, tokenizers are often far from the ideal and the history of LLMs is plagued with downstream issues attributable to them. From these "earlier" days as modern LLMs started seeing more activity, we saw things like:

a reddit user "SolidGoldMagikarp" getting their own dedicated token in OpenAI's tokenizer that was poorly modelled, eliciting the phenomena of "glitch tokens"
GPT2's Python performance being worse than expected partially due to the way in spaces were tokenized (paste in & see)
inability to detect the number of r's in 🍓 meme
numbers being tokenized totally incoherently in GPT2 which got rectified a few different ways but the jury is still out on what the consensus will be:

In most of these cases, the tokenizer & its pipeline gets tweaked and the problems resolved. There've even been efforts to automatically detect under-trained tokens for glitch tokens. But examples like the 🍓 meme (and more later in the post) are more fundamental examples of how we're depriving models of information in the name of efficiency via simplistic levers.

For the purpose of this post, we'll limit our exploration to text tokenization but I would be remiss if I didn't mention that tokenization is a feature across all modalities with modality-specific tokenizers becoming the standard. This comes with its own host of challenges but continues to perpetuate the externalised, separately trained models that have competing concerns and their own training dynamics which also end up having to be addressed incrementally, via extension or via improved approaches. This is all to say, the problem is evidently non-trivial and has received significant research effort.

In the world of text tokenization, at least from an external point of view (though, not sure about internally), things do seem to have stagnated. Even with this stability, the failure modes of tokenization continue to impede the models. A reasonable question to ask might be - "we have approaches that let us cope with these failure modes, do we really need to solve it?"

can we just ignore it?

From earlier days, chain of thought, tool use and RAG all began addressing these issues and more recently, increasingly sophisticated undisclosed mid/post-training recipes and the move to reasoning-based models continue in this direction. But it begs the question - how much model ability is being left on the table due to poor tokenization? In my view, this includes both sub-optimal merges for task diversity to misconfiguring the tokenizer relative to model capacity. The honest answer here is that no one seems to have publicly investigated this thoroughly (from what I could find). However, the revealed preference of the big labs is in favour of subword-level tokenization and hasn't seen much movement. Given no direct research to consider, we'll use the latest in learned tokenization and byte-level end-to-end learned tokenization to be proxies for understanding what's being left on the table.

While researching for this post, I ended up reading a bit too much into the text tokenization literature which probably warrants its own post. For the curious reader, this provides a great overview and kick off point but in the interest of my own sanity and your time, just trust me bro, there's quite a bit to it!

can we just delete it?

Before attempting optimising, we should always ask the important question "can we just delete it?". From a domain point of view, some are skeptical that bytes are adequate for modelling natural language. However, if we only entertain the technical feasibility - what does deletion look like?

In the GPT-2 paper, the authors revisit the choice of input representation and, empirically, register a similar performance gap on WebText to Google's work in the character-level LM with Deeper Self-Attention paper. It kicked off the character-level revival by showing that, with the help of auxiliary losses, it outperformed its LSTM character-level counter parts but still registered a gap versus word-level models. The authors follow up with another paper that bridges the gap but at the expense of much more compute and time to train.

This is all to say, we started at the character-level, authors tried going back to it but failed for other reasons that may encourage revision due to shifting underlying factors. If we look at BPE more closely, a commonly cited heuristic is that BPE tokens represent, on average, 4.4 bytes per token meaning that current BPE-based transformers' 32K token context windows are able to attend over ~140K bytes with a vocab size of 256K. If we were to use pure bytes and a vanilla transformer modelling UTF-8 bytes, we'd have a vocab size of 256 and be limited to attending to only 32K bytes!

So Google's ByT5 set out to answer the "can we delete?" question in its purest form:

Our goal in designing ByT5 is to take an existing token-based model and perform the minimal set of modifications to make it token-free, thereby limiting experimental confounds.

They showed that pure byte modelling, even when trained on 4x less data, had comparable or better performance to its SentencePiece counter part on a subset of benchmarks under 1B parameters (namely robustness to noise, word-level tasks like transliteration, morphological inflection, graphene-to-phoneme¹). Given the intentionally naive modification, it increased pre-training time by 33% (wall-clock time) and in the worst case², inference by up to a factor of 10x³.

However if one were to, hypothetically, entertain the heretical thoughts of straying from The One True Architecture then one could be free of attention's quadratic complexity and worry less about clogging up the context.

alternative architecture slander as a meme — *low effort alternative architecture slander*

In that case, ByT5's kindred soul (w.r.t simplicity) is MambaByte that capitalises on State Space Model's (SSM) fixed size memory state that doesn't scale with input context size which, when not dealing with compressed byte representations via subword-level tokenization, becomes a great fit for the problem. However even without the clogged context problem, the sequence length still remains and so do the increased inference steps so they employ the model in a speculative decoding setup to alleviate the burden. SSMs are a tool in the kit that have found strong utility where MambaByte would be useful but they come with their own host of challenges that we inherit when relying on them as the core method to remove the tokenizer.

Alas, given we are True Believers we would never have such thoughts. We hold steady faith in the Values of The Transformer and thus heretics we are not.

so... can we learn it?

In comparison to BPE's learning, there's a series of architecture changes we can make to a transformer to remove the requirement of optimised sub-word tokenization. With the bitter lesson in mind, if we're able to learn tokenization more generally, we would expect to see a model:

be competitive or improve loss scores
improve on downstream tasks across the board
demonstrate better scaling curves when thrown more compute and data

Before jumping into transformer modifications - are there any directionally relevant changes we can make to vanilla BPE? Mostly, they're incremental changes to compensate for its limitations but aren't aligned with our previously stated goal. It includes things like probabilistically skipping merge operations (common in a variety of tasks), a pretokenization curriculum to first learn subwords then super words that bridge whitespace, falling back to bytes instead of lumping everything into the <unk> token and enforcing consistency of predictions over different segmentations. However methods like updating the tokenizer based on downstream loss under different segmentations and jointly optimizing the tokenizer with the model are more aligned with our goal but are trickier to apply in practice.

Given we're seeking generality that demonstrates better scaling curves, this isn't going to cut it.

design space so far

Rather than going back three decades to paint a deep picture of the space, we'll focus in on the recent progress in the transformer-centric literature where there's been a few different stabs at addressing the efficiency challenges of pure byte modelling for the transformer case.

Each architecture is some variation on the theme of creating a compressed representation which usually materialise in a few choices:

down/upsampling to/from that compressed representation
how FLOPS are distributed across levels of representation
decoding strategy ⁴
fixed or dynamic width of bytes

In language modelling, it's commonplace to compare perplexity but in papers like these where we're not evaluating with a fixed tokenizer, some variation of bits-per-byte will be used as a tokenizer independent version of perplexity:

$$ \operatorname{BPB}(x)=\frac{\mathcal{L}{C E}(x)}{\ln (2) \cdot n{\text {bytes }}} $$

Without getting bogged down in excessive detail, let's consider some directionally-aligned landmark papers in recent memory.

CANINE's encoder (targeting non-generative tasks) used a combination of n-gram hash embeddings, local attention and strided convolutions to downsample from character-level to a compressed representation processable by a larger transformer ⁵.

Charformer is an encoder-decoder model that also learns to downsample end-to-end via a gradient-based block scoring function up to some fixed block size⁶. It isn't designed to be autoregressive either.

Concretely, from the character sequence it builds byte embeddings from which it constructs a series of candidate latent subword blocks up to a max block size at some stride. Stride size is set to the size of block for that block size.

At each position, which latent subword block should we use? This is enabled by a block scoring network to select the right block which gives us a score per block for each position $i$. Scores are then softmax'd to get a probability distribution $P_{i}$ over blocks for position $i$. These subword block representations are summed and weighted by their $P_{b,i}$ to form the final latent subword representation for position $i$:

$$ \hat{X}i=\sum_b^M P{b, i} X_{b, i} $$ And visually⁷:

In its original form (like CANINE), it can't be used in an autoregressive setting due to the downsampling for block scoring since no mask can be applied to ensure no subword is formed with future bytes at each position.

Building off of this, the Hourglass Transformers paper is a U-Net-like architecture that shows the success of adapting an autoregressive transformer with downsampling by some static factor at different stages (hence hierarchy in paper title) followed by upsampling with residual connections from the pre-pooled representation⁸. The down/upsampling are attention-based where they down/upsample the attention's queries via some arbitrary function (respectively average pool, linear upsampling).

Given they're partially targeting the task of language modelling, they resolve the information leak problem by doing an additional patch-aware shifting of labels to preserve the autoregressive property of the model. They also conduct interesting ablations such as scaling the intermediate layers acting on the downsampled sequence representation (thematically relevant). Crucially though, each time a token is decoded, the entire new sequence has to be passed through the entire network.

They show that they're able to improve upon a baseline byte-level model while still reducing the total number of parameters (at the 150M scale):

One of the primary authors extends this architecture and goes on to experiment in Efficient Transformers with Dynamic Token Pooling to replace the static patching. Given the patch boundary is to be dynamic, they experiment with learning a boundary predictor during training via:

supervision via tokenizer (ala CANINE-S⁵)
supervision via spikes in the conditional entropy of the predictive distribution
end-to-end via stochastic re-parametrisation

They also experiment with not learning the boundary predictor and just relying on a modality-specific boundary via the whitespace character.

Just as the previous architecture, it still requires the full model (i.e boundary predictor network, token model and the decoder) to be invoked after each character is decoded⁹.

MEGABYTE's multiscale transformers is the next autoregressive approach to look at:

It downsamples from the byte-level by embedding each position in the byte-sequence and chunking it into static patches of length $P$ and then employing multiscale transformers to model the patch-level and byte-level sequences. If you haven't come across the term "multiscale", much like "hierarchical" in the previous paper, it isn't a new term and it's used here to refer to a large global model that functions on a compressed sequence (i.e patch-level) and a small local model that operates on the full sequence (i.e byte-level). After reviewing the past few papers, hopefully this hierarchy definition should seem familiar.

When thinking about a larger model being triggered by a smaller model and some heuristics, speculative decoding might come to mind¹⁰. However if we were to have a byte-level draft model and a byte-level oracle model, it would miss out on the crucial sequence length compression and cripple the global model to have a much shorter max context length.

The core contribution of this paper is this specific separation of computation which impacts the frequency of execution and FLOPs distribution. With a full sequence $T$ bytes and $P = 4$ at inference time, $K = \frac{T}{P}$ where the global model is executed $K$ times (i.e 4x less in this case) versus the local model that runs $T$ times. The global model is also executed with $T/4$ patches in its context putting it in a similar position as the average subword-level token size.

In this setup, the local model is predicting each sequence element's likelihood conditioned solely on its respective patch (and not all previous patches!). In practice, they create $K$ copies of the local model so that both at prefill (during inference) and training, they're able to parallelise. Its autoregressive property is upheld by, as you would also expect, the inclusion of padding and offsetting inputs to the local and global models to avoid leaking information about future positions.

One of the paper's aim is general, modality-agnostic byte modelling which is demonstrated by their evaluation against language, audio and image modelling¹¹. With respect to language modelling, they show that they outperform other byte-level models in compute-controlled experiments (to much excitement) but they fail to demonstrate its performance against subword-level transformers in a compute-controlled setting. Given the paper takes on a lot, this seems to have fallen shorter on the list of priorities and so they compared in a compute-variable setting¹².

A follow up paper, SpaceByte demonstrates that when done in a compute-controlled setting alongside subword-level transformers, it doesn't perform quite as competitively as the Megabyte authors had figured.

Given that SpaceByte is more fixated on language modelling, they invest more into conducting this benchmarking to have baselines against which to compare. Stemming from the observation that patch boundaries will occur regardless of structure (i.e in the middle of words), they introduce a modality-specific patching rule (ala hourglass-based dynamic token pooling) that gives rise to dynamic patches (i.e patch on word boundaries via whitespace-like byte characters). In order to handle the dynamic patches, they introduce another local model before the global transformer. In this way, they end up approximating a simpler version of the dynamic patch hourglass transformer in the "whitespace as the dynamic patch boundary predictor" configuration.

back up for a breather

So where does this leave us?

As mentioned earlier, you'll see that all the architectures are primarily concerned with their down/upsampling method to create a compressed representation upon which they then disproportionately spend FLOPs from their budget. When designing the downsampling scheme, they're also making a choice as to whether the compressed representation will capture a fixed or dynamic width of bytes and how they're going to prevent future position information leakage.

Up until now, byte-level models seem to have found their fit in the wild for specific tasks where the granularity excels like toxicity detection which are known to outperform tokenized models on academic benchmarks. For the "default" case, they haven't seen much adoption. A recent contender has built on all these previous works, invested in proper studies and achieved some interesting results that seem to be best aligned with our previously stated goal.

Byte Latent Transformer

Using the set of approaches we've accumulated thus far let's break down the BLT. Much like SpaceByte, it's solely focused on language modelling. Starting with a broad overview of the components, it has:

Patcher $\mathcal{P}$ that decides the dynamic patch boundaries for a stream of bytes
Local Encoder $\mathcal{E}$ responsible for going from bytes to patches
Global Transformer $\mathcal{G}$ that contextualises the patches
Local Decoder $\mathcal{D}$ uses byte-level info from $\mathcal{E}$ and patches from $\mathcal{G}$ to predict the next-byte of (what will be) the next patch.

Putting that all together into an animation:

For most intents and purposes, the animation above should suffice in explaining the architecture to the point that we can investigate the paper's results. For the curious, continue on, for the time-pressed - jump to the results section or the quirks section that helps build intuition via tinkering.

mechanics

I'll refer to BLT as the local encoder/decoder + global transformer and the Patcher as a separate entity. The Patcher $\mathcal{P}$'s goal is similar to that in the Hourglass Transformer but rather than using a classifier trained to predict entropy-based patch boundaries, it uses the next-byte prediction of a small byte-level autoregressive LLM's to determine the boundaries via thresholds. Concretely, they're computed as the next byte entropies under the LM distribution $p_e$ over the byte vocabulary $\mathcal{V}$:

$$ H\left(x_i\right)=\sum_{v \in \mathcal{V}} p_e\left(x_i=v \mid x_{<i}\right) \log p_e\left(x_i=v \mid x_{<i}\right) $$ It's trained separately from the BLT (but on the same pre-training data mix) with sliding window attention ($n_{ctx} = 512$). A patch boundary is then decided on the basis of either one of two thresholds: $$ Global Constraint \quad H\left(x_t\right)>\theta_{g} $$

$$ Approx. Monotonic Constraint \quad H\left(x_t\right)-H\left(x_{t-1}\right)>\theta_r $$ The thresholds are calibrated on the basis of the average desired patch size on the pre-training data mix.

For notation, anything subscript ${i}$ denotes byte-level positions while ${j}$ refers to patch-level.

With patch boundaries defined, the BLT has the Local Encoder $\mathcal{E}$ that embeds bytes $b_i$ into $x_i$ via a $\mathbb{R}^{256 \times h_{\mathcal{E}}}$ matrix. It has alternating layers of transformer blocks and multi-headed cross attention. It downsamples using $x_i$ and the patch boundaries (from $\mathcal{P}$) to create patch representations $p_j$ (ala SpaceByte). $\mathcal{G}$ does a bog-standard pass through the transformer to produce the contextualised $o_{j}$. The Local Decoder $\mathcal{D}$ also has the alternating layers just in reverse order (i.e starts with cross attention). It uses both enriched patch-level $o_{j}$ and intermediate byte-level $\hat{x}i$ from $\mathcal{E}$ to predict the next byte $b{i+1}$ of (what will be) the next patch. Prior to $\mathcal{E}$, a local block causal mask is applied such that byte positions can attend across patch boundaries but not across document boundaries.

Focusing on $\mathcal{E}$, $\hat{x}{i}$ is downsampled via pooling that's then projected via a linear layer $\mathcal{E}_C \in \mathbb{R}^{h{\mathcal{E}} \times\left(h_{\mathcal{E}} \times U_{\mathcal{E}}\right)}$ to become $Q_j$ where $U_{\mathcal{E}}$ is the number of cross attention heads and $h_{\mathcal{E}}$ is the local encoder's embedding dimension. $Q_{j}$ is the patch-level query for the cross attention used in combination with byte-level $K_{i}$ and $V_{i}$ projections of $x_{i}$. For the cross attention, a special mask is used where each $Q_{j}$ only attends to the $K$ and $V$ that corresponds to the bytes in its patch $j$.

On the other side, there's $\mathcal{D}$ that upsamples by using $\hat{x}{i}$ (i.e last transformer block output) as the byte-level queries and patch-level keys and values via projections of the enriched $o{j}$. Optionally, instead of using $x_{i}$ directly they imbue each position with CANINE-like n-byte hash embeddings via:

$$ e_i=x_i+\sum_{n=3, \ldots, 8} E_n^{\text {hash }}\left(\operatorname{Hash}\left(g_{i, n}\right)\right) $$ where $x_i$ is the position's byte embedding and pushed in space by each n-gram's embedding (i.e here we're adding 6 embeddings to $x_i$). If you want an even more granular pass through the model, the appendix has a concrete forward pass with shapes.

Aligned with our previous analysis of the design space, BLT is attempting to create a more efficient representation while still integrating byte-level information such that the sparingly-run latent transformer can model, on average, more bytes per patch OR reduce a patch down to a single byte per step on particularly difficult problems (i.e low resource languages, reverse spelling etc). Put succinctly:

"the hypothesis that larger models taking fewer steps on larger patches might perform better than smaller models taking more steps".

results

should we even care about understanding this architecture & its quirks more deeply?

Using the criteria from earlier, let's review the loss results first.

As the old adage goes "if you're getting better results, are you sure that you didn’t add more compute to your network?". Hedging against this, they benchmark in compute-controlled settings for non-trivial model sizes (up to 8B and data up to 1T tokens/4T bytes). Importantly, they're comparing against both byte-level and subword-level models (i.e LLaMa 2, 3 and 3.1) which yielded the claim-to-fame graph:

In this study with fixed inference FLOPS and trained beyond compute-optimal point (much like Llama 3.1), they're claiming that:

BLT generally has a better scaling curve vs LLaMa 2 & 3
Increasing the patch size for BLT gives better scaling curves

The second claim seems much weaker in the larger model case and they attribute this weakness to the decreasing share of total FLOPS used by the byte-level local models that seem to scale slower than the global model. If this turns out to be true, it might just be a matter of shifting inference-time FLOPS to the local models and find the right conjoined scaling method. With the first claim, we already satisfy the (1) and (3) criteria we set out so we can move on to (2) criteria.

Sticking with training an 8B beyond compute-optimal point for a 1T token dataset, they show that BLT performs better on most general-interest tasks:

The downstream task performance (each dataset explained here) focusing on character-level tasks really shines:

It shows that even a model trained on 16x more data is unable to get anywhere near the performance on simple noised data and basic character-level tasks. To paint a more colourful picture, some examples include:

Going back to scaling trends, if we maintain the requirement of compute-optimality, they show matching scaling trends:

The obvious disclaimers here being that equal training FLOPS != equal wall clock training time given that the implementation of those FLOPS varies. This, in reality, reflects in more expensive trainings given that the Hardware FLOPS Utilisation (HFU) won't be as high and therefore require longer usage. Unfortunately, they don't share any concrete details but, just as we'll see in a bit, 50% less inference FLOPS are on the table which could warrant a slightly more expensive training (much like the current trend to train models beyond compute-optimality for cheaper inference).

quirks

entropy-based dynamic patching

Even when considering only the entropy-based patching, there are quite a few interesting implications! They train this small byte-level LLM (aka the Patcher) on the same data mix as the BLT model hoping that low/high entropy regions of the Patcher should strongly correlate with that of the BLT¹³. The implication of this is that the BLT ends up being able to dedicate less compute per byte to less surprising sub-sequences (since more bytes get included into a patch) or more compute to more surprising sub-sequences. Interestingly, this gives the architecture a bounded anti-fragile property in that its able to gain (dedicate more compute = better performance) from uncertainty (higher entropy) for OOD or near-OOD events.

Unlike the "less compute" case, the "more compute" case has a hard cap at 1 patch being 1 byte. Given that the less compute (higher compression) case enables the global transformer to be more efficient in the sequence dimension, it's able to squeeze more bytes into the same context length. Since the Patcher (that determines the surprise) is an LLM it also inherits the interesting properties of LLMs where in-context sequences become less surprising and are also further compressed! For the purpose of gaining intuition of this property, tinker around with this HF Space to see how compute will be distributed across your prompt and compare against tiktoken (GPT models) and Llama 3:

In the paper, the authors explicitly address this in-context patching with (what I'd regard) a hack of flushing the context window on newlines to avoid "entropy drift" where the designed patching behaviour departs from its desired efficiency/difficulty property but rather impinges on performance in downstream tasks such as MMLU (reasoning):

As the hack gets addressed, the variable-compression property seems to make this architecture appealing for its harmonious combination with reasoning. Reasoning models are foundational (haha) to current frontier models but they tend to quickly clog up their context window with long reasoning traces¹⁴ and have to end up spending more tokens for handling issues due to tokenization. I'd be interested to see follow up work a BLT-like architecture with reasoning to see the impact.

Since the average patch size is determined by some threshold, the authors show that they're able to change the patch size at inference time (i.e from a higher threshold to a lower one) so a model trained on larger patches can continue to work on smaller patches. This exists as another lever which can be used on a task-dependent basis. While nice to have at our disposal, we can anticipate it to be quite fragile and its lifetime to be tied to the eventual success of training the entropy patcher in an end-to-end fashion.

patch size and FLOPS

When modulating patch size, it only affects the global model's FLOPS contribution. The local model's FLOPS contribution won't change since they operate at the byte-level. Taking that into consideration, larger patch sizes at a smaller total BLT model size will cause the local models to make up a more significant share of the total FLOPS. Larger patch should be scaled in tandem with global model's size to properly distribute FLOPS. For this reason at patch size = 8, they're able to grow the total model parameters to be 1.7x its tokenized equivalent for the same inference budget:

If you're curious about this relationship between bytes, patch size and model compute, see how the FLOPS distribution changes as you change the parameters:

n-gram hash embeddings

Surprisingly, the n-gram hash embeddings account for a total embedding table of size shape[3_000_000, 256] since each of the 6 n-gram hash groups has 500K embeddings. They aren't included in the parameter count nor in the FLOPS (as they assume its implemented as an efficient lookup table) which is in line with OpenAI's scaling laws paper.

But you might be thinking, other LLMs have vocab sizes of 256K and the final linear layer demands non-trivial amounts of FLOPs, why isn't this an issue here? Since these embeddings are used only to nudge the byte-level embeddings by the neighbouring n-grams' embeddings but are never candidates for prediction themselves, the costly linear layer is avoided. They serve as imbuing byte-level positions with some context at no theoretical cost¹⁵.

Given the emphasis on the compute-controlled study, I assume that the n-gram hash embeddings are a method of offloading FLOPs from the architecture via feature engineering. In the ablations, one finds clues. For this paper, it's Table 9 where they're ablating how 10 total local layers should be split amongst the local encoder/decoder. They show that with a sufficiently parametrised local encoder, the n-gram hash embeddings register no meaningful impact¹⁶:

Besides their ablations, they claim that at the 8B scale, going from 500K to 300K hashes per group changed performance by 0.001 bpb on 15K steps from which they highlight how crucial they are to performance at larger scales. I struggle to follow that conclusion given that their 8B configuration still has $l_\mathcal{E}=1$. The authors also note that when they're training patch size 8 models, they're using 3 encoder layers instead of 1 giving us an idea as to how quickly the feature engineered n-gram hash embeddings become insufficient.

tokens to bytes in record time?

They run an experiment with initialising the global model from Llama 3.1's weights, train it for 220B tokens with a 10x lower learning rate for the global model than random init'd local models. Once complete, this model would have cumulatively been trained on 15.2T tokens. Presumably, they chose 220B tokens since that's ~1T bytes which was the crossover point in their fixed inference scaling study against the Llama 3 4B model.

They compare it) against a random init'd BLT (trained on 200B tokens) and the original Llama 3 (trained on 15T tokens) to find that it does worse than the thing it's init'd from but does better than if BLT was trained from scratch:

In the interest of mitigating sunken cost and promoting adoption, this is cool that it "works" but given they come to the conclusion "byte-ifying loses some of the performance" and write it off as "further work needed to take full advantage" and so it isn't a feasible for a "quick" conversion to byte-level. However, if you're one of the big labs wanting to run an experiment to check feasibility of reducing your inference costs without lobotomising your model, this is decent news. If this architecture truly does contribute to unlocking an additional scaling dimension and significantly reduces inference FLOPS, it's probably less important.

implications

With supply chain groaning to satisfy the blistering demand for intelligence across the economy, the total share of GPUs for research continues to be under pressure. Until clusters come online, it might mean that in practice less FLOPS at inference is not only a cost reduction measure but rather a means of affording more of the FLOPS budget to research!

Even in the event that BLT training is less efficient w.r.t HFU, given this shift to serving being an increasing cost, it might be a tradeoff that carries positive ROI. Consider that mid/post-training has some amount of the training budget dedicated to it to handle lower resource languages, vocab extension, failure of tokenization etc. If a subset of that budget isn't needed and goes to a less efficient HFU BLT training, will it come out to a clear positive ROI?

If the multiscale style architecture proliferates, we'll also see serving change (similar to industrial adoption of spec decoding) given the varying frequency and memory footprint with which these byte-level and patch-level models run. Given their experiments were run against Llama 3 where they almost 2x the dense model parameters, we can only imagine what usage along with MoE will do to cluster HBM requirements.

If sequence compression truly is pushed into an end-to-end model, sharing tokenizers as a static entity across models might be a thing of the past but it could be substituted for transfer-learning from patcher/encoder/decoder models into your specific domain. A few more "boring" impacts can be found in the appendix.

future

what does the next iteration look like?

With this externalised and separately trained Patcher, BLT still has some BPE-type fragility in that it depends on a component requires its own separate tweaking which might lead it to also become a fragile upstream dependency (but less so than BPE, hopefully).

What does a multi-modal BLT look like? If n-gram hash embeddings have to be replaced, learned modality-specific pre-processing into modality-specific embedding tables could be possible. More than that, I'd expect a deeper encoder but if the need for some additional structured signal remains, possibly some charformer-like GBST block generalised to multiple dimensions might be useful.

Multi modality will require some dynamic patch boundary predictor since the current Patcher LLM is limited to sliding window attention of 512 bytes. In the event where a 640x640 RGB image is over 1 million bytes, this probably breaks down quickly. On this point, I'd expect to see the reliance on flushing context for resolving "entropy drift" to be addressed with a more robust solution or its need completely negated as the Patcher LLM gets integrated or trained jointly with the BLT.

In an interview, MEGABYTE's first author (also author on BLT) mentions how they were exploring (at that time) what a multi-scale transformer operating on some pre-tokenized input would look like. Given the lack of mention in the BLT paper, it might have turned out to be a dead end which didn't quite fit in the paper so I'd be less inclined to expect this in the next iteration.

Rather than having this external patch boundary detector and multi scale transformers to handle the dynamic patches, does the desire for adaptive compute leak into the tokenizers directly like in the compute adaptive tokenizers paper?

Will the Bitter Lesson prevail? How far will the path of externalised tokenizers with early fusion on existing architecturestake us? Will iterations that increase complexity in the architecture that decouple processing by modality or possibly by task find the right balance? If history is our guide and external factors like compute and data continue to trend upwards, we can expect that these will be stepping stones to the Ultimate Architecture (it might be big).

Thanks to Christopher Fleetwood, Jochem Gietema, Philip Botros for their feedback on this post.

appendix

BLT Granular Mechanics

I wanted to think through every step of this model at inference time to fully grok it and had this higher-level pass through already written down. I thought it'd be a shame to not share as there are probably others in the same situation since other explanations aren't quite accurate. A useful mental model is that for each patch, we're squeezing out as many bytes as possible until the small autoregressive LLM's (Patcher's) entropy threshold decides that patch has been exhausted. Each patch is not predicted via next token prediction in $l_G$ but rather created by pooling bytes once the Patcher determines a patch boundary.

We can split decoding into two stages - inter-patch prediction and intra-patch prediction. I'm omitting the hash n-gram embeddings from the description. I'll stick to the notation from the paper, worth calling out that anything subscript by $i$ is referencing byte-level, $j$ is referencing patch-level.

For clarity's sake, I'll take a bs=1 scenario mid decoding where we start with the intra-patch decoding case. We have a list of patch lengths [1, 4, 4] determined by the small autoregressive LLM (i.e Patcher). Assuming $h_\mathcal{E} = 256$ and $h_{G} = 1024$, the bytes-level local representation has shape[1, 9, 256] (denoted as $x_i$) and patch-level global representation $o_j$ has already been passed through $l_G$ and has shape[1, 2, 1024]. Notice how we only have 2 patches even though we have 3 patch lengths.

After sampling to get shape[1, 10] (denoted as $b_i$), the Patcher's entropy threshold isn't reached and so no new patch is determined. We now expect to re-use the patch-level global representation $p_j$ with shape[1, 2, 1024]. The next step of the decoder requires the byte-level representation $b_{i}$ with shape[1, 10] to be embedded to $x_i$ with shape[1, 10, 256] , go through $l_\mathcal{E}$ 's transformer blocks to get the latent representation such that it can be used as the Query for the $l_{D}$'s Cross Attention (CA). Assuming a KV cache exists attached to the CA, we're able to do another full step through the $l_D$'s CA and transformer layers, and sample the next position so we have a new $b_i$ withshape[1, 11].

If this turns out to be an end-of-patch event, we have new patch lengths [1, 4, 6]. This is the inter-patch prediction which requires us to get the new $o_j$ patch representations such that we have a new patch to squeeze out the next set of bytes!

It starts with the $b_i$ shape[1, 11] becoming $x_i$ with shaped[1, 11, 256], to go through first transformer layer in $l_\mathcal{E}$ into a downsample (i.e pooling), projection to become the patch-based Query for the $l_\mathcal{E}$ CA followed by the rest of $l_\mathcal{E}$. This gives us a new patch-level representation $p_j$ with shape[1, 3, 1024] that goes through $l_{G}$, gets enriched and becomes $o_j$ with shape[1, 3, 1024] and then a normal $l_D$ forward pass (i.e with $l_\mathcal{E}$'s byte-based transformer output skip connection as the Query, $o_j$ as the patch-based KV for the $l_D$'s CA) from which we can then begin the intra patch decoding via sampling the new first byte position of this provisional patch!

boring impacts

If tokenizers aren't standardised, what does pricing look like? It's structured exactly the same! Patches are just tokens but with variable byte compression and the concept of tokens have such strong customer understanding that if you can avoid introducing new terminology that adds complication to your customer, you would (applies to all but OpenAI with their great model naming conventions). Similarly, even though OpenAI are yet to roll out SLAs, Microsoft's Azure OpenAI Service does have SLAs for latency which wouldn't be materially effected.

It does, however, add a level of obscurity for all model customers given it will be less obvious how you're being charged. Given that pricing obscurity is the norm in that for reasoning, reason traces aren't shown but customers are charged for them, shows that a bit of obscurity isn't a deterrent for adoption.

footnotes

It's worth noting that there is also mixed findings on these traits for different tasks as this survey for machine translation "show neither better domain robustness, nor better morphological generalization, despite being often so motivated" but do show "The only clear advantage of character models is high robustness towards source-side noise."↩
With an imbalanced encoder/decoder setup, inference times are going to vary on the basis of the task type where short inputs and/or long targets are more favourable for a ByT5 encoder-heavy architecture.↩
In this vein, MrT5 addresses the compute problem via introducing token deletion gates to force the model to learn merging input tokens into a more compact sequence, reducing sequence length by 50% and reaping the speedups. While EvaByte stays architecturally "simpler" by incorporating multi-byte prediction & a linearized attention variant.↩
for autoregressive models, guaranteeing no information leakage from future positions↩
Another model is also evaluated "CANINE-S" where the model uses a tokenizer for pre-training targets for the masked language modelling task.↩
A note in honour of the authors, this paper seems undeservingly un-cited in the MEGABYTE & Byte Latent Transformer papers↩
At this point, scores across blocks are independent and so they introduce an optional dot product across all scores to enable the model to consider scores across blocks when weighting the subword block representations. Meaning: $\hat{P}=\operatorname{softmax}\left(P P^{\top}\right) P$↩
It strongly resembles the work of the Funnel Transformer but doesn't target the language modelling case - the hourglass paper gets the mention instead given it works autoregressively and the follow up work that closely aligns with this post's design goal.↩
Follow up work Toucan removes this requirement for one of the boundary predictor cases by "reusing" the shared representations used to decode each character and they're able to speed up decoding by 2x.
↩
A few of the methods for triggering the oracle's forward pass for verification are:
- static K (i.e after 4 tokens, call oracle model)
- tuned probability threshold (i.e generate until draft model is not confident)
- staged speculative decoding or methods where you remove the draft model entirely:
↩
Seemingly, the first author's feed is also an indication of the fixation of freeing modelling multi modality from the tyranny of tokenizers.↩
"We also compare with the best previously reported numbers for sub-word models. These results may be confounded by differing amounts of compute and tuning used, but show that MEGABYTE gives results competitive with state-of-theart models trained on subwords. These results suggest that MEGABYTE may allow future large language models to be tokenization-free."↩
they ablate model size and context length to see the scaling laws, seeing diminishing returns at 50M.
↩
this holds true regardless of whether current models are wasting tokens but this may be less of a concern in the future as the RL matures↩
if they were to use the DeepMind Chinchilla paper's method, the embeddings FLOPS would be included↩
It's still valid that increasing encoder layer count will reduce the need for them. It's a bit hard to reason about, their n-gram ablations are all on the basis of encoder layer being set to 1. It seems odd that they don't run ablations with a heavy encoder and a light decoder with no n-gram hash embeddings.↩

i like getting older

2025-06-24 13:16:00

There are seconds here and there when I feel a deep gratitude about getting older. They're really random; I've sometimes held dishes while cleaning them, thinking "I might hate doing this right now, but I'll miss it once I'm no longer on this earth. I'll miss everything, no matter how mundane.".

I realize what a privilege getting older is. I get to keep following my goals, I get to finish things I set out to do and watch things conclude. I get to witness animals and plants and everything earth has to offer one more day (I'm scheduling this post to be posted at a later time, let's hope I don't die in-between or else this is gonna age like milk!). I get to see my loved ones one more day, and it's one more day added to how long I have known them.

I get more and more courage to do things the older I get. I have more freedoms because some mental blocks and self-restrictions and insecurities fall away and there's better jobs, shared income, help, a support network. I have more experience to draw from to make better decisions. As time passes and I get further and further away from my teens, I don’t have to be liked by everyone and I'm getting comfortable with the fact that some things I do or say will make others dislike me. I no longer have to impress people I don't even like in the first place! I'm more comfortable with losing people and moving on. More comfortable speaking up, asking for things and not tolerating unbearable behavior.

I realize that for some people, these might have gotten worse as they got older, but not for me.

Highly subjective, but I really find most of childhood, all of teens and most of 20s to be miserable in comparison. You’re always very restricted either by people or means - you’re dependent on so many other people, forced into things you don’t wanna do. You’re expected to ask everyone for permission to do the most basic things, you have little to no money. No one gives a fuck about your opinion, and your emotions are confusing. You’re underestimated left and right, belittled and infantilized. Getting diagnoses is hell because everyone thinks you’re faking it because you’re just too lazy or want an excuse to miss school. Fitting in and the opinions of your parents and teachers and friends and bullies and randoms on the street dictate everything.

I would rather shoot myself in the head than start over. 26 and onward are the best years of my life so far, all things considered, even when 2024 was so hard and almost killed me. I will be 30 at the end of the year and I couldn’t be more grateful for that because for a lot of my life I planned to be dead by 18.

Reply via email
Published {{ post_published_date }}

learning piano is FREAKING hard

2025-06-24 08:20:00

After my brief few months of taking Spanish audio lessons, I decided to pick up a hobby that was more interactive and had a better feedback loop. Learning Spanish is great, but I don't really have anyone to practice with, and going for walks and listening to the lessons is very passive. I need something active!!

So I searched online and found a private instructor in San Diego, bought a cheap $100 keyboard, and got to work.

The first few weeks were great. Getting back into music after a roughly 10 year break was easier than I thought. I used to play percussion instruments in high school and owned a drum kit until just a few years ago. This helped accelerate my piano learning, since I could already count a beat, and I knew how to read most of the notations on sheet music, like time signatures, accents, and dynamics. Learning the notes and where they fall on the staff (not to mention reading 2 staffs at one time) became my biggest hurdle.

So with all that I was making rapid progress, but now it's starting to come to a screeching halt. Learning piano feels like learning how to completely rewire your brain. The most challenging aspect is getting both your hands to coordinate with each other, while still being able to read the music in front of you. I am in awe at people that have been doing this their entire lives. Playing simple music gave me such a confidence boost, but now that I'm starting to play somewhat more challenging music, I feel hopeless and humbled as hell.

Right now, I'm learning a baby version of Bach's Minuet in G Major. I also wanted to try learning a song I actually like, so I bought the sheet music for Beach Bunny's Prom Queen (an absolute banger song).

If any of you people out there have any suggestions on how to best learn piano, feel free to email me down below! I'm also looking to upgrade my cheap $100 keyboard now that I'm more committed to this hobby, so again if anyone has a suggestion, I'd love to hear.

like this post? discuss it on the forums at basementcommunity.com. Or you can also send me an email!

notes on a heatwave

2025-06-24 06:48:51

Most of the things I do in my life walk a fine border between stupidity and curiosity. For one, I tend to ignore severe weather alerts. I never go too far - you won’t find me at the waterfront during a hurricane - but I like to be reminded of the natural power of the Earth at its strongest and most extreme moments. I don’t want to forget about it, as it seems so far away when I spend my days indoors at work with my eyes clinging to 24-inch monitor screens. I love outside. I love nature, despite all the peril she can inflict on our little human lives.

And in her power, she has given us a colossal heatwave today. I’m writing to you from my red and blue plaid picnic blanket, perched on the grass on the side of a small incline. I’m poorly positioned as my other writing utensils and water bottle keep rolling away from me. The 40 degree weather is making my eyes feel like they’re bulging out of my head. My senses feel heightened, every body sensation more noticeable. I’m grateful for the weather’s ability to make me cognisant of every drop of sweat down my neck, of every labourious breath, of every piece of skin cooking under the light of the sun. If I focus, I swear I can hear my melanin cells absorbing the UV.

It’s the hottest day here that I can remember in recent history. Somehow, it was only a few weeks ago I was complaining about the snow left on the ground. It’s hot enough here that people will die today. Pre-existing health issues, old age, and no A/C are a terrible combination in this heat that used to be so rare. It is only June. What do they mean when they say it can get hotter than this? I repeat the old climate activists’ adage to myself, “it is the coldest summer for the rest of my life”.

A small nuisance as I write this page, I keep having to recline backwards so as to stop my sweat from dripping onto these pages and spreading the ink, making my words indecipherable. It mimics the same cloudy way the words are feeling in my head.

I’ve maintained my fascination with weather since I was a little girl. I remember in the -25 degree hellscape of winter I would would give away my thick winter coat at recess to the kids who came to school a bit under-prepared. I’d like to tell you that I did that because I was an unselfish child - but this would be a lie. The shedding of my coat was an experiment in seeing how much cold I could endure, and to feel the power of the Earth on my bare skin, chilling through the layers of my body. I always chickened out eventually. Just in time to prevent frostbite.

I have the urge to chicken out again as the hot day is turning into a hot night. There is no reprieve in sight. I wonder when the last reprieve of my lifetime will come. I can only be grateful that the hope is still there.

In the haze of this new evening I am listening to Ora Cogan’s Formless. It is a psychedelic country album that feels exactly like the air outside.

I am snacking on fresh strawberries from the big farm outside of town - so ripe it’s painful.

I hope you have your reprieve, wherever you are.

yours,

sam

special thanks: to this song for carrying me through the heat:

Formless by Ora Cogan

🖤 new toast button 🖤

2025-06-24 05:20:00

~ 3 minutes to read

I changed my toast button to be a heart! I also removed the text that displayed the total toasts, and replaced it with as many hearts as there are toasts on a post.

What it looks like with one toast and you haven't toasted it yet:

What it looks like with multiple toasts and after you've toasted:

Here's the JavaScript that's getting the total number of toasts and adds that number of hearts.

<script>
document.addEventListener("DOMContentLoaded", function(){
  // get the total number of toasts 
  let countText = document.querySelector('.upvote-count');
  let total = parseInt(countText.innerText);
  let count = total > 1 ? total : 1;


  // finds the upvote button and add the hearts in span elements
  let heartList = document.querySelector('.upvote-button'); 
  heartList.title = 'Thank you 🖤';

  for (let i = 0; i < count; i ++) {
    let heartspan = document.createElement('span');
    let heart = document.createTextNode('🖤');
    heartspan.appendChild(heart);
    heartList.appendChild(heartspan);
  }
});


</script>

This is what the DOM looks like after that. I tried adding all of the hearts in one span, and also outside of the button, but to get all the hearts and the plus sign aligned how I wanted them, adding them as individual spans in the button ended up working the best.

And here's the CSS:


/* lets the text and hearts sit next to each other */
#upvote-form {
  display: flex !important;
}

/* adds the text before the button */
#upvote-form::before {
  content: "Leave a like:";
  margin-right: .3rem;
  background-color: var(--second-color);
  height: 26px;
  min-width: 124px;
}

/* lays out the hearts so they're next to each other and wrap when there's a lot */
.upvote-button {
  position: relative;
  display: flex;
  font-size: 16px;
  flex-wrap: wrap;
  flex-direction: row !important;
}

/* hides the default icon and text */
.upvote-button svg,
.upvote-button small {
	display: none;
}


/* changes the button content from + to 🖤 depending on its status */
.upvote-button::before,
.upvote-button:hover::before {
  content: '+';
  display: block;
  font-size: 20px;
  margin: 0 5px;
}

.upvote-button:disabled::before,
.upvote-button.upvoted::before,
.upvote-button.upvoted:hover::before {
  content: '🖤' !important;
  display: block;
  font-size: 16px;
  margin: 0;
}

I think I also found a weird bug because after I changed the CSS for the upvote button, I was able to upvote my posts again? Kinda handy because I needed to test the button, but not sure if that's supposed to happen.

Shout out to these two posts that I referenced to change the toast button to an emoji:

Written: 23 Jun, 2025
Published: {{ post_published_date }}
Last updated: {{ post_last_modified }} ago
Reply via email

Bear Blog Trending PostsModify