2025-05-09 21:24:20
After I wrote the code to make the bar graph of papal names for the previous post, I decided to reuse the code to make a similar graph for monarchs of England. Just as there is some complication in counting papal names, there are even more complications in counting names of English monarchs.
Who was the first king of England? I went with Æthelstan (924–927). Was Lady Jane Grey queen of England? Not for my chart. Note that Edward the Elder and Edward the Martyr came before Henry I.
Incidentally, John is the most common name for a pope and the least common for a king of England. Several monarch names are unique, but John’s name is conspicuously not reused since he was an odious king. I remember my world history teacher saying there would never be another English king named John, something I found disappointing at the time.
2025-05-09 19:35:37
The new pope chose the name Leo XIV. That made me curious about the distribution of names of popes and so I made the graph below. (I’m Protestant, so wasn’t familiar to me.)
Looks like Leo is tied with Clement for fourth place, the top three names being John, Benedict, and Gregory.
There are a few oddities in counting the names due to the time in the Middle Ages when there was disagreement over who was pope. For this reason some popes are listed twice, sorta like how Grover Cleveland and Donald Trump each appear twice in the list of US presidents. And although the last pope named John was John XXIII, 21 popes have been named John: there was no John XX due to a clerical error, and John XVI was declared an antipope.
I also made a higher resolution PDF.
The post Frequency of papal names first appeared on John D. Cook.2025-05-09 04:25:53
The recent post Converting between quaternions and rotation matrices describes an algorithm as “a mathematically correct but numerically suboptimal method/”
Let’s just look at a small piece of that post. The post explains how to find a rotation matrix to describe the same rotation as pre- and post- multiplying by a unit quaternion q, and how to recover q from a rotation. The latter involves
q0 = ½ √(1 + r11 + r22 + r33).
If you look at the definition of the r terms you’ll see that the quantity under the square root equal 4q0² and so the term is positive. In theory. In floating point arithmetic, the term you’re taking the square root of could be small or even slightly negative.
I mentioned at the bottom of the earlier post that I came up with a unit quaternion q that caused the code to throw an exception because it attempted to take the square root of a negative number.
There’s an easy way to keep the code from throwing an exception: check whether the argument is positive before taking the square root. If it’s positive, forge ahead. If it’s negative, then round it up to zero.
This makes sense. The argument is theoretically positive, so rounding it up to zero can’t do any harm, and in fact it would to a tiny bit of good. But this is missing something.
As a general rule of thumb, if an algorithm has trouble when a quantity is negative, it might have trouble when the quantity is small but positive too.
How could the term
1 + r11 + r22 + r33
which is theoretically positive be computed as a negative number? When all precision is lost in the sum. And this happens when
r11 + r22 + r33
is nearly −1, i.e. when you’re subtracting nearly equal numbers. In general, if two positive numbers a and b are the same up to n bits, you can lose n bits of precision when subtracting them. You could lose all of your precision if two numbers agree to n bits and all you have are n bits.
Putting in a “breaker” to prevent taking the square root of a negative number doesn’t address the problem of code not throwing an exception but returning an inaccurate result. Maybe you’re subtracting two numbers that don’t agree to all the bits of precision you have, but they agree to more bits than you’d care to lose.
The solution to this problem is to come up with another expression for q0 and the components, another expression that is equal in theory but numerically less sensitive, i.e. one that avoids subtracting nearly equal numbers. That’s what the authors do in the paper cited in the previous post.
The post Square root of a small number first appeared on John D. Cook.2025-05-09 00:50:13
Large language models display emergence behaviors: when the parameter count is scaled to a certain value, suddenly the LLM is capable of performing a new task not possible at a smaller size. Some say the abruptness of this change is merely a spurious artifact of how it is measured. Even so, many would like to understand, predict, and even facilitate the emergence of these capabilities.
The following is not a mathematical proof , but a plausibility argument as to why such behavior should not be surprising, and a possible mechanism. I’ll start with simple cases and work up to more complex ones.
An obvious point. Emergence is ubiquitous in nature. Ice near the freezing point that is slightly heated suddenly becomes drinkable (phase change). An undrivable car with three wheels gets a fourth wheel and is suddenly drivable. Nonlinearity exists in nature.
A simple example: consider fitting N arbitrary points in one dimension with linear regression using monomials. For a basis up to degree less than N-1, for most possible sets of data points (excluding “special” cases like collinear), the regression error will be non-zero, and reciprocally, the accuracy will be some finite value. Increase the number of monomials (parameter count) to N-1, and suddenly the error drops to zero, and accuracy jumps to infinity.
When using k-means clustering, if one has n clusters and runs k-means clustering with K<N cluster centers, the error will be significant, but when K=N, suddenly the cluster centers can model all clusters well, and the error drops dramatically.
Consider all Boolean circuits composed from some fixed logically complete set of gate types. Now consider the goal of constructing a Boolean circuit that takes a single byte representing the integer N and increments it to N+1, modulo 256 (8 bits input, 8 bits output). Clearly, such a circuit exists, for example, the standard chain of 1-bit add-and-carry circuits. Note one can in principle enumerate all possible circuits of finite gate count. It is manifest that an integer K>0 exists for which no circuit with less than K gates solves the problem but there exists a circuit with K gates that does. The standard chain of 8 1-bit adders might be such a minimizer, or maybe the optimal circuit is more exotic (for example see here, though this method is not guaranteed to compute a minimizer).
One would thus see this capability potentially emerge as soon as one reaches a gate budget of K gates. Now, one could argue that for a smaller gate budget, a partial result might be possible, for example, incrementing any 7-bit number—so the increase in capability is continuous, not emergent or wholly new. However, if all you care about is correctly incrementing any byte (for example, for manipulating ASCII text), then it’s all or nothing; there’s no partial credit. Even so, the gate budget required for for incrementing 8 bits compared to only 7-bit integers is only slightly higher, but this minor increase in gate count actually doubles the quantity of integers that can be incremented, which might be perceived as a surprising, unexpected (emergent) jump.
The parameter count of an LLM defines a certain bit budget. This bit budget must be spread across many, many tasks the final LLM will be capable of, as defined by the architecture and the training process (in particular, the specific mix of training data). These tasks are implemented as “algorithms” (circuits) within the LLM. The algorithms are mixed together and (to some extent) overlap in a complex way that is difficult to analyze.
Suppose one of these desired capabilities is some task X. Suppose all possible input/output pairs for this operation are represented in the training data (or, maybe not—maybe some parts of the algorithm can be interpolated from the training data). The LLM is trained with SGD, typically with 2-norm minimization. The unit ball in the 2-norm is a sphere in high dimensional space. Thus “all directions” of the loss are pressed down equally by the minimization process—which is to say, the LLM is optimized on all the inputs for many, many tasks, not just task X. The limited parameter bit budget must be spread across many, many other tasks the LLM must be trained to do. As LLMs of increasing size are trained, at some point enough parameter bits in the budget will be allocatable to represent a fully accurate algorithm for task X, and at this point the substantially accurate capability to do “task X” will be perceivable—“suddenly.”
Task X could be the 8-bit incrementer, which from an optimal circuit standpoint would manifest emergence, as described above. However, due to the weakness of the SGD training methodology and possibly the architecture, there is evidence that LLM training does not learn optimal arithmetic circuits at all but instead does arithmetic by a “bag of heuristics” (which incidentally really is, itself, an algorithm, albeit a piecemeal one). In this case, gradually adding more and more heuristics might be perceived to increase the number of correct answers in a somewhat more incremental way, to be sure. However, this approach is not scalable—to perform accurate arithmetic for any number of digits, if one does not use an exact arithmetic algorithm or circuit, one must use increasingly more heuristics to increase coverage to try to capture all possible inputs accurately. And still, transitioning from an approximate to an exact 8-bit incrementer might in practice be perceived as an abrupt new capability, albeit a small one for this example.
One could alternatively consider tool use (for example, a calculator function that is external to the LLM proper), but then a new tool must be written for every new task, and the LLM needs to understand how to use the tool. (Maybe at some point LLMs will know how to write and use their own algorithmic tools?)
The real question is how can we predict when a new LLM will achieve some new capability X. For example, X = “Write a short story that resonates with the social mood of the present time and is a runaway hit” (and do the same thing again once a year based on new data, indefinitely into the future without failure). We don’t know an “algorithm” for this, and we can’t even begin to guess the required parameter budget or the training data needed. That’s the point of using an LLM—its training internally “discovers” new, never seen before algorithms from data that would be difficult for humans to formulate or express from first principles. Perhaps there is some indirect way of predicting the emergence of such X, but it doesn’t seem obvious on the face of it how to predict this directly.
Based on these examples, it would seem not at all surprising for LLMs to exhibit emergent behaviors, though in our experience our encounter with them may be startling. Predicting them may be possible to a limited extent but for the general case seems really hard.
Do you have any thoughts? If so, please leave them in the comments.
The post Why do LLMs have emergent properties? first appeared on John D. Cook.2025-05-07 21:52:27
In the previous post I wrote about representing rotations with quaternions. This representation has several advantages, such as making it clear how rotations compose. Rotations are often represented as matrices, and so it’s useful to be able to go between the two representations.
A unit-length quaternion (q0, q1, q2, q3) represents a rotation by an angle θ around an axis in the direction of (q1, q2, q3) where cos(θ/2) = q0. The corresponding rotation matrix is given below.
Going the other way around, inferring a quaternion representation from a rotation matrix, is harder. Here is a mathematically correct but numerically suboptimal method known [1] as the Chiaverini-Siciliano method.
Here sgn is the sign function; sgn(x) equals 1 if x is positive and −1 if x is negative. Note that the components only depend on the diagonal of the rotation matrix, aside from the sign terms. Better numerical algorithms make more use of the off-diagonal elements.
Something seems a little suspicious here. Quaternions contain four real numbers, and 3 by 3 matrices contain nine. How can four numbers determine nine numbers? And going the other way, out of the nine, we essentially choose three that determine the four components of a quaternion.
Quaterions have four degrees of freedom, but we’re using unit quaternions, so there are basically three degrees of freedom. Likewise orthogonal matrices have three degrees of freedom. An axis of rotation is a point on a sphere, so that has two degrees of freedom, and the degree of rotation is the third degree of freedom.
In topological terms, the unit quaternions and the set of 3 by 3 orthogonal matrices are both three dimensional manifolds, and the former is a double cover of the latter. It is a double cover because a unit quaternion q corresponds to the same rotation as −q.
Implementing the equations above is straightforward.
import numpy as np def quaternion_to_rotation_matrix(q): q0, q1, q2, q3 = q return np.array([ [2*(q0**2 + q1**2) - 1, 2*(q1*q2 - q0*q3), 2*(q1*q3 + q0*q2)], [2*(q1*q2 + q0*q3), 2*(q0**2 + q2**2) - 1, 2*(q2*q3 - q0*q1)], [2*(q1*q3 - q0*q2), 2*(q2*q3 + q0*q1), 2*(q0**2 + q3**2) - 1] ]) def rotation_matrix_to_quaternion(R): r11, r12, r13 = R[0, 0], R[0, 1], R[0, 2] r21, r22, r23 = R[1, 0], R[1, 1], R[1, 2] r31, r32, r33 = R[2, 0], R[2, 1], R[2, 2] # Calculate quaternion components q0 = 0.5 * np.sqrt(1 + r11 + r22 + r33) q1 = 0.5 * np.sqrt(1 + r11 - r22 - r33) * np.sign(r32 - r23) q2 = 0.5 * np.sqrt(1 - r11 + r22 - r33) * np.sign(r13 - r31) q3 = 0.5 * np.sqrt(1 - r11 - r22 + r33) * np.sign(r21 - r12) return np.array([q0, q1, q2, q3])
We’d like to test the code above by generating random quaternions, converting the quaternions to rotation matrices, then back to quaternions to verify that the round trip puts us back essentially where we started. Then we’d like to go the other way around, starting with randomly generated rotation matrices.
To generate a random unit quaternion, we generate a vector of four independent normal random values, then normalize by dividing by its length. (See this recent post.)
To generate a random rotation matrix, we use a generator that is part of SciPy.
Here’s the test code:
def randomq(): q = norm.rvs(size=4) return q/np.linalg.norm(q) def randomR(): return special_ortho_group.rvs(dim=3) np.random.seed(20250507) N = 10 for _ in range(N): q = randomq() R = quaternion_to_rotation_matrix(q) t = rotation_matrix_to_quaternion(R) print(np.linalg.norm(q - t)) for _ in range(N): R = randomR() q = rotation_matrix_to_quaternion(R) T = quaternion_to_rotation_matrix(q) print(np.linalg.norm(R - T))
The first test utterly fails, returning six 2s, i.e. the round trip vector is as far as possible from the vector we started with. How could that happen? It must be returning the negative of the original vector. Now go back to the discussion above about double covers: q and −q correspond to the same rotation.
If we go back and add the line
q *= np.sign(q[0])
then we standardize our random vectors to have a positive first component, just like the vectors returned by rotation_matrix_to_quaternion
.
Now our tests all return norms on the order of 10−16 to 10−14. There’s a little room to improve the accuracy, but the results are good.
Update: I did some more random testing, and found errors on the order of 10−10. Then I was able to create a test case where rotation_matrix_to_quaternion
threw an exception because one of the square roots had a negative argument. In [1] the authors get around this problem by evaluating two theoretically equivalent expressions for each of the square root arguments. The expressions are complementary in the sense that both should not lead to numerical difficulties at the same time.
[1] See “Accurate Computation of Quaternions from Rotation Matrices” by Soheil Sarabandi and Federico Thomas for a better numerical algorithm. See also the article “A Survey on the Computation of Quaternions From Rotation Matrices” by the same authors.
The post Converting between quaternions and rotation matrices first appeared on John D. Cook.2025-05-07 21:50:46
Every rotation in 3D fixes an axis [1]. This is Euler’s rotation theorem from 1775. Another way to state the theorem is that no matter how you rotate a sphere about its center, two points stay fixed.
The composition of two rotations is a rotation. So the first rotation fixes some axis, the second rotation fixes some axis, and the composition fixes some third axis. It’s easy to see what these axes are if we work with quaternions. (Quaternions were discovered decades after Euler proved his rotation theorem.)
A rotation by θ about the axis given by a unit vector v = (v1, v2, v3) corresponds to the quaternion
q = (cos(θ/2), sin(θ/2)v1, sin(θ/2)v2, sin(θ/2)v3).
To rotate a point p = (p1, p2, p3) by an angle θ about the axis v, first embed p as a quaternion by setting its first coordinate to 0:
p → (0, p1, p2, p3)
and multiply the quaternion p on the left by q and on the right by the conjugate of q, written q*. That is, the rotation takes p to
qpq*.
Because q has unit length [1], its conjugate is also its inverse: q* = q−1.
Composing rotations represented by quaternions is simple. Rotating by a quaterions q and then by a quaterion r is the same as rotating by rq. Proof:
r(qpq*)r* = (rq)p(q*r*) = (rq)p(rq)*.
Here we use the property of quaternion conjugates that
(rq)* = q*r*.
See the next post for how to convert between quaternion and matrix representations of a rotation.
[1] Note that this is not the case in two dimensions, nor is it true in higher even dimensions.
[2] We assumed v had unit length as a vector in ℝ³. So
||q||² = cos²(θ/2) + sin²(θ/2) = 1.
The post Composing rotations using quaternions first appeared on John D. Cook.