MoreRSS

site iconJohn D. CookModify

I have decades of consulting experience helping companies solve complex problems involving applied math, statistics, and data privacy.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of John D. Cook

Bi-twin prime chains

2026-01-10 21:56:54

I mentioned bi-twin prime chains in the previous post, but didn’t say much about them so as not to interrupt the flow of the article.

A pair of prime numbers are called twins if they differ by 2. For example, 17 and 19 are twin primes.

A bi-twin chain is a sequence of twin primes in which the average of each twin pair is double that of the preceding pair. For example,

5, 7, 11, 13

is a bi-twin chain because (5, 7) and (11, 13) are twin pairs of primes, and the average of the latter pair is twice the average of the former pair.

There is some variation in how to describe how long a bi-twin chain is. We will define the length of a bi-twin chain to be the number prime pairs it contains, and so the chain above has length 2. The BiTwin Record page describes a bi-twin chain of length k as a chain with k − 1 links.

It is widely believed that there are infinitely many twin primes, but this has not been proven. So we don’t know for certain whether there are infinitely many bi-twin prime chains of length 1, much less chains of longer length.

The bi-twin chain of length three with smallest beginning number is

211049, 211051, 422099, 422101, 844199, 844201

Bi-twin records are expressed in terms of primorial numbers. I first mentioned primorial numbers a few days ago and now they come up again. Just as n factorial is the product of the positive integers up to n, n primorial is the product of the primes up to n. The primorial of n is written n#. [1]

The longest known bi-twin chains start with

112511682470782472978099, 112511682470782472978101

and has length 9.

We can verify the chains mentioned above with the following Python code.

def bitwin_length(average):
    n = average
    c = 0
    while isprime(n-1) and isprime(n+1):
        c += 1
        n *= 2
    return c

for n in [6, 211050, 112511682470782472978100]:
    print(bitwin_length(n))

[1] There are two conventions for defining n#. The definition used here, and on the BiTwin Record page, is the product of primes up to n. Another convention is to define n# to be the product of the first n primes.

The priomorial function in Sympy takes two arguments and will compute either definition, depending on whether the second argument is True or False. The default second argument is True, in which case primorial(n) returns the product of the first n primes.

The post Bi-twin prime chains first appeared on John D. Cook.

Prime chains

2026-01-10 20:32:19

The title of this post has a double meaning. We will look at chains in the sense of number theory and in the sense of cryptocurrency, i.e. Cunningham chains and blockchains, that involve prime numbers.

Cunningham chains

A chain of primes is a sequence of prime numbers in which each is almost double its predecessor. That is, the next number after p is 2p ± 1.

In a Cunningham chain of the first kind, the successor of p is 2p + 1. For example,

41, 83, 167.

In a Cunningham chain of the second kind, the successor of p is 2p − 1. For example,

19, 37, 73.

Two questions come up immediately. First, are there infinitely many Cunningham chains? Second, how long can a Cunningham chain be? What is known and what is conjectured are at opposite ends of the spectrum. It is unknown whether there are infinitely many Cunningham chains of length 2, but it is conjectured that there are infinitely many Cunningham chains of all lengths.

According to this page, the longest known Cunningham chains of the first kind has length 17, and the longest known Cunningham chain of the second kind has length 19. We can verify these results with the following Python code.

from sympy import isprime

def chain_length(start, kind):
    p = start
    c = 0
    while isprime(p):
        c += 1
        p = 2*p + kind
    return c

print(chain_length(2759832934171386593519, 1))
print(chain_length(79910197721667870187016101, -1))

Bi-twin chains

A number n is the basis of a bi-twin chain of length k if n − 1 is the start of a Cunningham chain of the first kind of length k and n + 1 is the start of a Cunningham chain of the second kind of length k.

I say more about bi-twin prime chains in the next post.

Primecoin

Primecoin was one of the first cryptocurrencies, coming out four years after Bitcoin. Primecoin is still going, though its market cap is six orders magnitude smaller than that of Bitcoin.

What’s interesting about Primecoin is that it uses finding prime chains as its proof of work task. To mint a new Primecoin block, you must find a prime chain of the required length whose origin a multiple of the hash of the block header [1].

Primecoin allows any of the three kinds of prime chains mentioned above: Cunningham chains of the first or second kind, or a bi-twin prime chain. Primecoin adjusts its mining difficulty over time by varying the length of Cunningham or bi-twin chain needed to mint a new block.

Related posts

[1] The origin of a prime chain is n if the first item in a Cunningham chain of the first kind is is n + 1, or if the first item in a Cunningham chain of the first kind or a bi-twin chain is n − 1.

The post Prime chains first appeared on John D. Cook.

Compressing a set of hash values

2026-01-09 21:01:39

Suppose you have a set of k hash values, each n bits long. Can you compress the set into less than kn bits?

It’s not possible to compress a list of hashes into less than kn bits, but you can hash a set into fewer bits.

Suppose you have a set of 230, roughly a billion, 64-bit hashes. Sort the set and look at the size of gaps between elements. You might expect that consecutive items on the list are roughly 234 apart, and so you could reconstruct the list by reporting the first item and the gaps between the rest, which are 34-bit numbers, not 64-bit numbers, a savings of 30 bits per hash.

This doesn’t exactly work, but it’s the kernel of a good idea. We don’t know that the distance between hashes can be represented by a 34 bit number. The gap could be more or less than 234, but we don’t expect it to often be much more than 234. So we use a variable-length encoding such that when the distance between values is on the order of 234, or less, we save bits, but we allow for the distance to be any size.

Applications

What we’ve described is the essence of Golomb-Rice coding. Its implementation in the Bitcoin protocol is referred to as Golomb-Coded Sets (GCS), described in BIP 158. Golomb-Rice coding is also used other applications where the  values to be compressed are not hashes, such as in lossless auto compression.

Encoding

Let’s go into some detail as to how the distances between sorted values are represented. Suppose you expect the differences to be on the order of M where M is a power of 2. For each difference d, let q and r be the quotient and remainder by M, i.e.

dqMr.

Encode q as a unary number, i.e. string of q 1s, and encode r as an ordinary binary number. Then Golomb-Rice coding of d is the concatenation of the representations of q and r. with a 0 in the middle as a separator. Using || to denote string concatenation we have

unary(q)  ||  0  ||  binary(r).

In general, unary encoding is extremely inefficient, but we’re betting that q will typically be quite small.

The reason we require M to be a power of 2 is so the representation of r will have a fixed length [1].

Let’s work out an example. Suppose M = 220 and

d = 222 + 123456 = 4 × M + 123456.

Then we write q as 1111 and r as 0011110001001000000 and encode d as the string

111100011110001001000000

Decoding

You can concatenate the encodings of consecutive d values and still be able to unambiguously decode the result. Because the r representations have a constant length, you know when an r ends and the next q begins.

For example, suppose we have the following string of bits.

1110010011001011001011111001000000110010001110011101111000101111011

We decode the string from left to right. We count how many ones we see, skip over a 0, then regarding the next 20 bits as a binary number.

111 0 01001100101100101111 1 0 01000000110010001110 0 11101111000101111011

We see three 1s before the first 0 and so we conclude q1 = 3. We skip over the 0 and read the value of r.

r1 = 01001100101100101111two = 314159.

Next we see a single 1 before the next 0, so q2 = 1. We read the next value of r as

r2 =  01000000110010001110two = 265358.

Next we see a 0, i.e. there are no 1s, and so the final q3 = 0. And we have

r3 =  11101111000101111011two = 979323.

So our reconstructed values of d are

d1 = q1M+ r1 = 3 × 220 + 314159 = 3459887

d2 = q2M+ r2 = 1 × 220 + 265358 = 1313934

d3 = q3M+ r3 = 0 × 220 + 979323 = 979323.

Now these are difference values. We need to know the smallest value m in order to construct the original set of values from the differences. Then the full values are m, m + d1, m + d1 + d2, and m + d1 + d2 + d3,.

Related posts

[1] This is the Rice part. Robert Rice simplified Samuel Golomb’s encoding scheme in the special case that M is a power of 2.

The post Compressing a set of hash values first appeared on John D. Cook.

Memorizing chemical element symbols

2026-01-07 21:58:01

Here’s something I’ve wondered about before: are there good mnemonics for chemical element symbols?

Some element symbols are based on Latin or German names and seem arbitrary to English speakers, such as K (kalium) for potassium or Fe (ferrum) for iron. However, these elements are very common and so their names and symbols are familiar.

When you take out the elements whose symbols are mnemonic in another language, every element symbol begins with the first letter of the element name. The tricky part is the second letter. For example, does Ra stand for radon or radium?

The following rule of thumb usually holds whenever there is a chemical symbol what corresponds to the first letters of two different elements:

 The lightest/longest-known element wins.

Scientists didn’t wait until the periodic table was complete before assigning symbols, and the easiest names were handed out first. Calcium (20) was assigned Ca, for example, before cadmium (48) and californium (98) were known.

The elements were discovered roughly in order of atomic weight. For example, beryllium (4) was discovered before berkelium (97) and neon (10) was discovered before neptunium (93). So sometimes you can substitute knowledge of chemistry for knowledge of history. [1]

There are instances where the heavier element got to claim the first-two-letter symbol. Usually the heavier element was discovered first. That’s why Ra stands for radium (88) and not radon (86). One glaring exception to this rule is that palladium (Pd) was discovered a century before protactinium (Pa).

Often the element that was discovered first is more familiar, and so you could almost say that when there’s a conflict, the more familiar element wins. For example, Li stands for lithium and not livermorium. This revises our rule of thumb above:

The lightest/longest-known/most familiar element wins.

To return to the question at the top of the post, I’m not aware of a satisfying set of mnemonics for chemical element symbols. But there are some heuristics. Generally the elements that are the lightest, most familiar, and have been known the longest get the simpler names. Maybe you can remember, for example, that berkelium must be Bk because B, Be, and Br were already taken by the time berkelium was discovered.

After using this heuristic, you could apply more brute-force mnemonic techniques for whenever the heuristic doesn’t work. (Whenever it doesn’t work for you: mnemonics are very personal.) For example, you might imagine a registered nurse (an RN) spraying the insecticide Raid on a fish, fish being a Major system encoding of the number 86, the atomic number of radon.

Related posts

[1] Chemical elements named after scientists, planets, and laboratories appear toward the end of the table and are recent discoveries.

The post Memorizing chemical element symbols first appeared on John D. Cook.

Largest known compositorial prime

2026-01-07 01:11:41

I ran across a blog post here that said a new record has been set for the largest compositorial prime. [1]

OK, so what is a compositorial prime? It is a prime number of the form

n! / n# + 1

where n# denotes n primorial, the product of the prime numbers no greater than n.

The newly discovered prime is

N = 751882!/751882# + 1

It was described in the article cited above as

751882!/751879# + 1,

but the two numbers are the same because there are no primes greater than 751879 and less than 751882, i.e.

751879# = 751882#.

About how large is N? We can calculate the log of the numerator easily enough:

>>> import scipy.special
>>> scipy.special.loggamma(751883)
9421340.780760147

However, the denominator is harder to compute. According to OIES we have

n# = exp((1 + o(1)) n)

which would give us the estimate

log(751882#) ≈ 751882.

So

log10(N) = log(N) / log(10) ≈ (9421340 − 751882) / log(10) ≈ 3765097.

According to this page,

log10(N) = 3765620.3395779

and so our approximation above was good to four figures.

So N has between 3 and 4 million digits, making it much smaller than the largest known prime, which has roughly 41 million digits. Overall, N is the 110th largest known prime.

 

[1] I misread the post at first and thought it said there was a new prime record (skipping over the “compositorial” part) and was surprised because the number is not a Mersenne number. For a long time now the largest known prime has been a Mersenne prime because there is a special algorithm for testing whether Mersenne numbers are prime, one that is much more efficient than testing numbers in general.

 

The post Largest known compositorial prime first appeared on John D. Cook.

log2(3) and log2(5)

2026-01-06 09:02:08

AlmostSure on X pointed out that

log2 3 ≈ 19/12,

an approximation that’s pretty good relative to the size of the denominator. To get an approximation that’s as accurate or better requires a larger denominator for log2 5.

log2 5 ≈ 65/28

This above observations are correct, but are they indicative of a more general pattern? Is log2 3 easier to approximate than log2 5 using rational numbers? There are theoretical ways to quantify this—irrationality measures—but they’re hard to compute.

If you look at the series of approximations for both numbers, based on continued fraction convergents, the nth convergent for log2 5 is more accurate than the nth convergent for log2 3, at least for the first 16 terms. After that I ran out of floating point precision and wasn’t sufficiently interested to resort to extended precision.

Admittedly this is a non-standard way to evaluate approximation error. Typically you look at the approximation error relative to the size of the denominator, not relative to the index of the convergents.

Here’s a more conventional comparison, plotting the log of approximation error against the log of the denominators.

Continued fraction posts

The post log2(3) and log2(5) first appeared on John D. Cook.