MoreRSS

site iconJohn D. CookModify

I have decades of consulting experience helping companies solve complex problems involving applied math, statistics, and data privacy.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of John D. Cook

Trying to fit exponential data

2025-12-22 23:00:06

The first difficulty in trying to fit an exponential distribution to data is that the data may not follow an exponential distribution. Nothing grows exponentially forever. Eventually growth slows down. The simplest way growth can slow down is to follow a logistic curve, but fitting a logistic curve has its own problems, as detailed in the previous post.

Suppose you are convinced that whatever you’re wanting to model follows an exponential curve, at least over the time scale that you’re interested in. This is easier to fit than a logistic curve. If you take the logarithm of the data, you now have a linear regression problem. Linear regression is numerically well-behaved and has been thoroughly explored.

There is a catch, however. When you extrapolate a linear regression, your uncertainly region flares out as you go from observed data to linear predictions based on the observed data. You’re uncertainty grows linearly. But remember that we’re not working with the data per se; we’re working with the logarithm of the data. So on the original scale, the uncertainty flares out exponentially.

The post Trying to fit exponential data first appeared on John D. Cook.

Trying to fit a logistic curve

2025-12-21 05:58:48

A logistic curve, sometimes called an S curve, looks different in different regions. Like the proverbial blind men feeling different parts of an elephant, people looking at different segments of the curve could come to very different impressions of the full picture.
logistic curve with extrapolations

It’s naive to look at the left end and assume the curve will grow exponentially forever, even if the data are statistically indistinguishable from exponential growth.

A slightly less naive approach is to look at the left end, assume logistic growth, and try to infer the parameters of the logistic curve. In the image above, you may be able to forecast the asymptotic value if you have data up to time t = 2, but it would be hopeless to do so with only data up to time t = −2. (This post was motivated by seeing someone trying to extrapolate a logistic curve from just its left tail.)

Suppose you know with absolute certainty that your data have the form

y(t) = \frac{a}{\exp(-b(t - c)) + 1} + \varepsilon

where ε is some small amount of measurement error. The world is not obligated follow a simple mathematical model, or any mathematical model for that matter, but for this post we will assume that for some inexplicable reason you know the future follows a logistic curve; the only question is what the parameters are.

Furthermore, we only care about fitting the a parameter. That is, we only want to predict the asymptotic value of the curve. This is easier than trying to fit the b or c parameters.

Simulation experiment

I generated 16 random t values between −5 and −2, plugged them into the logistic function with parameters a = 1, b = 1, and c = 0, then added Gaussian noise with standard deviation 0.05.

My intention was to do this 1000 times and report the range of fitted values for a. However, the software I was using (scipy.optimize.curve_fit) failed to converge. Instead it returned the following error message.

RuntimeError: Optimal parameters not found: Number of calls to function has reached maxfev = 800.

When you see a message like that, your first response is probably to tweak the code so that it converges. Sometimes that’s the right thing to do, but often such numerical difficulties are trying to tell you that you’re solving the wrong problem.

When I generated points between −5 and 0, the curve_fit algorithm still failed to converge.

When I generated points between −5 and 2, the fitting algorithm converged. The range of a values was from 0.8254 to 1.6965.

When I generated points between −5 and 3, the range of a values was from 0.9039 to 1.1815.

Increasing the number of generated points did not change whether the curve fitting method converge, though it did result in a smaller range of fitted parameter values when it did converge.

I said we’re only interested in fitting the a parameter. I looked at the ranges of the other parameters as well, and as expected, they had a wider range of values.

So in summary, fitting a logistic curve with data only on the left side of the curve, to the left of the inflection point in the middle, may completely fail or give you results with wide error estimates. And it’s better to have a few points spread out through the domain of the function than to have a large number of points only on one end.

Related posts

The post Trying to fit a logistic curve first appeared on John D. Cook.

Regular expressions that cross lines

2025-12-19 22:55:01

One of the fiddly parts of regular expressions is how to handle line breaks. Should regular expression searches be applied one line at a time, or should an entire file be treated as a single line?

This morning I was trying to track down a LaTeX file that said “discussed in the Section” rather than simply “discussed in Section.” I wanted to search on “the Section” to see whether I had a similar error in other files.

Line breaks don’t matter to LaTeX [1], so “the” could be at the end of one line and “Section” at the beginning of another. I found what I was after by using

    grep -Pzo "the\s+Section" foo.tex

Here -P tells grep to use Perl regular expressions. That’s not necessary here, but I imprinted on Perl regular expressions long ago, and I use PCRE (Perl compatible regular expressions) whenever possible so I don’t have to remember the annoying little syntax differences between various regex implementations.

The -z option says to treat the entire file as one long string. This eliminates the line break issue.

The -o option says to output only what the regular expression matches. Otherwise grep will return the matching line. Ordinarily that wouldn’t be so bad, but because of the -z option, the matching line is the entire file.

The \s+ charcters between the and Section represent one or more whitespace characters, such as a space or a newline.

The -P flag is a Gnu feature, so it works on Linux. But macOS ships with BSD-derived versions of its utilities, and its version grep does not support the -P option. On my Macbook I have ggrep mapped to the Gnu version of grep.

Another option is to use ripgrep rather than grep. It uses Perl-like regular expressions, and so there is no need for anything like the -P flag. The analog of -z in ripgrep is -U, so the counterpart of the command above would be

    ripgrep -Uo "the\s+Section" foo.tex

Usually regular expression searches are so fast that execution time doesn’t matter. But when it does matter, ripgrep can be an order of magnitude faster than grep.

Related posts

[1] LaTeX decides how to break lines in the output independent of line breaks in the input. This allows you to arrange the source file logically rather than aesthetically.

The post Regular expressions that cross lines first appeared on John D. Cook.

Multiples with no large digits

2025-12-16 23:49:52

Here’s a curious theorem I stumbled across recently [1]. Take an integer N which is not a multiple of 10. Then there is some multiple of N which only contains the digits 1, 2, 3, 4, and 5.

For example, my business phone number 8324228646 has a couple 8s and a couple 6s. But

6312 × 8324228646 = 52542531213552

which contains only digits 1 through 5.

For a general base b, let p be the smallest prime factor of b. Then for every integer N that is not a multiple of b, there is some multiple of N whose base b representation contains only the digits 1, 2, 3, …, b/p.

This means that for every number N that is not a multiple of 16, there is some k such that the hex representation of kN contains only the digits 1 through 8. For example, if we take the magic number at the beginning of every Java class file, 0xCAFEBABE, we find

1341 × CAFEBABEhex = 42758583546hex.

In the examples above, we’re looking for multiple containing only half the possible digits. If the largest prime dividing the base is larger than 2 then we can find a multiples with digits in a smaller range. For example, in base 35 we can find a multiple containing only the digits 1 through 7.

[1] Gregory Galperin and Michael Reid. Multiples Without Large Digits. The American Mathematical Monthly, Vol. 126, No. 10 (December 2019), pp. 950-951.

The post Multiples with no large digits first appeared on John D. Cook.

Golden iteration

2025-12-12 23:31:14

The expression

\sqrt{1 + \sqrt{1 + \sqrt{1 + \cdots}}}

converges to the golden ratio φ. Another way to say this is that the sequence defined by x0 = 1 and

x_{n+1} = \sqrt{x_n + 1}

for n > 0 converges to φ. This post will be about how it converges.

I wrote a little script to look at the error in approximating φ by xn and noticed that the error is about three times smaller at each step. Here’s why that observation was correct.

The ratio of the error at one step to the error at the previous step is

\frac{\sqrt{x + 1} - \varphi}{x - \varphi}

If x = φ + ε the expression above becomes

\frac{\sqrt{\varphi + \epsilon + 1} - \varphi}{\epsilon} = \frac{1}{\sqrt{2(3 + \sqrt{5})}} - \frac{\epsilon}{\sqrt{8} (3 + \sqrt{5})^{3/2}} + {\cal O}(\epsilon^2)

when you expand as a Taylor series in ε centered at 0. This says the error multiplied by a factor of about

\frac{1}{\sqrt{2(3 + \sqrt{5})}} = 0.309016994\ldots

at each step. The next term in the Taylor series is approximately −0.03ε, so the exact rate of convergence is a slightly faster at first, but essentially the error is multiplied by 0.309 at each iteration.

The post Golden iteration first appeared on John D. Cook.

Just change the key

2025-12-12 02:37:41

When I was a kid, I suppose sometime in my early teens, I was interested in music theory, but I couldn’t play piano. One time I asked a lady who played piano at our church to play a piece of sheet music for me so I could hear how it sounded. The music was in the key of A, but she played it in A♭. She didn’t say she was going to change the key, but I could tell from looking at her hands that she had.

Key signatures of A sharp, A flat

I was shocked by the audacity of changing the music to be what you wanted it to be rather than playing what was on the page. I was in band, and there you certainly don’t decide unilaterally that you’re going to play in a different key!

In retrospect what the pianist was doing makes sense. Hymns are very often in the key of A♭. One reason is it’s a comfortable key for SATB singing. Another is that if many hymns are in the same key, that makes it easy to go from one directly into another. If a traditional hymn is not in A♭, it’s probably in a key with flats, like B♭ or D♭. (Contemporary church music is often in keys with sharps because guitarists like open strings, which leads to keys like A or E.)

The pianist wasn’t a great musician, but she was good enough. Picking her key was a coping mechanism that worked well. Unless someone in the congregation has perfect pitch, you can change a song from the key of D to the key of D♭ and nobody will know.

There’s something to be said for clever coping mechanisms, especially if they’re declared, “You asked for A. Is it OK if I give you B?” It’s better than saying “Sorry, I can’t help you.”

The post Just change the key first appeared on John D. Cook.