2026-03-04 22:15:02
It is sometimes said that neural networks are “just” logistic regression. (Remember neural networks? LLMs are neural networks, but nobody talks about neural networks anymore.) In some sense a neural network is logistic regression with more parameters, a lot more parameters, but more is different. New phenomena emerge at scale that could not have been anticipated at a smaller scale.
Logistic regression can work surprisingly well on small data sets. One of my clients filed a patent on a simple logistic regression model I created for them. You can’t patent logistic regression—the idea goes back to the 1840s—but you can patent its application to a particular problem. Or at least you can try; I don’t know whether the patent was ever granted.
Some of the clinical trial models that we developed at MD Anderson Cancer Center were built on Bayesian logistic regression. These methods were used to run early phase clinical trials, with dozens of patients. Far from “big data.” Because we had modest amounts of data, our models could not be very complicated, though we tried. The idea was that informative priors would let you fit more parameters than would otherwise be possible. That idea was partially correct, though it leads to a sensitive dependence on priors.
When you don’t have enough data, additional parameters do more harm than good, at least in the classical setting. Over-parameterization is bad in classical models, though over-parameterization can be good for neural networks. So for a small data set you commonly have only two parameters. With a larger data set you might have three or four.
There is a rule of thumb that you need at least 10 events per parameter (EVP) [1]. For example, if you’re looking at an outcome that happens say 20% of the time, you need about 50 data points per parameter. If you’re analyzing a clinical trial with 200 patients, you could fit a four-parameter model. But those four parameters better pull their weight, and so you typically compute some sort of information criteria metric—AIC, BIC, DIC, etc.—to judge whether the data justify a particular set of parameters. Statisticians agonize over each parameter because it really matters.
Imaging working in the world of modest-sized data sets, carefully considering one parameter at a time for inclusion in a model, and hearing about people fitting models with millions, and later billions, of parameters. It just sounds insane. And sometimes it is insane [2]. And yet it can work. Not automatically; developing large models is still a bit of a black art. But large models can do amazing things.
How do LLMs compare to logistic regression as far as the ratio of data points to parameters? Various scaling laws have been suggested. These laws have some basis in theory, but they’re largely empirical, not derived from first principles. “Open” AI no longer shares stats on the size of their training data or the number of parameters they use, but other models do, and as a very rough rule of thumb, models are trained using around 100 tokens per parameter, which is not very different from the EVP rule of thumb for logistic regression.
Simply counting tokens and parameters doesn’t tell the full story. In a logistic regression model, data are typically binary variables, or maybe categorical variables coming from a small number of possibilities. Parameters are floating point values, typically 64 bits, but maybe the parameter values are important to three decimal places or 10 bits. In the example above, 200 samples of 4 binary variables determine 4 ten-bit parameters, so 20 bits of data for every bit of parameter. If the inputs were 10-bit numbers, there would be 200 bits of data per parameter.
When training an LLM, a token is typically a 32-bit number, not a binary variable. And a parameter might be a 32-bit number, but quantized to 8 bits for inference [3]. If a model uses 100 tokens per parameter, that corresponds to 400 bits of training data per inference parameter bit.
In short, the ratio of data bits to parameter bits is roughly similar between logistic regression and LLMs. I find that surprising, especially because there’s a sort of no man’s land between [2] a handful of parameters and billions of parameters.
[1] P Peduzzi 1, J Concato, E Kemper, T R Holford, A R Feinstein. A simulation study of the number of events per variable in logistic regression analysis. Journal of Clinical Epidemiology 1996 Dec; 49(12):1373-9. doi: 10.1016/s0895-4356(96)00236-3.
[2] A lot of times neural networks don’t scale down to the small data regime well at all. It took a lot of audacity to believe that models would perform disproportionately better with more training data. Classical statistics gives you good reason to expect diminishing returns, not increasing returns.
[3] There has been a lot of work lately to find low precision parameters directly. So you might find 16-bit parameters rather than finding 32 bit parameters then quantizing to 16 bits.
The post From logistic regression to AI first appeared on John D. Cook.2026-03-04 22:04:30
I was working with a colleague recently on a project involving the use of the OpenAI API.
I brought up the idea that, perhaps it is possible to improve the accuracy of API response by modifying the API call to increase the amount of reasoning performed.
My colleague quickly asked ChatGPT if this was possible, and the answer came back “No, it’s not possible to do that.” then I asked essentially the same question to my own instance of ChatGPT, and the answer was, “Yes, you can do it, but you need to use the OpenAI Responses API.”
How did we get such different answers? Was it the wording of the prompt? Was it the custom instructions given in the account personalization, where you describe who you are and how you want ChatGPT to respond? Is it possibly different conversation history? Many factors could have contributed to the different response. Unfortunately, many of these factors are either not easily controllable at the user level or not convenient to change to alternatives in a protracted trial and error search.
I’ve had other times when I will first get a highly standardized, generic answer from ChatGPT, even in Thinking mode, that I know is not quite right or just seems off. Then when I push back, I may get a profoundly different answer.
It’s simply a fact that large language models are conditional probabilistic systems that do not guarantee reproducibility in practice, even given the same inputs, even at temperature=0 [1]. Their outputs depend sensitively on prompt wording, context window contents, system instructions, and model configuration. Small differences in these inputs can yield substantially different outputs.
How well an AI chatbot responds can obviously have a massive impact on how effective the tool will be for your use case. Differences in responses could materially affect the outcome of your project. I take this as a wake-up call to be persistent, vigilant and flexible in attempts to obtain reliable answers from these new AI tools.
[1] (some) sources of nondeterminism: floating point / GPU nondeterminism, differing order of operations from distributed collectives, ties or near-ties in token probabilities, backend/infrastructure changes, model routing, hidden system prompt differences or tool availability.
The post An AI Odyssey, Part 2: Prompting Peril first appeared on John D. Cook.2026-03-03 10:40:22
I recently talked with a contact who repeated what he’d heard regarding agentic AI systems—namely, that they can greatly increase productivity in professional financial management tasks. However, I pointed out that though this is true, these tools do not guarantee correctness, so one has to be very careful letting them manage critical assets such as financial data.
It is widely recognized that AI models, even reasoning models and agentic systems, can make mistakes. One example is a case showing that one of the most recent and capable AI models made multiple factual mistakes in drawing together information for a single slide of a presentation. Sure, people can give examples where agentic systems can perform amazing tasks. But it’s another question as to whether they can always do them reliably. Unfortunately, we do not yet have procedural frameworks that provides reliability guarantees that are comparable to those required in other high-stakes engineering domains.
Many leading researchers have acknowledged that current AI systems have a degree of technical unpredictability that has not been resolved. For example, one has recently said, “Anyone who has worked with AI models understands that there is a basic unpredictability to them, that in a purely technical way we have not solved.”
Manufacturing has the notion of Six Sigma quality, to reduce the level of manufacturing defects to an extremely low level. In computing, the correctness requirements are even higher, sometimes necessitating provable correctness. The Pentium FDIV bug in the 1990s caused actual errors in calculations to occur in the wild, even though the chance of error was supposedly “rare.” These were silent errors that might have occurred undetected in mission critical applications, leading to failure. This being said, the Pentium FDIV error modes were precisely definable, whereas AI models are probabilistic, making it much harder to bound the errors.
Exact correctness is at times considered so important that there is an entire discipline, known as formal verification, to prove specified correctness properties for critical hardware and software systems (for example, the manufacture of computing devices). These methods play a key role in multi-billion dollar industries.
When provable correctness is not available, having at least a rigorous certification process (see here for one effort) is a step in the right direction.
It has long been recognized that reliability is a fundamental challenge in modern AI systems. Despite dramatic advances in capability, strong correctness guarantees remain an open technical problem. The central question is how to build AI systems whose behavior can be bounded, verified, or certified at domain-appropriate levels. Until this is satisfactorily resolved, we should use these incredibly useful tools in smart ways that do not create unnecessary risks.
The post An AI Odyssey, Part 1: Correctness Conundrum first appeared on John D. Cook.2026-03-03 00:08:49
In grad school I specialized in differential equations, but never worked with delay-differential equations, equations specifying that a solution depends not only on its derivatives but also on the state of the function at a previous time. The first time I worked with a delay-differential equation would come a couple decades later when I did some modeling work for a pharmaceutical company.
Large delays can change the qualitative behavior of a differential equation, but it seems plausible that sufficiently small delays should not. This is correct, and we will show just how small “sufficiently small” is in a simple special case. We’ll look at the equation
x′(t) = a x(t) + b x(t − τ)
where the coefficients a and b are non-zero real constants and the delay τ is a positive constant. Then [1] proves that the equation above has the same qualitative behavior as the same equation with the delay removed, i.e. with τ = 0, provided τ is small enough. Here “small enough” means
−1/e < bτ exp(−aτ) < e
and
aτ < 1.
There is a further hypothesis for the theorem cited above, a technical condition that holds on a nowhere dense set. The solution to a first order delay-differential like the one we’re looking at here is not determined by an initial condition x(0) = x0 alone. We have to specify the solution over the interval [−τ, 0]. This can be any function of t, subject only to a technical condition that holds on a nowhere-dense set of initial conditions. See [1] for details.
Let’s look at a specific example,
x′(t) = −3 x(t) + 2 x(t − τ)
with the initial condition x(1) = 1. If there were no delay term τ, the solution would be x(t) = exp(1 − t). In this case the solution monotonically decays to zero.
The theorem above says we should expect the same behavior as long as
−1/e < 2τ exp(3τ) < e
which holds as long as τ < 0.404218.
Let’s solve our equation for the case τ = 0.4 using Mathematica.
tau = 0.4
solution = NDSolveValue[
{x'[t] == -3 x[t] + 2 x[t - tau], x[t /; t <= 1] == t },
x, {t, 0, 10}]
Plot[solution[t], {t, 0, 10}, PlotRange -> All]
This produces the following plot.

The solution initially ramps up to 1, because that’s what we specified, but it seems that eventually the solution monotonically decays to 0, just as when τ = 0.
When we change the delay to τ = 3 and rerun the code we get oscillations.

[1] R. D. Driver, D. W. Sasser, M. L. Slater. The Equation x’ (t) = ax (t) + bx (t – τ) with “Small” Delay. The American Mathematical Monthly, Vol. 80, No. 9 (Nov., 1973), pp. 990–995
The post Differential equation with a small delay first appeared on John D. Cook.2026-03-02 02:15:16
After writing the previous post, I poked around in the bash shell documentation and found a handy feature I’d never seen before, the shortcut ~-.
I frequently use the command cd - to return to the previous working directory, but didn’t know about ~- as a shotrcut for the shell variable $OLDPWD which contains the name of the previous working directory.
Here’s how I will be using this feature now that I know about it. Fairly often I work in two directories, and moving back and forth between them using cd -, and need to compare files in the two locations. If I have files in both directories with the same name, say notes.org, I can diff them by running
diff notes.org ~-/notes.org
I was curious why I’d never run into ~- before. Maybe it’s a relatively recent bash feature? No, it’s been there since bash was released in 1989. The feature was part of C shell before that, though not part of Bourne shell.
2026-03-01 02:20:37
I’ve never been good at shell scripting. I’d much rather write scripts in a general purpose language like Python. But occasionally a shell script can do something so simply that it’s worth writing a shell script.
Sometimes a shell scripting feature is terse and cryptic precisely because it solves a common problem succinctly. One example of this is working with file extensions.
For example, maybe you have a script that takes a source file name like foo.java and needs to do something with the class file foo.class. In my case, I had a script that takes a La TeX file name and needs to create the corresponding DVI and SVG file names.
Here’s a little script to create an SVG file from a LaTeX file.
#!/bin/bash
latex "$1"
dvisvgm --no-fonts "${1%.tex}.dvi" -o "${1%.tex}.svg"
The pattern ${parameter%word} is a bash shell parameter expansion that removes the shortest match to the pattern word from the end of the expansion of parameter. So if $1 equals foo.tex, then
${1%.tex}
evaluates to
foo
and so
${1%.tex}.dvi
and
${1%.tex}.svg
expand to foo.dvi and foo.svg.
You can get much fancier with shell parameter expansions if you’d like. See the documentation here.
The post Working with file extensions in bash scripts first appeared on John D. Cook.