MoreRSS

site iconJohn D. CookModify

I have decades of consulting experience helping companies solve complex problems involving applied math, statistics, and data privacy.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of John D. Cook

Approximating even functions by powers of cosine

2026-05-01 08:06:12

A couple days ago I wrote a post about turning a trick into a technique, finding another use for a clever way to construct simple, accurate approximations. I used as my example approximating the Bessel function J(x) with (1 + cos(x))/2. I learned via a helpful comment on Mathstodon that my approximation was the first-order part of a more general series

J_0(x) = 1 + \frac{\cos(x) - 1}{2} - \frac{(\cos(x) - 1)^2}{48} + \frac{7(\cos(x) - 1)^3}{1440} + \cdots

The first-order approximation has error O(x4), as shown in the earlier post. Adding the second-order term makes the error O(x6), and adding the third-order term makes it O(x8).

I’ve written a few times about cosine approximations to the normal probability density. For example, see this post. We could use the same idea as the series above to approximate the normal density with a series of powers of cosine. This gives us

\exp(-x^2/2) = 1 + (\cos(x) - 1) + \frac{(\cos(x) - 1)^2}{3} + \frac{2(\cos(x) - 1)^3}{45} + \cdots

and as before, the first, second, and third order truncated series have error O(x4), O(x6), and O(x8).

The general theory behind what’s going on here is an extension of Bürmann’s theorem. The original version of the theorem relies on a series inversion theorem that in turn relies on the approximating function, in our case cos(x) − 1, not having zero derivative at the center of the series. But there is a more general form of Bürmann’s theorem based on a more general form of series inversion. We will always need a more general version of the theorem when working with even functions because even functions have zero derivative at zero.

Here’s another example, this time using the Bessel function J1, an odd function, which does use the original version of Bürmann’s theorem to approximate J1 by powers of sine.

J_1(x) = \frac{1}{2} \sin(x) + \frac{1}{48} \sin^3(x) + \frac{17}{1920} \sin^5(x) + \cdots

In this case truncating the series after sink(x) gives an error O(xk + 2).

You can find more on Bürmann’s theorem in Whittker and Watson.

The post Approximating even functions by powers of cosine first appeared on John D. Cook.

Three ways to differentiate ReLU

2026-04-30 22:51:08

When a function is not differentiable in the classical sense there are multiple ways to compute a generalized derivative. This post will look at three generalizations of the classical derivative, each applied to the ReLU (rectified linear unit) function. The ReLU function is a commonly used activation function for neural networks. It’s also called the ramp function for obvious reasons.

The function is simply r(x) = max(0, x).

Pointwise derivative

The pointwise derivative would be 0 for x < 0, 1 for x > 0, and undefined at x = 0. So except at 0, the pointwise derivative of the ramp function is the Heaviside function.

H(x) = \left\{ \begin{array}{ll} 1 & \mbox{if } x \geq 0 \\ 0 & \mbox{if } x < 0 \end{array} \right.
In a real analysis course, you’d simply say r′(x) =H(x) because functions are only defined up to equivalent modulo sets of measure zero, i.e. the definition at x = 0 doesn’t matter.

Distributional derivative

In distribution theory you’d identify the function r(x) with the distribution whose action on a test function φ is

\langle r, \varphi \rangle = \int_{-\infty}^\infty r(x)\, \varphi(x) \, dx

Then the derivative of r would be the distribution r′ satisfying

\langle r^{\prime}, \varphi\rangle = -\langle r, \varphi^{\prime} \rangle

for all smooth functions φ with compact support. You can prove using integration by parts that the above equals the integral of φ from 0 to ∞, which is the same as the action of H(x) on φ.

In this case the distributional derivative of r is the same as the pointwise derivative of r interpreted as a distribution. This does not happen in general [1]. For example, the pointwise derivative of H is zero but the distributional derivative of H is δ, the Dirac delta distribution.

For more on distributional derivatives, see How to differentiate a non-differentiable function.

Subgradient

The subgradient of a function f at a point x, written ∂f(x), is the set of slopes of tangent lines to the graph of f at x. If f is differentiable at x, then there is only one slope, namely f′(x), and we typically say the subgradient of f at x is simply f′(x) when strictly speaking we should say it is the one-element set {f′(x)}.

A line tangent to the graph of the ReLU function at a negative value of x has slope 0, and a tangent line at a positive x has slope 1. But because there’s a sharp corner at x = 0, a tangent at this point could have any slope between 0 and 1.

\partial f(x) = \left\{ \begin{array}{cl} 1 & \text{if } x > 0 \\<br /> \left[0,1\right] & \text{if } x = 0 \\<br /> 0 & \text{if } x < 0 \end{array} \right.

My dissertation was full of subgradients of convex functions. This made me uneasy because subgradients are not real-valued functions; they’re set-valued functions. Most of the time you can blithely ignore this distinction, but there’s always a nagging suspicion that it’s going to bite you unexpectedly.

 

[1] When is the pointwise derivative of f as a function equal to the derivative of f as a distribution? It’s not enough for f to be continuous, but it is sufficient for f to be absolutely continuous.

The post Three ways to differentiate ReLU first appeared on John D. Cook.

Turning a trick into a technique

2026-04-29 05:44:58

Someone said a technique is a trick that works twice.

I wanted to see if I could get anything interesting by turning the trick in the previous post into a technique. The trick created a high-order approximation by subtracting a multiple one even function from another. Even functions only have even-order terms, and by using the right multiple you can cancel out the second-order term as well.

For an example, I’d like to approximate the Bessel function J0(x) by the better known cosine function. Both are even functions.

J0(x) = 1 − x2/4 + x4/64 + …
cos(x) = 1 − x2/2 + x4/24 + …

and so

2 J0(x) − cos(x) = 1 − x4/96 + …

which means

J0(x) ≈ (1 + cos(x))/2

is an excellent approximation for small x.

Let’s try this for a couple examples.

J0(0.2) = 0.990025 and (1 + cos(0.2))/2 = 0.990033.

J0(0.05) = 0.99937510 and (1 + cos(0.05))/2 = 0.99937513.

The post Turning a trick into a technique first appeared on John D. Cook.

Circular arc approximation

2026-04-28 21:09:42

Suppose you have an arc a, a portion of a circle of radius r, and you know two things: the length c of the chord of the arc, and the length b of the chord of half the arc, illustrated below.

Here θ is the central angle of the arc. Then the length of the arc, rθ, is approximately

a = rθ ≈ 12 b²/(c + 4b).

If the arc is moderately small, the approximation is very accurate.

This approximation is simple, accurate, and not obvious, much like the one in this post

Derivation

Let φ = θ/4. Then the angle between the chords b and c is φ. This follows from the inscribed angle theorem, illustrated below.

There are two right triangles in the diagram above that have an angle φ: a smaller triangle with hypotenuse b and a larger triangle with hypotenuse 2r. From the smaller triangle we learn

cos(φ) = c / 2b

and from the larger triangle we learn

sin(φ) = b / 2r.

Now expand in power series.

c / 2b = cos(φ) = 1 − φ2/2! + φ4/4! − …
2ba = sin(φ) / φ = 1 − φ2/3! + φ4/5! − …

If we multiply 2ba by 3 and subtract c / 2b then the φ2 terms cancel out and we get

6ba − c / 2b = 2 − φ4/60 + …

and so

6ba − c / 2b ≈ 2

to a very high degree of accuracy when φ is small. The approximation follows by solving for a.

Example

Let θ = π/3 and so φ = 0.26…, not a particularly small value of φ, but small enough for the approximation to work well.

Set r = 1 so a = θ. Then

b = 2 sin(π/12) = 0.51764

and

c = 2b cos(π/12) = 1.

Now in application, we know b and c, not θ, and so pretend we measured b = 0.51764 and c = 1. Then we would approximate a by

12b²/(c + 4b) = 1.04718

while the exact value is 1.04720. Unless you can measure lengths to more than four significant figures, the approximation may has well be exact because approximation error would be less than measurement error.

 

[1] J. M. Bruce. Approximation to a Circular Arc. The American Mathematical Monthly. Vol. 49, No. 3 (March 1942), pp. 184–185

The post Circular arc approximation first appeared on John D. Cook.

Closed-form solution to the nonlinear pendulum equation

2026-04-26 00:16:12

The previous post looks at the nonlinear pendulum equation and what difference it makes to the solutions if you linearize the equation.

If the initial displacement is small enough, you can simply replace sin θ with θ. If the initial displacement is larger, you can improve the accuracy quite a bit by solving the linearized equation and then adjusting the period.

You can also find an exact solution, but not in terms of elementary functions; you have to use Jacobi elliptic functions. These are functions somewhat analogous to trig functions, though it’s not helpful to try to pin down the analogies. For example, the Jacobi function sn is like the sine function in some ways but very different in others, depending on the range of arguments.

We start with the differential equation

θ″(t) + c² sin( θ(t) ) = 0

where c² = g/L, i.e. the gravitational constant divided by pendulum length, and initial conditions θ(0) = θ0 and θ′(0) = 0. We assume −π < θ0 < π.

Then the solution is

θ(t) = 2 arcsin( a cd(ct | m ) )

where a = sin(θ0/2), ma², and cd is one of the 12 Jacobi elliptic functions. Note that cd, like all the Jacobi functions, has an argument and a parameter. In the equation above the argument is ct and the parameter is m.

The last plot in the previous post was misleading, showing roughly equal parts genuine difference and error from solving the differential equation numerically. Here’s the code that was used to solve the nonlinear equation.

from scipy.special import ellipj, ellipk
from numpy import sin, cos, pi, linspace, arcsin
from scipy.integrate import solve_ivp

def exact_period(θ):
    return 2*ellipk(sin(θ/2)**2)/pi

def nonlinear_ode(t, z):
    x, y = z
    return [y, -sin(x)]    

theta0 = pi/3
b = 2*pi*exact_period(theta0)
t = linspace(0, 2*b, 2000)

sol = solve_ivp(nonlinear_ode, [0, 2*b], [theta0, 0], t_eval=t)

The solution is contained in sol.y[0].

Let’s compare the numerical solution to the exact solution.

def f(t, c, theta0):
    a = sin(theta0/2)
    m = a**2
    sn, cn, dn, ph = ellipj(c*t, m)
    return 2*arcsin(a*cn/dn)

There are a couple things to note about the code. First,SciPy doesn’t implement the cd function, but it can be computed as cn/dn. Second, the function ellipj returns four functions at once because it takes about as much time to calculate all four as it does to compute one of them.

Here is a plot of the error in solving the differential equation.

And here is the difference between the exact solution to the nonlinear pendulum equation and the stretched solution to the linear equation.

The post Closed-form solution to the nonlinear pendulum equation first appeared on John D. Cook.

nth derivative of a quotient

2026-04-25 21:12:39

There’s a nice formula for the nth derivative of a product. It looks a lot like the binomial theorem.

(gh)^{(n)} = \sum_{k=0}^n \binom{n}{k} g^{(k)} h^{(n-k)}

There is also a formula for the nth derivative of a quotient, but it’s more complicated and less known.

We start by writing the quotient rule in an unusual way.

\left(\frac{g}{h}\right)^{(1)} = \frac{1}{h^2} \left| \begin{array}{cc} h & g \\ h^\prime & g^\prime \\ \end{array} \right|

Applying the quotient rule twice gives the following.

\left(\frac{g}{h}\right)^{(2)} = \frac{1}{h^3} \left| \begin{array}{ccc} h & 0 & g \\ h^\prime & h & g^\prime \\ h^{\prime\prime} & 2h^\prime & g^{\prime\prime} \\ \end{array} \right|

And here’s the general rule in all its glory.

\left(\frac{g}{h}\right)^{(n)} = \frac{1}{h^{\,n+1}} \left| \begin{array}{cccccc} h & 0 & 0 & \cdots & 0 & g \\[3pt] h^\prime & h & 0 & \cdots & 0 & g^\prime \\[3pt] h^{\prime\prime} & 2h^\prime & h & \cdots & 0 & g^{\prime\prime} \\[3pt] \cdots & \cdots & \cdots & \cdots & \cdots & \cdots \\[3pt] h^{(n)} & \binom{n}{1}h^{(n-1)} & \binom{n}{2}h^{(n-2)} & \cdots & \binom{n}{1}h^\prime & g^{(n)} \end{array} \right|

 

Source: V. F. Ivanoff. The nth Derivative of a Fractional Function. The American Mathematical Monthly, Vol. 55, No. 8 (Oct., 1948), p. 491

The post nth derivative of a quotient first appeared on John D. Cook.