Notebook originel: statisticalTools.ipynb
Common Statistical Tools🔗
Author: Yannick Copin y.copin@ipnl.in2p3.fr
Abstract: A restricted number of statistical tools are presented, with emphasis on details sometimes overlooked in literature. This may be too simple, or too technical, or whatever: comments and suggestions are welcome.
[1]:
# NB: this cell includes technical LaTeX definitions.
# $
# \renewcommand{\d}{\mathrm{d}}
# \renewcommand{\v}[1]{\boldsymbol{#1}} % Vector
# \newcommand{\m}[1]{\mathsf{#1}} % Matrix
# %\DeclareMathOperator{\Var}{Var}
# \DeclareMathOperator{\Cov}{Cov}
# %\DeclareMathOperator{\med}{med}
# $
[2]:
import numpy as np
from scipy import stats
from scipy import linalg
import matplotlib.pyplot as plt
[3]:
%matplotlib inline
Sample statistics🔗
We define two test datasets:
Case without intrinsic dispersion: \(\boldsymbol{x} = (x_1,\ldots, x_N)\) is a sample of \(N\) measurements of the same null quantity, with (usually estimated) gaussian measurement uncertainties \(\d x_{i}\) following a log-normal distribution (i.e., \(\log\d x^2\) follows a \(\mathcal{N}(0,1)\) distribution).
Case with intrinsic dispersion: \(\boldsymbol{y} = (y_1,\ldots, y_N)\) is an intrinsically dispersed \(\mathcal{N}(0,\varsigma^2)\) sample, measured with the same gaussian measurement uncertainties \(\d x_{i}\) as above. Unfortunately, the intrinsic dispersion \(\varsigma\) is usually not known a priori.
[4]:
n = 100
rng = np.random.default_rng(12345) # Random generator
dx = np.sqrt(rng.lognormal(sigma=0.3, size=n)) # Log-normally distributed uncertainties
x = rng.normal(loc=0, scale=dx) # Pure gaussian errors
y = x + rng.normal(size=n) # Gaussian uncertainties + intrinsic dispersion ς=1
[5]:
fig, ax = plt.subplots()
ax.errorbar(np.arange(n), x, yerr=dx, fmt='bo', label='x: no intrinsic dispersion (ς = 0)')
ax.errorbar(np.arange(n) + 0.2, y, yerr=dx, fmt='rs', label='y: x + intrinsic dispersion ς > 0')
ax.set(xlabel="Experiment", ylabel="Value")
ax.legend();
Maximum-likelihood estimators🔗
Sample location🔗
Arithmetic average:
[6]:
mu = np.mean(x) # Mean
dmu = np.std(x, ddof=1) / np.sqrt(n) # Standard error on mean
print(f"Arithmetic mean: µ = {mu:.4f} ± {dmu:.4f}")
Arithmetic mean: µ = 0.0204 ± 0.1067
Weighted (« optimal ») average:
[7]:
muw, sumw = np.average(x, weights=dx ** -2, returned=True) # "Optimally" (inverse-variance) weighted mean
dmuw = np.sqrt(1 / sumw) # Uncertainty on weighted mean
print(f"x weighted mean: µ = {muw:.4f} ± {dmuw:.4f}")
print(f"y weighted mean (wrong disp.): µ = {np.average(y, weights=dx ** -2):.4f} ± {dmuw:.4f}")
muyw, sumwy = np.average(y, weights=1 / (1 + dx ** 2), returned=True)
dmuyw = np.sqrt(1 / sumwy)
print(f"y weighted mean (true disp.): µ = {muyw:.4f} ± {dmuyw:.4f}")
x weighted mean: µ = 0.0455 ± 0.0976
y weighted mean (wrong disp.): µ = 0.0441 ± 0.0976
y weighted mean (true disp.): µ = 0.0252 ± 0.1411
Attention:
In the optimally weighted mean, \(\sigma_i^2 = 1/w_{i}\) is supposed to be the total variance of normally-distributed \(x_i\), therefore including both measurement error \(\d x_i\) and intrinsic distribution dispersion \(\varsigma\) if any: \(\sigma_i^2 = \d x_i^2 + \varsigma^2\).
If the weighting is not the true inverse variance, the weighted average is optimal (with minimal variance) but biased.
The reduced \(\chi^2\) around estimated mean, defined as \(\chi^2_\mu = 1/(N-1) \times \sum_i (x_i - \mu_w)^2 / \sigma_i^2\), should not be significantly different from one, if \(\sigma_i\) is compatible to the total (observed) dispersion.
If this is not the case, \(\sigma_i\) has probably been misestimated, either because it ignored the intrinsic dispersion \(\varsigma\) or because the measurement uncertainties \(\d x_i\) are misestimated, or both.
[8]:
print(f"DoF: k = {n - 1}")
chi2 = np.sum(((x - muw) / dx) ** 2)
p = stats.distributions.chi2.sf(chi2, n - 1)
z = max(0, stats.distributions.norm.isf(p)) # z-value
print(f"x (no disp.): χ² = {chi2:6.2f}, p-value = {p:.3f} ({z:.1f}-σ)")
chi2 = np.sum(((y - np.average(y, weights=dx ** -2))/dx) ** 2) # Wrong dispersion: sigma_i**2 = dx**2
p = stats.distributions.chi2.sf(chi2, n - 1)
z = max(0, stats.distributions.norm.isf(p)) # z-value
print(f"y (wrong disp.): χ² = {chi2:6.2f}, p-value = {p:.3f} ({z:.1f}-σ)")
chi2 = np.sum(((y - muyw) / np.hypot(1, dx)) ** 2) # Correct dispersion: sigma_i**2 = sigma**2 + dx**2
p = stats.distributions.chi2.sf(chi2, n - 1)
z = max(0, stats.distributions.norm.isf(p)) # z-value
print(f"y (true disp.): χ² = {chi2:6.2f}, p-value = {p:.3f} ({z:.1f}-σ)")
DoF: k = 99
x (no disp.): χ² = 93.89, p-value = 0.626 (0.0-σ)
y (wrong disp.): χ² = 220.03, p-value = 0.000 (6.5-σ)
y (true disp.): χ² = 109.42, p-value = 0.223 (0.8-σ)
Sample scale🔗
Unweighted: The supposedly unbiased sample variance wrt. unknown mean \(\mu\) (but see discussion in Oliphant, 2006) is:
[9]:
sig = np.std(x, ddof=1)
dsig = sig / np.sqrt(2 * (n-1)) # Normal approximation, n>10
print(f"Dispersion: σ = {sig:.4f} ± {dsig:.4f}")
Dispersion: σ = 1.0666 ± 0.0758
Weighted:
[10]:
w = dx ** -2
sumw = np.sum(w)
sigw = np.sqrt( np.sum(w * (x - muw) ** 2) / sumw )
print(f"x Weighted dispersion (biased): σ = {sigw:.4f}")
sigw = np.sqrt( np.sum(w * (x - muw) ** 2) * sumw / (sumw ** 2 - np.dot(w, w)) )
print(f"x Weighted dispersion (unbiased): σ = {sigw:.4f}")
sigyw = np.sqrt( np.sum(w * (y - np.average(y, weights=w)) ** 2) * sumw / (sumw ** 2 - np.dot(w, w)) )
print(f"y Weighted dispersion (wrong disp., unbiased): σ = {sigyw:.4f}")
wy = 1 / (1 + dx ** 2) # Total dispersion (incl. intrinsic)
sigyw = np.sqrt( np.sum(wy * (y - muyw) ** 2) * sumwy / (sumwy ** 2 - np.dot(wy, wy)) )
print(f"y Weighted dispersion (true disp., unbiased): σ = {sigyw:.4f}")
x Weighted dispersion (biased): σ = 0.9461
x Weighted dispersion (unbiased): σ = 0.9513
y Weighted dispersion (wrong disp., unbiased): σ = 1.4563
y Weighted dispersion (true disp., unbiased): σ = 1.4838
Weighted RMS is usually defined as:
[11]:
def wrms(x, dx):
w = 1/dx**2
muw,sumw = np.average(x, weights=w, returned=True)
return np.sqrt( np.sum(w*(x-muw)**2)/sumw )
wrmsx = wrms(x,dx)
print(f"WRMS = {wrmsx:.4f}")
WRMS = 0.9461
This strictly corresponds to the biased weighted variance. When using inverse-variance weighting \(w_i = 1/\sigma_i^2\), one has \(\mathrm{WRMS}^2 = \chi^2 \times \sigma_{\mu_w}^2\). Under the assumption that the intrinsic scatter is null (\(\varsigma\equiv 0\)), the WRMS can be used as a proxy for the rescaled uncertainty on the weighted mean when the errors \(\d x_i\) were misestimated by a multiplicative factor:
[12]:
muw, sumw = np.average(x, weights=(0.1*dx) ** -2, returned=True) # Errors are wrong by a factor 10
dmuw = np.sqrt(1 / sumw)
print(f"x Weighted mean (wrong errors): µ = {muw:.4f} ± {dmuw:.4f}") # Wrong weighted mean uncertainty
dmur = wrms(x, 0.1 * dx) / np.sqrt(n - 1)
print(f"x Weighted mean (rescaled errors): µ = {muw:.4f} ± {dmur:.4f}") # Rescaled weighted mean uncertainty
x Weighted mean (wrong errors): µ = 0.0455 ± 0.0098
x Weighted mean (rescaled errors): µ = 0.0455 ± 0.0951
Note: the weighted mean \(\hat{\mu}_{w}\) is unaffected by a scaling on the errors \(\d x_{i}\).
Bayesian estimates🔗
They are studied in Oliphant, 2006, and implemented in scipy.stats.mvsdist (bayesian estimate of the sample mean, variance and standard deviation PDFs) or scipy.stats.bayes_mvs (bayesian confidence intervals).
[13]:
(mu, (lmu, hmu)), (var, (lvar, hvar)), (sig, (lsig, hsig)) = stats.bayes_mvs(x, alpha=0.95) # 95%
print("Bayesian estimates: µ = %.4f +%.4f -%.4f (95%%)" % (mu, hmu-mu, mu-lmu))
print("Bayesian estimates: σ = %.4f +%.4f -%.4f (95%%)" % (sig, hsig-sig, sig-lsig))
Bayesian estimates: µ = 0.0204 +0.2116 -0.2116 (95%)
Bayesian estimates: σ = 1.0747 +0.1643 -0.1383 (95%)
Robust estimators🔗
Median-based estimators are robust (50% of the points can be wrong), but have a low accuracy. More pertinent estimators can be defined.
Location🔗
The variance of the median \(m\) for large samples and normal distributions is:
where \(s\) is an (presumably robust) estimate of the scale.
Scale🔗
Median Absolute Deviation:
Inter-Quartile Range: this is the 25%-trimmed range, i.e. the difference between the 75th- and the 25th-percentiles of a sample.
Both estimators have a optimal breakpoint of 0.5: up to half the points of the sample can be arbitrarily large before the estimator gives an arbitrarily large estimate. (By comparison, the sample standard deviation has a breakpoint of 0.)
For a normal distribution, these are related to standard deviation \(\sigma\) by:
where \(\Phi\) is the standard normal cumulative distribution function (scipy.stats.norm.cdf), and its inverse \(\Phi^{-1}\) is known as the normal quantile function, or probit function (scipy.stats.norm.ppf). Note that MAD is biased for low number of points (see Croux & Rousseeuw, 1992).
Unfortunately, \(\mathrm{MAD}\) is only 37% as efficient as the sample standard deviation.
[14]:
med = np.median(x)
nmad = stats.median_abs_deviation(x, scale='normal') # Normalized Median Absolute Deviation (for normal distributions)
dmed = nmad / np.sqrt(2 * n / np.pi) # For large normal samples
print(f"Median: {med:.4f} ± {dmed:.4f}")
Median: 0.0599 ± 0.1182
More robust estimators of scale with maximal breakdown point are presented in Croux and Rousseeuw (1992).
Absolute Pairwise Differences: defined in Rousseeuw and Croux (1993)
Covariance matters🔗
\(\chi^2\) minimisation🔗
Consider \(\boldsymbol{y}\) a column-vector of \(N\) measurements \(y_i\), each of which assumed to be Gaussian distributed with a mean \(F_i(\boldsymbol{\theta})\) (hereafter the model \(\boldsymbol{F}(\boldsymbol{\theta})\), depending on \(M\) parameters \(\theta_k\)). The maximum-likelyhood estimate \(\hat{\boldsymbol{\theta}}\) of set of parameters \(\boldsymbol{\theta}\) is the one minimizing the \(\chi^2\) defined by:
where \(\m{V}\) is the (supposedly known) covariance matrix between the measurements: \(V_{ij} = \text{Cov}(y_i, y_j)\).
In that case, \(\chi^2(\boldsymbol{\theta})\) follows a \(\chi^2\)-distribution with \(k=N-M\) degrees of freedom. In particular:
The goodness-of-fit (level of agreement between the measurements and the model) can be assessed from the minimum value \(\chi^2_\min = \chi^2(\hat{\boldsymbol{\theta}})\).
\(\chi^2 = \chi^2_\min + 1\) defines the \(1\,\sigma\)-isocontours around the maximum-likelyhood estimate \(\hat{\boldsymbol{\theta}}\). Using a Taylor-series quadratic development of \(\chi^2(\boldsymbol{\theta})\) around its minimum, one gets an approximation of the best-fit paramater covariance matrix:
\[\Cov(\hat{\boldsymbol{\theta}}) \simeq 2\,\m{H}^{-1}\]where \(\m{H}\) the Hessian matrix of the \(\chi^2\) estimated at \(\hat{\boldsymbol{\theta}}\):
\[\begin{split}\begin{align} H_{ij} &= \left.\frac{\partial^2\chi^2}{\partial\theta_i\partial\theta_j}\right|_{\boldsymbol{\theta}=\hat{\boldsymbol{\theta}}} \\ &= 2 \left.\frac{\partial\boldsymbol{F}}{\partial\theta_i}\right|_{\hat{\boldsymbol{\theta}}}^T \cdot \m{V}^{-1} \cdot \left.\frac{\partial\boldsymbol{F}}{\partial\theta_j}\right|_{\hat{\boldsymbol{\theta}}}. \end{align}\end{split}\]
If the measurements are uncorrelated, the covariance matrix \(\m{V}\) is purely diagonal, with \(V_{ii} = \sigma_i^2\) the (supposedly known) variance of measurement \(y_i\), and the previous equations can be rewritten:
Example: weighted mean🔗
We define two datasets:
an uncorrelated repeated measurement sample \(\boldsymbol{x}\) (\(N=100\)) of a null-quantity with log-normal measurement uncertainties,
a correlated sample \(\boldsymbol{y}\), derived from \(\boldsymbol{x}\) with an exponential correlation length of \(\tau = 5\) px and same measurement uncertainties.
[15]:
def vec2corr(vec):
"""Define n×n correlation matrix from *vec* of length n, such that
corr[0,:] = corr[:,0] = vec."""
n = len(vec)
tmp = np.concatenate((vec[:0:-1],vec))
corr = np.array([ tmp[n-(i+1):2*n-(i+1)] for i in range(n) ])
return corr
[16]:
n = 100
tau = 5. # Exponential correlation length
rng = np.random.default_rng(12345)
sig = np.sqrt(rng.lognormal(sigma=0.3, size=n)) # Log-normally distributed uncertainties
cor = vec2corr(np.exp(-np.arange(n)/tau)) # Correlation matrix
cov = cor * np.outer(sig, sig) # Covariance matrix
x = rng.normal(loc=0, scale=sig) # Uncorrelated gaussian measurements
y = np.dot(linalg.sqrtm(cor), x) # Correlated gaussian measurements
[17]:
fig, ax = plt.subplots()
ax.errorbar(np.arange(n), x, yerr=sig, fmt='b.', label='x: uncorrelated')
ax.errorbar(np.arange(n)+0.2, y, yerr=sig, fmt='rs', label='y: correlated')
ax.legend()
[17]:
<matplotlib.legend.Legend object at 0x7fdd853a1c90>
The goodness-of-fit with respect to constant value \(\mu = 0\) is given by:
[18]:
icov = linalg.pinvh(cov)
def chi2(mu, x):
"""Estimate chi2 *without* accounting for covariance."""
return np.sum(((x - mu)/sig)**2)
def chi2Cov(mu, y):
"""Estimate chi2 accounting for covariance."""
return np.dot(y - mu, np.dot(icov, y - mu))
print(f"DoF: k={n}")
c = chi2(0, x) # χ²
p = stats.distributions.chi2.sf(c, n) # p-value
z = max(0, stats.distributions.norm.isf(p)) # z-value
print(f"x (uncorrelated): χ² = {c:6.2f}, p-value={p:.3f} ({z:+.1f}-σ)")
c = chi2(0, y) # No account for covariance
p = stats.distributions.chi2.sf(c, n) # p-value
z = max(0, stats.distributions.norm.isf(p)) # z-value
print(f"y (correlated, w/o cov.): χ² = {c:6.2f}, p-value={p:.3f} ({z:+.1f}-σ)")
c = chi2Cov(0, y) # Account of covariance
p = stats.distributions.chi2.sf(c, n) # p-value
z = max(0, stats.distributions.norm.isf(p)) # z-value
print(f"y (correlated, w/ cov.): χ² = {c:6.2f}, p-value={p:.3f} ({z:+.1f}-σ)")
DoF: k=100
x (uncorrelated): χ² = 94.10, p-value=0.647 (+0.0-σ)
y (correlated, w/o cov.): χ² = 115.53, p-value=0.137 (+1.1-σ)
y (correlated, w/ cov.): χ² = 89.93, p-value=0.755 (+0.0-σ)
In this particular case, the fact of disregarding the covariance for the correlated dataset leads to a spurious 2.4-sigma tension with the true average \(\mu = 0\).
The objective is to derive the maximum-likelihood estimate of the average values, the model is simply the 1-parameter function \(\boldsymbol{F} = (\mu, \ldots, \mu)\). In presence of a covariance matrix \(\m{V}\), one has then \(\sigma_{\hat{\mu}}^2 = \left(\sum_{ij} (\m{V}^{-1})_{ij}\right)^{-1}\). For uncorrelated measurements, this gives \(\sigma_{\hat{\mu}}^2 = \left(\sum_{i} 1/\sigma_i^2\right)^{-1}\) (uncertainty on variance weighted mean, see above).
[19]:
mus = np.linspace(-1, 1, 1000)
print(f"DoF: k={n - 1}")
def p2z(p):
"""Express the input one-sided *p*-value as a sigma equivalent
significance from a normal distribution (the so-called *z*-value)."""
return np.where(p >= 0.5, 0, stats.distributions.norm.isf(p))
xchi2s = [ chi2(mu, x) for mu in mus ]
iminx = np.argmin(xchi2s) # Poor man's minimization!
mux = mus[iminx]
dmux = np.sqrt(1/np.sum(1/sig**2))
print("x ML mean: µ = %+.4f ± %.4f (%.1f-sigma)" % (mux, dmux, np.abs(mux)/dmux))
c = chi2(mux, x) # χ²
p = stats.distributions.chi2.sf(c, n-1) # p-value
z = p2z(p) # z-value
print(f" χ² = {c:6.2f}, p-value={p:.3f} ({z:+.1f}-σ)")
ychi2s = [ chi2(mu, y) for mu in mus ]
iminy = np.argmin(ychi2s) # Poor man's minimization!
muy = mus[iminy]
dmuy = np.sqrt(1/np.sum(1/sig**2))
print("y ML mean (w/o cov): µ = %+.4f ± %.4f (%.1f-sigma)" % (muy, dmuy, np.abs(muy)/dmuy))
c = chi2(muy, y) # No account for covariance
p = stats.distributions.chi2.sf(c, n-1) # p-value
z = p2z(p) # z-value
print(f" χ² = {c:6.2f}, p-value={p:.3f} ({z:+.1f}-σ)")
ychi2Covs = [ chi2Cov(mu, y) for mu in mus ]
iminyCov = np.argmin(ychi2Covs) # Poor man's minimization!
muyCov = mus[iminyCov]
dmuyCov = np.sqrt(1/np.sum(icov)) # Simple expression of dµ derived from model
print("y ML mean (w/ cov): µ = %+.4f ± %.4f (%.1f-sigma)" % (muyCov, dmuyCov, np.abs(muyCov)/dmuyCov))
c = chi2Cov(muyCov, y) # Account of covariance
p = stats.distributions.chi2.sf(c, n-1) # p-value
z = p2z(p) # z-value
print(f" χ² = {c:6.2f}, p-value={p:.3f} ({z:+.1f}-σ)")
DoF: k=99
x ML mean: µ = +0.0450 ± 0.0976 (0.5-sigma)
χ² = 93.89, p-value=0.626 (+0.0-σ)
y ML mean (w/o cov): µ = +0.1031 ± 0.0976 (1.1-sigma)
χ² = 114.40, p-value=0.138 (+1.1-σ)
y ML mean (w/ cov): µ = +0.2112 ± 0.2146 (1.0-sigma)
χ² = 88.96, p-value=0.755 (+0.0-σ)
The effect of neglecting the covariance in correlated data is even stronger on ML-estimate of mean: the estimated value is more than 7-sigma away from the true value, with an unrealistically good \(\chi^2_\min\).
[20]:
fig, ax = plt.subplots()
ax.plot(mus, xchi2s, 'g:')
ax.plot(mus, ychi2s, 'r--')
ax.plot(mus, ychi2Covs, 'b-')
ax.errorbar([mux],[xchi2s[iminx]], xerr=dmux, fmt='g^', label='UNcorrelated')
ax.errorbar([muy],[ychi2s[iminy]], xerr=dmuy, fmt='rs', label='Correlated (w/o cov)')
ax.errorbar([muyCov],[ychi2Covs[iminyCov]], xerr=dmuyCov, fmt='bo', label='Correlated (w/ cov)')
ax.set(xlabel="µ", ylabel="χ²")
ax.legend(numpoints=1)
ax.grid()
Propagation of uncertainties🔗
Consider a set of \(N\) quantities (measurements, parameter estimates, etc.) \(\boldsymbol{y} = (y_1, \ldots y_N)\) associated to a covariance matrix \(V_{ij} = \Cov(y_i, y_j)\) (estimated from measurement or adjustment procedures), and a set of \(M\) functions \(\boldsymbol{f} = (f_1, \ldots, f_M)\). The covariance matrix among the propagated values \(\boldsymbol{f}(\boldsymbol{y})\) can be estimated from a Taylor expansion around \(\boldsymbol{y}\). To first order, one has:
where the matrix of derivative (« jacobian ») is given by: \(J_{ij} = \left.\frac{\partial f_i}{\partial y_j}\right|_{\boldsymbol{y}}\).
This 1st-order expression is exact if \(\boldsymbol{f}\) is linear: \(\boldsymbol{f}(\boldsymbol{y}) = \m{J}\cdot\boldsymbol{y}\). If this is not the case, one has to make sure the functions are reasonably linear over the domain \(\boldsymbol{y} \pm \boldsymbol{\sigma}_{\boldsymbol{y}}\), or expand the Taylor series to higher orders.
The 1st-order expression be rewritten in a non-matricial form:
In the case of uncorrelated input quantities, \(V_{ij} = \sigma_i^2 \delta_{ij}\) and one recovers the traditional uncertainty propagation expression:
Example: synthetic photometry🔗
Consider a spectrum made of \(N\) flux measurements \(\boldsymbol{y} = (y_1, \ldots y_N)\) associated to a covariance matrix \(V_{ij} = \Cov(y_i, y_j)\). One wants to compute the synthetic photometry (i.e. integrated flux) in bands \(A\) and \(B\),
where the fixed values of \(\alpha\) and \(\beta\) coefficients define the filter bandpass, as well as resulting variances and covariances.
In this simple linear case, fluctuations \(\delta A\) are directly related to flux fluctuations \(\delta y_i\):
and therefore
References🔗
Besides standard statistics handbooks,
Cette page a été générée à partir de
statisticalTools.ipynb.