Class KolmogorovSmirnovTest
- java.lang.Object
-
- org.apache.commons.math4.legacy.stat.inference.KolmogorovSmirnovTest
-
public class KolmogorovSmirnovTest extends Object
Implementation of the Kolmogorov-Smirnov (K-S) test for equality of continuous distributions.The K-S test uses a statistic based on the maximum deviation of the empirical distribution of sample data points from the distribution expected under the null hypothesis. For one-sample tests evaluating the null hypothesis that a set of sample data points follow a given distribution, the test statistic is \(D_n=\sup_x |F_n(x)-F(x)|\), where \(F\) is the expected distribution and \(F_n\) is the empirical distribution of the \(n\) sample data points. The distribution of \(D_n\) is estimated using a method based on [1] with certain quick decisions for extreme values given in [2].
Two-sample tests are also supported, evaluating the null hypothesis that the two samples
xandycome from the same underlying distribution. In this case, the test statistic is \(D_{n,m}=\sup_t | F_n(t)-F_m(t)|\) where \(n\) is the length ofx, \(m\) is the length ofy, \(F_n\) is the empirical distribution that puts mass \(1/n\) at each of the values inxand \(F_m\) is the empirical distribution of theyvalues. The default 2-sample test method,kolmogorovSmirnovTest(double[], double[])works as follows:- When the product of the sample sizes is less than 10000, the method presented in [4] is used to compute the exact p-value for the 2-sample test.
- When the product of the sample sizes is larger, the asymptotic
distribution of \(D_{n,m}\) is used. See
approximateP(double, int, int)for details on the approximation.
For small samples (former case), if the data contains ties, random jitter is added to the sample data to break ties before applying the algorithm above. Alternatively, the
bootstrap(double[],double[],int,boolean,UniformRandomProvider)method, modeled after ks.boot in the R Matching package [3], can be used if ties are known to be present in the data.In the two-sample case, \(D_{n,m}\) has a discrete distribution. This makes the p-value associated with the null hypothesis \(H_0 : D_{n,m} \ge d \) differ from \(H_0 : D_{n,m} \ge d \) by the mass of the observed value \(d\). To distinguish these, the two-sample tests use a boolean
strictparameter. This parameter is ignored for large samples.The methods used by the 2-sample default implementation are also exposed directly:
exactP(double, int, int, boolean)computes exact 2-sample p-valuesapproximateP(double, int, int)uses the asymptotic distribution Thebooleanarguments in the first two methods allow the probability used to estimate the p-value to be expressed using strict or non-strict inequality. SeekolmogorovSmirnovTest(double[], double[], boolean).
References:
- [1] Evaluating Kolmogorov's Distribution by George Marsaglia, Wai Wan Tsang, and Jingbo Wang
- [2] Computing the Two-Sided Kolmogorov-Smirnov Distribution by Richard Simard and Pierre L'Ecuyer
- [3] Jasjeet S. Sekhon. 2011. Multivariate and Propensity Score Matching Software with Automated Balance Optimization: The Matching package for R Journal of Statistical Software, 42(7): 1-52.
- [4] Wilcox, Rand. 2012. Introduction to Robust Estimation and Hypothesis Testing, Chapter 5, 3rd Ed. Academic Press.
Note that [1] contains an error in computing h, refer to MATH-437 for details.- Since:
- 3.3
-
-
Constructor Summary
Constructors Constructor Description KolmogorovSmirnovTest()
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description doubleapproximateP(double d, int n, int m)Uses the Kolmogorov-Smirnov distribution to approximate \(P(D_{n,m} > d)\) where \(D_{n,m}\) is the 2-sample Kolmogorov-Smirnov statistic.doublebootstrap(double[] x, double[] y, int iterations, boolean strict, org.apache.commons.rng.UniformRandomProvider rng)Estimates the p-value of a two-sample Kolmogorov-Smirnov test evaluating the null hypothesis thatxandyare samples drawn from the same probability distribution.doublecdf(double d, int n)Calculates \(P(D_n < d)\) using the method described in [1] with quick decisions for extreme values given in [2] (see above).doublecdf(double d, int n, boolean exact)CalculatesP(D_n < d)using method described in [1] with quick decisions for extreme values given in [2] (see above).doublecdfExact(double d, int n)CalculatesP(D_n < d).doubleexactP(double d, int n, int m, boolean strict)Computes \(P(D_{n,m} > d)\) ifstrictistrue; otherwise \(P(D_{n,m} \ge d)\), where \(D_{n,m}\) is the 2-sample Kolmogorov-Smirnov statistic.doublekolmogorovSmirnovStatistic(double[] x, double[] y)Computes the two-sample Kolmogorov-Smirnov test statistic, \(D_{n,m}=\sup_x |F_n(x)-F_m(x)|\) where \(n\) is the length ofx, \(m\) is the length ofy, \(F_n\) is the empirical distribution that puts mass \(1/n\) at each of the values inxand \(F_m\) is the empirical distribution of theyvalues.doublekolmogorovSmirnovStatistic(org.apache.commons.statistics.distribution.ContinuousDistribution distribution, double[] data)Computes the one-sample Kolmogorov-Smirnov test statistic, \(D_n=\sup_x |F_n(x)-F(x)|\) where \(F\) is the distribution (cdf) function associated withdistribution, \(n\) is the length ofdataand \(F_n\) is the empirical distribution that puts mass \(1/n\) at each of the values indata.doublekolmogorovSmirnovTest(double[] x, double[] y)Computes the p-value, or observed significance level, of a two-sample Kolmogorov-Smirnov test evaluating the null hypothesis thatxandyare samples drawn from the same probability distribution.doublekolmogorovSmirnovTest(double[] x, double[] y, boolean strict)Computes the p-value, or observed significance level, of a two-sample Kolmogorov-Smirnov test evaluating the null hypothesis thatxandyare samples drawn from the same probability distribution.doublekolmogorovSmirnovTest(org.apache.commons.statistics.distribution.ContinuousDistribution distribution, double[] data)Computes the p-value, or observed significance level, of a one-sample Kolmogorov-Smirnov test evaluating the null hypothesis thatdataconforms todistribution.doublekolmogorovSmirnovTest(org.apache.commons.statistics.distribution.ContinuousDistribution distribution, double[] data, boolean exact)Computes the p-value, or observed significance level, of a one-sample Kolmogorov-Smirnov test evaluating the null hypothesis thatdataconforms todistribution.booleankolmogorovSmirnovTest(org.apache.commons.statistics.distribution.ContinuousDistribution distribution, double[] data, double alpha)Performs a Kolmogorov-Smirnov test evaluating the null hypothesis thatdataconforms todistribution.doubleksSum(double t, double tolerance, int maxIterations)Computes \( 1 + 2 \sum_{i=1}^\infty (-1)^i e^{-2 i^2 t^2} \) stopping when successive partial sums are withintoleranceof one another, or whenmaxIterationspartial sums have been computed.doublemonteCarloP(double d, int n, int m, boolean strict, int iterations, org.apache.commons.rng.UniformRandomProvider rng)Uses Monte Carlo simulation to approximate \(P(D_{n,m} > d)\) where \(D_{n,m}\) is the 2-sample Kolmogorov-Smirnov statistic.doublepelzGood(double d, int n)Computes the Pelz-Good approximation for \(P(D_n < d)\) as described in [2] in the class javadoc.
-
-
-
Constructor Detail
-
KolmogorovSmirnovTest
public KolmogorovSmirnovTest()
-
-
Method Detail
-
kolmogorovSmirnovTest
public double kolmogorovSmirnovTest(org.apache.commons.statistics.distribution.ContinuousDistribution distribution, double[] data, boolean exact)
Computes the p-value, or observed significance level, of a one-sample Kolmogorov-Smirnov test evaluating the null hypothesis thatdataconforms todistribution. Ifexactis true, the distribution used to compute the p-value is computed using extended precision. SeecdfExact(double, int).- Parameters:
distribution- reference distributiondata- sample being being evaluatedexact- whether or not to force exact computation of the p-value- Returns:
- the p-value associated with the null hypothesis that
datais a sample fromdistribution - Throws:
InsufficientDataException- ifdatadoes not have length at least 2NullArgumentException- ifdatais null
-
kolmogorovSmirnovStatistic
public double kolmogorovSmirnovStatistic(org.apache.commons.statistics.distribution.ContinuousDistribution distribution, double[] data)
Computes the one-sample Kolmogorov-Smirnov test statistic, \(D_n=\sup_x |F_n(x)-F(x)|\) where \(F\) is the distribution (cdf) function associated withdistribution, \(n\) is the length ofdataand \(F_n\) is the empirical distribution that puts mass \(1/n\) at each of the values indata.- Parameters:
distribution- reference distributiondata- sample being evaluated- Returns:
- Kolmogorov-Smirnov statistic \(D_n\)
- Throws:
InsufficientDataException- ifdatadoes not have length at least 2NullArgumentException- ifdatais null
-
kolmogorovSmirnovTest
public double kolmogorovSmirnovTest(double[] x, double[] y, boolean strict)
Computes the p-value, or observed significance level, of a two-sample Kolmogorov-Smirnov test evaluating the null hypothesis thatxandyare samples drawn from the same probability distribution. Specifically, what is returned is an estimate of the probability that thekolmogorovSmirnovStatistic(double[], double[])associated with a randomly selected partition of the combined sample into subsamples of sizesx.lengthandy.lengthwill strictly exceed (ifstrictistrue) or be at least as large as (ifstrictisfalse) askolmogorovSmirnovStatistic(x, y).- Parameters:
x- first sample dataset.y- second sample dataset.strict- whether or not the probability to compute is expressed as a strict inequality (ignored for large samples).- Returns:
- p-value associated with the null hypothesis that
xandyrepresent samples from the same distribution. - Throws:
InsufficientDataException- if eitherxorydoes not have length at least 2.NullArgumentException- if eitherxoryis null.NotANumberException- if the input arrays contain NaN values.- See Also:
bootstrap(double[],double[],int,boolean,UniformRandomProvider)
-
kolmogorovSmirnovTest
public double kolmogorovSmirnovTest(double[] x, double[] y)
Computes the p-value, or observed significance level, of a two-sample Kolmogorov-Smirnov test evaluating the null hypothesis thatxandyare samples drawn from the same probability distribution. Assumes the strict form of the inequality used to compute the p-value. SeekolmogorovSmirnovTest(ContinuousDistribution, double[], boolean).- Parameters:
x- first sample datasety- second sample dataset- Returns:
- p-value associated with the null hypothesis that
xandyrepresent samples from the same distribution - Throws:
InsufficientDataException- if eitherxorydoes not have length at least 2NullArgumentException- if eitherxoryis null
-
kolmogorovSmirnovStatistic
public double kolmogorovSmirnovStatistic(double[] x, double[] y)
Computes the two-sample Kolmogorov-Smirnov test statistic, \(D_{n,m}=\sup_x |F_n(x)-F_m(x)|\) where \(n\) is the length ofx, \(m\) is the length ofy, \(F_n\) is the empirical distribution that puts mass \(1/n\) at each of the values inxand \(F_m\) is the empirical distribution of theyvalues.- Parameters:
x- first sampley- second sample- Returns:
- test statistic \(D_{n,m}\) used to evaluate the null hypothesis that
xandyrepresent samples from the same underlying distribution - Throws:
InsufficientDataException- if eitherxorydoes not have length at least 2NullArgumentException- if eitherxoryis null
-
kolmogorovSmirnovTest
public double kolmogorovSmirnovTest(org.apache.commons.statistics.distribution.ContinuousDistribution distribution, double[] data)
Computes the p-value, or observed significance level, of a one-sample Kolmogorov-Smirnov test evaluating the null hypothesis thatdataconforms todistribution.- Parameters:
distribution- reference distributiondata- sample being being evaluated- Returns:
- the p-value associated with the null hypothesis that
datais a sample fromdistribution - Throws:
InsufficientDataException- ifdatadoes not have length at least 2NullArgumentException- ifdatais null
-
kolmogorovSmirnovTest
public boolean kolmogorovSmirnovTest(org.apache.commons.statistics.distribution.ContinuousDistribution distribution, double[] data, double alpha)
Performs a Kolmogorov-Smirnov test evaluating the null hypothesis thatdataconforms todistribution.- Parameters:
distribution- reference distributiondata- sample being being evaluatedalpha- significance level of the test- Returns:
- true iff the null hypothesis that
datais a sample fromdistributioncan be rejected with confidence 1 -alpha - Throws:
InsufficientDataException- ifdatadoes not have length at least 2NullArgumentException- ifdatais null
-
bootstrap
public double bootstrap(double[] x, double[] y, int iterations, boolean strict, org.apache.commons.rng.UniformRandomProvider rng)
Estimates the p-value of a two-sample Kolmogorov-Smirnov test evaluating the null hypothesis thatxandyare samples drawn from the same probability distribution. This method estimates the p-value by repeatedly sampling sets of sizex.lengthandy.lengthfrom the empirical distribution of the combined sample. Whenstrictis true, this is equivalent to the algorithm implemented in the R functionks.boot, described inJasjeet S. Sekhon. 2011. 'Multivariate and Propensity Score Matching Software with Automated Balance Optimization: The Matching package for R.' Journal of Statistical Software, 42(7): 1-52.
- Parameters:
x- First sample.y- Second sample.iterations- Number of bootstrap resampling iterations.strict- Whether or not the null hypothesis is expressed as a strict inequality.rng- RNG for creating the sampling sets.- Returns:
- the estimated p-value.
-
cdf
public double cdf(double d, int n)
Calculates \(P(D_n < d)\) using the method described in [1] with quick decisions for extreme values given in [2] (see above). The result is not exact as withcdfExact(double, int)because calculations are based ondoublerather thanBigFraction.- Parameters:
d- statisticn- sample size- Returns:
- \(P(D_n < d)\)
- Throws:
MathArithmeticException- if algorithm fails to converthto aBigFractionin expressingdas \((k - h) / m\) for integerk, mand \(0 \le h < 1\)
-
cdfExact
public double cdfExact(double d, int n)
CalculatesP(D_n < d). The result is exact in the sense that BigFraction/BigReal is used everywhere at the expense of very slow execution time. Almost never choose this in real applications unless you are very sure; this is almost solely for verification purposes. Normally, you would choosecdf(double, int). See the class javadoc for definitions and algorithm description.- Parameters:
d- statisticn- sample size- Returns:
- \(P(D_n < d)\)
- Throws:
MathArithmeticException- if the algorithm fails to converthto aBigFractionin expressingdas \((k - h) / m\) for integerk, mand \(0 \le h < 1\)
-
cdf
public double cdf(double d, int n, boolean exact)
CalculatesP(D_n < d)using method described in [1] with quick decisions for extreme values given in [2] (see above).- Parameters:
d- statisticn- sample sizeexact- whether the probability should be calculated exact usingBigFractioneverywhere at the expense of very slow execution time, or ifdoubleshould be used convenient places to gain speed. Almost never choosetruein real applications unless you are very sure;trueis almost solely for verification purposes.- Returns:
- \(P(D_n < d)\)
- Throws:
MathArithmeticException- if algorithm fails to converthto aBigFractionin expressingdas \((k - h) / m\) for integerk, mand \(0 \le h < 1\).
-
pelzGood
public double pelzGood(double d, int n)
Computes the Pelz-Good approximation for \(P(D_n < d)\) as described in [2] in the class javadoc.- Parameters:
d- value of d-statistic (x in [2])n- sample size- Returns:
- \(P(D_n < d)\)
- Since:
- 3.4
-
ksSum
public double ksSum(double t, double tolerance, int maxIterations)
Computes \( 1 + 2 \sum_{i=1}^\infty (-1)^i e^{-2 i^2 t^2} \) stopping when successive partial sums are withintoleranceof one another, or whenmaxIterationspartial sums have been computed. If the sum does not converge beforemaxIterationsiterations aTooManyIterationsExceptionis thrown.- Parameters:
t- argumenttolerance- Cauchy criterion for partial sumsmaxIterations- maximum number of partial sums to compute- Returns:
- Kolmogorov sum evaluated at t
- Throws:
TooManyIterationsException- if the series does not converge
-
exactP
public double exactP(double d, int n, int m, boolean strict)
Computes \(P(D_{n,m} > d)\) ifstrictistrue; otherwise \(P(D_{n,m} \ge d)\), where \(D_{n,m}\) is the 2-sample Kolmogorov-Smirnov statistic. SeekolmogorovSmirnovStatistic(double[], double[])for the definition of \(D_{n,m}\).The returned probability is exact, implemented by unwinding the recursive function definitions presented in [4] (class javadoc).
- Parameters:
d- D-statistic valuen- first sample sizem- second sample sizestrict- whether or not the probability to compute is expressed as a strict inequality- Returns:
- probability that a randomly selected m-n partition of m + n generates \(D_{n,m}\)
greater than (resp. greater than or equal to)
d
-
approximateP
public double approximateP(double d, int n, int m)
Uses the Kolmogorov-Smirnov distribution to approximate \(P(D_{n,m} > d)\) where \(D_{n,m}\) is the 2-sample Kolmogorov-Smirnov statistic. SeekolmogorovSmirnovStatistic(double[], double[])for the definition of \(D_{n,m}\).Specifically, what is returned is \(1 - k(d \sqrt{mn / (m + n)})\) where \(k(t) = 1 + 2 \sum_{i=1}^\infty (-1)^i e^{-2 i^2 t^2}\). See
ksSum(double, double, int)for details on how convergence of the sum is determined.- Parameters:
d- D-statistic valuen- first sample sizem- second sample size- Returns:
- approximate probability that a randomly selected m-n partition of m + n generates
\(D_{n,m}\) greater than
d
-
monteCarloP
public double monteCarloP(double d, int n, int m, boolean strict, int iterations, org.apache.commons.rng.UniformRandomProvider rng)
Uses Monte Carlo simulation to approximate \(P(D_{n,m} > d)\) where \(D_{n,m}\) is the 2-sample Kolmogorov-Smirnov statistic. SeekolmogorovSmirnovStatistic(double[], double[])for the definition of \(D_{n,m}\).The simulation generates
iterationsrandom partitions ofm + ninto annset and anmset, computing \(D_{n,m}\) for each partition and returning the proportion of values that are greater thand, or greater than or equal todifstrictisfalse.- Parameters:
d- D-statistic value.n- First sample size.m- Second sample size.iterations- Number of random partitions to generate.strict- whether or not the probability to compute is expressed as a strict inequalityrng- RNG used for generating the partitions.- Returns:
- proportion of randomly generated m-n partitions of m + n that result in \(D_{n,m}\)
greater than (resp. greater than or equal to)
d.
-
-