Stochastic (Monte Carlo) estimation of Type I error Probability in Dixon's Q-test

[based on: C. E. EFSTATHIOU: "Estimation of Type I Error Probability from Experimental Dixon's "Q" Parameter on Testing for Outliers within small Size Data Sets", Talanta, 69(5), 1068-1071 (2006)., PDF]

The traditional way of performing Q-test is based on the use of tabulated critical-Q values. However, as it is common to all similarly performed significance tests, this way cannot provide us with the exact value of type-I error probability (p). Hence, if e.g. the suspect value can be rejected at 95% confidence level, but it must be retained at 99%, we only know that: 0,01 < p < 0,05.

Statistical software packages for performing significance tests have made obsolete the use of tabulated critical values, since their usual outcome is directly the value of p, which is internally computed.

Unfortunately, in the case of the Q-test, the required mathematics for the calculation of p are particularly arduous and special numerical integration techniques are needed. In his original publication [Ann. Math. Stat., 22 (1951) 68], Dixon gave analytical expressions of the type Q = F(p), only for Ν = 3 and 4. These functions can be rearranged and obtain the form of p = f(Q), which are used in this applet.

Thus for N = 3 the equation used is:
whereas for N = 4 the equation used is (provided that p<0.5, otherwise the Monte Carlo approach is used):

For N > 4 there are no similar analytical expressions, but we can use a simple Monte Carlo approach for the estimation of p. [Note: for a short introduction in Monte Carlo techniques, see applet: Drunken sailor's random walk].

The algorithm used is as follows:

(i) A set of N random values from the same normally distributed population is obtained, and the Q-value corresponding to this random set (Q_rand) is calculated.

(ii) If this Q_rand value is greater than the input Q_exp value, then a counter (C) is incremented.

(iii) Steps (i) and (ii) are repeated N_sim times (N_sim : number of simulations).

(iv) The ratio C / N_sim is the estimated value of p.

The flow-chart of this algorithm is shown to the right:

The reasoning behind this algorithm is that all N_sim Q_rand–values have been obtained from the same by definition "outliers-free" sets of N random values, since they have been all obtained from the same normal population. Therefore, the fraction of Q_rand values greater that the examined Q_exp value, represents the probability of normal occurrence Q_rand at this range.

Obviously, a large number N_simof simulations is needed to obtain a reasonably accurate value of p. In the present applet it is N_sim= 300,000. Τhis number yields p values accurate to at least 2-significant figures within a reasonably short calculation period (typically: 5-10 s).

NOTE: Normally distributed random numbers with mean μ and standard deviation σ_x can be readily produced by using the equation shown to the right as "random-number generator":

r_j is a random number uniformly distributed between 0 and 1. Such random numbers (actually: pseudo-random) are provided by most high-level computer languages. As n increases, the generated x values tend to obtain a normal distribution. Typically, n=12. This "generator" is based on the "Central Limit Theorem" (see applet: Central Limit Theorem)