# The Variability of a Word List Percentage

**by Noel Mann**

The following information has not been researched thoroughly, but is the opinion of the author. It is included here because it explains some of the reasons behind the data analysis procedure promoted in this book.

WordSurv will calculate a variance for a word list percentage, based on Simons (1977)<ref>Simons, Gary. 1977. Tables of significance for lexicostatistics, Survey reference manual, ed. by T. G. Bergman. Dallas:SIL.</ref>, that decreases with the word list length. A longer word list is NOT necessarily more precise (i.e. lower variance). As Simons indicates, there are quite a number of factors that contribute to the error in a lexical similarity percentage (e.g. elicitation error, transcription error, etc.). These are summed up in a subjective reliability code, accounting for the measurement error. So far, so good. There is nothing wrong with coming up with a reasonable approximation for estimating a component of error.

But Simons goes on to say that there is another source of error, namely sampling error. His assumption is that the word list is a random sample from some larger set of “basic” vocabulary. This assumption leads to the variance decreasing with word list length. It has been argued that the “basic-ness” of the most commonly used word lists decreases with the length of the list. That is, for example, Swadesh’s first 100 are more “basic” than his next 100. So, if you had 70 words of the first 100, then you might be able to say that these are a random sample of those 100 words, and use Simons’ formulas (if you really think those 100 are equally basic). But his tables allow for sample sizes far greater than 100. This implies that a list of very large size contains words just as “basic” as the Swadesh 100.

Even if you did use Simons’ formula for the case of randomly sampling 70 words from the Swadesh 100, there is another problem. Sampling from a finite list is different than sampling from an infinite list. When sampling from a finite population, the sampling error is actually smaller. The sampling variance can be multiplied by the “finite population correction factor” (1 – n/N) where n is the sample size (e.g. 70) and N is the population size (e.g. 100). For 70 out of 100 words, this would mean multiplying the usual variance formula by (1 - 0 .7) = 0.3. So Simons’ formula would lead to a variance that is 1/0.3 = 3.33 times too large, leading to a standard deviation that is 1/√0.3 = 1.83 times too large. So, if you really had 70 such words, and WordSurv reported a margin of error of, say, ± 10%, then due to the finite population correction factor, the margin of error should only be ± 5.5%.

More importantly, usually what we have is not a random sample from a list of equally “basic” words, so the variance formula is not valid. The main way this can be seen is in the evidence given in Section 3.1.3 that the lexical similarity percentage decreases with word list length. Simons’ formula assumes that the “true” lexical similarity percentage is an unknown constant, and the longer the list, the more closely one can estimate it. But I am proposing that the “true” underlying lexical similarity percentage depends on the word list length. Increasing the word list length will only help within a set of equally “basic” words, and then only up to the maximum size of that set.

So, in conclusion, I recommend not using Simons’ tables to assign a margin of error to a lexical similarity percentage. IF you can really reasonably assume that your list (of size n) is a random sample from a list (of size N) of equally “basic” vocabulary, then you can apply his formula if you multiply the resulting variance by the finite population correction factor of (1 – n/N). Equivalently, multiply the margin of error by √(1 – n/N).

<references />