Interpreting Word List Data

From SurveyWiki
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search
Data Collection Tools
Recorded Text Testing
Sentence Repetition Testing
Word Lists
Participatory Methods
Word Lists
Preparing a Word List
Collecting Word List Data
Analyzing Word List Data
Interpreting Word List Data
Tips from the Field


Once you have the percentages, how do you interpret them? This section discusses how to use lexical similarity percentages to make inferences about intelligibility and dialect groupings. Remember that there is much more to groupings and intelligibility than just lexical similarity! While lexical similarity does not tell you everything, it does give you a starting point.

For further reference as you read these procedures, you can refer to the Field Guide Glossary and also the Discussion Tab of this page.

Limits of Lexicostatistics

Lexicostatistics is the technique we use to calculate percentages of lexical similarity between two languages. It does have some limitations you'll need to be aware of however:

  • it should not be used to give specific dates of divergence of languages or dialects. The more rigorous Comparative Method can be used for that if needed. Tree diagrams of relatedness are often used in representing the findings of the Comparative Method but these should not be drawn on the basis of lexicostatistics data because it does not give us enough information to construct linguistic relatedness in such detail.
  • lexicostatistics will provide you with percentages of lexical similarity, but these percentages have no value in themselves as exact values of similarity between one language and another. This is due to a variety of reasons, including 1) the fact that varieties are usually not intelligible to each other to the same degree because of social and other reasons, 2) that lexical similarity does not necessarily represent total language similarity, and 3) that most methods of comparison are hybrid methods, meaning they've used several factors to measure similarity.

You also have to consider the variance of your scores (see below). If you're comparing languages A and B with language Z, just because you arrive at a figure of 68% between A and Z, you cannot say that B is more intelligible to Z speakers if it shows 69% lexical similarity. That 1% difference is unlikely to be accurate with the tiny samples of each language that we have to work on; your variance will be greater than 1%, meaning you cannot say which of these pairs of languages is more intelligible. To calculate intelligibility, more accurate methods of intelligibilty testing should be used.

Because of these limitations, lexicostatistics should only be used to indicate the lack of intelligibility and nothing more. SIL recommends that, if lexical similarity is below 70%, you can conclude that there is lack of intelligibility between the two varieties.<ref>Douglas W. Boone. 2007. On the uses of word lists and implications for surveyors.</ref>

Even though the only strong conclusion you can make based on a lexical similarity percentage is lack of intelligibility, it is okay to use lexical similarity as a basis for a first guess at language groupings or clusters. The first guess is helpful, for example, in giving you some guidance in choosing test points for intelligibilty testing. This can be done by considering various thresholds and seeing which language varieties group together based on their lexical similarity being above the threshold.


Lexical Similarity Matrix

Often, a survey involves collecting word lists from more than just two varieties. A meaningful way to present the resulting lexical similarity percentages is in a matrix. In the absence of other information about how the varieties are grouped, start by setting up the matrix in some geographical ordering. Then insert the percentages.

Suppose you are investigating three dialects that are located along a river in the order A, B, C. Based on your data, you find that they have the following lexical similarity percentages:

A and B 85%
A and C 80%
B and C 60%

The resulting geographically ordered matrix would be:

85% B
80% 60% C

Note that you do not need to enter anything in the top right cells since they would be identical to the bottom left cells. The full matrix is symmetric. That is, the lexical similarity between A and B is the same as that between B and A.

Just because dialects are in some order geographically does not necessarily mean that they actually group closest with their geographic neighbors. The next step is to rearrange the matrix such that more lexically similar varieties are closer to each other in the matrix. When you do this, you have to be careful to keep all the percentages in the right places! One of the most useful features of WordSurv is that it can do this rearranging automatically. In general, an optimally ordered matrix should have:

  • The larger percentages closer to the diagonal.
  • The smaller percentages further from the diagonal, closer to the bottom left corner.

It might not always be possible to have the entire matrix follow these rules. For the simple example above, the rearranged matrix would be:

85% A
60% 80% C

If you were using a 70% cutoff for lack of intelligibility, then you would conclude that B and C are mutually (inherently) unintelligible. You would need to use intelligibility testing to help determine if A and B understand each other and if A and C understand each other.

If you did intelligibility testing and found that these two pairs do understand each other, you might conclude that both B and C understand A, but not each other, and that you could possibly just develop literature in A which both B and C could use. There are, however, many other factors, such as sociolinguistic ones, that you must look at before making this conclusion.

Precision of a Lexical Similarity Percentage

WordSurv produces a variance for each lexical similarity percentage. The variance is the square of the standard deviation. This gives you a measure of how accurate the percentage is. The method which WordSurv uses takes into account an estimate (that you provide) of how reliable your data is. Reliability can be affected by many things. For example, if you are a new surveyor, or investigating a language group you have never tried to transcribe before, or an informant was missing some teeth, then the reliability would be lower.

The method also, however, uses some statistical theory that implicitly assumes that the word list is a random sample from all the words of the language, which it is not. A consequence of this assumption is that WordSurv’s formula leads to a smaller variance (higher precision) for a longer word list. As discussed in section 2 of these procedures, what actually happens with a longer word list is that the lexical similarity percentage will tend to decrease because a longer word list includes more words that are more likely to change over time (words outside of the traditional 'basic lists' which are theoretically more stable). Thus, the random sample assumption is not valid. Whether the percentage is more accurate for longer lists or not is hard to say. If it is, then it is a more accurate estimate, but of a different quantity than for a shorter list.

If a lexical similarity percentage is 68%, does that mean that the two varieties are unintelligible? If you use a strict 70% cutoff, the answer is “yes”. But look at other factors such as reported comprehension, contact, and attitudes in order to decide whether or not to consider intelligibility testing. Similarly, if the percentage is not much greater than 70% consider other factors before commencing intelligibility testing rather than base your decision on an arbitrary cutoff value.

Lexical Similarity Groupings

Besides screening for lack of intelligibility, lexical similarity can also be used to form preliminary dialect groupings. Note that the basis for these groupings is intelligibility and not, as with the Comparative Method, any genetic similarity. Lexical similarity groupings are useful in forming a hypothesis about intelligibility groups which can then be tested using RTT, sociolinguistic investigation, and linguistic analysis.

Consider the following lexical similarity matrix for some varieties of Chin.

Lexical similarity matrix for some varieties of Chin

Using a cutoff of 75%, there are four groups (A, B, C, and D). Within each group, all the percentages are at least 75%. Between groups, all the percentages are below 75%. Given the groupings, you can report the ranges of lexical similarity within each group and between groups. This involves simply looking at the portion of the matrix corresponding to the comparisons of interest and noting the range (the smallest to the largest) of the percentages. Within group B, for example, the lexical similarity ranges from 75% to 88%. Comparing B with A yields a lexical similarity range of 54% to 71%. It is possible that you could end up with a matrix where the groupings are not as clear as this one. For example, suppose you used a 70% cutoff instead of a 75% cutoff. Where would you place C and D? They have at least 70% similarity with at least one of the B varieties, but not with all of them, nor with each other. This matrix is not the final answer to dialect groupings, just a way to get a preliminary picture. Start with a low cutoff and see what happens. Then, as you increase the cutoff, the picture will become steadily clearer. However, you probably do not want to be making dialect distinctions based on a really high cutoff.

Using a map, you can draw lexical similarity contours for various threshold percentages. For example, consider the following fictitious lexical similarity matrix.

A fictitious lexical similarity matrix

Using cutoffs of 60%, 70%, 80%, and 90%, the contours would look like the following:


Make sure to clearly indicate in your report that these are lexical similarity contours. Otherwise, someone might interpret your figure to be indicating intelligibility groupings.

See the Lexical Similarity Grouping Examples page for more.

Using Lexical Similarity for Intelligibility Testing Site Selection

Consider the matrix and contours in the previous section. That matrix is just a beginning. It provides good information for deciding where you should do intelligibility testing. In general, you would want to test any two varieties for which the lexical similarity was at least 70%, but there might be situations where you would want to test for intelligibility at lower levels.

Based on the lexical similarity contours for varieties A, B, C, and D, you might hypothesize that a possible single reference dialect is C. It is the only location that has at least 60% lexical similarity with all the others. At any higher threshold, there would have to be at least two reference dialects. You would want to pursue intelligibility testing to go further. These lexical similarity contours give you some idea of where to begin. For example, you would definitely want to test to see if A, B, and D can understand C. In any case, you would also want to investigate sociolinguistic factors such as patterns of contact and acquired bidialectalism before deciding on dialect groupings.


Word lists are very useful in gaining a preliminary picture of the relationships between language varieties. While lexical similarity percentages computed from a word list are a very imprecise indicator of high intelligibility, they can be used as a reliable screen for low intelligibility. This reliability can be maximized by careful attention to the protocol used in word list elicitation, transcription, and analysis. Additionally, lexical similarity percentages can be used to form a hypothesis for dialect groupings as well as provide information relevant to intelligibility testing site selection.