Difference between revisions of "Analyzing Word List Data"

From SurveyWiki
Jump to navigationJump to search
 
(14 intermediate revisions by 2 users not shown)
Line 6: Line 6:
  
 
'''Introduction'''
 
'''Introduction'''
As we said in the introduction to these procedures, the most common reason to use word lists is to distinguish one variety from another and to decide when one variety is unintelligible to speakers of another. To do this, we analyse how similar one word list is to another and produce a lexical similarity percentage. We have to decide on a principle of comparison detailing the criteria we use to decide whether words are similar or not. This section will detail procedures you can carry out to achieve this.
+
As we said in [[Word Lists|the introduction to these procedures]], the most common reason to use word lists is to distinguish one variety from another and to decide when one variety is unintelligible to speakers of another. To do this, we analyse how similar one word list is to another and produce a lexical similarity percentage. We have to decide on a principle of comparison detailing the criteria we use to decide whether words are similar or not. This section will detail procedures you can carry out to achieve this.
  
 
For further refence as you read these procedures, you can refer to the [[Field Guide Glossary]].
 
For further refence as you read these procedures, you can refer to the [[Field Guide Glossary]].
  
 
==Data Entry==
 
==Data Entry==
Type the transcriptions into a spreadsheet program such as Excel, or into WordSurv. Make sure to use Unicode fonts for IPA characters. Type the transcriptions exactly as you wrote them in the field. You can revise them later based on the recordings you made (see the section on checking transcriptions below).
+
Type the transcriptions into a spreadsheet program such as Excel, or directly into [[WordSurv]]. If you use a spreadsheet program though, you will probably be more able to export the data into other tools that may make analysis easier, such as [[Phonology Assistant]]. For detailed instructions about inputting word lists into WordSurv 6.0.2, see [[Inputting Data into WordSurv]].
 +
 
 +
Make sure to use [[Unicode]] fonts for your [[IPA]] characters. If you don't, they may not display correctly on other computers. For instructions on entering IPA Unicode text, see our [[Typing IPA]] page.
 +
 
 +
Type the transcriptions exactly as you wrote them in the field. You can revise them later based on the recordings you made (see the section on checking transcriptions below).
  
 
==Transferring the Recording onto a Computer==
 
==Transferring the Recording onto a Computer==
Line 20: Line 24:
 
Transfer the recording into SA track by track, if using a digital recorder, or, if using a tape recorder, in chunks of a few minutes each. If you transfer it all at once, you will get a very large file that takes a long time to open and split up later. You will save time by recording in chunks now. Track by track is the best way to make sure you don't lose anything. You can set an MD player to play one-track-at-a-time. With a tape recorder, just transfer about 5 minutes at a time. If you happened to stop in the middle of a word, then rewind a little before transferring the next section to avoid losing any data.
 
Transfer the recording into SA track by track, if using a digital recorder, or, if using a tape recorder, in chunks of a few minutes each. If you transfer it all at once, you will get a very large file that takes a long time to open and split up later. You will save time by recording in chunks now. Track by track is the best way to make sure you don't lose anything. You can set an MD player to play one-track-at-a-time. With a tape recorder, just transfer about 5 minutes at a time. If you happened to stop in the middle of a word, then rewind a little before transferring the next section to avoid losing any data.
  
See [[Appendix J]] for details about what settings to use in SA when transferring a recording. It is very important to make sure that the settings are checked, otherwise, you could end up with a poor quality transfer.
+
See [[Speech Analyzer transfer settings]] for details about what settings to use in SA when transferring a recording. It is very important to make sure that the settings are checked, otherwise, you could end up with a poor quality transfer.
  
 
==Checking the Transcriptions==
 
==Checking the Transcriptions==
Listen to the recordings in order to check the field transcriptions you earlier entered into the computer. Make sure you have used the same symbol for the same sound throughout and, as much as possible, phonetically accurate. You can use software like [[SA|Speech Analyzer]]  to do acoustic analysis to get a more accurate idea of the phonetics. The SA Help files are indeed very helpful for understanding acoustic phonetics. An excellent text to help you with this is Peter Ladefoged's 2003 book ''Phonetic Data Analysis''.
+
Listen to the recordings in order to check the field transcriptions you earlier entered into the computer. Make sure you have used the same symbol for the same sound throughout and, as much as possible, phonetically accurate. You can use software like [[Speech Analyzer]] (SA) to do acoustic analysis to get a more accurate idea of the phonetics. The SA Help files are indeed very helpful for understanding acoustic phonetics. An excellent text to help you with this is Peter Ladefoged's 2003 book ''Phonetic Data Analysis''.
  
 
Sometimes you will come across sounds that you will want to compare closely. Using SA, you can combine the recordings for two words into a separate file (this is even easier if you have split the recording up into separate files for each word). Then you can listen to the words one after the other, and compare the waveforms, pitch, formants, or whatever else you want.
 
Sometimes you will come across sounds that you will want to compare closely. Using SA, you can combine the recordings for two words into a separate file (this is even easier if you have split the recording up into separate files for each word). Then you can listen to the words one after the other, and compare the waveforms, pitch, formants, or whatever else you want.
Take some time at this stage to save some time later: split the recording up into separate files for each word in the list. These files should contain the word number, the English gloss, the elicitation prompt (all spoken by you), and the informant saying the word three times. There are many times when checking the transcriptions that you will want to quickly refer to another word and be able to listen to them one after the other. Doing this also means that researchers who later access the data can do so very efficiently.
 
  
Once you've transferred the data, make a backup copy of the transcriptions. It's vital to keep this in a separate place from the computer with the original files. If you back up a copy to the same hard drive or physical location as your original data and your computer equipment gets stolen or suffers a fault you might lose both your original and backup. Your backup is also useful in case you change something in your files and need to recover the original part or compare it with the original.
+
Take some time at this stage to save some time later: split the recording up into separate files for each word in the list. These files should contain the word number, the English [[gloss]], the elicitation prompt (all spoken by you), and the informant saying the word three times. There are many times when checking the transcriptions that you will want to quickly refer to another word and be able to listen to them one after the other. Doing this also means that researchers who later access the data can do so very efficiently.
 +
 
 +
Once you've transferred the data, make a backup copy of the transcriptions. It's vital to '''keep this in a separate place from the computer with the original files'''. If you back up a copy to the same hard drive or physical location as your original data and your computer equipment gets stolen or suffers a fault you might lose both your original and backup. Your backup is also useful in case you change something in your files and need to recover the original part or compare it with the original.
  
 
==Identifying Roots==
 
==Identifying Roots==
Line 45: Line 50:
 
#For the first word ‘I’ (first person singular), the form is unambiguously monosyllabic; thus no further analysis is required and these forms can be directly compared.
 
#For the first word ‘I’ (first person singular), the form is unambiguously monosyllabic; thus no further analysis is required and these forms can be directly compared.
 
#For the words egg and seed, all of the varieties have generally similar forms. This disyllabic form is made up of a shared first weak syllable and two different major syllables following that. In this case, the weak syllable is lost in one case and including it would lead to an artificially lower lexical similarity percentage. It is further noted that the similarities between the form for both egg and seed may indicate that this is some sort of classification particle and should be eliminated from the lexicostatistic comparison; thus only the major syllable is compared.
 
#For the words egg and seed, all of the varieties have generally similar forms. This disyllabic form is made up of a shared first weak syllable and two different major syllables following that. In this case, the weak syllable is lost in one case and including it would lead to an artificially lower lexical similarity percentage. It is further noted that the similarities between the form for both egg and seed may indicate that this is some sort of classification particle and should be eliminated from the lexicostatistic comparison; thus only the major syllable is compared.
#The word forms for warm and sit are similar, with an apparent morphologic vowel particle ad the end of the word, and an unusual form in two of the words ending in a glottal stop.
+
#The word forms for warm and sit are similar, with an apparent morphologic vowel particle at the end of the word, and an unusual form in two of the words ending in a glottal stop.
#Considering the word forms for leaf and root, it is apparent that there is an initial morpheme which possibly means having to do with trees. This morpheme provides supplemental semantic information that is not necessary to the core meaning of the major syllable. Including it in the comparison would lead to an artificially higher lexical similarity percentage. Thus, this morpheme is ignored.
+
#Considering the word forms for leaf and root, it is apparent that there is an initial morpheme having a meaning relating to 'tree'. This morpheme provides supplemental semantic information that is not necessary to the core meaning of the major syllable. Including it in the comparison would lead to an artificially higher lexical similarity percentage. Thus, this morpheme is ignored.
  
 
Applying these basic steps, the data can be reduced to the roots by eliminating weak and supplemental syllables resulting in the following data:
 
Applying these basic steps, the data can be reduced to the roots by eliminating weak and supplemental syllables resulting in the following data:
Line 71: Line 76:
 
===The Blair Method===
 
===The Blair Method===
  
A method which takes the middle ground between the Comparative Method and simple inspection has been developed in South Asia. It's based on the method outlined in Frank Blair's 1990 book ''Survey on a Shoestring''.<ref name="blair" />
+
A method which takes the middle ground between the [[Comparative Method]] and simple inspection has been developed in South Asia. It's based on the method outlined in Frank Blair's 1990 book ''Survey on a Shoestring''<ref name="blair" /> and is thus called the ''Blair Method''.  
 
 
The basis of this method is the ''segment''. A segment is a phoneme unit e.g. an initial phoneme. The corresponding segments in a pair of words are analysed and placed in the following categories.
 
 
 
*Category 1
 
** Exact matches (e.g., [b] occurs in the same position in each word.)
 
** Vowels which differ by only one phonological feature (e.g., [i] and [e] occur in the same position in each word.)
 
** Phonetically similar segments which occur consistently in the same position in three or more word pairs. For example, the [g]/[gɦ] correspondences in the following entries from these two dialects would be considered category one:
 
 
 
{|class=wikitable border=1 cellpadding=5
 
|-
 
! Gloss
 
! Dialect One
 
! Dialect Two
 
|-
 
| Fingernail
 
| [goru]
 
| [gɦoru]
 
|-
 
| Axe
 
| [godeli]
 
| [gɦodel]
 
|-
 
| Cloth
 
| [guda]
 
| [gɦuda]
 
|-
 
| Boy
 
| [peka]
 
| [pekal]
 
|}
 
 
 
*Category 2
 
** Consonant segments which are not corresponsingly similar in a certain number of pairs. The exact number of pairs needed for correspondance will vary depending on the amount of words in a list. For a list of 210 words, 3 would be reasonable. For more on this, see the page on the Significance of Sound Correspondences.
 
** Vowels which differ by two or more phonological features (e.g., [a] and [u]).
 
 
 
*Category 3
 
** All corresponding segments which are not phonetically similar.
 
** A segment which corresponds to nothing in the second word of the pair. For example, the [l]/[#] correspondence in the word for boy in the example above.
 
 
 
Depending on the context, you might have to allow for some localised issues. For example, SIL's South Asia survey team ignores
 
 
 
* Interconsonantal [ə]
 
* Word initial, word final, or intervocalic [h] or [ɦ]
 
* Any deletion
 
 
 
====Phonetic Similarity====
 
Notice that Blair's categories imply that you have to judge whether consonant pairs are phonetically similar or not. Additionally, you need to know what vowels differ by only one feature. So, before applying Blair’s method, you need to make charts showing what consonants you consider to be phonetically similar and what vowels you consider to differ by only one feature.
 
 
 
Consult a phonology text for help. The specifics will depend on the language family. Here is another place where the [Comparative Method] can be useful. If you have, or another researcher has, constructed a proto-language for the language family of interest, then consult it for help as to what sorts of sound correspondences to look forFigure 1 shows an example for consonants from surveyor Noel Mann. Phonetically similar segments are joined by a single line.
 
 
 
[[File:Consonants.jpg|center|Example of chart showing phonetically similar segments.]]
 
 
 
Figure 2 shows an example for vowels.
 
 
 
[[File:Vowels.jpg|center|Example of chart showing vowels differing by one feature.]]
 
 
 
Noel notes that “the schwa is a tricky element; although the criterion says we can ignore it interconsonantally, it can be one of a number of disguised phones.” So sometimes it should be ignored, but other times included in the comparison. How can you tell the difference?
 
 
 
Noel says
 
 
 
<blockquote>This may vary depending on the language. I would use the general rule of ignoring schwa when it is in the onset position, that is between consonants in the onset but not when it is the nuclear element of a syllable. The reason for ignoring it in the onset is that it is often merely a transitional element and some linguists will transcribe it while others will not (this appears to be the case in many Asian languages). In this position, it is often a slight transition and not a full segment. For the nucleus however, it tends to be the remnant of a vowel at a different place on the phonetic chart (particularly in Tibeto-Burman) which has become reduced to schwa.</blockquote>
 
 
 
An example from Mpi, a Tibeto-Burman language in the Loloish branch is shown in Figures 3 and Figure 4 (Nahhas 2005)<ref name="nahhas">Nahhas, Ramzi W. 2005. Sociolinguistic survey of Mpi in Thailand. Linguistics Department Research Paper #202. Chiang Mai: Payap University.</ref>. Nahhas comments
 
 
 
<blockquote>In practice, what actually happened for Mpi was that after I had inventoried the segments and consulted a phonology text, if I encountered correspondences which seemed to be regular but were not connected by a line in the chart I added a line! “Phonetically similar” is a fuzzy concept. Sound changes that seem strange in one part of the world might be reasonable in another.</blockquote>
 
 
 
[[File:Note.jpg|center|Mpi phonetically similar consonants (Nahhas 2005)<ref name="nahhas />(Consonants joined by a line segment are considered to be phonetically similar.)]]
 
 
 
[[File:Nahhasvowels.jpg|center|Mpi vowels (Nahhas 2005)<ref name="nahhas />(Vowel pairs joined by a line segment are considered to differ by one feature. Those joined by two segments with only one intervening vowel are considered to differ by two features.)]]
 
 
 
====Counting Correspondences====
 
 
 
When you do encounter a correspondence between phonetically similar segments, then you need to look through all the other words in the list (not just the 100 or so used in the comparison). Count how many times these two segments appear in the same word pair and in the same relationship (that is, count the correspondences in which one segment is always in one language and the other is always in the other). Blair says that a correspondence is considered to be “regular” if there are three or more occurrences in the data. This was based on a word list of length 210. This seems to be a good rule of thumb, and you will often be correct in using it. However, in fact, the right number to use is not three in all cases. It depends on the length of the word list and the frequency of occurrence of the segments in question. For more, refer to the [[Significance of Sound Correspondences]] page.
 
 
 
Table 2 provides an example of the observed sound correspondences obtained from a lexical comparison of two Mpi villages, Ban Dong and Ban Sakoen, using 436-item word list (see Nahhas 2005)<ref name="nahhas" />.
 
 
 
[[File:Mpicorrespondences.jpg|600px|center|Regular sound correspondences observed in Mpi (Nahhas 2005).]]
 
 
 
Table 2 brings up a couple of interesting points. First, for some languages, you should consider a consonant cluster as two segments, for others as one. For Mpi, they were counted as one segment. Also, occurrences of the same sound change, even with different segments, can be grouped together. For Mpi, there were three different environments (all bilabial) where a [j]-[l] correspondence occurred. These were grouped together to give a total of 10 occurrences. Thus, some single occurrences were counted as “regular” because the change in question was observed in enough other environments.
 
 
 
====Examples of Segment Comparisons====
 
 
 
The following example is from Noel Mann. In it, the [n]-[ŋ] correspondence is counted wherever it occurs (initially, medially, or finally).
 
 
 
[[File:Mannsexample.jpg|center|Examples of segment comparisons.]]
 
 
 
====Criteria for Lexical Similarity====
 
 
 
Once all the segment pairs for a word pair are categorized, the following rule is applied. The rule given here has been formulated independently by a few different surveyors as a concise way of summarizing Blair’s chart. Also, this rule allows Blair’s chart to be generalized to any number of phones. For more, see the [[Segment Table]] page.
 
 
 
Two items are judged to be phonetically similar if:
 
 
 
At least 50% of the segments compared are in Category 1
 
'''AND'''
 
At least 75% of the segments compared are in Category 1 and Category 2.
 
 
 
For one to five segments, Table 3 shows the possible combinations of segments that lead to a conclusion of lexical similarity. Any combination not in the table would not be considered lexically similar.
 
 
 
 
 
[[File:5segments.jpg|center|Criteria for Lexical Similarity
 
(This is for up to 5 segments. For other amounts, see the Segment Table page).]]
 
 
 
Let’s look again at the example from Section 4.6.3.3:
 
  
[[File:Example.jpg|center]]
+
See the [[Blair Method]] page for detailed guidance on how comparisons are made between varieties.
  
Thus, based on Table 3, two of the three pairs are considered lexically similar.
+
===Levenshtein Distance Method===
 +
See our [[Levenshtein Distance]] page for more info about this method.
  
 
===Computing the Percentage===
 
===Computing the Percentage===
  
Whatever method is used to determine lexical similarity for word pairs, the method of computing the percentage is simple. Just divide the number of lexically similar items by the number of items you are comparing. In the example from the previous section, this would be 2 ÷ 3 = 0.667 = 67%. Of course, you would never do this with only three words, but you get the idea. Suppose instead that you had compared 98 pairs of words and found that 74 of them were lexically similar. Then your lexical similarity percentage would be 74 ÷ 98 = 0.755 = 76%.
+
Whatever method is used to determine lexical similarity for word pairs, the method of computing the percentage is simple. Just divide the number of lexically similar items by the number of items you are comparing. In the example from the end of the [[Blair method]] page, this would be 2 ÷ 3 = 0.667 = 67%. Of course, you would never do this with only three words, but you get the idea. Suppose instead that you had compared 98 pairs of words and found that 74 of them were lexically similar. Then your lexical similarity percentage would be 74 ÷ 98 = 0.755 = 76%.
  
 
The Blair Method is applied to language varieties pairwise. That is, if you have 10 varieties you are comparing, you have to apply the method to each of the 45 unique pairs. Sometimes it is faster to do some of the steps for many varieties at once. Suppose again that you are comparing 10 varieties. For each item in the list, place each of the 10 words in a “correspondence set” (or “similarity set”). When items are clearly lexically similar (e.g. there are 5 phones and 4 of them are identical) or clearly not similar, there is no point in taking very much time writing which categories and sub-categories each segment are in. Just mark all the words that are clearly lexically similar with the letter “a”, indicating that they are in the same set. If there is more than one distinct set of similar words, use more letters to distinguish each set. Later, you can go back and figure out the categories for the remaining, less clear, words, and put them in the right sets. Doing this in [[WordSurv]] or Excel can speed the process of computing the lexical similarity percentages for all the possible pairs of varieties. [[WordSurv]] does it automatically, and Excel can be programmed to do so. Once you have computed the percentages for each pair of varieties, you can organize them in a matrix, such as Table 4.
 
The Blair Method is applied to language varieties pairwise. That is, if you have 10 varieties you are comparing, you have to apply the method to each of the 45 unique pairs. Sometimes it is faster to do some of the steps for many varieties at once. Suppose again that you are comparing 10 varieties. For each item in the list, place each of the 10 words in a “correspondence set” (or “similarity set”). When items are clearly lexically similar (e.g. there are 5 phones and 4 of them are identical) or clearly not similar, there is no point in taking very much time writing which categories and sub-categories each segment are in. Just mark all the words that are clearly lexically similar with the letter “a”, indicating that they are in the same set. If there is more than one distinct set of similar words, use more letters to distinguish each set. Later, you can go back and figure out the categories for the remaining, less clear, words, and put them in the right sets. Doing this in [[WordSurv]] or Excel can speed the process of computing the lexical similarity percentages for all the possible pairs of varieties. [[WordSurv]] does it automatically, and Excel can be programmed to do so. Once you have computed the percentages for each pair of varieties, you can organize them in a matrix, such as Table 4.
Line 198: Line 101:
 
==References==
 
==References==
 
<references />
 
<references />
 +
[[Category:Word_Lists]]

Latest revision as of 16:35, 6 August 2012

Data Collection Tools
Tools.png
Interviews
Observation
Questionnaires
Recorded Text Testing
Sentence Repetition Testing
Word Lists
Participatory Methods
Matched-Guise
Word Lists
Preparing a Word List
Collecting Word List Data
Analyzing Word List Data
Interpreting Word List Data
Tips from the Field

Introduction As we said in the introduction to these procedures, the most common reason to use word lists is to distinguish one variety from another and to decide when one variety is unintelligible to speakers of another. To do this, we analyse how similar one word list is to another and produce a lexical similarity percentage. We have to decide on a principle of comparison detailing the criteria we use to decide whether words are similar or not. This section will detail procedures you can carry out to achieve this.

For further refence as you read these procedures, you can refer to the Field Guide Glossary.

Data Entry

Type the transcriptions into a spreadsheet program such as Excel, or directly into WordSurv. If you use a spreadsheet program though, you will probably be more able to export the data into other tools that may make analysis easier, such as Phonology Assistant. For detailed instructions about inputting word lists into WordSurv 6.0.2, see Inputting Data into WordSurv.

Make sure to use Unicode fonts for your IPA characters. If you don't, they may not display correctly on other computers. For instructions on entering IPA Unicode text, see our Typing IPA page.

Type the transcriptions exactly as you wrote them in the field. You can revise them later based on the recordings you made (see the section on checking transcriptions below).

Transferring the Recording onto a Computer

If you used a digital recording device, transfer the data using either a USB cable or, if the data is stored on a memory card, a card reader. For tape recorders or older Minidisc (MD) players, you have to use a patch cord which goes from the headphone (or line-out) jack of the recorder to the microphone (or line-in) jack of the computer.

Use software such as SIL’s Speech Analyzer (SA) for recording sound on the computer as you play recording from the recorder. SA not only allows you to record sound, but also to view the waveform, pitch contours, spectrogram, and formants, as well. These can be extremely useful for analysis.

Transfer the recording into SA track by track, if using a digital recorder, or, if using a tape recorder, in chunks of a few minutes each. If you transfer it all at once, you will get a very large file that takes a long time to open and split up later. You will save time by recording in chunks now. Track by track is the best way to make sure you don't lose anything. You can set an MD player to play one-track-at-a-time. With a tape recorder, just transfer about 5 minutes at a time. If you happened to stop in the middle of a word, then rewind a little before transferring the next section to avoid losing any data.

See Speech Analyzer transfer settings for details about what settings to use in SA when transferring a recording. It is very important to make sure that the settings are checked, otherwise, you could end up with a poor quality transfer.

Checking the Transcriptions

Listen to the recordings in order to check the field transcriptions you earlier entered into the computer. Make sure you have used the same symbol for the same sound throughout and, as much as possible, phonetically accurate. You can use software like Speech Analyzer (SA) to do acoustic analysis to get a more accurate idea of the phonetics. The SA Help files are indeed very helpful for understanding acoustic phonetics. An excellent text to help you with this is Peter Ladefoged's 2003 book Phonetic Data Analysis.

Sometimes you will come across sounds that you will want to compare closely. Using SA, you can combine the recordings for two words into a separate file (this is even easier if you have split the recording up into separate files for each word). Then you can listen to the words one after the other, and compare the waveforms, pitch, formants, or whatever else you want.

Take some time at this stage to save some time later: split the recording up into separate files for each word in the list. These files should contain the word number, the English gloss, the elicitation prompt (all spoken by you), and the informant saying the word three times. There are many times when checking the transcriptions that you will want to quickly refer to another word and be able to listen to them one after the other. Doing this also means that researchers who later access the data can do so very efficiently.

Once you've transferred the data, make a backup copy of the transcriptions. It's vital to keep this in a separate place from the computer with the original files. If you back up a copy to the same hard drive or physical location as your original data and your computer equipment gets stolen or suffers a fault you might lose both your original and backup. Your backup is also useful in case you change something in your files and need to recover the original part or compare it with the original.

Identifying Roots

When eliciting words, you might get the word you want plus some extra information. For example, if you ask for the word for run, you might get run (going) or run (coming). When comparing the words for run across languages, the extra words for going or coming are irrelevant, and so should be dropped. Although you want to isolate run in your recording, don't cut or delete the extra information from your original data. This could be very useful for anyone researching this language in the future. Instead, make a copy of the original and then cut out the extra information in the copy.

Also, some languages have syllables that are added on before or after many different words. These are not part of the root and so should be dropped. Suppose that in two related varieties, many words have a nasal pre-syllable that is determined by the following consonant (e.g. mb, nd, or ŋg). If these were included in the comparison of words, they would artificially increase the lexical similarity, because they are the same in every word that has a nasal pre-syllable. Again, as explained in the previous paragraph, don't cut or delete extra information from the original files but make a copy to edit.

Thus, the first step in comparing the words is to reduce each transcription to its root form. If there is a common syllable added onto many words, drop it. If there are extra words and you know their meaning is not relevant, then drop them. If there are extra words whose meaning you do not know, then keep the pair of words that seem to be most similar.

For example, consider the data below from Noel Mann:

Raw word list data of 5 varieties.

Analysis for these data would be something like this:

  1. For the first word ‘I’ (first person singular), the form is unambiguously monosyllabic; thus no further analysis is required and these forms can be directly compared.
  2. For the words egg and seed, all of the varieties have generally similar forms. This disyllabic form is made up of a shared first weak syllable and two different major syllables following that. In this case, the weak syllable is lost in one case and including it would lead to an artificially lower lexical similarity percentage. It is further noted that the similarities between the form for both egg and seed may indicate that this is some sort of classification particle and should be eliminated from the lexicostatistic comparison; thus only the major syllable is compared.
  3. The word forms for warm and sit are similar, with an apparent morphologic vowel particle at the end of the word, and an unusual form in two of the words ending in a glottal stop.
  4. Considering the word forms for leaf and root, it is apparent that there is an initial morpheme having a meaning relating to 'tree'. This morpheme provides supplemental semantic information that is not necessary to the core meaning of the major syllable. Including it in the comparison would lead to an artificially higher lexical similarity percentage. Thus, this morpheme is ignored.

Applying these basic steps, the data can be reduced to the roots by eliminating weak and supplemental syllables resulting in the following data:

Reduced data of 5 varieties.

Borrowings

If the purpose of your survey is to investigate current intelligibility, the best way to treat borrowings is to always keep borrowings from older languages but always drop borrowings from more recently introduced languages. For more, refer to the information by Nahhas about borrowings.

Comparing Words

The final product we are after is a lexical similarity percentage for each pair of varieties being compared. To get these, we need to compare word pairs and then calculate the proportion of similar pairs in each list. There are many ways to compare words and some are listed in this section.

Inspection Method

This is the easiest, but the least accurate. The researcher simply looks at the pairs and decides which ones are similar and which ones are not. It might be useful for a quick count, to have an idea of what a more thorough analysis might reveal. But it should never be mistaken for a thorough method because it has no scientific criteria to judge relationships between words. It isn't reliable because every linguist who uses this method on the same set of words is likely to come to different conclusions.

Comparative Method

The most thorough way to decide which words are similar is through the use of the Comparative Method to establish which word pairs are cognate. In this case, if words are cognate, then they are considered to be lexically similar and the lexical similarity percentage can also be called a cognate percentage. However, as Frank Blair says in his 1990 book Survey on a Shoestring<ref name="blair">Blair, Frank. 1990. Survey on a Shoestring. Dallas: SIL.</ref>, “this process is often more time-consuming than a researcher on survey could desire. It also may require information not readily available to the surveyor.”

If the only use for the word list is to screen for low intelligibility, the Comparative Method may be too time-consuming and, as this method favours large lists of words, there may not be enough data from an intelligibility survey to apply it anyway. Surveyors should keep this end in mind when deciding how many words to collect. Collecting more data than you require will not only meet your needs adequately but can also better serve the greater linguistic community by enabling other types of analysis to be applied.

The Blair Method

A method which takes the middle ground between the Comparative Method and simple inspection has been developed in South Asia. It's based on the method outlined in Frank Blair's 1990 book Survey on a Shoestring<ref name="blair" /> and is thus called the Blair Method.

See the Blair Method page for detailed guidance on how comparisons are made between varieties.

Levenshtein Distance Method

See our Levenshtein Distance page for more info about this method.

Computing the Percentage

Whatever method is used to determine lexical similarity for word pairs, the method of computing the percentage is simple. Just divide the number of lexically similar items by the number of items you are comparing. In the example from the end of the Blair method page, this would be 2 ÷ 3 = 0.667 = 67%. Of course, you would never do this with only three words, but you get the idea. Suppose instead that you had compared 98 pairs of words and found that 74 of them were lexically similar. Then your lexical similarity percentage would be 74 ÷ 98 = 0.755 = 76%.

The Blair Method is applied to language varieties pairwise. That is, if you have 10 varieties you are comparing, you have to apply the method to each of the 45 unique pairs. Sometimes it is faster to do some of the steps for many varieties at once. Suppose again that you are comparing 10 varieties. For each item in the list, place each of the 10 words in a “correspondence set” (or “similarity set”). When items are clearly lexically similar (e.g. there are 5 phones and 4 of them are identical) or clearly not similar, there is no point in taking very much time writing which categories and sub-categories each segment are in. Just mark all the words that are clearly lexically similar with the letter “a”, indicating that they are in the same set. If there is more than one distinct set of similar words, use more letters to distinguish each set. Later, you can go back and figure out the categories for the remaining, less clear, words, and put them in the right sets. Doing this in WordSurv or Excel can speed the process of computing the lexical similarity percentages for all the possible pairs of varieties. WordSurv does it automatically, and Excel can be programmed to do so. Once you have computed the percentages for each pair of varieties, you can organize them in a matrix, such as Table 4.

Lexical similarity of some Bahnaric varieties in Cambodia (Julie Barr, SIL)

Notice that the comparison of a variety with itself is always 100%. Also, the upper right of the matrix is blank. That is because these cells, if filled in, would be redundant. It is helpful to arrange the matrix such that the languages that group together lexically are next to each other. WordSurv has an algorithm which will do this automatically.

It is important, in constructing such a matrix, that all the percentages be based on roughly the same set of words. For more on why this is, see the The Variability of a Word List Percentage page. WordSurv calculates such a thing, but there are good reasons to think it dubious.

Syllostatistics

Noel Mann has modified the Blair Method using what he calls syllostatistics. In Southeast Asia, many languages are monosyllabic and comparing the corresponding onsets and rhymes of word pairs makes more sense than comparing individual phones. Thus, for a word like [naŋ], the comparison would consist of two segments, [na-] and [-ŋ], rather than three as in Blair’s Method. See the Syllostatistics page for more.

References

<references />