Preparing a Word List

From SurveyWiki
Jump to navigationJump to search
Data Collection Tools
Recorded Text Testing
Sentence Repetition Testing
Word Lists
Participatory Methods
Word Lists
Preparing a Word List
Collecting Word List Data
Analyzing Word List Data
Interpreting Word List Data
Tips from the Field

Introduction Before you can collect word list data, you need to have a list of words to collect! You also need to translate the list into the language of elicitation and pilot test the list to see if the words you have chosen (and their translations) are appropriate. Finally, you need to make sure to bring with you all the materials you need for the fieldwork.

While you read, you might find referring to the Field Guide Glossary useful.

1 Getting the Word List Together<ref>The information below and on subsequent word list pages has largely been made available by the kind permission of Ramzi W. Nahhas and Noel W. Mann from their document The Steps of Eliciting and Analyzing Word Lists: A practical guide of Payap University Graduate School, September 2006.</ref>

Word lists that are used around the world vary because they are used for different purposes, regions, cultures, or language families. If there is no standard list in the place where you work, or if you need to use a different list because conditions like the ones mentioned have changed, then you will want to adapt one that best suits your needs and circumstances. The following checklist will help you carry out the necessary background research you need to have an informed approach to creation of your word list and data collection and analysis:

  • Learn all of the variations of the names for the languages, dialects, and peoples you will be studying that are identified in Ethnologue and other sources.
  • Look for previous research or publications in this particular language or dialect cluster. Be sure to search according to variations of the language names discovered in the first step and keep track of any new names that come up. Check in-house for publications your organisation might have which might not be publically available.
  • Visit all relevant libraries and look for previous studies of the language you're researching.
  • Make copies of maps that you find in reading and write in the margins any additional information or possible contradictions that you notice.
  • Record all available information about the sound systems of the [lects|Lects] in question and the larger language family to which they belong. Create a summary of this information that you can review before leaving on the survey trip and/or before taking the first couple of word lists.
  • Take notes on hypotheses or findings regarding the linguistic relationships of the lects in question, especially if they are the results of historical-comparative work.
  • If you find any word lists from the lects you are interested in or in related lects, make copies of them if copyright will not be infringed.
  • Note the names of any speakers mentioned who live outside the language area and the names of outsiders or organizations who work in or frequently visit the area. If possible, contact them for more information.

1.1 Information to Include

All word lists should contain, on their first page, basic information. You will need to include the following information about the list and the researcher:

  • date
  • language name
  • alternate language names
  • location of the language (including the wider geopolitical names, e.g. district, county)
  • origin of the word list (this may not be the place where it's elicited)
  • researcher's name

When writing down the location of the language group and the place where the list is being collected, be as detailed and informative as possible. Include the country, the province or region, and all relevant lower level political divisions as well as the name of the specific village.

You will also need to include information about the participant/s. However, bear in mind that you may be required to protect the identify of your participant/s. If this is the case for any reason, you should list the following details not on the front page of the word list but in a separate notebook and to assign a code for each participant or group of participants. On the word list front page, write the code to identify which participant/s the data relates to. Include the following information about the participant/s:

  • name
  • age
  • sex
  • place of birth
  • present residence
  • travel history (i.e. time spent away from the speech community)

1.2 Words to Include or Exclude

Because you want to be able to compare your list of words with data from other word lists that have been done previously, it's important that your list includes some of the same words. This does not mean that you cannot change some words for your particular situation, but it does mean that you should include the Swadesh 100 word list or possibly the Swadesh 200. If you are studying a tonal language, be sure to include words that may help focus your data on tone and consider including words that tend to distinguish between language varieties in the [[wikipedia:language_family|language family] you are studying.

There are good reasons to exclude certain words from your list. Here are some of them:

  • Non-local concepts
Don't include words for concepts that are unlikely to be part of the language. For example, there may be no words for “snow” or “apple.”
  • Multiple items with the same roots
For example, the pairs ‘bark’- ‘skin’, ‘hair’- ‘feather’, and ‘blood’- ‘red’. In each set, you might find that the local words are the same, or have the same root with different modifiers. Eliminate all but one from each set because eliciting more than one will not add any new data.
  • One-to-many word mappings
One word list item might represent only one word but in the local language, that item uses many words. Such words will be difficult to elicit consistently unless you make the word list item more specific. For example, if you ask for the word for “to carry” in Southeast Asia, you might have a problem because many languages have a number of words for this action and no single generic word for it. They might have “to carry on one’s back”, “to carry on one’s shoulders”, etc. Another example is the pronoun paradigm. You may ask for their word for “I”, but the local language might have many different words for this depending on age, gender, status, etc. You might get the male pronoun in one language and the female pronoun in another. You want to be basing lexical similarity judgments on words that were elicited for the same concept. Alternatively, you might have more than one word on your word list that are commonly referred to by the same word in the local languages. For example, “woman” and “wife”. All but one of these words should be excluded from your word list.
  • Semantic range differences
For example, what you call “blue” might be what they call “green”. Their word for “arm” might include all the parts of the body from the shoulder to the end of the fingers (whereas in English, there are the words “arm” and “hand”). There might “semantic shifts” where the meaning of a word in some varieties has changed. In such cases, the truly cognate word pairs would not be elicited by the same gloss. A different word could have been adapted (or borrowed) for the old meaning. Thus, there will be a seeming non-similarity caused by semantic shift.
  • Compound words
Eliminate words that in the local languages are typically compound words which include words already occurring elsewhere in the list. Such words do not add any new information. For example, the local word for “branch” might be “tree” + “arm”. Some nouns might just be a verb + nominalizer (or a verb might be a noun + verbalizer). If you already have the root word on the list and languages in this area typically use a nominalizer/verbalizer for that word, then it will not add any new information.
  • Taboo/Embarrassing words
Some words are not allowed to be spoken by certain people. Other words are spoken, but are embarrassing to say to a stranger. You do not want to make the subject feel uncomfortable.
  • Onomatopoeia: If words are derived from sounds, for example, if the word for ‘cat’ is just the sound a cat makes, then you could end up with words that are lexically similar but not because of any linguistic relatedness.
  • Elicitation difficulty
Eliminate any other words which you conclude are difficult to elicit consistently across the languages of the region. One possible reason is if the word in the language of elicitation is not well known and/or the concept is not easy to demonstrate in any other way (e.g. by a picture, or by acting).

You may need to adopt a frame to elicit specific types of items such as numbers and colors which might not ever be said in isolation. For example, if numbers are usually used with a noun, then you can use “one person” as the elicitation prompt rather than just “one.” You might also need a frame for eliciting tones. Often, tones are more comparable when they occur in the same environment.

If you want to exclude words based on the principles in this section before you collect data, then you will need to look at previous research and pilot test the word list. Problematic words that you did not know about until after the fieldwork could also be dropped in the analysis stage.

1.3 Elicitation Probes

The word glosses in the language of elicitation are called elicitation probes. They should meet the following criteria:

  • Use common words – the elicitation probe should use common vocabulary items that are likely to be known by second language speakers, including the person eliciting the list.
  • Fits the semantic range – the probe consistently elicits glosses that have the meaning that is intended by the original item. If necessary, limit the semantic range in order to get a consistent elicitation. In some languages, for example, “to carry on one’s back” is better than “to carry” since the latter could be expressed by many different words depending on the method of carrying.
  • Include directions for clarification – if the probe includes a word that can have more than one meaning, include some guidelines for specifying the desired meaning. For example, pronouns may have inclusive/exclusive distinctions. Be careful about the distinction between generic and specific words. For example, you might ask for their word for “animal” and get their word for one particular kind of animal (like “water buffalo”).

1.4 Word List Length

In a perfect world, you would collect all the words of a language! Of course in real life, the longer your word list, the more time you will have to spend eliciting and recording it, the more tired your participant will become, and the more time you will have to spend processing, analyzing, and archiving the data. But if you collect too few words, you may not have enough data for your analysis. Given the remote locations of some villages, you want to get all the data you need, or that others might want in the future, on your first visit. So how many words should you elicit?

For computing a lexical similarity percentage, you only need about 100 words. But it is still a good idea to collect more words. First, there may be problems in eliciting some words, or classes of words, and you will need to eliminate these words during the analysis. Secondly, given the difficulty of getting to the speakers in many cases, it is worth the extra effort to elicit more words for the sake of other researchers who might want to use the word list for other purposes (e.g. tone analysis, phonology, etc.). Finally, for the analysis, it is helpful to have more words than you will formally compare so that you have more opportunities to spot regular sound correspondences between language varieties. SIL language surveyors typically elicit 200-450 words for a word list.

If your purpose is to do historical reconstruction, however, you need many more words (at least 500, but 1,000, or even 5,000 words are better). You will always wish you had more words when doing historical reconstruction.

See Nahhas' comments about Simons' claim that higher numbers of words give higher accuracy of lexical similarity.

1.5 Participant Screening Questions

What you want to research is a language variety. But in fact, what you do is interview people. Therefore, it is crucial to make sure that your participants really represent the language variety you are interested in. When you meet a potential word list participant, ask some screening questions.

In general, a word list participant should be representative of their L1, the language variety you are studying. What this actually looks like will vary. For example, it might be that there has been a lot of migration due to civil unrest, or that everyone marries someone from somewhere else. But, typically, the following criteria will work to ensure that the person knows and uses the local dialect:

  • Born in that village
  • Grew up in that village
  • If they have lived elsewhere, it is not a significant amount of recent time because, if so, this influences loan words and fluency.
  • If you feel this subject represents a village other than the one you are in, note that. You could dismiss the subject, or go ahead and collect a word list if you are interested in that variety and may not be able to go there.
  • L1 was their first language
  • L1 is their best language
  • Both parents are L1 people from that village
  • Both parents spoke L1 to the subject as a child
  • Spouse is an L1 person from that village
  • Is the right age and gender for the population you want to sample. It is good to choose a specific age and gender combination and try to find an participant to match that combination. In this way, your word lists will be more comparable. Men and women tend to have different pitches and languages do change over time. The old might use a form that the young no longer use. Which you want, again, depends on your purpose. Also, young people might not know some words in their language due to lack of experience. You could always try to get a 40-50 year old man, for example.
  • Educated to the level you require for your sample.

The Screening Questions page gives an example for a suggested screening interview. There may also be things you can observe about a person which might exclude them before you even ask any of the screening questions. Make sure to translate the participant screening questions into the language of elicitation and pilot test them along with the word list.

1.6 Organizing the Word List

Group the words by semantic domain (e.g. body parts, numbers, action verbs, etc.). Also, indicate which words on the list are the highest priority to elicit. If you end up having only limited time in an area, you can make sure to get the most essential data first. It is best to print the word list ahead of time, rather than write it in a notebook because it saves time and helps reduce errors. The printout should include the following for each item on the list (see the example word list.):

  • Number
  • Elicitation probes in the following languages/scripts
    • English
    • Language(s) of elicitation (language of elicitation script)
    • Language(s) of elicitation (IPA transcription) (if you cannot read the script)
    • Related languages (IPA transcriptions from previous research)
  • Directions for clarifying a probe’s meaning (if necessary)
  • Blank columns or rows for the new varieties to be elicited

1.7 Further Preparation

Before collecting the first list, familiarize yourself with the sound system of the language family to be surveyed and the pronunciation of your elicitation probes.

2 Site Selection

In some cases, it is very obvious where you should collect the word list. Perhaps there is clearly a main village for a particular variety that everyone recognizes as representing that variety. But often the point of the word list study is to begin the process of finding such dialect centers, or to begin the process of grouping varieties into languages. I say “begin the process” just to remind the reader that lexical similarity can only take you so far. Intelligibility testing and, more importantly, sociolinguistics play much more important roles. In these cases, site selection becomes a very important part of the study since your conclusions will be said to represent certain varieties.

In actual fact, your conclusions will only represent the sites you have visited, and you will need to be cautious as you generalise about the whole population from a limited sample. How do you know the sites you have chosen really represent the varieties you claim they do? Sometimes you cannot know without further research, but by applying good site selection principles, you can do the best you can.

2.1 Preliminary Visit

If you don't have enough information about the language area to be able to know how to select sites, then a preliminary visit is a good idea.

The aim is to get village-level information about each village and its neighbours, so you do not want to spend a lot of time in any one place. As the point is not to go “deep” in any one place, but to gather information about the possible sites, the sorts of tools you might use would include questionnaires. Resist the temptation to administer more in-depth survey tools during a preliminary trip. The deeper you go at any one site, the less time you have to visit more sites, and the whole point of a preliminary visit is to get information about as many sites as possible.

2.2 Site Selection Principles

You could either pick sites using random selection or pick them intentionally. While it is possible to use simple random sampling to choose the sites you will visit, in most cases your background research will reveal information that you can use in selecting sites more strategically. For example, you may learn that there are thought to be two dialect areas. In that case, you want to make sure to choose sites from each area. In general, you should use all the information at your disposal to choose sites and so random sampling should be reserved for when you lack any distinguishing information at all. If you have sites that are equally similar with respect to what you are measuring, you can choose randomly among those sites.

For your particular survey, some of the following guiding principles might be more important than others.

2.2.1 Grouping Sites

Put all the possible sites into distinct groups, where sites in the same group are, as far as you know, similar with respect to what you are measuring. Then select at least one site from each group.

You can judge similarity of sites based on related information. For example, language vitality is related to language contact, which is related to geographic location. So in a language vitality survey, sites could be grouped according to location. Here are some other criteria that you could use to group sites:

  • Bilingualism, Language Vitality, Language Use and Attitudes
Sites may differ in contact to the LWC, presence of schools, markets, LWC people living in the village, remoteness, etc. So, for example, you would want to pick some sites that have schools and some sites that do not; or some sites that have only one ethnolinguistic group and some that are mixed.
  • Dialect Relationships
Sites may have different dialects. So you would want to pick sites from each dialect group. This could be based on background reading and dialect perception interviews
  • Comprehension
Sites may differ not only in the dialect they speak, but also in the amount of contact they have with other sites. So, for example, within a dialect, you would want to pick some sites that have a lot of contact with other sites, and some that do not.

2.2.2 Select both central and peripheral sites

Central and peripheral are often defined in terms of geographical location, but not always. A site could be central or peripheral for geographic, social, cultural, economic, religious, historic, or political reasons. Peripheral can also be defined in terms of what you are trying to measure; for example, sites with the highest or lowest level of bilingualism.

While peripheral sites might represent the extremes of what you are measuring, central sites are not necessarily the sites that have an in between value for what you are measuring. Rather, they are central because they are more important or influential. For example, in studying bilingualism, the peripheral sites might be those with high or low bilingualism, while the central sites might just be the population centers, regardless of bilingual ability.

Choosing central sites allows you to find out information about the most influential sites. Choosing peripheral sites allows you to find out information about the extremes. Then you can assume that the other sites fall somewhere in between these extremes. Some examples of central and peripheral sites follow, categorized by what you might want to measure.

  • Language Vitality
A central site might be a site that has the largest population, or is the historic homeland. A peripheral site might be a site that has very high contact with the LWC, or perhaps a site that is very remote. The definitions of “central” and “peripheral” here could be switched depending on whether you suspect high or low vitality.
  • Linguistic Relatedness
A central site might be a site that is considered by the people to best represent a particular dialect. A peripheral site might be a site that is considered to be sort of in between two dialects, or one that is clearly part of one dialect, but is not the representative variety.
    • Comprehension
A central site might be a place with a variety that everyone else understands. A peripheral site might be a place with a variety that no one else understands. In some cases, you could just pick peripheral sites. For example, if you suspect that there is low language vitality at all sites, then you might just select one or a few sites where you guess the language vitality is highest. If you find low vitality there, then you could assume that the others are even lower. In other cases, due to time constraints, you might only choose a central site. But then be aware that you may not have gained any information about the periphery.

2.2.3 Beware of convenience sampling!

A reason sometimes given by researchers for selecting certain sites is convenience. It is tempting to only visit the sites that are the easiest to get to. But are these sites representative of the whole population? If there is a group of sites that really are equivalent as far as you know, and the fact that one is more convenient to visit has nothing to do with what you are studying, then picking that site is fine. But in many cases this will not be true! Usually the more convenient sites are also the ones with more language contact, more wealth, more education, etc. and these all affect the language varieties spoken there. Such sites are more convenient for everyone, not just for language surveyors!

One example where convenience might be a legitimate reason for picking a site is if you have a contact at a particular site. Having a contact can greatly improve your ability to get good data since you already have a relationship in the community.

2.2.4 Check boundaries

If you are not sure how far the ethnolinguistic group you are surveying extends, then you might want to pick sites even beyond the known geographic boundaries. You might find that there are more villages than you thought! Or you might be able to confirm that the group only extends so far.

2.2.5 Modify site selection during fieldwork

Sometimes information gathered during the fieldwork will cause you to rethink your site selection. What information will influence your site selection depends on what you are trying to measure. Suppose you are trying to determine which dialects understand each other. If one of your survey tools includes questions about dialect perceptions, then the survey team can modify the site selection during the fieldwork based on the answers to these questions. Suppose you have grouped together a set of villages thinking they are all about the same linguistically, but subjects in the selected site identify another site in that group as speaking differently from them, then you could add that other site. It no longer fits in the original group of similar sites.

2.2.6 Check assumptions

You will always have to make assumptions in order to select sites. State these assumptions, and why you think they are true, clearly in your Initial Plan. For example, “I am assuming the following three villages have the same language vitality so I am only going to visit one of them. I think this assumption might be true because...” Then make sure to select a few additional sites that you will visit if you have time in order to check your assumptions.

For example, if you are assuming that language vitality is the same in a set of three villages, and you only have time to visit one of them, then have in mind the possibility of checking this assumption by visiting one or both of the other sites if circumstances change and it turns out that you do have time. If you do not plan for this possibility, then you might not be prepared to take advantage of it should it arise.

2.3 Can You Generalize to the Target Population?

If you select sites well, the data you collect should provide you with information which represents the whole of the target population. This is possible with random selection, but only if it is perfectly random. Otherwise, you risk missing important sites, especially you have limited time to do fieldwork. Choosing sites intentionally can ensure that important sites are not missed, but scope is always limited because you cannot select and visit every site.

Because of this, care must be taken to interpret the results of any survey. You should state your desired scope. This is also called your target population. Then, based on your site selection methods and your subject selection methods, determine what your actual scope is. This is also called your effective population. In the survey report, be careful to draw direct conclusions about only your effective population. As shown in some of the later sections, you might still be able to draw indirect conclusions about the target population by making some reasonable assumptions based on background research.

3 Pilot Testing the Word List

After you have printed out your draft word list, you must test it for problematic words. Some words may be problematic due to factors specific to a language family, while others may be problematic due to factors specific to a geographic region. Pilot testing the wordlist on a related language in the same region will help you discover problems you might not have thought of ahead of time. Make sure to pilot test the translated participant screening questions, as well.

One suggestion is to pilot test a word list from three varieties related to the varieties you will study. When eliciting a specific item, it should result in a response that is different from all the other items on the list, and should result in words in the three varieties that have the same meaning. If a particular gloss results in words with different meanings, then it is not a reliable prompt. Also, if the gloss results in the same word as another gloss, then one of these should be eliminated. See tne next section for the protocol for eliciting a word list. Follow this protocol during the pilot test as well so that you are well-practiced when you go on the actual fieldwork trip.

4 Making a Word List Book

Print and bind (with a plastic cover) the word list, including a number of copies of the participant screening questions (have 2-3 for each word list you intend to collect). This is the word list book.

5 Equipment Checklist

Word Lists
Preparing a Word List
Collecting Word List Data
Analyzing Word List Data
Interpreting Word List Data
Tips from the Field

Before you go to elicit a word list, make sure you have all the equipment you need, including backups. Make sure to test all the equipment and also the backup equipment before you go! You want to make sure it works and that you know how to use everything. It is not good to have asked someone for their time and then be wasting it trying to get the equipment to work while they wait. Here's a basic checklist:

  • Word List Book (includes the participant screening questions and the word list).
  • IPA chart
  • 7 or more pens (at least 3 of one color and 2 of each of two other colors.) Try to use pens with waterproof ink that will not smear or fade
  • 2 audio recorders (e.g. cassette, MiniDisc, or MP3) (one is a backup)
  • Media for recorder (e.g. cassette tapes, minidiscs). You will need 25-35 minutes’ worth of recording space for each 100 items.
  • 2 unidirectional microphones (one is a backup)
  • 2 sets of headphones (one is a backup)
  • Spare batteries


  • PDA, powered by Palm OS
  • Spare stylii
  • PalmSurv software available online
  • PalmSurv word list template
  • Backup media (e.g. SD card, Memory Stick, Compact Flash card)
  • Spare batteries or 12V charger