Digital English Studies
Corpus analysis: references & key concpets
List of key concepts for understanding the basics of doing corpus linguistics in English studies.
Corpus linguistics
Concept | Explanation |
---|---|
absolute frequency | the number of times a particular piece of data or aparticular value appears during a study; a simplecount of the number of times a value is observed |
collocation | a co-occurrence relationship between two words;words are said to collocate with one another ifone is more likely to occur in the presence of theother than elsewhere |
concordance | a display of every instance of aspecified word orother search term in a corpus, together wtih agiven amount of preceding and following contextfor each result or “hit” |
corpus | a collection of texts stored on a computer |
data-driven learning | a way of using corpora in language teaching thatinvolves the learners being given direct access tothe corpus and a tool for searching it, theintention being that their exploration of thecorpus helps their learning of the langauge |
frequency distribution | information about frequency of use of a termacross texts, speakers, etc. |
encoding | the process of representing a text as a sequenceof characters in computer memory (e.g.UNICODE, UTF-8, ANSI) |
frequency list | a list of all the items of a given type in a corpus(e.g. all words, all POS-tags) together with a countof how often each one occurs |
KWIC | key word in context; a format for displaying aconcordance where the search result is lined up ina central column, and the columns on either sidecontain a short chunk of the context precedingand following each result in the corpus; thestandard abbreviation is KWIC; “key word” heremeans the search term |
lemma | a group of wordforms that are related by beinginflectional forms of the same base word; e.g. inEnglish destroy, destroys, destroying, destroyedare all part of the verb lemma destroy; the notionof a headword (as found in a dictionary) isgenerally equivalent to that of lemma |
n-gram | a sequence of n elements (usually words) thatoccur directly one after another in a corpus,where n is two or more; studying n-grams (alsocalled clusters or lexical bundles) is one way tooperationalise the analysis of collocation |
normalized frequency | same as relative frequency; a frequencyexpressed relative to some other value, as aproportion of the whole – e.g. frequency of aword relative to the total number of words in thecorpus; normalized frequencies can be comparedeven if they arise from datasets of different sizes |
raw frequency | the number of times a particular piece of data or aparticular value appears during a study; a simplecount of the number of times a value is observed |
register | a way of classifying texts according to non-linguistic criteria, such as the purpose for which atext was produced, the intended audience, thelevel of formality, whether its purpose is narrationor description and so on |
relative frequency | same as relative frequency; a frequencyexpressed relative to some other value, as aproportion of the whole – e.g. frequency of aword relative to the total number of words in thecorpus; normalized frequencies can be comparedeven if they arise from datasets of different sizes |
token | any single, particular instance of an individualword in a text or corpus |
type | a single particular wordform; any difference ofform (e.g. spelling) makes a word into a differenttype; one type may occur many times in a text orcorpus |
type-token ratio | a measure of vocabulary diversity in a corpus,equal to the number of types divided by the totalnumber of tokens; a closer the ratio is to 1 (or100%), the more varied the vocabulary |