Digital English Studies

Corpus analysis: references & key concpets

List of key concepts for understanding the basics of doing corpus linguistics in English studies. 

Corpus linguistics

Concept Explanation
absolute frequency the number of times a particular piece of data or aparticular value appears during a study; a simplecount of the number of times a value is observed
collocation a co-occurrence relationship between two words;words are said to collocate with one another ifone is more likely to occur in the presence of theother than elsewhere
concordance a display of every instance of aspecified word orother search term in a corpus, together wtih agiven amount of preceding and following contextfor each result or “hit”
corpus a collection of texts stored on a computer
data-driven learning a way of using corpora in language teaching thatinvolves the learners being given direct access tothe corpus and a tool for searching it, theintention being that their exploration of thecorpus helps their learning of the langauge
frequency distribution information about frequency of use of a termacross texts, speakers, etc.
encoding the process of representing a text as a sequenceof characters in computer memory (e.g.UNICODE, UTF-8, ANSI)
frequency list a list of all the items of a given type in a corpus(e.g. all words, all POS-tags) together with a countof how often each one occurs
KWIC key word in context; a format for displaying aconcordance where the search result is lined up ina central column, and the columns on either sidecontain a short chunk of the context precedingand following each result in the corpus; thestandard abbreviation is KWIC; “key word” heremeans the search term
lemma a group of wordforms that are related by beinginflectional forms of the same base word; e.g. inEnglish destroy, destroys, destroying, destroyedare all part of the verb lemma destroy; the notionof a headword (as found in a dictionary) isgenerally equivalent to that of lemma
n-gram a sequence of n elements (usually words) thatoccur directly one after another in a corpus,where n is two or more; studying n-grams (alsocalled clusters or lexical bundles) is one way tooperationalise the analysis of collocation
normalized frequency same as relative frequency; a frequencyexpressed relative to some other value, as aproportion of the whole – e.g. frequency of aword relative to the total number of words in thecorpus; normalized frequencies can be comparedeven if they arise from datasets of different sizes
raw frequency the number of times a particular piece of data or aparticular value appears during a study; a simplecount of the number of times a value is observed
register a way of classifying texts according to non-linguistic criteria, such as the purpose for which atext was produced, the intended audience, thelevel of formality, whether its purpose is narrationor description and so on
relative frequency same as relative frequency; a frequencyexpressed relative to some other value, as aproportion of the whole – e.g. frequency of aword relative to the total number of words in thecorpus; normalized frequencies can be comparedeven if they arise from datasets of different sizes
token any single, particular instance of an individualword in a text or corpus
type a single particular wordform; any difference ofform (e.g. spelling) makes a word into a differenttype; one type may occur many times in a text orcorpus
type-token ratio a measure of vocabulary diversity in a corpus,equal to the number of types divided by the totalnumber of tokens; a closer the ratio is to 1 (or100%), the more varied the vocabulary