Description of the sessions

Statistics lectures (Harald Baayen)

The goal of this course is to familiarize students with a range of regression techniques that are available for the analysis of one response variable (e.g., reaction time, or pupil dilation, pitch, accuracy) that is to be modeled as a function of one or more predictors. Modeling techniques will be introduced conceptually, and emphasis will be on worked examples of their application. The first lecture addresses the issue of collinearity in multiple regression, addressing the question of how to analyse data with strongly correlated predictors. The second and third lectures introduce the generalized additive model (GAM), which relaxes the assumption that the functional relation between the response and one or more predictors is linear. It is ideal for modeling wiggly curves and wiggly (hyper)surfaces. Model criticism and tools for dealing with model residuals that are not identically and independently distributed will be introduced. The fourth lecture will introduce quantile regression with GAMs, which enable the researcher to study not only how the mean of the response variable varies with predictors, but also how its quantiles depend on the regressors. Thus, quantile regression can tease apart which factors dominate, e.g., short acoustic durations, or short reaction times, and which are specifically influential for long durations or long reaction times. Each session will consist of a lecture, followed by a hands-on lab session with worked examples.


Baayen, R. H., and Divjak. D. (2017). Ordinal GAMMs: A New Window on Human Ratings. In Makarova, A., Dickey, S. M., and Divjak, D. (Eds.) Each Venture a New Beginning. Studies in Honor of Laura A. Janda. Bloomington, Slavica, 39-56

Baayen, R. H. and Linke, M. (2019). Introduction to the generalized additive model. Manuscript, University of Tuebingen.

Baayen, R. H., Rij, J. van, De Cat, C., and Wood, S. N. (2018). Autocorrelated errors in experimental data in the language sciences: Some solutions offered by Generalized Additive Mixed Models. In Speelman, D., Heylen, K., and Geeraerts, D. (Eds.) Mixed Effects Regression Models in Linguistics, (pages 49 - 69). Springer, Berlin.

Baayen, R. H., Vasishth, S., Kliegl, R., and Bates, D. (2017). The cave of Shadows. Addressing the human factor with generalized additive mixed models. Journal of Memory and Language, 206 - 234.

Roettger, T. B., Winter, B., and Baayen, R. H. (2018). Emergent data analysis in phonetic sciences: Towards pluralism and reproducibility. Journal of Phonetics, 73, 1-7.

Tomaschek, F., Hendrix, P., and Baayen, R. H. (2018). Strategies for addressing collinearity in multivariate linguistic data. Journal of Phonetics, 71, 249-267.

Modeling lectures (Harald Baayen)

Naive discriminative learning (NDL) is a computational implementation of central ideas of discriminative linguistics, a theory of language that is under development at the quantitative linguistics lab of the University of Tübingen. Instead of grounding language in a compositional calculus defined over phonemes and morphemes, discriminative linguistics takes inspiration from Shannon's information theory as well as the learning theory of Rescorla and Wagner. Discrimination, not composition, is taken to be fundamental to language and language processing. Discrimination is achieved through error-driven learning, with constant recalibration as experience accumulates over the lifetime. This course first introduces naive discriminative learning, which can be seen as an implementation of incremental logistic regression, and provides examples of how it can be used to understand morphological priming effects as well as differences in segment duration in spontaneous speech. The second lecture turns to linear discriminative learning (LDL), which can be viewed as incremental multivariate multiple linear regression. A model for the mental lexicon based on linear discriminative learning networks is introduced, and used to study lexical processing of both real words and pseudowords. This model can produce and understand morphologically complex words without building on units representing morphemes (defined as the smallest linguistic elements combining form and meaning). The third lecture will illustrate the model's power for inflectional paradigms of several languages, including Latin, Polish, Classical Hebrew, and Estonian. Software for both NDL and LDL (R and python packages) is introduced, and participants will receive some training in the use of these packages.


Arnold, D., Tomaschek, F., Sering, K., Lopez, F., and Baayen, R.H. (2017). Words from spontaneous conversational speech can be recognized with human-like accuracy by an error-driven learning algorithm that discriminates between meanings straight from smart acoustic features, bypassing the phoneme as recognition unit. PLoS ONE 12(4): e0174623, 1-16.

Baayen, R. H., Milin, P., and Ramscar, M. (2016). Frequency in lexical processing. Aphasiology, 1174 - 1220.

Baayen, R. H., Chuang, Y. Y., and Blevins, J. P. (2018). Inflectional morphology with linear mappings. The Mental Lexicon, 13 (2), 232-270.

Baayen, R. H., Chuang, Y. Y., Shafaei-Bajestan E., and Blevins, J. P. (2019). The discriminative lexicon: A unified computational model for the lexicon and lexical processing in comprehension and production grounded not in (de)composition but in linear discriminative learning. Complexity, 2019, 1-39.

Geeraert, K., Newman, J., and Baayen, R. H. (2017). Idiom variation: Experimental data and a blueprint of a computational model. In Christiansen, M., and Arnon, I. (Eds.) More than Words: The Role of Multiword Sequences in Language Learning and Use. Special issue of Topics in Cognitive Science, 9, 653-669.

Linke, M., Bröker, F., Ramscar, M., and Baayen, R. H. (2017). Are baboons learning "orthographic" representations? Probably not. PLoS ONE, 12 (8):e0183876.

Shafaei-Bajestan, E. and Baayen, R. H. (2018). Wide Learning for Auditory Comprehension. In Yegnanarayana, B. (Chair) Proceedings of Interspeech 2018, 966-970. Hyderabad, India: International Speech Communication Association (ISCA).

Trawling in the Unknown: Exploratory Statistical Approaches in Learner Corpus Research (Ilmari Ivaska)

Statistically oriented research designs are often described as either confirmatory or exploratory in nature. Confirmatory approaches are structured around testing a hypothesis by means of theoretically motivated variables, whereas exploratory approaches are (corpus- or) data-driven used to describe the typical tendencies of certain data, and they often operate with a large number of potentially interesting variables. Applying exploratory techniques in the context of learner corpus research, the goal could be to detect co-occurrence patterns of linguistic features that characterize the differences between native and non-native writing, or to search for the most consistent grammatical differences between learners with different language backgrounds, thus revealing potential candidates for crosslinguistic influences. In this session, we will familiarize ourselves with two exploratory methods, Exploratory Factor Analysis (EFA) and Statistical Keyness Analysis. We will apply them to learner corpus data to see what they can tell us – and discuss what they do not tell.


Gabrielatos, Costas. 2018. Keyness analysis: nature, metrics and techniques. In C. Taylor & A. Marchi (eds.), Corpus Approaches to Discourse: A critical review, 225–258. Oxford: Routledge.

Kruger, Haidee & Bertus van Rooy. 2018. Register variation in written contact varieties of English. English World-Wide 39(2), 214–242.

Comparability Paradox and Choosing the Unit of Observation in Learner Corpus Research: Grappling with the unavoidable (Ilmari Ivaska)

Corpus linguistics research contrasting two or more language varieties – such as learner corpus research – has inevitably had to try and reconcile two theoretical and methodological positions that stand in opposition to each other. On the one hand, the two (or more) datasets compared should be as similar to each other as possible, optimally so that they diverge in only one respect – the variable whose influence is being investigated. On the other hand, in order to maximize the representativeness of the data and thus the generalizability of results, the datasets should contain as much variation as possible. The paradox is related to the unit of observation – are we interested in the behavior of certain kinds of constructions or in that of certain kinds of texts. In this session, the core questions tackled are: “What constitutes an observation?” and “What do we count when we count?”. We will contrast the construction-based and the text-based approaches to the unit of observation and discuss the methodological considerations related to them in light of learner corpus research. We will use learner corpus data to showcase the pros and cons of the two approaches and discuss the kinds of answers obtained using different techniques.


Biber, Douglas. 2012. Register as a predictor of linguistic variation. Corpus Linguistics and Linguistic Theory 8(1), 9–37.

Leech, Geoffrey. 2006. New resources, or just better old ones? The Holy Grail of representativeness. In N. Nesselhauf & C. Biewer (eds.), Corpus Linguistics and the Web (Language and Computers 59), 133–149. London: Brill.