Automated processing of an English learner corpus: the case of this and that

Abstract : In this paper, we address the question of automatic annotation of English learner corpus data. Our project aims to explore a method to automatically annotate any learner corpus of English in order to support multi-corpus linguistic analysis of errors. We use traditional automatic POS annotation on a learner corpus. We go one step further by using the newly created token-tag pairs to identify specific features that play a role in the distinction between expected and unexpected uses of language forms. As a comprehensive search of features of all forms appears too vast, we limit our experiment to the demonstratives this and that. We can then use their linguistic features to automate an error classification process in the learner corpus Corpus linguistics applied to the domain of second language acquisition (SLA) has rapidly evolved over the last decade. Initially annotated with native English POS annotation schemes, learner corpora were then looked at in relation to errors, giving way to manual annotation with specific fine-grained error-oriented tagsets (Dagneaux et al 1998, Diaz Negrillo 2009). However, due to high volumes of data and inter-annotator variations such methods can become cumbersome. In the same period of time algorithms based on probabilistic methods have been developed and clearly show satisfactory performance results inautomatic POS tagging (De Haan 2000, van Rooy et al 2003) on non-native data.In our experiment, we make use of the Diderot-Longdale, a spoken English learner corpus and a subset of theLongdale corpus (Meunier et al 2008). The second corpus is the Penn Treebank POS-tagged WallStreet Journal (Charniak et al 1987). Our protocol follows two phases. First, we use TreeTagger (Schmid 1994) to POS tag our learner corpus. In doing so, we extend the Penn Treebank tagset (Marcus1993) used by the tagger to overcome the lack of granularity that prevents the characterisation of all possible uses of that and this. For instance in the current version of the tagset, that may only be assigned four tags: DT for determiner, IN/that for subordinator, WDT for relative pronoun, and RB for adverbial. These tags do not distinguish proforms. In the second phase, we use a training corpus composed of datafrom the WSJ and the Diderot-Longdale, based respectively on the initial and our refined version of the Penn Treebank. We train a memory-based learner program (TiMBL, Daelemans et al 2010) with features based on the token-tag pairs described above, so as to classify expected and unexpected uses of this and that. Our experiment shows that automatic fine-grained annotation of thisand that in a learner corpus is possible, as it not only provides descriptive grammatical information via POS, but also sorts out unexpected from expected uses. Such an experiment can then be applied to other corpora allowing for cross -corpus comparison of data or, in other terms, corpus interoperability.
Liste complète des métadonnées
Contributeur : Nicolas Ballier <>
Soumis le : mardi 8 décembre 2015 - 12:29:52
Dernière modification le : jeudi 11 janvier 2018 - 02:08:08


  • HAL Id : hal-01239864, version 1


Thomas Gaillat, Pascale Sébillot, Nicolas Ballier. Automated processing of an English learner corpus: the case of this and that. ICAME332012 : Corpora at the centre and crossroads of English linguistics, May 2012, Louvain, Belgium. 2012, 〈〉. 〈hal-01239864〉



Consultations de la notice