Résumé de communication (table-ronde Lexicométrie et corpus multilingues)

Abstract : Beyond automatic parallel text alignment, which is now well-known of our scientific community, this panel session focuses on how to extend statistical techniques in view of exploring multilingual textual data. As regards parallel corpora, new tools and methodologies have emerged. Processing comparable corpora (i.e. made-up of similar texts which are not the translation of one another) is also a significant challenge. Textual statistics for monolingual corpora can be adapted to this new type of data. Furthermore, some corpora are written in languages which raise new issues as concerns textual statistics softwares: for example the management of the characters encoding, the corpus tokenisation into sensible word-like units, or the definition of clear and coherent linguistic annotation schemes. International standards have recently been published and others are in preparation. They constitute efficient guidelines for corpus and linguistical ressources encoding. As they deal with the genuine diversity of languages throughout the world, these standards allow the comparability and the reusability of textual data.
Document type :
Other publications
Complete list of metadatas

https://hal-univ-diderot.archives-ouvertes.fr/hal-01224679
Contributor : Maria Zimina <>
Submitted on : Wednesday, November 4, 2015 - 11:02:16 PM
Last modification on : Friday, January 4, 2019 - 5:33:30 PM

Identifiers

  • HAL Id : hal-01224679, version 1

Collections

Citation

Maria Zimina. Résumé de communication (table-ronde Lexicométrie et corpus multilingues). 2004, pp.1203-1206. ⟨hal-01224679⟩

Share

Metrics

Record views

69