Building an electronic combinatory dictionary as a writing aid tool for researchers in biology

Abstract : The present paper reports on a method of exploring the combinatorial properties of terms belonging to a specific field of biology, yeast biology, based on the analysis of a corpus of scientific articles. This research has led to the production of a writing aid tool meant to help non-native authors write scientific papers in English. The tool meets the needs of young French researchers, who are constrained to publish in English as early as the Post-doctoral level. The imperative which governs a researcher’s career nowadays, “Publish in English or perish”, is rendered discouraging both by the lack of specialised dictionaries and by the lack of teaching materials targeted for these needs. In order to better estimate our users’ needs we sent out a questionnaire to the teaching and research members of the Life Sciences Department at the University Paris Diderot. The results analysis has shown that almost 96% of all scientific publications are written directly in English and that 90% of participants to the questionnaire use other scientific articles as a writing aid. The kind of information they search in pre-published articles is - to the same extent – scientific information and hints on phraseological information such as obligatory prepositions, connectors, terminological collocations ([to] clone, express, cut, carry a gene) but also collocations belonging to “general scientific language” (Pecman, 2004), such as to strengthen, reinforce, support a hypothesis. Mastering phraseological information is one of the elements proving a scientist’s belonging to a scientific community. In order to extract terminological collocations specific to yeast biology but also collocations belonging to general scientific language we built a specialised corpus, composed of research articles on yeast biology, selected with the help of biologists working at the University Paris Diderot. We have thus gathered a large working corpus of over 5.5 million words, which we have POS tagged and parsed using the Stanford dependency parser (Marneffe, 2006). In the first research stage (reported in this article) we focused on restrictive collocations, for which we supplied the following working definition (by adopting a number of defining features discussed – among others – in Hausmann (1989), Benson (1986) or Lin (1998)): restrictive collocations are recurrent binary combinations, the members of which are in a direct syntactic relation. As the orientation of the collocation (between the base and the collocative) is parallel to that of the syntactic dependency, we adopted a hybrid automatic collocation extraction method similar to that of Lin (1998) or Kilgarriff & Tugwell (2001). The hybrid collocation extraction method we devised is based on the dependency parsing of our corpus. We first extract co-occurences of items in a given syntactic relation. Unlike the methods cited above, we do not pre-define the syntactic patterns we are interested in, but rather eliminate a number of auxiliary relations (such as negation, or determination, although they should be subject to further investigation) and examine all remaining syntactic relations. The method uses a common association measure, mutual information, in order to sort co-occurrences extracted on the basis of syntactic patterns, and a few extra heuristics (frequency and coverage) in order to distinguish collocations from free combinations and one author's idiosyncrasy. Using this hybrid method we extracted collocations occurring at least three times in the corpus, in at least three different documents, recording, at the same time, the frequency of occurrence, the number of documents in which they appeared and the mutual information of the co-occurrence. Results of this extraction process were included in an electronic dictionary the preliminary version of which may be consulted online at the address: http://ytat2.ijm.jussieu.fr/LangYeast/LangYeast_index.html Choosing an electronic dictionary format has allowed us to use both bases and collocates as entries in the dictionary and supply one illustrating example for each candidate collocation. The preliminary version of t he tool contains the combinatory profiles for 2810 nouns, 1034 verbs and 1334 adjectives, containing more than 78 000 collocations. Several improvements of the dictionary may be envisaged. Among other things a lexicographical validation of the dictionary entries (selected mostly on frequency criteria), supplying more illustrating examples for each entry, and – most importantly – finding a way of presenting results better adapted for the end users of our dictionary, for whom notions such as “regisseur”, “argument” or “modification nominale” are irrelevant. The writing aid tool we wish to supply for biologists writing in English as a second language will be extended in two research directions which we have begun to explore. On the one hand we wish to extend our analysis to the argumental structure of a number of specialised verbs taking into account syntagmatic constraints on verb arguments. These structures, also extracted from the dependency parses of the corpus by analysing all dependency relations related to the verb, should provide biologists with a clearer picture on the verb usage. Collocational analysis can only provide a partial picture of this. On the other hand, we envisage extending our analysis form restricted collocations (which we define as binary recurrent combinations) to larger collocational complexes (cf. Howarth, 1996) or usage patters (such as idiomatic formulae specific to scientific discourse). Finally, we envisage using the dictionary we have developed as well as the corpus from which it is derived in English for Special Purposes courses for biologists and building teaching materials derived from these resources.
Document type :
Book sections
Complete list of metadatas

https://hal-univ-diderot.archives-ouvertes.fr/hal-01217595
Contributor : Natalie Kübler <>
Submitted on : Monday, October 19, 2015 - 6:24:09 PM
Last modification on : Friday, January 4, 2019 - 5:33:30 PM

Identifiers

  • HAL Id : hal-01217595, version 1

Collections

Citation

Alexandra Volanschi, Natalie Kübler. Building an electronic combinatory dictionary as a writing aid tool for researchers in biology. Granger, S.; Paquot, M. Lexicography in the 21st Century: New Applications, ⟨Presses Universitaires de Louvain⟩, pp.343-355, 2010. ⟨hal-01217595⟩

Share

Metrics

Record views

109