UAM MT Project

"Sistema de Gestión de Publicaciones Técnicas en Eurotech"
(FIT-350101-2005-19)

Principal Investigators (UAM): Mick O'Donnell, Susana Murcia

Becarios: Ismael Pascual Nieto, Irene Eleta, Jose Maria Martinez, Cesar Dante, Fernando Maquedano

Project Description

This project was funded by the Ministerio de Industria, Turismo y Comercio from January 2006 until March 2007. The project, under the PROFIT call, was a cooperation between Seinet , Eurotech and UAM to increase the suitability of Seinet's content management system, Xtent, for working with technical publications, such as those used in Eurotech. The role of UAM was to integrate automatic translation software into Xtent.

However, one of our first results was the discovery that none of the available MT systems were suitable for this purpose: they either did not function in server mode (a requirement of the project), or if they did, their cost was well outside the budget.

For this reason, we re-directed our component of the project towards developing our own MT system. Our eventual goal is to build a complete English-Spanish MT system for technical documentation. During the scope of the project, only some of the necessary tasks were completed. During the time the project ran, we developed software to:
  • Sentence Alignment Software: software to determine which sentences in parallel corpora are the correct translations of the other.
  • Translation Dictionary Extraction Software: software to use a parallel corpora to create a translation dictionary from a parallel corpus.
  • Word Alignment Software: Software to determine which words correspond to each other in a pair of aligned sentences.
Apart from these automatic routines, we developed an interface to allow humans to edit the automatically produced resources:
  • Sentence Alignment Tool: Allows a human editor to inspect sentence alignment of a file and correct errors.
  • Word Alignment Tool: Allows a human editor to inspect word alignment of a file and correct errors. The changes are fed back into the translation dictionary.
Additional resources were also developed:
  • A large syntactic dictionary of English: containing more than 200,000 unique surface terms.
  • A large syntactic dictionary of Spanish: containing approximately 200,000 surface terms.
  • A corpus of technical documentation (100,000 words), annotated with semantic and syntactic information (part of our goal is to explore how particular 'speech functions' in technical documents are translated, e.g.. giving information, giving directions, warnings. We wish to show that the way text is translated needs to take into account what speech function it  performs.
  • Description of Simplified Technical English and Spanish: a text report bringing together resources which describe STE and STS.

Future Work

We are currently extending the system. Firstly, we are building a GUI which will be available to researchers working with parallel corpora, for use in dictionary creation, sentence alignment and word alignment. Secondly, we have developed software for Named Entity Recognition, which needs to e incorporated into the current system (such that named entities are aligned as a phrase, not as single words).  Our eventual goal is to produce sentence patterns by extracting out the NPs and adverbs from sentences,  and normalising for tense/aspect/number.  Our system will thus produce a Translation Memory in terms of a set of source language sentence patterns, and the corresponding sentence patterns in the target language. These sentence patterns will then form a resource for automatic translation of new texts.

Publications

  • Ismael Pascual Nieto and Michael O'Donnell 2007 "Flexible statistical construction of bilingual dictionaries". Proceedings of the  XXIII Congreso de la SEPLN,  10-12 septiembre de 2007, Sevilla.
  • César Dante Barragán, Susana Murcia Bielsa, & Mick O'Donnell 2007 "Levels of technicality in product documentation and its realisation in syntactic forms used to express directions and warnings". Paper presented at the 34th International Systemic Functional Congress, Odense, Denmark.