Tutorial Announcement      

Coling 2000 Tutorial Announcement

Pascaline Merten: SGML/XML with NLP in mind


NLP deals with linguistic data. An increasing number of text or speech corpora are available, which makes it possible to carry out empirical and statistical linguistic research, work out terminologies, develop and test systems. However, the data must be in a format that allows easy use and exchange, regardless of the computer system in use. On the other hand, translation tools are becoming mature. Still, it is not easy to transfer data from one system to another, whether these are terminological data, a translation memory, or the lexical data coded for an MT system. The situation is even worse when one seeks to transfer data between heterogeneous systems : a terminological DB and the dictionary of an MT system (or vice versa).

SGML is designed precisely to facilitate the exchanges of textual data and give a system and application independent storage format. HTML is an application of SGML. The complexity of SGML has confined it to the industrial and academic worlds. XML offers a simplified and web-oriented version of SGML. This tutorial will first introduce the problem of exchanging data. It will be argued that the solution of this problem lies in a structured and layout-independent approach. The SGML and XML fundamentals will be explained, and their syntactic differences will be stressed.

We will then cover the applications which have been developed in the field of TLN. The TEI published in 1994 is a powerful and flexible DTD for the description of all types of texts. Various initiatives supplemented it for corpus markup (CES) or the exchange of terminological (MARTIF, GENETER) and lexical data (OLIF). It is a XML DTD which is used for the exchange of translation memories. Other projects aim at the integration of tools and use either XML or SGML as a pivot (SALT, MULTIDOC). The conclusion is twofold: it is not because people agree on a description scheme that they agree on the content, but by facilitating the input, the storage and the exchange of data, markup standard formats allow the researchers to concentrate on what is essential: language.

Useful References

Projects & DTDs

CES: http://www.lpl.univ-aix.fr/projects/multext/CES/

EAGLES: http://www.ilc.pi.cnr.it/EAGLES96/home.html

GENETER: http://www.uhb.fr/geneter

MARTIF: http://www.ttt.org

MULTEXT: http://issco-www.unige.ch/projects/MULTEXT.html,

MULTIDOC: http://www.iai.uni-sb.de/multidoc/

OTELO: http://www.otelo.lu

PAROLE: http://www.ilc.pi.cnr.it/parole/parole.html

SALT: http://www.ttt.org/salt/

TMX: http://www.lisa.org/tmx/index.html


Herwijnen (E) 1990, Practical SGML, Kluwer Academic Publisher, 1990


Sperberg-McQueen (C.M.); Burnard (Lou), eds. 1994
Guidelines for Electronic Text Encoding and Interchange (TEI P3).
Chicago/Oxford: Text Encoding Initiative, 1994


GOLDFARB (Ch.), PRESCOD (P.), 1998
The XML Handbook, New York, Prentice Hall PTR, 1998


3 hours

The Author

Pascaline Merten, 35
Teacher computer science and machine-aided translation at the Institut supérieur de traducteurs et interprètes of the Haute Ecole de Bruxelles -, 34 rue Hazard, 1180 Brussels +322-3401280; pmerten@isti.be

I have been a researcher in the field of machine translation and in terminology. I have worked as SGML/XML consultant for a law publisher and then in projects involving the Belgian and the French governments. I'm currently involved in a PhD about the use of exchange standards in the field of NLP at the Free University of Brussels (ULB).


Nothing more than a well structured mind and some mathematical and logical aptitudes.

related events
  DFKI Language Technology Lab
German Research Center
for Artificial Intelligence
Language Technology Lab