Coling 2000 Tutorial Announcement
NLP deals with linguistic data. An increasing number of text or speech corpora are available, which makes it possible to carry out empirical and statistical linguistic research, work out terminologies, develop and test systems. However, the data must be in a format that allows easy use and exchange, regardless of the computer system in use. On the other hand, translation tools are becoming mature. Still, it is not easy to transfer data from one system to another, whether these are terminological data, a translation memory, or the lexical data coded for an MT system. The situation is even worse when one seeks to transfer data between heterogeneous systems : a terminological DB and the dictionary of an MT system (or vice versa).
SGML is designed precisely to facilitate the exchanges of textual data and give a system and application independent storage format. HTML is an application of SGML. The complexity of SGML has confined it to the industrial and academic worlds. XML offers a simplified and web-oriented version of SGML. This tutorial will first introduce the problem of exchanging data. It will be argued that the solution of this problem lies in a structured and layout-independent approach. The SGML and XML fundamentals will be explained, and their syntactic differences will be stressed.
We will then cover the applications which have been developed in the field of TLN. The TEI published in 1994 is a powerful and flexible DTD for the description of all types of texts. Various initiatives supplemented it for corpus markup (CES) or the exchange of terminological (MARTIF, GENETER) and lexical data (OLIF). It is a XML DTD which is used for the exchange of translation memories. Other projects aim at the integration of tools and use either XML or SGML as a pivot (SALT, MULTIDOC). The conclusion is twofold: it is not because people agree on a description scheme that they agree on the content, but by facilitating the input, the storage and the exchange of data, markup standard formats allow the researchers to concentrate on what is essential: language.
Projects & DTDs
Herwijnen (E) 1990, Practical SGML, Kluwer Academic Publisher, 1990
I have been a researcher in the field of machine translation and in terminology. I have worked as SGML/XML consultant for a law publisher and then in projects involving the Belgian and the French governments. I'm currently involved in a PhD about the use of exchange standards in the field of NLP at the Free University of Brussels (ULB).
Nothing more than
a well structured mind and some mathematical and logical aptitudes.