Zajac: Practical development of large computational
This course provides
an in-depth practical introduction to the development of large computational
dictionaries with examples for bilingual dictionaries for machine translation.
It covers issues from the
The course will cover
the following topics:
conception of the dictionary schema and of the structure of lexical entries,
use/development of a lexical toolset, acquisition team management issues,
Overview; Encoding and formats: SGML/XML (esp. TEI), Unicode, flat
formats (e.g. for relational databases), hierarchical formats (e.g
and levels of linguistic knowledge: morphological, syntactic, word-senses,
for lexical acquisition: applications' requirements, depth/breath
issues, planning for scalability, using resources (MRDs, corpora
and associated tools), training issues.
of a lexical database: structure of a lexical entry, structure of
a dictionary, defining the lexical database schema, defaults and
for lexical acquisition. Corpora: processing raw corpora (e.g. HTML
corpora), building a stemmer for stem/POS extraction, building acquisition
files. MRDs:processing MRDs, building acquisition files from MRDs.
On-line resources: WordNet and others, thesaurii and ontologies,
online corpora. Paper dictionaries: as a reference for checking
the dictionary, OCR it or not?
acquisition: primary acquisition tools vs. revision tools; Team
application dictionaries: Generic lexical databases vs application
dictionaries; Compilation of indexes; Compilation of entries: extracting
and testing: Sampling method; Testing using a tagged corpus; Testing
Researchers and practioners
in Language Engineering: developers of LE systems and in particular linguists
and lexicographers. The targeted level is a graduate level in linguistics/lexicography
or computational linguistics.