Reference name of working document: DXLT specification draft 1b
Date: 2000-09-16
SALT project — XML representations of Lexicons and Terminologies
(XLT) — Default XLT Format (DXLT)
|
Document type: SALT working draft Document language: en |
Warning
This document is not an ISO standard. It is distributed
for review and comment. It is subject to change without notice and may not be
referred to as an International Standard. It is in the format of an
International Standard but is produced by the SALT project. It is derived from
the MSC project of ISO TC/37, which was superceded by the MTF project.
Recipients of this document are
invited to submit, with their comments, notification of any relevant patent rights
of which they are aware and to provide supporting documentation.
Copyright notice
This SALT document is a project draft and is
copyright-protected by SALT, with the Brigham Young University Linguistics
Department Translation Research Group, Provo 84602, USA, as its agent. While
the reproduction of project drafts in any form for use by official participants
in the SALT project is permitted without prior permission from SALT, neither
this document nor any extract from it may be reproduced, stored or transmitted
for any other purpose without prior
written permission from the SALT project.
The SALT project (see, for example,
www.ttt.org/salt/ for more information) is an international project funded by
the European Union and by private industry and supported by a number of
universities and industry groups.
please send comments on this document to the SALT
project c/o Alan K. Melby:
e-mail: <akm@byu.edu>
telephone number: +1 801 378-2144 (Provo, USA)
Reproduction for sales purposes may be subject
to royalty payments or a licensing agreement. Violators may be prosecuted.
Foreword....................................................................................................................................................... iv
Introduction.................................................................................................................................................... 1
0.1 Intended audience.............................................................................................................................. 1
0.2 A family of formats.............................................................................................................................. 1
0.3 Distinction between DXLT and other XLT formats.............................................................................. 2
1 Scope................................................................................................................................................. 3
2 Relevant ISO Standards ..................................................................................................................... 3
3 Terms and definitions......................................................................................................................... 4
4 Requirements for DXLT documents.................................................................................................... 5
5 An example of a DXLT document....................................................................................................... 6
6 Definition of the core-structure component......................................................................................... 8
6.1 General............................................................................................................................................... 8
6.2 Hierarchical overview......................................................................................................................... 8
6.3 Text elements, i.e., elements that contain plain, basic or note text................................................... 9
6.4 Meta data categories........................................................................................................................ 10
7 Definition of the default data-constraint specification (DCS) component.......................................... 11
7.1 General............................................................................................................................................. 11
7.2 Systematic listing of XML-element data categories in DXLT............................................................. 11
8 Defining user-group subsets.............................................................................................................. 17
8.1 General............................................................................................................................................. 17
8.2 An example of a user-group DCS file................................................................................................ 17
Annex A Core structure component............................................................................................................. 19
A.1 The core-structure DTD for DXLT....................................................................................................... 19
A.2 The schema version of the DXLT core structure................................................................................ 21
Annex B The data-constraint component..................................................................................................... 22
B.1 The DCS schema............................................................................................................................... 19
B.2 The Default DCS file.......................................................................................................................... 21
Annex C Examples....................................................................................................................................... 23
C.1 Low-level encoding (characters, dates, locales, etc) in DXLT........................................................... 23
C.2 Representing DXLT data categories in terminological entries.......................................................... 23
C.3 Encoding guidelines......................................................................................................................... 25
Annex D Design, application, and context of XLT........................................................................................ 31
D.1 Design principles.............................................................................................................................. 31
D.2 Applications of DXLT and other XLT formats.................................................................................... 32
D.3 Connections between XLT and TMF................................................................................................. 32
Annex E Conformance checking.................................................................................................................. 40
Bibliography................................................................................................................................................. 42
SALT is the acronym for "Standards-based
Access to Lexicons and Terminologies". The SALT project is working in
co-operation with ISO Technical Committee 37, the LISA OSCAR group, the OLIF2
consortium, the Text Encoding Initiative, the ISLES project, and other entities
with common interests. As the name implies, SALT is based on various existing
standards.
A principal objective of the SALT project is to
facilitate the representation, dissemination, and exchange of highly-structured
information from both human-oriented terminological data collections
(terminologies) and machine-translation lexicons.
XLT, which is being developed within the SALT
project, stands for XML-based formats for Lexicons and Terminologies. It is
anticipated that XLT will (1) support
the merging and extraction of OLIF2 files, (2) provide the basis for the
OSCAR TBX format, and (3), when restricted to the Terminologies side, fall
within the Terminology Markup Framework (TMF) currently being developed by ISO
Technical Committee 37.
This SALT document defines an XML-based
application referred to as the Default XLT Format (DXLT). DXLT is the
primary member of the XLT family of
formats. This document also provides the basis for defining other members of
the XLT family. The intended audience for this document consists of three
groups: (1) programmers and analysts
who desire to develop software applications that process XLT-compliant data
streams, for example, by converting them to
data streams in some other format or by deriving XLT-compliant data streams from some other format; (2) terminologists and other language
specialists who desire to analyze a terminological data collection for
representation in some XLT format, in particular in DXLT, or to define either a
user-group subset of DXLT or some other XLT format, and (3) managers who desire to obtain an
overview of the XLT family and its default format, DXLT.
Each of these three groups should be familiar with
this Introduction. In addition to an understanding of this Introduction,
terminologists and other language specialists need a basic understanding of the
structure of XML documents and the data categories in ISO 12620. Besides having
or obtaining this background information, they should study the body of this
SALT document (sections 1-8) and annexes C and D, but they do not need the ability to write or modify
XML DTDs or schemas. An introduction to the data categories of ISO 12620 is
available through www.ttt.org. Programmers and analysts developing software
applications to process DXLT and other XLT formats must have a thorough
knowledge of XML and familiarity with the entirety of this SALT document and
the various standards on which it is based.
The XLT family of formats is based on various international
standards. The X in XLT stands for XML, indicating that each member of the
XLT family is an XML application. The L
in XLT stands for Lexicons,
indicating that information from human-oriented lexicons and NLP lexicons
(especially machine translation lexicons) can be incorporated into XLT. The NLP
aspect of XLT is based on OLIF (see Otelo project,
http://www.olif.net/olif/OLIF1.html). The T
in XLT stands for Terminologies. The
terminological approach of XLT is based on two ISO standards (ISO 12620 and
12200). ISO 12620 provides an inventory of data categories (i.e., data element
types, often implemented as column names in a table or field names in a
record). ISO 12200, also known as Martif, provides the basis for the core
structure for the family of formats. Thus, XLT is a standards-based family of
formats for representing, manipulating, and sharing terminological data.
Each member of the XLT family differs from others only in which data
categories are allowed and what values they can take. These choices are
represented in a Data Constraint Specification (DCS) file. The following figure
shows how XLT is based on the classic form-content distinction. Each
combination of the core DTD/schema (which defines the structure) and a
particular DCS file (which defines the allowed content) results in a format
that is a member of the XLT family of formats.
XLT Family of Formats
Form Content


Core DTD/schema DCS 1 DCS 2
Format 1 Format 2 … Format n
Default-XLT (DXLT) is one member of the XLT family of formats. The DCS file that defines DXLT is naturally called the Default DCS file of XLT. It is anticipated that the data categories in the Default DCS file will suffice for most dissemination and interchange tasks. It thus expected that most members of the XLT family of formats will be defined using strict subsets of the Default DCS file. However, it is possible that some particular application will require data categories or data-category values not allowed by the Default DCS file. In that case, a DCS file can be defined that is not a subset of the Default DCS file. Subsets of the Default DCS file define "children" of DXLT, and custom DCS files that are not subsets of the Default DCS file define "siblings" of DXLT. XLT is simply the family of formats defined by the XLT core structure and all the various DCS files that combine with it.
The data models underlying terminology resources can be very complex, and therefore XLT formats can also be complex. Complexity is managed by identifying generalizations and breaking down complex objects into simpler modules that can each be understood on its own. The XLT approach abstracts away the structure found in a variety of formats and places it in the core structure module that contains very general data elements such as <descrip> (descriptive information) and <admin> (administrative information). The specialization of the core structure to specific data categories is represented in a DCS file, which may include the data category definition as a particular type of <descrip> element. This allows XLT-aware software to deal with a relatively simple core structure and adapt automatically to various members of the XLT family by consulting a DCS file, which has a very simple structure. Complexity is not magically eliminated, since the logical combination of the core structure and a particular DCS file can indeed be rather complex. But in XLT each of the two modules (form and content) can be dealt with separately, in accordance with basic principles of object-oriented design. No one terminology format can satisfy the needs of user groups; however, based on experiments to date, most user groups can use the same core structure and accommodate their particular needs using a user-group-specific DCS file.
It is anticipated that the LISA OSCAR TermBase eXchange format (TBX) will be a subset of DXLT. Also, the European Union project called IATE is using an intermediate format (IATE-XLT) that is a subset of DXLT. Any two members of the XLT family are interoperable in so far as their respective DCS files are compatible.
SALT
project — XML representations of Lexicons and Terminologies (XLT) —
Default XLT Format (DXLT)
1 Scope
For various types of machine processing,
including transmission over the Internet, terminological data can be
represented using XML. The format defined by this SALT document is an XML
application designed to support machine processing of terminological data in
various computer environments, including standalone computers, the Internet,
and intranets.
The format defined in this SALT document is
designed to represent terminological data in a relatively "blind",
that is, neutralized fashion for purposes of (a) interchange, (b)
dissemination, and (c) data analysis. This SALT document is based on (1) an
XML-compliant core structure compatible with “Negotiated MARTIF” (ISO 12200)
and (2) an XML formalism called the Data Constraint Specification (DCS) schema
for specifying constraints on the core structure. In addition this SALT
document contains one set of constraints, the Default set of constraints,
expressed in that formalism. Each set of constraints specifies (a) which data
categories, primarily from ISO 12620, are allowed as instantiations of the meta
data categories in the core structure, (b) which values the data categories can
take, and (c) at which levels in the core structure data-category elements can
appear. In addition, this set of constraints can de-activate selected modules
and options of the core structure, such as which languages are allowed, whether
certain text markup tags are allowed, and whether particular types of
Complementary Information are allowed in the current family member. The format
defined by the core structure and data-category specification included in this
SALT document is called DXLT (the Default-XLT format).
This SALT document further provides guidelines
for specifying user subsets of DXLT. The specification of a user subset does
not involve modification of any XML DTDs or schemas. Other members of the XLT
family of formats can be defined using the core structure and DCS formalism
included in this document. XLT formats include no recursive XML elements, thus
reducing the processing burden on import routines.
XLT formats are members of the lcollection of
formats intended to be compliant with ISO Technica Committee project called TMF
(ISO/CD 16642 – Terminology Markup Framework). XLT is being developed in
parallel with TMF (see Annex D). It is intended that DXLT and its subsets, in
particular, will qualify as Terminology Markup Languages (TMLs) within TMF.
The following ISO standards relevant. For dated
references, subsequent amendments to, or revisions of, any of these
publications do not apply. However, parties to agreements based on documents
are encouraged to investigate the possibility of applying the most recent
editions of the standards indicated below. For undated references, the latest
edition of the standard referred to applies. Members of ISO and IEC maintain
registers of currently valid International Standards.
The key ISO standards and projects upon which this document is based
are: (1) ISO/CD 16642 (the TMF project) (2) ISO 12200:1999 (Negotiated MARTIF)
as amended by TC37/SC3 NWI 318, (3) ISO 12620:1999 (Data Categories), (4) ISO
8879:1986 (SGML) as extended by TC2 (ISO/IEC JTC 1/SC 34 N 029:1998-12-06) to
allow for the definition of XML, and (5) ISO 10646-1 (commonly known as
Unicode).
Expanded list of relevant ISO standards (not including projects which are not yet International Standards):
- ISO/IEC 639, Information
technology – ISO 639:1988, Code for the representation of names of
languages.
- ISO 639-2:1998, Code for the
representation of names and languages—part
2:Alpha-3 code.
- ISO/IEC 646:1991, Information
technology – ISO 7-bit coded character set for information interchange.
- ISO 1087:1990, Terminology –
Vocabulary.
- ISO/1087-2:1999, Terminology
work – Vocabulary – Part 2: Computer applications.
- ISO 3166-1:1997, Code for the
representation of names of countries and their subdivisions – Part 1: Country
codes
- ISO 8601:1988, Data elements and
interchange formats – Information interchange – Representation of dates and
times.
- ISO 8879:1986 (SGML) as extended by TC2 (ISO/IEC JTC 1/SC 34 N
029:1998-12-06) to allow for XML.
- ISO/IEC 10646-1:1993, Information
technology—Universal Multiple-Octet Coded Character Set (UCS)—Part 1: Architecture and basic
multilingual plane.
- ISO 12200 as amended, Computer
applications in terminology – Machine-readable terminology interchange format
(MARTIF) – Negotiated interchange.
-
ISO 12620, Terminology –- Computer applications
– Data categories.
For the purposes of this SALT document, the
following terms and definitions apply:
3.1
analysis
identification of the elements and structure of
a terminological data collection so that the data fields, their types, and
their relationships are made explicit
3.2
blindness
property of a data format indicating the degree
to which the data are so rigorously defined that it is unnecessary for the
importer to establish contact with the originator of the data in order to
interpret them
NOTE: The property of blindness is achieved through
the process of neutralization of differences between original formats. The
metaphor behind the term blindness, which has its origin in the engineering
phrase “blind transmission”, is that on the receiving end of a transmission, it
is unnecessary to “see” who is sending the information in order to process it.
Blindness is not an absolute property but is a matter of degree.
3.3
core-structure module
component of a format’s definition that
specifies some elements as meta data categories and indicates which structural
relations are allowed among elements
3.4
data category
result of the specification of a given data
field [ISO 1087-2:2000], (i.e. a type of data field, such as definition)
NOTE: ISO 12620 is an inventory of data categories.
3.5
data stream
a sequence of bytes that correspond to the
contents of a document or file
NOTE: an XML document can be called a “document”, a “file”,
or a “data stream” interchangeably
3.6
data constraint module
component of a format’s definition that
constrains the core-structure module, e.g., by specifying which data categories
are allowed and how each data category can be used
3.7
dissemination
representation of data in an intermediate
format that allows a wide range of potential users to access and reuse the data
3.8
pre-negotiation
property of an intermediate format indicating
that it is adapted to maximizing the preservation of both content and
structural nuances found in the source data, even at the expense of blindness
NOTE: Pre-negotiation and blindness, although
sometimes at odds with each other, should not be considered antonyms, but
rather choices imposed by the tension between complete neutralization and
complete preservation of information in a data collection.
3.9
interchange
transaction involving exporting data from and
importing data into a terminological data collection where those data are
represented in some intermediate format for the purpose of facilitating access
to the data by computer programs
3.10
meta data category
a name used to group similar data categories
together; thus, a category of data categories
NOTE: Meta data categories XLT include descrip, admin and termNote.
3.11
modularity
property of an
electronic format whereby the complexity of the structure and content treated
by the format is managed by defining
sub-components that can be studied separately, side by side, and then logically
combined
NOTE: In XLT, one module defines the core structure
using meta data categories, and the other module specifies constraints on the
core structure module, including which data categories can instantiate each
meta data category.
3.12
metadata registry
description of the
fields in a database for the purpose of facilitating understanding by outside
parties [cf. definition in ISO 11179].
3.13
neutralization
process whereby the differences between the
representation of data elements from various original data collections are reduced
by re-expressing them using the pre-specified structural features, data
categories, and data-category values of an intermediate format
3.14
representation
expression of data content and structural
relationships in an intermediate format outside the environment of the
originating data collection
NOTE: Representation may involve the retention of all
or part of the information from the originating data collection; in addition,
it can involve various degrees of
neutralization and thus tend toward either blindness or pre-negotiation.
3.15
XML™ (eXtensible
Markup Language)
universal format for structured documents and
data on the World Wide Web (WWW); a particular subset of SGML.
NOTE: XSLT is a programming language
specifically designed for manipulating XML documents
For an XML document to be considered
DXLT-compliant, it must qualify on three counts: (1) It must be a well-formed
XML document. (Well-formedness is a purely formal XML notion based on such
criteria as all elements being explicitly empty or explicitly terminated and
not overlapping.) (2) It must be valid according to the XLT core-structure
module (described informally in section 6 and defined formally by the XML DTD
in Annex A). (Validity is also a formal XML notion.) (3) It must adhere to the
constraints in the Default data constraint specification (DCS) module or
user-defined subset thereof currently applicable. These three counts are levels
of conformance to the DXLT specification. Requirements for other members of the
XLT family are similar, the only difference being that the third count requires
adherence to the particular DCS module associated with that family member.
In practice, DXLT documents are typically
created by an export routine in some piece of HLT (Human Language Technology)
software, and they can either be displayed using a tool such as XSLT or be
processed by an import routine that is part of some other piece of HLT
software. So long as the XML documents that are created and processed are DXLT-compliant,
it is not necessary for a human to inspect them and no formal conformance check
is necessary. However, in some circumstances, such as dealing with suspected
data corruption, DXLT-compliance can be checked using DXLT-validation software.
The first two aspects of DXLT-compliance can be
checked by validating the DXLT document against the DTD of the core structure
using a validating XML parser, and the third aspect can be checked using a
custom software application that checks for adherence to the constraints in the
DCS module.
As noted above, it is possible to validate
whether any given well-formed XML data stream is DXLT compliant. However, this
validation is a formal process and does not ensure that appropriate
terminological methods have been used to create the data or that the content of
the data categories is accurate. Validation may determine, for instance, that
the value of an XML element such as term type is not one of the allowed values,
but validation cannot detect a poorly written definition. See Figure 4.1 for
examples of these distinctions in DXLT. The first part is not well-formed,
since the first <descrip> element has a spelling error in the end tag and
since the second <descrip> has no closing tag at all. The second part is
well-formed but not valid, since the core-structure module of DXLT does not
allow for a <desskrip> tag. The third part conforms to the XLT DTD but
not to the Default DCS of DXLT, since there is no DXLT data category called
"conflagration". The fourth part is valid but not accurate, since a
kitten is not a dog or wolf.
Not well-formed:
<term>kitten</term>
<descrip
type=’definition’>content</decrip>
<descrip
type=’definition’>other content
Well-formed but not valid:
<term>kitten</term>
<desskrip type="definition">content</desskrip>
<descrip
type="definition">other content</descrip>
Valid but not DCS-adherent:
<term>kitten</term>
<descrip
type="conflagration">content</descrip>
Valid and DCS-adherent but not accurate: