Reference name of working document:  DXLT specification draft 1b

Date:  2000-09-16

Source: SALT project

SALT project — XML representations of Lexicons and Terminologies (XLT) — Default XLT Format (DXLT)

 

 

 

 

 

Document type:  SALT working draft

Document language:  en

 

 

 

 

Warning

This document is not an ISO standard. It is distributed for review and comment. It is subject to change without notice and may not be referred to as an International Standard. It is in the format of an International Standard but is produced by the SALT project. It is derived from the MSC project of ISO TC/37, which was superceded by the MTF project.

Recipients of this document are invited to submit, with their comments, notification of any relevant patent rights of which they are aware and to provide supporting documentation.

 


Copyright notice

This SALT document is a project draft and is copyright-protected by SALT, with the Brigham Young University Linguistics Department Translation Research Group, Provo 84602, USA, as its agent. While the reproduction of project drafts in any form for use by official participants in the SALT project is permitted without prior permission from SALT, neither this document nor any extract from it may be reproduced, stored or transmitted for any other purpose without prior written permission from the SALT project.

The SALT project (see, for example, www.ttt.org/salt/ for more information) is an international project funded by the European Union and by private industry and supported by a number of universities and industry groups.

please send comments on this document to the SALT project c/o Alan K. Melby:

e-mail: <akm@byu.edu>

telephone number: +1 801 378-2144 (Provo, USA)

 

Reproduction for sales purposes may be subject to royalty payments or a licensing agreement. Violators may be prosecuted.

Contents

Foreword....................................................................................................................................................... iv

Introduction.................................................................................................................................................... 1

0.1        Intended audience.............................................................................................................................. 1

0.2        A family of formats.............................................................................................................................. 1

0.3        Distinction between DXLT and other XLT formats.............................................................................. 2

1          Scope................................................................................................................................................. 3

2          Relevant ISO Standards ..................................................................................................................... 3

3          Terms and definitions......................................................................................................................... 4

4          Requirements for DXLT documents.................................................................................................... 5

5          An example of a DXLT document....................................................................................................... 6

6          Definition of the core-structure component......................................................................................... 8

6.1        General............................................................................................................................................... 8

6.2        Hierarchical overview......................................................................................................................... 8

6.3        Text elements, i.e., elements that contain plain, basic or note text................................................... 9

6.4        Meta data categories........................................................................................................................ 10

7          Definition of the default data-constraint specification (DCS) component.......................................... 11

7.1        General............................................................................................................................................. 11

7.2        Systematic listing of XML-element data categories in DXLT............................................................. 11

8          Defining user-group subsets.............................................................................................................. 17

8.1        General............................................................................................................................................. 17

8.2        An example of a user-group DCS file................................................................................................ 17

Annex A  Core structure component............................................................................................................. 19

A.1       The core-structure DTD for DXLT....................................................................................................... 19

A.2       The schema version of the DXLT core structure................................................................................ 21

Annex B  The data-constraint component..................................................................................................... 22

B.1       The DCS schema............................................................................................................................... 19

B.2       The Default DCS file.......................................................................................................................... 21

Annex C  Examples....................................................................................................................................... 23

C.1       Low-level encoding (characters, dates, locales, etc) in DXLT........................................................... 23

C.2       Representing DXLT data categories in terminological entries.......................................................... 23

C.3       Encoding guidelines......................................................................................................................... 25

Annex D  Design, application, and context of XLT........................................................................................ 31

D.1       Design principles.............................................................................................................................. 31

D.2       Applications of DXLT and other XLT formats.................................................................................... 32

D.3       Connections between XLT and TMF................................................................................................. 32

Annex E  Conformance checking.................................................................................................................. 40

Bibliography................................................................................................................................................. 42

 

Foreword

SALT is the acronym for "Standards-based Access to Lexicons and Terminologies". The SALT project is working in co-operation with ISO Technical Committee 37, the LISA OSCAR group, the OLIF2 consortium, the Text Encoding Initiative, the ISLES project, and other entities with common interests. As the name implies, SALT is based on various existing standards.

A principal objective of the SALT project is to facilitate the representation, dissemination, and exchange of highly-structured information from both human-oriented terminological data collections (terminologies) and machine-translation lexicons.

XLT, which is being developed within the SALT project, stands for XML-based formats for Lexicons and Terminologies. It is anticipated that XLT will (1) support  the merging and extraction of OLIF2 files, (2) provide the basis for the OSCAR TBX format, and (3), when restricted to the Terminologies side, fall within the Terminology Markup Framework (TMF) currently being developed by ISO Technical Committee 37.

 


Introduction

0.1    Intended audience

This SALT document defines an XML-based application referred to as the Default XLT Format (DXLT). DXLT is the primary  member of the XLT family of formats. This document also provides the basis for defining other members of the XLT family. The intended audience for this document consists of three groups: (1) programmers and analysts who desire to develop software applications that process XLT-compliant data streams, for example, by converting them to data streams in some other format or by deriving XLT-compliant data streams from some other format; (2) terminologists and other language specialists who desire to analyze a terminological data collection for representation in some XLT format, in particular in DXLT, or to define either a user-group subset of DXLT or some other XLT format, and (3) managers who desire to obtain an overview of the XLT family and its default format, DXLT.

Each of these three groups should be familiar with this Introduction. In addition to an understanding of this Introduction, terminologists and other language specialists need a basic understanding of the structure of XML documents and the data categories in ISO 12620. Besides having or obtaining this background information, they should study the body of this SALT document (sections 1-8) and annexes C and D, but they do not need the ability to write or modify XML DTDs or schemas. An introduction to the data categories of ISO 12620 is available through www.ttt.org. Programmers and analysts developing software applications to process DXLT and other XLT formats must have a thorough knowledge of XML and familiarity with the entirety of this SALT document and the various standards on which it is based.

 

0.2        A family of formats

The XLT family of formats is based on various international standards. The X in XLT stands for XML, indicating that each member of the XLT family is an XML application. The L in XLT stands for Lexicons, indicating that information from human-oriented lexicons and NLP lexicons (especially machine translation lexicons) can be incorporated into XLT. The NLP aspect of XLT is based on OLIF (see Otelo project, http://www.olif.net/olif/OLIF1.html). The T in XLT stands for Terminologies. The terminological approach of XLT is based on two ISO standards (ISO 12620 and 12200). ISO 12620 provides an inventory of data categories (i.e., data element types, often implemented as column names in a table or field names in a record). ISO 12200, also known as Martif, provides the basis for the core structure for the family of formats. Thus, XLT is a standards-based family of formats for representing, manipulating, and sharing terminological data.

Each member of the XLT family differs from others only in which data categories are allowed and what values they can take. These choices are represented in a Data Constraint Specification (DCS) file. The following figure shows how XLT is based on the classic form-content distinction. Each combination of the core DTD/schema (which defines the structure) and a particular DCS file (which defines the allowed content) results in a format that is a member of the XLT family of formats.

 

                                         XLT Family of Formats

              Form                                                                         Content

Core DTD/schema                              DCS 1                                  DCS 2

                                                                           

                                   

                          

 

                                               Format 1                     Format 2              Format  n

 

0.3        Distinction between DXLT and other XLT formats

Default-XLT (DXLT) is one member of the XLT family of formats.  The DCS file that defines DXLT is naturally called the Default DCS file of XLT. It is anticipated that the data categories in the Default DCS file will suffice for most dissemination and interchange tasks.  It thus expected that most members of the XLT family of formats will be defined using strict subsets of the Default DCS file. However, it is possible that some particular application will require data categories or data-category values not allowed by the Default DCS file.  In that case, a DCS file can be defined that is not a subset of the Default DCS file. Subsets of the Default DCS file define "children" of DXLT, and custom DCS files that are not subsets of the Default DCS file define "siblings" of DXLT.  XLT is simply the family of formats defined by the XLT core structure and all the various DCS files that combine with it.

The data models underlying terminology resources can be very complex, and therefore XLT formats can also be complex. Complexity is managed by identifying generalizations and breaking down complex objects into simpler modules that can each be understood on its own.  The XLT approach abstracts away the structure found in a variety of formats and places it in the core structure module that contains very general data elements such as <descrip> (descriptive information) and <admin> (administrative information). The specialization of the core structure to specific data categories is represented in a DCS file, which may include the data category definition as a particular type of <descrip> element. This allows XLT-aware software to deal with a relatively simple core structure and adapt automatically to various members of the XLT family by consulting a DCS file, which has a very simple structure. Complexity is not magically eliminated, since the logical combination of the core structure and a particular DCS file can indeed be rather complex.  But in XLT each of the two modules (form and content) can be dealt with separately, in accordance with basic principles of object-oriented design. No one terminology format can satisfy the needs of user groups; however, based on experiments to date, most user groups can use the same core structure and accommodate their particular needs using a user-group-specific DCS file.

It is anticipated that the LISA OSCAR TermBase eXchange format (TBX) will be a subset of DXLT.  Also, the European Union project called IATE is using an intermediate format (IATE-XLT) that is a subset of DXLT.  Any two members of the XLT family are interoperable in so far as their respective DCS files are compatible.


 

SALT project — XML representations of Lexicons and Terminologies (XLT) — Default XLT Format (DXLT)

1     Scope

For various types of machine processing, including transmission over the Internet, terminological data can be represented using XML. The format defined by this SALT document is an XML application designed to support machine processing of terminological data in various computer environments, including standalone computers, the Internet, and intranets.

The format defined in this SALT document is designed to represent terminological data in a relatively "blind", that is, neutralized fashion for purposes of (a) interchange, (b) dissemination, and (c) data analysis. This SALT document is based on (1) an XML-compliant core structure compatible with “Negotiated MARTIF” (ISO 12200) and (2) an XML formalism called the Data Constraint Specification (DCS) schema for specifying constraints on the core structure. In addition this SALT document contains one set of constraints, the Default set of constraints, expressed in that formalism. Each set of constraints specifies (a) which data categories, primarily from ISO 12620, are allowed as instantiations of the meta data categories in the core structure, (b) which values the data categories can take, and (c) at which levels in the core structure data-category elements can appear. In addition, this set of constraints can de-activate selected modules and options of the core structure, such as which languages are allowed, whether certain text markup tags are allowed, and whether particular types of Complementary Information are allowed in the current family member. The format defined by the core structure and data-category specification included in this SALT document is called DXLT (the Default-XLT format).

This SALT document further provides guidelines for specifying user subsets of DXLT. The specification of a user subset does not involve modification of any XML DTDs or schemas. Other members of the XLT family of formats can be defined using the core structure and DCS formalism included in this document. XLT formats include no recursive XML elements, thus reducing the processing burden on import routines.

XLT formats are members of the lcollection of formats intended to be compliant with ISO Technica Committee project called TMF (ISO/CD 16642 – Terminology Markup Framework). XLT is being developed in parallel with TMF (see Annex D). It is intended that DXLT and its subsets, in particular, will qualify as Terminology Markup Languages (TMLs) within TMF.

2     Relevant ISO standards

The following ISO standards relevant. For dated references, subsequent amendments to, or revisions of, any of these publications do not apply. However, parties to agreements based on documents are encouraged to investigate the possibility of applying the most recent editions of the standards indicated below. For undated references, the latest edition of the standard referred to applies. Members of ISO and IEC maintain registers of currently valid International Standards.

The key ISO standards and projects upon which this document is based are: (1) ISO/CD 16642 (the TMF project) (2) ISO 12200:1999 (Negotiated MARTIF) as amended by TC37/SC3 NWI 318, (3) ISO 12620:1999 (Data Categories), (4) ISO 8879:1986 (SGML) as extended by TC2 (ISO/IEC JTC 1/SC 34 N 029:1998-12-06) to allow for the definition of XML, and (5) ISO 10646-1 (commonly known as Unicode).

Expanded list of relevant ISO standards (not including projects which are not yet International Standards):

- ISO/IEC 639, Information technology – ISO 639:1988, Code for the representation of names of languages.

- ISO 639-2:1998, Code for the representation of names and languages—part  2:Alpha-3 code.

- ISO/IEC 646:1991, Information technology – ISO 7-bit coded character set for information interchange.

- ISO 1087:1990, Terminology – Vocabulary.

- ISO/1087-2:1999, Terminology work – Vocabulary – Part 2: Computer applications.

- ISO 3166-1:1997, Code for the representation of names of countries and their subdivisions – Part 1: Country codes

- ISO 8601:1988, Data elements and interchange formats – Information interchange – Representation of dates and times.

- ISO 8879:1986 (SGML) as extended by TC2 (ISO/IEC JTC 1/SC 34 N 029:1998-12-06) to allow for XML.

- ISO/IEC 10646-1:1993, Information technology—Universal Multiple-Octet Coded Character Set  (UCS)—Part 1: Architecture and basic multilingual plane.

- ISO 12200 as amended, Computer applications in terminology – Machine-readable terminology interchange format (MARTIF) – Negotiated interchange.

-          ISO 12620, Terminology –- Computer applications – Data categories.

 

 

3     Terms and definitions

For the purposes of this SALT document, the following terms and definitions apply:

3.1

analysis

identification of the elements and structure of a terminological data collection so that the data fields, their types, and their relationships are made explicit

3.2

blindness

property of a data format indicating the degree to which the data are so rigorously defined that it is unnecessary for the importer to establish contact with the originator of the data in order to interpret them

NOTE: The property of blindness is achieved through the process of neutralization of differences between original formats. The metaphor behind the term blindness, which has its origin in the engineering phrase “blind transmission”, is that on the receiving end of a transmission, it is unnecessary to “see” who is sending the information in order to process it. Blindness is not an absolute property but is a matter of degree.

3.3

core-structure module

component of a format’s definition that specifies some elements as meta data categories and indicates which structural relations are allowed among elements

 3.4

data category

result of the specification of a given data field [ISO 1087-2:2000], (i.e. a type of data field, such as definition)

NOTE: ISO 12620 is an inventory of data categories.

3.5

data stream

a sequence of bytes that correspond to the contents of a document or file

NOTE: an XML document can be called a “document”, a “file”, or a “data stream” interchangeably

3.6

data constraint module

component of a format’s definition that constrains the core-structure module, e.g., by specifying which data categories are allowed and how each data category can be used

3.7

dissemination

representation of data in an intermediate format that allows a wide range of potential users to access and reuse the data

3.8

pre-negotiation

property of an intermediate format indicating that it is adapted to maximizing the preservation of both content and structural nuances found in the source data, even at the expense of blindness

NOTE: Pre-negotiation and blindness, although sometimes at odds with each other, should not be considered antonyms, but rather choices imposed by the tension between complete neutralization and complete preservation of information in a data collection.

3.9

interchange

transaction involving exporting data from and importing data into a terminological data collection where those data are represented in some intermediate format for the purpose of facilitating access to the data by computer programs

3.10

meta data category

a name used to group similar data categories together; thus, a category of data categories

NOTE: Meta data categories XLT include descrip, admin and termNote.

3.11

modularity

property of an electronic format whereby the complexity of the structure and content treated by the format is managed by defining sub-components that can be studied separately, side by side, and then logically combined

NOTE: In XLT, one module defines the core structure using meta data categories, and the other module specifies constraints on the core structure module, including which data categories can instantiate each meta data category.

3.12

metadata registry

description of the fields in a database for the purpose of facilitating understanding by outside parties [cf. definition in ISO 11179].

3.13

neutralization

process whereby the differences between the representation of data elements from various original data collections are reduced by re-expressing them using the pre-specified structural features, data categories, and data-category values of an intermediate format

3.14

representation

expression of data content and structural relationships in an intermediate format outside the environment of the originating data collection

NOTE: Representation may involve the retention of all or part of the information from the originating data collection; in addition, it can  involve various degrees of neutralization and thus tend toward either blindness or pre-negotiation.

3.15

XML™ (eXtensible Markup Language)

universal format for structured documents and data on the World Wide Web (WWW); a particular subset of SGML.

NOTE: XSLT is a programming language specifically designed for manipulating XML documents

 

4     Requirements for DXLT documents

For an XML document to be considered DXLT-compliant, it must qualify on three counts: (1) It must be a well-formed XML document. (Well-formedness is a purely formal XML notion based on such criteria as all elements being explicitly empty or explicitly terminated and not overlapping.) (2) It must be valid according to the XLT core-structure module (described informally in section 6 and defined formally by the XML DTD in Annex A). (Validity is also a formal XML notion.) (3) It must adhere to the constraints in the Default data constraint specification (DCS) module or user-defined subset thereof currently applicable. These three counts are levels of conformance to the DXLT specification. Requirements for other members of the XLT family are similar, the only difference being that the third count requires adherence to the particular DCS module associated with that family member.

In practice, DXLT documents are typically created by an export routine in some piece of HLT (Human Language Technology) software, and they can either be displayed using a tool such as XSLT or be processed by an import routine that is part of some other piece of HLT software. So long as the XML documents that are created and processed are DXLT-compliant, it is not necessary for a human to inspect them and no formal conformance check is necessary. However, in some circumstances, such as dealing with suspected data corruption, DXLT-compliance can be checked using DXLT-validation software.

The first two aspects of DXLT-compliance can be checked by validating the DXLT document against the DTD of the core structure using a validating XML parser, and the third aspect can be checked using a custom software application that checks for adherence to the constraints in the DCS module.

As noted above, it is possible to validate whether any given well-formed XML data stream is DXLT compliant. However, this validation is a formal process and does not ensure that appropriate terminological methods have been used to create the data or that the content of the data categories is accurate. Validation may determine, for instance, that the value of an XML element such as term type is not one of the allowed values, but validation cannot detect a poorly written definition. See Figure 4.1 for examples of these distinctions in DXLT. The first part is not well-formed, since the first <descrip> element has a spelling error in the end tag and since the second <descrip> has no closing tag at all. The second part is well-formed but not valid, since the core-structure module of DXLT does not allow for a <desskrip> tag. The third part conforms to the XLT DTD but not to the Default DCS of DXLT, since there is no DXLT data category called "conflagration". The fourth part is valid but not accurate, since a kitten is not a dog or wolf.

Not well-formed:

   <term>kitten</term>

   <descrip type=’definition’>content</decrip>

   <descrip type=’definition’>other content

 

Well-formed but not valid:

                           <term>kitten</term>

   <desskrip type="definition">content</desskrip>

   <descrip type="definition">other content</descrip>

 

Valid but not DCS-adherent:

   <term>kitten</term>

   <descrip type="conflagration">content</descrip>

 

Valid and DCS-adherent but not accurate: