Reference name of working document:  DXLT specification draft 1b

Date:  2000-09-16

Source: SALT project

SALT project — XML representations of Lexicons and Terminologies (XLT) — Default XLT Format (DXLT)

 

 

 

 

 

Document type:  SALT working draft

Document language:  en

 

 

 

 

Warning

This document is not an ISO standard. It is distributed for review and comment. It is subject to change without notice and may not be referred to as an International Standard. It is in the format of an International Standard but is produced by the SALT project. It is derived from the MSC project of ISO TC/37, which was superceded by the MTF project.

Recipients of this document are invited to submit, with their comments, notification of any relevant patent rights of which they are aware and to provide supporting documentation.

 


Copyright notice

This SALT document is a project draft and is copyright-protected by SALT, with the Brigham Young University Linguistics Department Translation Research Group, Provo 84602, USA, as its agent. While the reproduction of project drafts in any form for use by official participants in the SALT project is permitted without prior permission from SALT, neither this document nor any extract from it may be reproduced, stored or transmitted for any other purpose without prior written permission from the SALT project.

The SALT project (see, for example, www.ttt.org/salt/ for more information) is an international project funded by the European Union and by private industry and supported by a number of universities and industry groups.

please send comments on this document to the SALT project c/o Alan K. Melby:

e-mail: <akm@byu.edu>

telephone number: +1 801 378-2144 (Provo, USA)

 

Reproduction for sales purposes may be subject to royalty payments or a licensing agreement. Violators may be prosecuted.

Contents

Foreword....................................................................................................................................................... iv

Introduction.................................................................................................................................................... 1

0.1        Intended audience.............................................................................................................................. 1

0.2        A family of formats.............................................................................................................................. 1

0.3        Distinction between DXLT and other XLT formats.............................................................................. 2

1          Scope................................................................................................................................................. 3

2          Relevant ISO Standards ..................................................................................................................... 3

3          Terms and definitions......................................................................................................................... 4

4          Requirements for DXLT documents.................................................................................................... 5

5          An example of a DXLT document....................................................................................................... 6

6          Definition of the core-structure component......................................................................................... 8

6.1        General............................................................................................................................................... 8

6.2        Hierarchical overview......................................................................................................................... 8

6.3        Text elements, i.e., elements that contain plain, basic or note text................................................... 9

6.4        Meta data categories........................................................................................................................ 10

7          Definition of the default data-constraint specification (DCS) component.......................................... 11

7.1        General............................................................................................................................................. 11

7.2        Systematic listing of XML-element data categories in DXLT............................................................. 11

8          Defining user-group subsets.............................................................................................................. 17

8.1        General............................................................................................................................................. 17

8.2        An example of a user-group DCS file................................................................................................ 17

Annex A  Core structure component............................................................................................................. 19

A.1       The core-structure DTD for DXLT....................................................................................................... 19

A.2       The schema version of the DXLT core structure................................................................................ 21

Annex B  The data-constraint component..................................................................................................... 22

B.1       The DCS schema............................................................................................................................... 19

B.2       The Default DCS file.......................................................................................................................... 21

Annex C  Examples....................................................................................................................................... 23

C.1       Low-level encoding (characters, dates, locales, etc) in DXLT........................................................... 23

C.2       Representing DXLT data categories in terminological entries.......................................................... 23

C.3       Encoding guidelines......................................................................................................................... 25

Annex D  Design, application, and context of XLT........................................................................................ 31

D.1       Design principles.............................................................................................................................. 31

D.2       Applications of DXLT and other XLT formats.................................................................................... 32

D.3       Connections between XLT and TMF................................................................................................. 32

Annex E  Conformance checking.................................................................................................................. 40

Bibliography................................................................................................................................................. 42

 

Foreword

SALT is the acronym for "Standards-based Access to Lexicons and Terminologies". The SALT project is working in co-operation with ISO Technical Committee 37, the LISA OSCAR group, the OLIF2 consortium, the Text Encoding Initiative, the ISLES project, and other entities with common interests. As the name implies, SALT is based on various existing standards.

A principal objective of the SALT project is to facilitate the representation, dissemination, and exchange of highly-structured information from both human-oriented terminological data collections (terminologies) and machine-translation lexicons.

XLT, which is being developed within the SALT project, stands for XML-based formats for Lexicons and Terminologies. It is anticipated that XLT will (1) support  the merging and extraction of OLIF2 files, (2) provide the basis for the OSCAR TBX format, and (3), when restricted to the Terminologies side, fall within the Terminology Markup Framework (TMF) currently being developed by ISO Technical Committee 37.

 


Introduction

0.1    Intended audience

This SALT document defines an XML-based application referred to as the Default XLT Format (DXLT). DXLT is the primary  member of the XLT family of formats. This document also provides the basis for defining other members of the XLT family. The intended audience for this document consists of three groups: (1) programmers and analysts who desire to develop software applications that process XLT-compliant data streams, for example, by converting them to data streams in some other format or by deriving XLT-compliant data streams from some other format; (2) terminologists and other language specialists who desire to analyze a terminological data collection for representation in some XLT format, in particular in DXLT, or to define either a user-group subset of DXLT or some other XLT format, and (3) managers who desire to obtain an overview of the XLT family and its default format, DXLT.

Each of these three groups should be familiar with this Introduction. In addition to an understanding of this Introduction, terminologists and other language specialists need a basic understanding of the structure of XML documents and the data categories in ISO 12620. Besides having or obtaining this background information, they should study the body of this SALT document (sections 1-8) and annexes C and D, but they do not need the ability to write or modify XML DTDs or schemas. An introduction to the data categories of ISO 12620 is available through www.ttt.org. Programmers and analysts developing software applications to process DXLT and other XLT formats must have a thorough knowledge of XML and familiarity with the entirety of this SALT document and the various standards on which it is based.

 

0.2        A family of formats

The XLT family of formats is based on various international standards. The X in XLT stands for XML, indicating that each member of the XLT family is an XML application. The L in XLT stands for Lexicons, indicating that information from human-oriented lexicons and NLP lexicons (especially machine translation lexicons) can be incorporated into XLT. The NLP aspect of XLT is based on OLIF (see Otelo project, http://www.olif.net/olif/OLIF1.html). The T in XLT stands for Terminologies. The terminological approach of XLT is based on two ISO standards (ISO 12620 and 12200). ISO 12620 provides an inventory of data categories (i.e., data element types, often implemented as column names in a table or field names in a record). ISO 12200, also known as Martif, provides the basis for the core structure for the family of formats. Thus, XLT is a standards-based family of formats for representing, manipulating, and sharing terminological data.

Each member of the XLT family differs from others only in which data categories are allowed and what values they can take. These choices are represented in a Data Constraint Specification (DCS) file. The following figure shows how XLT is based on the classic form-content distinction. Each combination of the core DTD/schema (which defines the structure) and a particular DCS file (which defines the allowed content) results in a format that is a member of the XLT family of formats.

 

                                         XLT Family of Formats

              Form                                                                         Content

Core DTD/schema                              DCS 1                                  DCS 2

                                                                           

                                   

                          

 

                                               Format 1                     Format 2              Format  n

 

0.3        Distinction between DXLT and other XLT formats

Default-XLT (DXLT) is one member of the XLT family of formats.  The DCS file that defines DXLT is naturally called the Default DCS file of XLT. It is anticipated that the data categories in the Default DCS file will suffice for most dissemination and interchange tasks.  It thus expected that most members of the XLT family of formats will be defined using strict subsets of the Default DCS file. However, it is possible that some particular application will require data categories or data-category values not allowed by the Default DCS file.  In that case, a DCS file can be defined that is not a subset of the Default DCS file. Subsets of the Default DCS file define "children" of DXLT, and custom DCS files that are not subsets of the Default DCS file define "siblings" of DXLT.  XLT is simply the family of formats defined by the XLT core structure and all the various DCS files that combine with it.

The data models underlying terminology resources can be very complex, and therefore XLT formats can also be complex. Complexity is managed by identifying generalizations and breaking down complex objects into simpler modules that can each be understood on its own.  The XLT approach abstracts away the structure found in a variety of formats and places it in the core structure module that contains very general data elements such as <descrip> (descriptive information) and <admin> (administrative information). The specialization of the core structure to specific data categories is represented in a DCS file, which may include the data category definition as a particular type of <descrip> element. This allows XLT-aware software to deal with a relatively simple core structure and adapt automatically to various members of the XLT family by consulting a DCS file, which has a very simple structure. Complexity is not magically eliminated, since the logical combination of the core structure and a particular DCS file can indeed be rather complex.  But in XLT each of the two modules (form and content) can be dealt with separately, in accordance with basic principles of object-oriented design. No one terminology format can satisfy the needs of user groups; however, based on experiments to date, most user groups can use the same core structure and accommodate their particular needs using a user-group-specific DCS file.

It is anticipated that the LISA OSCAR TermBase eXchange format (TBX) will be a subset of DXLT.  Also, the European Union project called IATE is using an intermediate format (IATE-XLT) that is a subset of DXLT.  Any two members of the XLT family are interoperable in so far as their respective DCS files are compatible.


 

SALT project — XML representations of Lexicons and Terminologies (XLT) — Default XLT Format (DXLT)

1     Scope

For various types of machine processing, including transmission over the Internet, terminological data can be represented using XML. The format defined by this SALT document is an XML application designed to support machine processing of terminological data in various computer environments, including standalone computers, the Internet, and intranets.

The format defined in this SALT document is designed to represent terminological data in a relatively "blind", that is, neutralized fashion for purposes of (a) interchange, (b) dissemination, and (c) data analysis. This SALT document is based on (1) an XML-compliant core structure compatible with “Negotiated MARTIF” (ISO 12200) and (2) an XML formalism called the Data Constraint Specification (DCS) schema for specifying constraints on the core structure. In addition this SALT document contains one set of constraints, the Default set of constraints, expressed in that formalism. Each set of constraints specifies (a) which data categories, primarily from ISO 12620, are allowed as instantiations of the meta data categories in the core structure, (b) which values the data categories can take, and (c) at which levels in the core structure data-category elements can appear. In addition, this set of constraints can de-activate selected modules and options of the core structure, such as which languages are allowed, whether certain text markup tags are allowed, and whether particular types of Complementary Information are allowed in the current family member. The format defined by the core structure and data-category specification included in this SALT document is called DXLT (the Default-XLT format).

This SALT document further provides guidelines for specifying user subsets of DXLT. The specification of a user subset does not involve modification of any XML DTDs or schemas. Other members of the XLT family of formats can be defined using the core structure and DCS formalism included in this document. XLT formats include no recursive XML elements, thus reducing the processing burden on import routines.

XLT formats are members of the lcollection of formats intended to be compliant with ISO Technica Committee project called TMF (ISO/CD 16642 – Terminology Markup Framework). XLT is being developed in parallel with TMF (see Annex D). It is intended that DXLT and its subsets, in particular, will qualify as Terminology Markup Languages (TMLs) within TMF.

2     Relevant ISO standards

The following ISO standards relevant. For dated references, subsequent amendments to, or revisions of, any of these publications do not apply. However, parties to agreements based on documents are encouraged to investigate the possibility of applying the most recent editions of the standards indicated below. For undated references, the latest edition of the standard referred to applies. Members of ISO and IEC maintain registers of currently valid International Standards.

The key ISO standards and projects upon which this document is based are: (1) ISO/CD 16642 (the TMF project) (2) ISO 12200:1999 (Negotiated MARTIF) as amended by TC37/SC3 NWI 318, (3) ISO 12620:1999 (Data Categories), (4) ISO 8879:1986 (SGML) as extended by TC2 (ISO/IEC JTC 1/SC 34 N 029:1998-12-06) to allow for the definition of XML, and (5) ISO 10646-1 (commonly known as Unicode).

Expanded list of relevant ISO standards (not including projects which are not yet International Standards):

- ISO/IEC 639, Information technology – ISO 639:1988, Code for the representation of names of languages.

- ISO 639-2:1998, Code for the representation of names and languages—part  2:Alpha-3 code.

- ISO/IEC 646:1991, Information technology – ISO 7-bit coded character set for information interchange.

- ISO 1087:1990, Terminology – Vocabulary.

- ISO/1087-2:1999, Terminology work – Vocabulary – Part 2: Computer applications.

- ISO 3166-1:1997, Code for the representation of names of countries and their subdivisions – Part 1: Country codes

- ISO 8601:1988, Data elements and interchange formats – Information interchange – Representation of dates and times.

- ISO 8879:1986 (SGML) as extended by TC2 (ISO/IEC JTC 1/SC 34 N 029:1998-12-06) to allow for XML.

- ISO/IEC 10646-1:1993, Information technology—Universal Multiple-Octet Coded Character Set  (UCS)—Part 1: Architecture and basic multilingual plane.

- ISO 12200 as amended, Computer applications in terminology – Machine-readable terminology interchange format (MARTIF) – Negotiated interchange.

-          ISO 12620, Terminology –- Computer applications – Data categories.

 

 

3     Terms and definitions

For the purposes of this SALT document, the following terms and definitions apply:

3.1

analysis

identification of the elements and structure of a terminological data collection so that the data fields, their types, and their relationships are made explicit

3.2

blindness

property of a data format indicating the degree to which the data are so rigorously defined that it is unnecessary for the importer to establish contact with the originator of the data in order to interpret them

NOTE: The property of blindness is achieved through the process of neutralization of differences between original formats. The metaphor behind the term blindness, which has its origin in the engineering phrase “blind transmission”, is that on the receiving end of a transmission, it is unnecessary to “see” who is sending the information in order to process it. Blindness is not an absolute property but is a matter of degree.

3.3

core-structure module

component of a format’s definition that specifies some elements as meta data categories and indicates which structural relations are allowed among elements

 3.4

data category

result of the specification of a given data field [ISO 1087-2:2000], (i.e. a type of data field, such as definition)

NOTE: ISO 12620 is an inventory of data categories.

3.5

data stream

a sequence of bytes that correspond to the contents of a document or file

NOTE: an XML document can be called a “document”, a “file”, or a “data stream” interchangeably

3.6

data constraint module

component of a format’s definition that constrains the core-structure module, e.g., by specifying which data categories are allowed and how each data category can be used

3.7

dissemination

representation of data in an intermediate format that allows a wide range of potential users to access and reuse the data

3.8

pre-negotiation

property of an intermediate format indicating that it is adapted to maximizing the preservation of both content and structural nuances found in the source data, even at the expense of blindness

NOTE: Pre-negotiation and blindness, although sometimes at odds with each other, should not be considered antonyms, but rather choices imposed by the tension between complete neutralization and complete preservation of information in a data collection.

3.9

interchange

transaction involving exporting data from and importing data into a terminological data collection where those data are represented in some intermediate format for the purpose of facilitating access to the data by computer programs

3.10

meta data category

a name used to group similar data categories together; thus, a category of data categories

NOTE: Meta data categories XLT include descrip, admin and termNote.

3.11

modularity

property of an electronic format whereby the complexity of the structure and content treated by the format is managed by defining sub-components that can be studied separately, side by side, and then logically combined

NOTE: In XLT, one module defines the core structure using meta data categories, and the other module specifies constraints on the core structure module, including which data categories can instantiate each meta data category.

3.12

metadata registry

description of the fields in a database for the purpose of facilitating understanding by outside parties [cf. definition in ISO 11179].

3.13

neutralization

process whereby the differences between the representation of data elements from various original data collections are reduced by re-expressing them using the pre-specified structural features, data categories, and data-category values of an intermediate format

3.14

representation

expression of data content and structural relationships in an intermediate format outside the environment of the originating data collection

NOTE: Representation may involve the retention of all or part of the information from the originating data collection; in addition, it can  involve various degrees of neutralization and thus tend toward either blindness or pre-negotiation.

3.15

XML™ (eXtensible Markup Language)

universal format for structured documents and data on the World Wide Web (WWW); a particular subset of SGML.

NOTE: XSLT is a programming language specifically designed for manipulating XML documents

 

4     Requirements for DXLT documents

For an XML document to be considered DXLT-compliant, it must qualify on three counts: (1) It must be a well-formed XML document. (Well-formedness is a purely formal XML notion based on such criteria as all elements being explicitly empty or explicitly terminated and not overlapping.) (2) It must be valid according to the XLT core-structure module (described informally in section 6 and defined formally by the XML DTD in Annex A). (Validity is also a formal XML notion.) (3) It must adhere to the constraints in the Default data constraint specification (DCS) module or user-defined subset thereof currently applicable. These three counts are levels of conformance to the DXLT specification. Requirements for other members of the XLT family are similar, the only difference being that the third count requires adherence to the particular DCS module associated with that family member.

In practice, DXLT documents are typically created by an export routine in some piece of HLT (Human Language Technology) software, and they can either be displayed using a tool such as XSLT or be processed by an import routine that is part of some other piece of HLT software. So long as the XML documents that are created and processed are DXLT-compliant, it is not necessary for a human to inspect them and no formal conformance check is necessary. However, in some circumstances, such as dealing with suspected data corruption, DXLT-compliance can be checked using DXLT-validation software.

The first two aspects of DXLT-compliance can be checked by validating the DXLT document against the DTD of the core structure using a validating XML parser, and the third aspect can be checked using a custom software application that checks for adherence to the constraints in the DCS module.

As noted above, it is possible to validate whether any given well-formed XML data stream is DXLT compliant. However, this validation is a formal process and does not ensure that appropriate terminological methods have been used to create the data or that the content of the data categories is accurate. Validation may determine, for instance, that the value of an XML element such as term type is not one of the allowed values, but validation cannot detect a poorly written definition. See Figure 4.1 for examples of these distinctions in DXLT. The first part is not well-formed, since the first <descrip> element has a spelling error in the end tag and since the second <descrip> has no closing tag at all. The second part is well-formed but not valid, since the core-structure module of DXLT does not allow for a <desskrip> tag. The third part conforms to the XLT DTD but not to the Default DCS of DXLT, since there is no DXLT data category called "conflagration". The fourth part is valid but not accurate, since a kitten is not a dog or wolf.

Not well-formed:

   <term>kitten</term>

   <descrip type=’definition’>content</decrip>

   <descrip type=’definition’>other content

 

Well-formed but not valid:

                           <term>kitten</term>

   <desskrip type="definition">content</desskrip>

   <descrip type="definition">other content</descrip>

 

Valid but not DCS-adherent:

   <term>kitten</term>

   <descrip type="conflagration">content</descrip>

 

Valid and DCS-adherent but not accurate:

   <term>kitten</term>

   <descrip type="definition">a  young dog (canis lupus)</descrip>

Figure 4.1 — Well-formedness, validity, adherence, and accuracy

 

5          An example of a DXLT document

The following is an example of a simple but complete DXLT document. The numbers in square brackets to the left of certain lines are not part of the DXLT document. They serve as footnote numbers to the comments below.

[1]    <?xml version='1.0'?>

         <!DOCTYPE martif PUBLIC "ISO 12200:1999A//DTD MARTIF core (XLTcdV04)//EN">

[2]    <martif type='DXLT' lang='en' >

[3]    <martifHeader>

                <fileDesc><sourceDesc><p>from an Oracle corporation termBase</p></sourceDesc></fileDesc>

<encodingDesc><p type='DCSName'>DXLTdV04</p></encodingDesc>

        </martifHeader>

[4]    <text> <body>

[5]            <termEntry id='ID67'>

[6]                    <descrip type='subjectField'>manufacturing</descrip>

[7]                    <descrip type='definition'>A value between 0 and 1 used in …</descrip>

[8]                    <langSet lang='en'>

[9]                            <tig>

                                        <term>alpha smoothing factor</term>

[10]                                  <termNote type='termType' >fullForm</termNote>

[11]                  </tig>

[12]                  </langSet>

[13]                  <langSet lang='hu'>

[14]                          <tig><term>Alfa sim&#x00ED;t&#x00E1;si t&#x00E9;nyez&#x00F5; </term></tig>

[15]                  </langSet>

[16]          </termEntry>

[17] </body> </text>

[18] </martif>

 

Only a minimal acquaintance with XML is assumed in the following explanation. Indeed an acquaintance with HTML from building simple web pages, along with the knowledge that XML allows user-defined tag names whereas HTML comes with a set of pre-defined tag names, should be sufficient to allow understanding of the following explanation. For key DXLT elements, the correspondence to the structural component of the meta-model in the ISO TC 37 TMF project is given.

[1]   <?XML ... : These lines state that the following lines constitute an XML document that conforms to version 1.0 of the definition of XML by the World Wide Web consortium (W3C) and to the DXLT DTD.

[2]   <martif ...: This line states that this particular XML document is an DXLT document and thus, along with other members of the XLT family, an be validated against a specification of the XLT core structure, which, for this document, is called XLTcdV04, and can be checked for adherence against the master Default DCS module.  The lang attribute indicates that the default language for text in this document is English (ISO 639 code 'en').

[3]   <martifHeader ...: These lines provide global information about the collection: specifically, a file description indicating that the example was derived from an entry in an termbase used at Oracle corporation and that the DXLT DCS (DXLTdV04, 'd' for DCS), not to be confused with the XLT core DTD (XLTcd04, 'cd' for core DTD) is being used.

[4]   <text> <body>: The text element surrounds the body element, which contains the collection of concept-oriented  "Teminological Entry" (<termEntry>) elements.

[5]   <termEntry ...: Each termEntry element is one instance of the "Terminological Entry" object class. The id attribute has a value that is unique throughout the document, making it possible for other elements to point unambiguously to this element.

[6]   <descrip type='subjectField' ...: The subject field data category is authorized by the DCS (Data Constraint Specification) mentioned above. It consists of a meta data category element (descrip) with the specific data category indicated in the value of the type attribute.

[7]   <descrip type='definition' ...: This piece of descriptive information is also associated with the concept.

[8]   <langSet lang='en'>: The langSet element corresponds to a "Language Section" object class, according to which a Terminological Entry consists of associated information and language sections. This line begins the English Language Section.

[9]   <tig><term> ...: The meta-model states that a Language Section consists of instances of a "Term Section" object class, which, in DXLT corresponds to a <tig> (or <ntig>) element. An instance of a Term Section consists of a term and associated information, which in this case is the term type. The name tig stands for term information group.

[10] <termNote type='termType' ...: This piece of descriptive information associated with the term is the 12620 data category "term type". Its value is "fullForm". A termNote tag is used instead of descrip since the information is closely associated with the term itself rather than the concept being described.

[11] </tig>: This element simply ends the current Term Section.

[12] </langSet>: This element ends the English Language Section.

[13] <langSet lang='hu'>: This element begins the Hungarian Language Section.

[14]     <tig> ...: This line consists of a Term Section with a Hungarian term but no definition and no explicit term type. Each character of the term that is not found in ISO 646 is represented as a hex character reference corresponding directly to a Unicode character. The actual Hungarian term is "Alfa simítási tényező". Note that the final character "ő" (o-tilde) should more properly be an o-double-acute, which is represented by the following Unicode hex character reference: "&#x0151;", a character not available in a typical font.  In XML, a Unicode hex character reference consists of "&#x" + four hex digits from the Unicode standard + a semicolon.

[15] </langSet>: This element ends the Hungarian Language Section.

[16] </termEntry>: This element ends the current Terminological Entry.

[17] </body> </text>: These elements end the set of terminological entries, which in this case consist of only one entry, and the XLT text element, which is the composite of terminological entries and other resources called Complementary Information in the meta-model. In this DXLT document, there are no resources outside the terminological entry. If there were, they would be in the XLT element back.

[18] </martif>: This element ends the entire DXLT document.

 

This sample DXLT entry has several properties:

1.    It corresponds directly to the meta-model in the TMF project.

2.    It is a well-formed XML document.

3.    It conforms to DXLT, by being welli-formed as well as being valid according to the core structure and by adhering to the master data constraint specification (DCS) module of DXLT.

 

6     Definition of the core-structure module

6.1    General

This section defines the core structure of XLT informally, particularly for a human analyst who is either seeking to understand an XLT document or to analyze source or target terminological data in order to prepare a mapping that a programmer can use to write an automatic conversion routine from the source format to, for example, DXLT or from DXLT to the target format.

 

6.2    Hierarchical overview

The highest-level XML element in an XLT document is the "martif" element, which consists of a "martifHeader" element and a "text" element. (See Figure 6.1.)

 

The text element in Figure 6.1 consists of terminological entries (that together make up the XLT body element) and "Complementary Information" (a meta-model object class) that are found in the front and back elements.

The martifHeader element corresponds to "Global Information" in the meta-model and consists of a description of the whole terminological data collection (in the fileDesc element), information about the data-category specification and character encoding ( in the encodingDesc element), and a history of major revisions to the collection (in the revisionDesc element).

A question mark after an element in the box-and-line diagrams below indicates that it is optional.

See Annex A for more detail on these elements.

 

Figure 6.1 — The highest-level elements

Each terminological concept entry in the body element is called a termEntry (see Figure 6.2) and follows the structure of the meta-model.

 

The "auxInfo" element in Figure 6. 2 corresponds to "Terminology-related Information" in the meta-model, and each piece of terminology-related information can associated with any one of three levels: the Terminological Entry level (termEntry in XLT, i.e. the concept level), the Language-section level (LangSet in XLT), and the Term-section level (ntig, or its simplified version, tig, in XLT). The termNote and termNoteGrp elements at the Term -section level are also part of Terminology-related Information in the meta-model and consist of term-related descriptive elements that can only appear at the Term-section level and below.  The termCompList element corresponds to the "Term Component Section" object class of the meta-model.

 

entry-level                 language-level                        term-level

Figure 6.2 — The structure of a terminological entry in body

In XLT, auxInfo consists of any combination of the following elements:

descrip, descripGrp, admin, adminGrp, transacGrp, note, ref, and xref.

A ref element is a crossreference that points somewhere inside the martif element. An xref element is a crossreference that points to an external object using a URI (a URL or other Web address). A note element, as expected, is a note. These three elements appear at various levels to allow the creation of links and the recording of supplementary information.

A transacGrp element gives information about a transaction. ISO 12620 (A.10.2) states that the two terminology management functions concerning a transaction are date and responsibility. A date is specified by a date element, and a responsibility is specified by an adminNote element. Thus, a transacGrp contains a transac element that describes the transaction, accompanied by any combination of transacNote, date, note, ref, and xref elements that apply to the transaction. Any date in XLT must appear within a transacGrp, even if an implicit transaction must be made explicit.

An adminGrp element is similar to a transacGrp in that it groups information pertaining to another element, in this case an admin rather than a transac, specifically, a combination of adminNote, note, ref, and xref. An admin is a simplified adminGrp in which there is just a single admin element and the adminGrp container has been omitted.

A descripGrp element consists of a descrip element followed by any combination of descripNote, admin, adminGrp, transacGrp, note, ref, and xref elements.

The descrip and admin elements are examples of meta data categories in XLT. Each instance of a meta data category in XLT is an element that is specialized by the value of its type attribute. The various instantiations of the meta data categories are given in section 7.  The DXLT DCS file restricts each instantiation of a descrip to certain levels.

A termNoteGrp element, like other …Grp elements, consists of a base element, in this case a termNote, and auxiliary information, in this case, admin, adminGrp, transacGrp, note, ref, and xref elements. A comparison with descripGrp shows that the difference is that there are no descrip elements in a termNoteGrp.  This is because descrip contains concept-related data categories that do not apply to the term itself.

A termCompList element shows the internal composition of a term and consists of a combination of termCompGrp and, in the simplified case, termComp elements. A termCompGrp, consistent with the pattern set by other ...Grp elements, consists of a termComp element and a combination of termNote, termNoteGrp, admin, adminGrp, transacGrp, note, ref, and xref elements that apply to it. Each termComp element contains some component of a term, such as one of the words of which it is composed.

6.3       Text elements, i.e., elements that contain plain, basic or note text

In XLT, elements such as descrip, descripNote, admin, adminNote, transac, transacNote, termNote, note, ref, and xref, contain text. Sometimes, the permissible values of the element are restricted to a picklist. In other cases, the element can contain free text. There are three types of free text in XLT: plain, basic, and note. Plain text (#PCDATA) is defined by the XML specification. It contains no elements, only characters and character entities. Basic text is plain text with the addition of optional embedded hi elements. A hi element highlights a segment of text and optionally points to another element.  One use of hi is to mark an entailed term inside a definition. A term element contains basic text. Note text, which is used in definitions and contextual examples and similar elements, allows the following additional embedded elements besides hi: foreign, bpt, ept, it, ph, and ut. The foreign element is used to mark a segment of text that is in a different language from the surrounding text, e.g. "a <foreign lang='fr'> pamplemousse </foreign> is a grapefruit."

The five elements, bpt, ept, it, ph, and ut, are meta-markup tags that are used to mark up (i.e., encapsulate) markup to distinguish it from text. They allow XLT elements to contain various kinds of markup that needs to be retained but not necessarily processed during terminology management functions. Any such enclosed markup is modified so that start-tag characters ('<') become entities (&lt;) and ampersands become entities (&amp;). If a piece of markup to be encapsulated consists of two paired pieces of markup, such as the markup used to show that a piece of text is to be in bold or italics, then bpt and ept (begin and end paired tags) are used.  If the markup to be encapsulated consists of one piece that would be paired except that the other piece was cut off and appears outside the current element, then an it (isolated tag) is used. If the piece of markup to be encapsulated stands on its own, marking a place such as a footnote, then ph (placeholder) is used.  If the categorization of the piece of markup is unknown, then ut (unknown tag) is used.

Suppose one has the following segment of text to put into an XML element in XLT:

"We need a  big  dog."

 

The marked-up text might be underlying this presentation might be:

"We need a <bold> big </bold> dog."

 

This is not a problem for meta-markup tags. One can put it into an XLT element as follows:

"We need a <bpt i='1'>&lt;bold>/bpt> big <ept i='1'>&lt;/bold></ept> dog."

 

Then one can get the original segment back by taking out the meta-markup tags and converting any "&lt;" inside a meta-markup tag back to "<".

 

Now consider about the following segment (that uses SGML markup):

"We need a  big but < 50 pound  dog", which might have the following underlying SGML markup:

"We need a <bold> big but &lt; 50 pound </bold> dog"

(i.e. a "big but less-than-fifty-pound dog" in which the less-than sign "<" has already been converted to an SGML entity in the source segment before placing it into XLT, since in this case the less-than sign is a literal rather than an escape character).

 

One would put it into an XLT segment as follows:

We need a <bpt i='1'>&lt;bold>/bpt> big but &amp;lt; 50 pound <ept i='1'>&lt;/bold></ept> dog.

Then, when we try to re-construct the original segment, we will get what we started with, since the &amp; will be converted back to an ampersand.

 

HTML tags are one kind of markup that may be enclosed inside meta-markup elements. This allows the markup to be retained and processed during display or import without unduly complicating the core structure by including the XHTML DTD include in the XLT core structure. Any kind of markup, including RTF, can be encapulated in meta-markup tags and later retrieved without loss of information. The XLT approach to meta markup is borrowed from the TMX format of LISA, an ISO/TC 37 liaison organization and supporter of the SALT project.

 

6.4       Meta data categories

The meta data categories of DXLT are as follows. Each of them can potentially be given multiple instantiations in a DCS module, each instantiation specifying one data category. In DXLT, the specific data category instantiation is indicated by the value of a type attribute (e. g. <descrip type='definition'>).

a)       termNote
(A termNoteGrp element receives the data category of its termNote element.)

b)       termComp
(Each termComp element in a termCompList inherits the data category of the list; then each termCompGrp element receives the data category of its termComp element.)

c)       admin
(An adminGrp element receives the data category of its admin.)

d)       adminNote

e)       transac
(A transacGrp element receives the data category of its transac element.)

f)         transacNote

g)       descrip
(A descripGrp element receives the data category of its descrip element.)

h)       descripNote

i)         ref

j)         xref

k)       refObject
(Each refObject element in a refObjectList inherits the data category of the list.)

In general, a  …Grp element in DXLT receives the data category of the first element of the group, and all the elements of a …List element inherit the data category of the list. If the …Grp elements were not optional in the simple case of a single element, then the data category would be specified on the …Grp element directly.

A term is not formally a meta data category in DXLT, but the termType data category used with a termNote element is used to specify term type, thus rendering a term element an indirect meta data category.

 

6.5  Attributes

The main attributes used in DXLT are lang (language), type, id (to identify an element uniquely), and target (to point to an ID). Additional attributes are found in Annex A.

The value of the lang attribute inherits downward through the implied tree structure of the XML document unless overridden by another lang attribute. The martif element is required to have a lang attribute. The language specified in the martif element becomes the working language of the entire DXLT file. Each langSet element must also specify a language that applies to that Language Section. Thus, a definition at Terminological Entry level is assumed to be in the working language of the martif file unless otherwise specified, and a note in a Language Section is assumed to be in the language of that Language Section unless otherwise specified.

The the allowed values of the lang attribute in XLT are the same as the allowed values of the lang attribute in TMX.

The id and target attributes work together to point unambiguously between elements in the same martif file. For example, one entry:

<termEntry id="c5574">

...(entry for "hunting dog")

</termEntry>

 

could be pointed to by another entry:

<termEntry>

<descrip type="superordinateConceptGeneric" target="c5574">hunting dog</descrip>

…(entry for "Retriever" [a type of hunting dog])

</termEntry>

 

The redundant content "hunting dog" in the second entry is for display purposes. It provides a name for the link to the other entry that can be viewed by a human who is deciding whether to follow the link.

 

 

7     Definition of the data-category module selected from ISO 12620

7.1    General

This section describes the Default data constraint specification (DCS) module for DXLT, which is based on a selection of data categories from ISO 12620 selected to support somewhat blind interchange. The formal, machine-processable version of the DXLT master DCS module can be found in Annex B.  It is referred to as the master DCS of DXLT when distinguishing it from a particular user-group subset DCS.

NOTE: The list orders the data categories according to the section of ISO 12620 in which they are described. It is also the order in which they appear in the master DCS module.

 

7.2        Systematic listing of data categories in DXLT

The following tables define the DXLT master DCS (data constraint specification), which describes the data categories in DXLT that are implemented as XML elements that instantiate a meta data category. The remaining data categories are implemented as the term element, the note element, the date element, the lang attribute, the id attribute, the hi element, and the foreign element. These basic data categories are mentioned in section 6, since they are part of the core structure.

Guidelines for encoding particular data categories in DXLT as XML are given in Annex C.

Each data category other than the basic data categories is related to the meta-model by being classified as either administrative or descriptive. Descriptive data categories may describe either a concept or a term. All data categories that use the Martif tag name descrip are concept-related descriptive data categories.  All data categories that use the Martif tag names termNote or termComp are term-related descriptive data categories. All data categories that use the tag name admin are administrative. Descriptive and administrative data categories are further divided into properties and relations.  In DXLT, a data category is a relation if the target attribute is allowed by the DCS file. Notes can be either administrative or descriptive.

In the following table (split into parts for convenience), the first column (ISO 12620) is the position code of the data category in ISO 12620. The second column (Martif Data Category Name) is the name of that data category when given as the value of the type attribute. Typically, it consists of the name in ISO 12620 with spaces removed and the first letter of the second and subsequent words upper-cased. The third column (TextType) tells what kind of text is allowed in the element. The fourth column tells whether this element can take a target attribute, in which case it indicates what kind of element can be targeted. The fifth column (Martif Tag Name) tells which meta data category is used in DXLT for this data category. The sixth column (Level) gives any exceptional information about the levels in the meta-model at which a particular data category can appear.  Admin elements can appear at any level. Descrip elements can appear at the entry, language, or term levels unless otherwise restricted (using codes TE for Terminological Entry, LS for Language Section, and TM for term). TermNote elements can appear at only at the term level, unless authorized (by a TC code)  to appear also at the Term Component level. 

The last column (column seven) contains various comments. The code PA means that this data category is not yet officially in ISO 12620 and is thus Pending Approval (PA). Picklists are found after the tables in footnotes. If the comment column contains a position code of a data category, this indicates that the listed data category has been combined with the data category of the current row.

Data categories that do not have a picklist in the DXLT master DCS can have a picklist in a user-group subset DCS of DXLT (see section 8) if the user-group in question can agree on a picklist for that data category. One obvious candidate for a user-group picklist is partOfSpeech, for which there is no agreed-on picklist when all the languages of the world and all linguistic theories are to be taken into account.

 

List 7.2

Basic data categories:

-          term [A.01]: <term>

-          highlighted text: <hi>

-          foreign language text: <foreign>

-          language [A.10.07]: the lang attribute, e.g., lang="es" on an element

-          element identifier [A.10.15]: the id attribute, e.g. <termentry id="eid-45631">

-          date [A10.02.01]: <date>

-          comment [A.08]: <note>

 

Table 7.2a-n

Table 7.2a — Types of terms (12620: A.2.1)

ISO 12620

Martif Data Category Name

TextType

Target

Martif Tag Name

Level

Comments

A.02.01

termType

picklist

none

termNote

 

f1

A.02.01.05

commonNameFor

basicText

term

termNote

 

 

A.02.01.08

abbreviatedFormFor

basicText

term

termNote

 

 

 

f1:      picklist: mainEntryTerm, synonym, internationalScientificTerm, commonName, internationalism, fullForm, shortForm, abbreviatedForm, variant, transliteratedForm, transcribedForm, symbol, formula, equation, logicalExpression, sku, partNumber, phraseologicalUnit, standardText

 

Table 7.2b — Grammar, Usage, and Origin (12620: A.2.2 – A.2.4)

ISO 12620

Martif Data Category Name

TextType

Target

Martif Tag Name

Level

Comments

A.02.02.01

partOfSpeech

plainText

none

termNote

TC

 

A.02.02.02

grammaticalGender

picklist

none

termNote

TC

f2

A.02.02.03

grammaticalNumber

picklist

none

termNote

TC

f3

A.02.02.04

animacy

picklist

none

termNote

TC

f4

 

 

A.02.02.07

grammaticalValency

plainText

none

termNote

TC

PA

A.02.03.01

usageNote

noteText

none

termNote

 

 

A.02.03.02

geographicalUsage

picklist

none

termNote

 

f5

A.02.03.03

register

picklist

none

termNote

 

f6

A.02.03.04

frequency

picklist

none

termNote

 

f7

A.02.03.05

temporalQualifier

picklist

none

termNote

 

f8

A.02.03.06

timeRestriction

noteText

none

termNote

 

 

A.02.03.07

proprietaryRestriction

picklist

none

termNote

 

f9

A.02.04.01

termProvenance

picklist

none

termNote

 

f10

A.02.04.02

etymology

basicText

none

termNote

TC

 

 

f2:      picklist: masculine, feminine, neuter,  other

f3       picklist:  singular, plural, dual, mass, other

f4:      picklist: animate, inanimate, other

f5:      picklist: SF, CH, FR etc from ISO 3166 (country codes)

f6:      picklist: neutralRegister, technicalRegister, in-houseRegister, bench-levelRegister, slangRegister, vulgarRegister

f7:      picklist: commonlyUsed, infrequentlyUsed, rarelyUsed

f8:      picklist: archaicTerm, outdatedTerm, obsoleteTerm

f9:      Picklist: trademark, tradeName

f10:    transdisciplinaryBorrowing, translingualBorrowing, loan, translation, neologism

PA = Pending Approval

 

 

Table 7.2c — Term components (12620: A.2.5 – A.2.8)

ISO 12620

Martif Data Category Name

TextType

Target

Martif Tag Name

Level

Comments

A.02.05

pronunciation

basicText

none

termNote

 

 

A.02.06

syllabification

basicText

none

termCompList

 

 

A.02.07

hyphenation

basicText

none

termCompList

 

 

A.02.08.01

morphologicalElement

basicText

none

termCompList

 

 

A.02.08.02

termElement

basicText

none

termCompList

 

 

A.02.08.03

termStructure

noteText

none

termNote

 

PA

 

PA = Pending Approval

 

 

Table 7.2d — Term status (12620: A.2.9)

ISO 12620

Martif Data Category Name

TextType

Target

Martif Tag Name

Level

Comments

A.02.09.01

normativeAuthorization

picklist

none

termNote

 

f11

A.02.09.02

languagePlanningQualifier

picklist

none

termNote

 

f12

A.02.09.03

administrativeStatus

picklist

none

termNote

 

f13

A.02.09.04

processStatus

picklist

none

termNote

 

f14

 

Note: discussion is needed as to whether processStatus should have a picklist or be plainText

 

f11:    picklist: standardizedTerm, preferredTerm, admittedTerm, deprecatedTerm, supercededTerm, legalTerm, regulatedTerm

f12:    picklist: recommendedTerm, nonstandardizedTerm, proposedTerm, newTerm

f13:    picklist: standardizedTerm, preferredTerm, admittedTerm, deprecatedTerm, supercededTerm, legalTerm, regulatedTerm

f14:    picklist: unprocessed, provisionallyProcessed, finalized

 

 

Table 7.2e — Equivalence (12620: A.3)

ISO 12620

Martif Data Category Name

TextType

Target

Martif Tag Name

Level

Comments

A.03.02

falseFriend

basicText

term

termNote

 

 

A.03.04

reliabilityCode

picklist

none

descrip

 

f15

A.03.05

transferComment

noteText

term

termNote

 

 

 

f15:    picklist: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10

 

 

Table 7.2f — Classification System (12620: A.4)

ISO 12620

Martif Data Category Name

TextType

Target

Martif Tag Name

Level

Comments

A.04

subjectField

plaintext

none

descrip

LS, TM

 

A.04.02

classificationCode

plaintext

classSysDescrip

descrip

LS, TM

A.4.1+

 

Note: In 12620, A.04.02 is called classificationNumber.

 

Table 7.2g — Concept-related descriptions (12620: A.5)

 

ISO 12620

Martif Data Category Name

TextType

Target

Martif Tag Name

Level

Comments

A.05.01

definition

noteText

none

descrip

LS, TM

 

A.05.02

explanation

noteText

none

descrip

LS, TM

 

A.05.03

context

noteText

none

descrip

LS, TM

 

A.05.04

example

noteText

none

descrip

LS, TM

 

A.05.05.01

figure

noteText

binaryData

descrip

TE, LS, TM

 

A.05.05.02

audio

noteText

binaryData

descrip

TE, LS, TM

 

A.05.05.03

video

noteText

binaryData

descrip

TE, LS, TM

 

A.05.05.04

table

noteText

binaryData

descrip

TE, LS, TM

 

A.05.05.05

otherBinaryData

noteText

binaryData

descrip

TE, LS, TM

 

A.05.06

unit

noteText

none

descrip

TM

 

A.05.07

range

noteText

none

descrip

TM

 

A.05.08

characteristic

noteText

none

descrip

TM

 

 

 

Table 7.2h — Concept relations (12620: A.6 and A.7 combined)

ISO 12620

Martif Data Category Name

TextType

Target

Martif Tag Name

Level

Comments

A.07.02

conceptPosition

plaintext

conceptSys Descrip

descrip

TE, LS

A.07.01+

A.07.02.01

broaderConceptGeneric

basicText

entry

descrip

TE, LS

A.06.01+

A.07.02.01

broaderConceptPartitive

basicText

entry

descrip

TE, LS

A.06.02+

A.07.02.02

superordinateConceptGeneric

basicText

entry

descrip

TE, LS

A.06.01+

A.07.02.02

superordinateConceptPartitive

basicText

entry

descrip

TE, LS

A.06.02+

A.07.02.03

subordinateConceptGeneric

basicText

entry

descrip

TE, LS

A.06.01+

A.07.02.03

subordinateConceptPartitive

basicText

entry

descrip

TE, LS

A.06.02+

A.07.02.04

coordinateConceptGeneric

basicText

entry

descrip

TE, LS

A.06.01+

A.07.02.04

coordinateConceptPartitive

basicText

entry

descrip

TE, LS

A.06.02+

A.07.02.05.01

relatedConceptBroader

basicText

entry

descrip

TE,LS

 

A.07.02.05.02

relatedConceptNarrower

basicText

entry

descrip

TE,LS

 

A.07.02.05

relatedConcept

basicText

entry

descrip

TE, LS

 

A.07.02.06

sequentialRelatedConcept

basicText

entry

descrip

TE, LS

A.06.03

A.07.02.07

temporallyRelatedConcept

basicText

entry

descrip

TE, LS

A.06.03. 01

A.07.02.08

spatiallyRelatedConcept

basicText

entry

descrip

TE, LS

A.06.03. 02 

A.07.02.09

associatedConcept

basicText

entry

descrip

TE, LS

A.06.04

A.10.18.06

antonymTerm

basicText

term

descrip

TM

 

A.10.18.06

antonymConcept

basicText

entry

descrip

TE

 

 

Note: further discussion is needed concerning whether antonyms are term-relations, concept-relations, or both

 

 

 

 

Table 7.2i — Specialized notes (12620: A.8)

ISO 12620

Martif Data Category Name

TextType

Target

Martif Tag Name

Level

Comments

A.08.01

descripType

picklist

element

descripNote

 

PA, f16

A.08.02

definitionType

picklist

element

descripNote

 

PA, f17

A.08.03

contextType

picklist

element

descripNote

 

PA, f18

 

note: General note is not a meta data category

f16:    picklist: translation

f17:    picklist intensionalDefinition, extensionalDefinition, partitiveDefinition

f18:    picklist definingContext, explanatoryContext, associativeContext, linguisticContext, metalinguisticContext

 

Table 7.2j — Documentary language  (e.g., thesaurus) (12620: A.9)

ISO 12620

Martif Data Category Name

TextType

Target

Martif Tag Name

Level

Comments

A.09.02

thesaurusDescriptor

basicText

thesaurus Descrip

descrip

TE

A.09.01+

A.09.04

keyword

plaintext

none

admin

 

 

A.09.05

indexHeading

plaintext

none

admin

 

 

 

 

Table 7.2k — Transactions (12620: A.10.1 – A:10.2-3)

ISO 12620

Martif Data Category Name

TextType

Target

Martif Tag Name

Level

Comments

A.10.01

transactionType

picklist

none

transac

 

f19

A.10.02.02

responsibility

basicText

Person/Org

transacNote

 

two data categories

A.10.02.03

count

plaintext

none

transacNote

 

 

A.10.02.10

subsetOwner

basicText

personOrg

admin

 

 

                                                                                                                                                                                              

Notes: <date> can also appear in a <transacGrp>; A.10.02.02 data categories are:  responsiblePerson and responsibleOrg

f19:    creation [formerly origination], input, modification, check, approval, withdrawal, standardization, exportation, importation, proposal, userAccess

 

Table 7.2l — Subsets (12620: A.10.3)

ISO 12620

Martif Data Category Name

TextType

Target

Martif Tag Name

Level

Comments

A.10.03.01

customerSubset

plaintext

none

admin

 

 

A.10.03.03

projectSubset

plaintext

none

admin

 

 

A.10.03.05

productSubset

plaintext

none

admin

 

 

A.10.03.06

applicationSubset

plaintext

none

admin

 

 

A.10.03.07

environmentSubset

plaintext

none

admin

 

 

A.10.03.08

businessUnitSubset

plaintext

none

admin

 

 

A.10.03.09

securitySubset

picklist

none

admin

 

f20

 

Note:  entailed term (10.6.1) is implemented in DXLT using the hi element, and foreign (10.8) is implemented in DXLT using  the foreign element.

Note: (1 = public; 10 = highly confidential)

f20:    1, 2, 3, 4, 5, 6, 7, 8, 9, 10

 

Table 7.2m — Other administrative information (12620: A.10.4 – A.10.21, except antonym)

ISO 12620

Martif Data Category Name

TextType

Target

Martif Tag Name

Level

Comments

10.06.02

sortKey

plainText

none

admin

 

 

A.10.06.03

searchTerm

basicText

none

admin

 

 

A.10.13

entrySource

noteText

?

admin

TE, LS, TM

 

A.10.14

conceptIdentifier

noteText

?

admin

TE, LS, TM

 

A.10.18

see

noteText

element

ref

TE, LS, TM

 

A.10.18

crossReference

noteText

any element

ref

TE, LS, TM

 

A.10.18

xCrossReference

noteText

external

xref

TE, LS, TM

 

A.10.18.05

homograph

basicText

term

termNote

 

 

A.10.18.06

antonym

basicText

term

descrip

 

 

A.10.19