[CLS Framework]

TTT Homepage
CLS Framework
Introduction
Section map
Overview
Applications:
  ·Representation   ·Design
  ·Sharing
ISO 12620 data categories
Downloads
XML information

Copyright © 2000
Translation Research Group
Send comments to comments@ttt.org
Last updated: January 27, 2001

The CLS Framework: Reltef specification


Daniel Hardman, CMR-TermSoft

Revised 1999-03-18

Reltef™ and MetaRef™ © 1996-2000
Daniel Hardman and CMR-TermSoft. This format is copyrighted to control its evolution. Feel free to use it in your own applications, but please send us a note to let us know that you've found it useful or to get the latest updates.

Background assumptions

This document builds on the termbase data model framework outlined in a separate document entitled "The CLS Framework". The CLS Framework is a logical organization of ISO 12620 data categories that reflects certain tenets of terminology theory and conceptualizes terminology data models using building blocks familiar to professional terminologists.

12620-->CLS Framework, CLS --SGML-->MARTIF, CLS --MetaRef-->Reltef

In order to arrive at a final data model for a termbase, we assume the CLS Framework as a point of departure. The data categories and the categorical hierarchy proposed in the CLS Framework can be implemented in various ways. Two possible encoding paradigms are SGML and relational database systems. When the CLS Framework is implemented with SGML, the result is MARTIF (ISO 12200). When the CLS Framework is implemented in relational database systems using MetaREF™ (outlined below), the result is Reltef™. Because MARTIF and Reltef™ share a common framework, they are completely compatible and congruent.

Reltef™ is a specific application of MetaREF™, an abstract data model designed by CMR-TermSoft. The MetaREF™ model consists of an E-R diagram and a set of tables and relationships that can be implemented in any mid-range (e.g., MS Access) to high-range RDBMS (e.g., Oracle, SQL Server, Ingres, Informix, Sybase, etc.). The Reltef™ model consists of the abstract structure of MetaREF™ plus tuples in several MetaREF™ "meta" tables. This data locks the DBMS into the specific terminology model an application requires. A Reltef™ implementation consists of the Reltef™ model plus a coherent body of terminological data stored in non-meta tables.

Reltef™ takes advantage of many strengths of a traditional DBMS, including integrity constraints, data normalization, access to data through traditional and extended SQL, business rules, and so forth. It also provides enormous flexibility and can be customized to address the needs of many data sets in a wide variety of human languages.

Basic Philosophy

Relational systems do certain things very well when it comes to data storage, manipulation, and retrieval. They share a fairly standardized query language (SQL); they are able to impose many constraints to ensure the validity of data; they typically import and export data to other RDBMS engines with a high degree of transparency (assuming congruent data models); they support multiple concurrent access, record locking, etc. However, RDBMSs also have certain weaknesses when it comes to highly linguistic data. They typically index information using a single collating sequence that must be applied to the database as a whole; they have no standardized way of encoding multiple languages (especially if the CJK languages are part of the mix); many validation routines for text fields are English-oriented; their query interface has no built-in linguistic savvy. Also, various relational implementations of terminological data models are highly incompatible with one another because of fundamentally different assumptions about what the atomic units of data and structure in the database should be.

SGML can be considered more properly a document encoding language as opposed to a database system. It allows for detailed "templates" (DTDs) that specify a particular structure; these templates can be populated by an infinite (but bounded) variety of specific instantiations. A well-known SGML DTD is HTML, the language used to encode documents on the Web. Like RDBMSs, SGML has both strengths and weaknesses. Its strengths include true platform independence, extremely flexible handling of multiple languages, automated processing and conversion by SGML-aware software, structural and data integrity constraints, and conformity validation. Its weaknesses include the lack of a query language, no indexing capability, an inability to regulate and distribute access to data, and lack of SGML management engines that provide database-like functionality.

Reltef™ stands for "Relational terminology encoding format", and represents a fusion of these two normally distinct approaches. Reltef™ data is stored in relational tables and joined by standard relationships; it can be queried using SQL, described by an E-R diagram, and edited using RDBMS tools. However, it also parallels the structure of an SGML document and borrows from SGML to deal with character encoding, collating sequences, and other multilingual issues. It can be converted automatically between an SGML document and relational tables to allow information to flow across platform, language, and software boundaries with maximum ease and minimum data loss.

Molecular and atomic views of data

Chemists make important generalizations about matter on at least two very different levels: molecular and atomic. Certain molecules may have common characteristics even though their specific structures are somewhat different (acid~base, organic~inorganic, strand~ring~crystal, etc.) It is valid and useful to think about matter from a molecular perspective. However, an atomic-level classification is also necessary to explain many phenomena, such as why particular molecules form and how they can combine.

Reltef™ views terminological data at two different levels of abstraction. At one level the major objects under consideration are data category instances--pieces of information that can be classified using the CLS Framework, such as concepts, terms, definitions, and the like. At a more elemental level, Reltef™ recognizes the root entities of MetaREF™: languages, picklists, text values, dates, and so forth. These two perspectives could be called "molecular" and "atomic", respectively. To illustrate these perspectives, consider the following piece of fictional terminological data:

 

hyperdrive
 

POS:

noun

NUM:

singular

SYN:

faster-than-light drive

SYN:

FTL

DOM:

science fiction ~ space travel ~ propulsion

DEF:

Any method of propulsion that achieves travel at speeds that exceed the speed of light. Typically this travel is conceived of as non-linear, meaning that no continuous traversal of space-time occurs.

REF:

Dictionary of Science Fiction and Fantasy, vol. 1, p. 234.

CON:

"Hold on, kid. I'm gonna kick this thing into hyperdrive. Let's see the star cruiser follow us then!"

REF:

Star Wars, Act II, scene 6.

 

What pieces of data are present in this sample? A terminologist might identify the following data category instances: three terms (one preferred, the other two synonyms), a part of speech, an indication of the grammatical number of a term, a domain, a definition, a context, and two bibliographical references.

This valid and relatively intuitive way of conceptualizing the data (in which the primary unit of data is the data category instance) can be thought of as "molecular", in the sense that the data units can be further reduced or generalized. Terms, definitions, contexts and the like share certain common features, regardless of their specific type: they each modify a single parent element; they each have a single value; they each fall into a single data category. A particular configuration of these more elemental building blocks [atoms] yields a data category instance [molecule]. Molecular perspectives are important both to MARTIF and to relational terminology applications.

Molecular approaches

Because of its origins in SGML, MARTIF is able to address information at an atomic level. For convenience, however, most discussions of MARTIF focus on the molecular perspective, working directly with data categories familiar to a terminologist. Thus the major building block of a MARTIF file is a <termEntry> (container for all information on a given concept). A <termEntry> must consist of one or more <ntig>s (container for all information about a particular term), plus additional (optional) descriptive information (such as a definition). The <ntig>s, in turn, each contain exactly one <term>, followed by optional descriptive information (the part of speech, a contextual example, etc.).

Typical relational terminology models are also molecule-oriented. Concepts, terms, definitions, contexts, bibliographic references, and grammatical information are each conceived of as separate entities, with predefined relationships between the entities encoded as part of the data model: CONCEPT is labeled by one or more TERMs; TERM is demonstrated by one or more CONTEXTs; CONCEPT is identified by one or more DEFINITIONs, etc. In such systems, the concept is usually privileged against all other entities, in the sense that the concept forms a nucleus around which other pieces of information must cluster. Each entity usually corresponds to a separate table, with fields that reflect its particular data characteristics. For example, the term table might have a numeric ID, a 120-character text field to hold the value of the term, and a small field to identify the term's language. It could also have a status field to classify the term as preferred or deprecated, a part of speech field, a grammatical number field, and so forth.

Molecular-only models of terminology data have the advantage of being relatively intuitive to a terminologist. However, they also have certain drawbacks. They are typically built in such a way as to preclude the sharing of information with incompatible models. For example, if a relational model requires that all terms have an associated status, and if terminology from an external system has no status on the terms, then the integrity constraints of the DBMS will make it impossible to import the external data. This is a good thing in terms of internal consistency, but may make it difficult to share information.

Once a molecular model has been implemented, it also adopts all of the linguistic idiosyncrasies of its RDBMS. For example, Japanese information in an Oracle database may be reliably and safely encoded on a particular flavor of UNIX, but may not transfer at all if that database must be moved to a different machine running a different OS. It almost certainly will not transfer if Oracle is laid aside in favor of another RDBMS somewhere in the future. Some database engines use Unicode internally, but few export it and even fewer allow Unicode-aware queries. If they do allow Unicode queries, there is still the issue of conversion between the OS code page and Unicode during keyboard input, web access, and so forth; any solutions are virtually guaranteed to be implementation-specific and non-transferable. And the RDBMS will probably not sort CJK languages properly. If it does, then it is likely to impose a CJK sort order throughout the database (or possibly allow CJK plus one other order).

A Dual Perspective

Reltef™ is innovative in that it conceives of terminological data at both the atomic and molecular levels of abstraction. It uses the MetaREF™ data model to coordinate data at the atomic level, and gains from MetaREF™ all the inherent power of an RDBMS to manage entities, relationships, and integrity constraints. And it allows a query-driven management of the more familiar molecular-level constructs as well.

How it works

The atomic level in Reltef™ is implemented directly through the MetaREF™ data model. The core of this data model is a set of thirteen entities [atoms]:

 

Entity name

Description

data category

a particular class of terminological information (e.g., a term, a part of speech, etc.)

data category name

a language-dependent, user-friendly name for a given data category (e.g., in Spanish a term could be called "término")

data category index type

an indexing strategy that corresponds to a data category (e.g., don't index it, index it as a single value, index it word-by-word)

lang

a distinct language, consisting of a uniform encoding scheme that employs a single charset (e.g., French, German, Italian, etc.)

charset

a unique combination of characters that may be used to represent one or more languages (e.g., ISO 8879-1)

picklist

a set of possible values for terminological data from a given data category (e.g., for the data category "part of speech": {noun, verb, adjective})

element

a unique piece of terminological data

date value

a date(-time stamp) that constitutes the value of an element

number value

a number that constitutes the value of an element

picklist value

a member of a picklist that constitutes the value of an element

text value

a sequence of characters that constitutes the value of an element

index value

a string that represents a normalized, indexed form of all or part of the value of a particular element

link

a connection between two elements

 

The first six items above are "meta" entities; they are created (and their corresponding tables filled with information) before any terminological data is added to the database. It is from these meta entities that "MetaREF" (Meta Relational Encoding Format) gets its name. MetaREF™ defines a very abstract layer of informational possibilities. By populating the six meta entities (plus one meta relationship), the specific data model used by an application (in this case, a termbase) is outlined and enforced. In other words, the meta tables together define the structure that constrains and unifies the terminological data at a molecular level. They might be considered catalyst atoms, necessary to the combination of the other atoms in molecular reactions.

The remaining seven MetaREF™ entities are populated directly through traditional data entry or import and (in Reltef™) hold the actual terminology data visible to an end user of the system. The information these entities contain may be validated at a molecular level using standard SQL queries. Most information-retrieval queries formulated by end users of the DBMS focus almost entirely on information contained by these entities.

atomic and molecular perspectives on terminological data
 
 

Sample database

To illustrate how these entities are actually implemented in the tables of a Reltef™ database, we will create a small sample. We begin by encoding the CLS Framework in the seven meta tables inherited from MetaREF™. This process identifies all the data categories that the Reltef™ database will recognize, how instances of the categories may combine and be indexed, what values are valid for a given data category, and so forth. Once we have completed the meta table population, our database will be specifically terminology-oriented, and will be ready for the input of data recognizable to the termbase end user.

The first step is identifying the languages the database will support. This fills the langs table and implements the lang entity. Note that the collating sequence is important for certain kinds of linguistic processing, but is left blank to simplify our example for now.

 

Langs table

langID

description

charsetID

collating sequence

noise list

en

English

ISO 8879-1

the, of, is, are, it, to, be, not, and

es

Spanish

ISO 8879-1

de, y, o, la, el, que, los, las

 

We must also explain what charset "ISO 8879-1" is. This fills the charsets table and implements the charset entity.

 

Charsets table

charsetID

description

wsd

ISO 8879-1

standard "lower ASCII" plus Western European language chars such as accented vowels, etc. Supports English, French, German, Spanish, Italian, Swedish, Dutch, Norwegian, etc.

ISOLAT1.WSD

 

Next we should define some data categories. All Reltef™ databases populate the data categories table (and implement the data categories entity) from the CLS Framework (and via CTI, from ISO 12620). Of course, a particular implementation of Reltef™ need not use all or even most of the myriad data categories outlined there-we only define those kinds of data we are interested in. But we must draw our definition from the standard to ensure compatibility and convertibility.

Here it is vital to note that from an atomic perspective, all data categories are created equal. A basic assumption of most relational terminology databases is the centrality of the concept entity, to which terms, definitions and so forth are subservient. There is no intrinsic quality of the MetaREF™ data model that parallels this assumption. In the data categories table, a concept is just another tuple. It is possible to enforce the concept's core quality in Reltef™, but the enforcing occurs in the valid families table (discussed below), not here.

The structure of the data categories table is relatively straightforward. The value of each categoryID (and GI name + type) derive directly from ISO 12620. Possible values of the "value type" field are: container (element is a shell that contains other elements), none (element has no value at all), text (free-form language-specific information), code (free-form language-independent information), number, date, and picklist. The "forms link" field tells whether this data category is required to point to another element (as in cross references).

 

Data categories table

categoryID

GI name

type

value type

needs lang

forms link

0

termbase

*

container

No

No

0.1

body

*

container

No

No

0.3

termEntry

*

container

No

No

1.1.1

term

*

text

Yes

No

2.1.1

termNote

partOfSpeech

picklist

No

No

2.1.2

termNote

grammaticalGender

picklist

No

No

2.1.3

termNote

grammaticalNumber

picklist

No

No

2.2.2

termNote

geographicalUsage

text

Yes

No

5.1

descrip

definition

text

Yes

No

5.3

descrip

context

text

Yes

No

8

note

*

text

Yes

No

10.18

ptr

cross-reference

none

No

Yes

 

Now we need to give these data categories (each identified by a somewhat cryptic categoryID) a meaningful, language-specific name. This fills the data category names table and implements the data category name entity.

 

Data category names table

categoryID

langID

name

0

en

termbase

0

es

base de datos de terminología

0.1

en

terminological information

0.1

es

informacíon terminológica

0.3

en

concept

0.3

es

concepto

1.1.1

en

term

1.1.1

es

término

2.1.1

en

part of speech

2.1.1

es

función gramatical

2.1.2

en

gender

2.1.2

es

masculino/femenino

2.1.3

en

number

2.1.3

es

número gramatical

2.2.2

en

geographical usage

2.2.2

es

uso geofráfico

5.1

en

definition

5.1

es

definición

5.3

en

contextual example

5.3

es

ejemplo contextual

8

en

note

8

es

nota

10.18

en

cross-reference

10.18

es

véase también

 

Having given a name to each data category, we need to create picklists for those data categories that require them. This fills the picklists table and implements the picklist entity.

 

Picklists table

categoryID

value

2.1.1

adjective

2.1.1

noun

2.1.1

verb

2.1.2

feminine

2.1.2

masculine

2.1.2

neuter

2.1.3

mass

2.1.3

plural

2.1.3

singular

 

Now we identify a data category index type for all data categories that can be indexed (value-less data categories cannot be indexed). This fills the data category index types table and implements the data category index type entity. Note that in field "index type", the allowed values are: "none" (the data category is not indexed), "whole" (the entire value of each instance of the data category is indexed as a single value, "chunk" (the value of each instance of the data category is divided into chunks and indexed on a per-chunk basis, and "both" (whole + chunk).

 

Data category index types table

categoryID

index type

1.1.1

both

2.1.1

whole

2.1.2

whole

2.1.3

whole

2.2.2

chunk

5.1

chunk

5.3

chunk

8

chunk

 

Our last task with the "meta" entities is to specify how data categories may combine. This fills the valid families table and implements the relationship data category A may be parent of B (see E-R diagram below). Note that in field "child occurrence rule", the allowed values are: 1 (child must occur exactly once), ? (child must occur 0 or 1 times), * (child must occur 0 or more times), and + (child must occur 1 or more times).

The data we place in this table will be used to enforce so-called "business rules" as well as certain kinds of data constraints. For example, by specifying that all concepts (data category 0.3) must have at least 1 child term (data category 1.1.1), we impose a portion of the concept centrality feature mentioned earlier. If a concept is added, then the valid families table will require that at least one term be added as well. We can check for conformance to this information with a simple query. Many DBMS engines allow this kind of query as a validity check whenever a record is added, making molecular-level integrity constraints straightforward and bulletproof.

Valid families table

parent categoryID

child categoryID

child occurrence rule

0

0.1

1

0.1

0.3

+

0.3

1.1.1

+

0.3

5.1

+

0.3

8

*

0.3

10.18

*

1.1.1

2.1.1

?

1.1.1

2.1.2

*

1.1.1

2.1.3

*

1.1.1

2.2.2

*

1.1.1

5.3

*

1.1.1

8

*

 

Once these seven tables contain data, we have a sufficient framework to began actually entering terminology data. Let's create a two concepts, three terms, and some simple supporting information, using the following structure:

 

Data tree--DB is root, concepts are children, terms are grandchildren...

 

Here we have uniquely labeled each piece of information using a somewhat arbitrary ID ("term1", "concept2", etc.) Reltef™ imposes some restrictions on these IDs. They are strings of at most 32 chars. They must begin with an alpha, and can contain alphas, digits, hyphens, and periods ("."). But they need not reflect their associated data category. The terminologist who manages a Reltef™ database decides if the IDs should follow a particular convention, or the IDs could be generated automatically during data entry or import. In this case, we have chosen a reasonable convention that seems easy to follow. We begin by adding each data category instance to our elements table:

 

ELEMENTS table

ID

categoryID

parentID

testdb

0

testdb

body1

0.1

testdb

concept1

0.3

body1

concept2

0.3

body1

definition1

5.1

concept1

term1

1.1.1

concept1

ptr1

10.18

concept1

term2

1.1.1

concept1

term3

1.1.1

concept2

context1

5.3

term1

geousage1

2.2.2

term1

note1

8

term2

pos1

2.1.1

term2

gender1

2.1.2

term2

 

By adding these tuples to the elements table, we have created instances of various terminological (molecular) data categories. However, at a molecular level some of these newly-created objects are incomplete. For example, the terms, the definition, the context, the note, and the geographical usage note all have text values. In Reltef™'s atomic perspective, the text values are separate entities associated through the element has text value relationship by a common ID. Thus we add the following information to the text values table:

 

TEXT VALUES table

ID

value

definition1

Any method of propulsion that achieves travel at speeds that exceed the speed of light. Typically this travel is conceived of as non-linear, meaning that no continuous traversal of space-time occurs.

term1

hyperdrive

term2

sistema de propulsión más rápido que la luz

term3

warp drive

note1

this particular equivalent is not likely to be reliable

geousage1

USA

context1

"Hold on, kid. I'm gonna kick this thing into hyperdrive. Let's see the star cruiser follow us then!"

 

Each element may also have an explicitly associated lang in the elements use lang table (if the element's data category is defined to take a lang attribute):

 

ELEMENTS USE LANG table

ID

langID

definition1

en

term1

en

term2

es

term3

en

note1

en

geousage1

en

context1

en

 

The gender and part of speech data category instances both draw their value from picklists. Consulting the earlier information we entered in the picklists table, we choose a valid value for each data category instance and enter it in the picklist values table:

 

PICKLIST VALUES table

ID

value

pos1

noun

gender1

masculine

 

Now we can implement the link created by ptr1. We do this in the links table:

 

LINKS table

ID

targetID

ptr1

concept2

 

Our final step is to index the values of various elements, according to the instructions in the data category index types table. Indexed values are stored in the index values table.

 

INDEX VALUES table

ID

offset

value

definition1

1

any

 

definition1

2

method

definition1

3

propulsion

definition1

4

achieves

definition1

5

travel

definition1

6

speeds

definition1

7

exceed

definition1

8

speed

definition1

9

light

definition1

10

typically

definition1

11

travel

definition1

12

conceived

definition1

13

non-linear

definition1

14

meaning

definition1

15

continuous

definition1

16

traversal

definition1

17

space-time

definition1

18

occurs

context1

1

hold

context1

2

kid

context1

3

gonna

context1

4

kick

context1

5

thing

context1

6

hyperdrive

context1

7

lets

context1

8

see

context1

9

star

context1

10

cruiser

context1

11

follow

context1

12

us

context1

13

then

term1

0

hyperdrive

term1

1

hyperdrive

term2

0

sistema de propulsion mas rapido que la luz

term2

1

sistema

term2

2

propulsion

term2

3

mas

term2

4

rapido

term2

5

luz

term3

0

warp drive

term3

1

warp

term3

2

drive

note1

1

particular

note1

2

equivalent

note1

3

likely

note1

4

reliable

 

At this point all of our data has been entered. Several Reltef™ tables have not been used (date values, code values, number values). This is because our tiny sample has no data category instances that have date, code, or number values. The structure of these tables and the method of data entry is relatively similar to that of the text values table.

Molecular-level interface

Once the tables and relationships have been created in a Reltef™ database, the DBMS will automatically maintain what we have called the "atomic" or MetaREF™ level components-the most abstract layer of entities and relationships. For example, the database engine will prevent a text value from being associated with a non-existent language; it will only allow elements to be assigned to valid data categories; it will force picklist values to come from pre-defined picklists; when a particular element is deleted, its associated value, index value, and so forth will also be deleted. This provides one important layer of validation.

However, the atomic level is highly abstract, and does not parallel the typical needs of an end user who knows about terms and definitions but not elements and data categories. The typical end user will need to answer queries such as "find all French software engineering terms that are only appropriate in Canada" or "list Spanish equivalents for English terms containing with 'polymorphic'". And the database needs to perform validation at the molecular level as well, to implement rules like "all English terms must have a contextual example" or "each definition must have a bibliographic reference". Both of these kinds of operations are possible running SQL queries against the Reltef™ database:

 

To find:

Use SQL:

all concepts

SELECT DISTINCTROW elements.ID

FROM elements

WHERE (((elements.categoryID)="0.3"));

all contexts

SELECT DISTINCTROW elements.ID, [text values].value, [text values use lang].langID, elements.parentID

FROM elements INNER JOIN ([text values] INNER JOIN [text values use lang] ON [text values].ID = [text values use lang].ID) ON (elements.ID = [text values].ID) AND (elements.ID = [text values use lang].ID)

WHERE (((elements.categoryID)="5.3"));

all terms

SELECT DISTINCTROW elements.ID, [text values].value, [text values use lang].langID, elements.parentID

FROM elements INNER JOIN ([text values] INNER JOIN [text values use lang] ON [text values].ID = [text values use lang].ID) ON (elements.ID = [text values].ID) AND (elements.ID = [text values use lang].ID)

WHERE (((elements.categoryID)="1.1.1"));

all terms matching pattern __

SELECT DISTINCTROW terms.ID, terms.value, terms.langID, terms.parentID

FROM terms INNER JOIN [index values] ON terms.ID = [index values].ID

WHERE ((([index values].value) Like [search pattern:]) AND (([index values].offset)=0));

all equivalents for term __ in lang __

SELECT DISTINCTROW terms_1.value, terms_1.langID, terms_1.parentID

FROM [terms matching pattern __] INNER JOIN (terms AS terms_1 INNER JOIN concepts ON terms_1.parentID = concepts.ID) ON [terms matching pattern __].parentID = concepts.ID

WHERE (((terms_1.langID)=[target language code:]));

 

Of course, many more queries could be written. The queries make molecular-level constructs accessible to the end-user, and also provide a mechanism for molecular-level validation of the database.

 

Correspondence between tables and entities/relationships

Table name

Entity (relationship) name

charsets

charset

code values

code value; also relationship "element has code value"

data categories

data category

data category index types

data category index type; also relationship "data category takes index type"

data category names

data category name; also relationships "data category name uses lang" and "data category has lang-specific name"

date values

date value; also relationship "element has date value"

elements

elements; also relationship "element A is parent of element B"

index values

index value; also relationship "element value is indexed"

langs

lang; also relationship "lang uses charset"

links

associative relationship "link"

number values

number value; also relationship "element has number value"

picklist values

picklist value; also relationship "element has picklist value"

picklists

picklist; also relationships "data category uses picklist" and "picklist value derives from picklist"

text values

text value; also relationship "element has text value"

elements use lang

relationship "element uses lang"

valid families

relationship "data category A may be parent of B"

 

 

| Return to ttt homepage | Introduction | Section map | Overview |
| Applications: Representation; Design; Sharing |
| ISO 12620 Data Categories | Downloads | XML info |