TMX format   TMX Format

LISA SIG

Implementation Notes

This document is a collection of notes to implement support for the Translation Memory eXchange format. See the TMX Specifications document for additional information.

Content


Recommended values for attributes

This section enumerates the recommended values for some of the attributes.

Recommended value for the datatype attribute:

   "unknown"   = undefined (default)          "mif"       = FrameMaker MIF
   "rtf"       = Rich Text Format             "xml"       = XML-based format
   "transit"   = Transit                      "sgml"      = SGML-based format
   "opentag"   = OpenTag                      "xptag"     = Quark XPressTag
   "stf-f"     = S-Tagger for FrameMaker      "pm"        = PageMaker
   "stf-i"     = S-Tagger for Interleaf       "html"      = HTML
   "ipf"       = IPF/BookMaster               "hptag"     = HP-tag
   "plaintext" = Plain Text                   "interleaf" = Interleaf
   "alptext"   = WinJoust                     "winres"    = Windows Resources
   "cstyle"    = C/C++ style text             "java"      = Java style text
   "pascal"    = Pascal style text            "cmx"       = Corel CMX format
Recommended values for the type attribute of <bpt> and <it>:

   "bold"      = Bold                         "italic"    = Italic
   "ulined"    = Underline                    "scap"      = Small caps
   "dulined"   = Double-underlined            "font"      = Font change
   "color"     = Color change                 "link"      = Linked text
   "struct"    = XML/SGML structure
Recommended values for the type attribute of <ph>:

   "index"     = Index marker                 "date"      = Date
   "time"      = Time                         "fnote"     = Footnote
   "enote"     = End-note                     "alt"       = Alternate text
   "image"     = Image                        "pb"        = Page break
   "lb"        = Line break                   "cb"        = Column break
   "inset"     = Inset                        

TMX prologue

A TMX document usually starts with the XML declaration, followed by the DOCTYPE declaration, followed by the actual TMX document. XML comments (<!-- -->) can be inserted anywhere XML allows them.

The XML declaration and the DOCTYPE declaration are not mandatory but recommended. The DOCTYPE statement is useful when the document is validated against the DTD.

<?xml version="1.0" ?>
<!DOCTYPE tmx SYSTEM "tmx12.dtd">
<tmx version="1.2">
....

Using the <prop> element

The <prop> element allows the tools to insert non-standard information in a TMX document. The <prop> element is defined by its type attribute and its content.

As your tool is fully responsible for handling the content of a <prop> element you can use it in any way you wish. For example the content can be a list of instructions your tool can parse, not only a simple text.

<prop type="user-defined">name="domain" value="Computer science"</prop>
<prop type="x-domain">Computer science</prop>

It's the responsibility of each tool provider to provide the types and values of the properties it uses.


Using the segtype attribute

The segtype attribute can have several values: "block", "paragraph", "sentence" and "phrase".

The "block" value is used when the segment does not correspond to one of the other values, for example when you want to store a chapter composed of several paragraphs in a single <tu>.

<tu segtype="block">
 <prop type="sentbreak">$#$</prop>
 <tuv xml:lang="en"><seg>This is the first paragraph of a big section.$#$
This is the second paragraph.$#$This is the third.</seg></tuv>

Here the property "sentbreak" defines the token used to indicate the separation between sentences within the block of text. You can therefore easily break down the segment into smaller units if needed. You can imagine many other ways to use this mechanism.

A TMX file includes sentence level segmentation for maximum portability, so it is recommended that you use such segmentation rather than a specific, proprietary method like the one above.


Grouping <tu> elements

If you want to indicate that several <tu> elements belong to a logical group, you can specify a <prop> element for each of the <tu> which comprise the group.

<tu>
 <prop type="group">1</prop>
 <tuv xml:lang="en"><seg>First segment</seg><tuv>
 <tuv xml:lang="fr"><seg>Premier segment</seg><tuv>
</tu>
<tu>
 <prop type="group">1</prop>
 <tuv xml:lang="en"><seg>Second segment</seg><tuv>
 <tuv xml:lang="fr"><seg>Second segment</seg><tuv>
</tu>

TMX does not implement the notion of order. If the order of the <tu> elements is relevant, you may want to use the tuid attribute or a <prop> element to reflect it. See the following example.

<tu>
 <!-- Group 1, first item -->
 <prop type="group">1-1</prop>
 <tuv xml:lang="en"><seg>First segment</seg><tuv>
 <tuv xml:lang="fr"><seg>Premier segment</seg><tuv>
</tu>
<tu>
 <!-- Group 1, second item -->
 <prop type="group">1-2</prop>
 <tuv xml:lang="en"><seg>Second segment</seg><tuv>
 <tuv xml:lang="fr"><seg>Second segment</seg><tuv>
</tu>

Embedded text

In some cases a segment can contain a sub-flow of text that is not directly part of the main text. For example, sometimes footnote or index definition text is contained within the footnote or index marker.

The <sub> element allows you to delimit these chunk of embedded text. Here are a few examples. The main text is underlined, the content mark-up tags are in bold.

- Footnote in RTF
Original RTF:
Elephants{\cs16\super \chftn {\footnote \pard\plain
\s15\widctlpar \f4\fs20
{\cs16\super \chftn } An elephant is a very
large animal.}} are big.

TMX with content mark-up:
Elephants<ph type="fnote">{\cs16\super \chftn {\footnote \pard\plain
\s15\widctlpar \f4\fs20
{\cs16\super \chftn } <sub>An elephant is a very
large animal.</sub>}}</ph> are big.
- Index marker in RTF
Original RTF:
Elephants{\pard\plain \widctlpar
\v\f4\fs20 {\xe {Big animal\bxe }}} are big.

TMX with content mark-up:
Elephants<ph type="index">{\pard\plain \widctlpar
\v\f4\fs20 {\xe {<sub>Big animal</sub>\bxe }}}</ph> are big.
- Text of an attribute in a HTML element
Original HTML:
See the <A TITLE="Go to Notes"
HREF="notes.htm">Notes</A> for more details.

TMX with content mark-up:
See the <bpt i="1" type="link">&lt;A TITLE="<sub>Go to Notes</sub>"
HREF="notes.htm"></bpt>Notes<ept i="1">&lt;/A></ept> for more details.

Note that many systems may have the text originally embedded as separate <tu> elements.

Overlapping codes

To delimit paired sequences of native code, TMX uses the <bpt> and <ept> elements. Within the same segment there is always a corresponding <ept> element for a given <bpt> one. However, if the segment contains several groups of codes it may happen that the first <ept> after a given <bpt> is not the corresponding one.

TMX (with correct content mark-up):
<seg><bpt i="1">&lt;B></bpt>Bold,
<bpt i="2">&lt;I></bpt>Bold+Italic<ept i="1">&lt;/B></ept>,
Italic<ept i="2">&lt;/I></ept></seg>

The steps to process correctly the segment are:

  1. If no attribute i is specified in a <bpt>, the next <ept> is its corresponding closing element.
  2. If the <bpt> element has an i attribute value specified, the TMX application must locate the <ept> element with the same i attribute value, and exclude this pair of elements from the default processing.

Inline codes handling

TMX offers several ways to handle inline codes (non-text parts of a segment, such as formatting codes, like <B> in HTML). There are five TMX elements for encapsulating inline codes: <bpt>, <ept>, <it>, <ph>, and <ut>.

The guidelines to use these codes are the following:

1. Use <bpt> for opening each code that has a corresponding closing code in the segment.

2. Use <ept> for closing each code that has a corresponding opening code in the segment.

For example:

<seg>The <bpt i="1">&lt;i></bpt><bpt i="2">&lt;b></bpt>big<ept i="2">&lt;/b></ept> 
black<ept i="1">&lt;/i></ept> cat.</seg>

3. Use <it> for opening or closing each code that has no corresponding closing or opening code in the segment.

In some cases, because of the segmentation, you may have opening and closing codes that have no corresponding closing or opening codes within the same segment. Use <it> to encapsulate those codes. <it> has a mandatory attribute pos that should be set to "begin" or "end" depending on whether the isolated code is an opening or a closing code.

4. Use <ph> for standalone codes.

Use <ph> to encapsulate codes that are not opening/closing, for example empty elements in XML:

<seg>The icon <ph>&lt;img src="testNode.gif"/></ph> represents a conditional node.</seg>

5. Use <ut> if the tool cannot apply any of the four preceding rules.

Use <ut> only when none of the previous tagging rules can be applied.


Last update of this document: Jan-03-2002