About TQE: Section B1

June 13, 2025 9:57 pm

Preliminary Stage

In the Preliminary Stage, we prepare for the Error-Annotation stage by producing the following resources:

The specifications of the translation, in the format of an STS XML file
The metric selected for evaluation, compatible with the specifications, in the format of a metric XML file, including:
- Error types and severity levels
- A scoring model
- A cutscore
A bitext of source text and target text, aligned and segmented. This may be the whole document, or a sample thereof.

B1.1 Formalizing Specifications: Making an STS File

In order to ensure a transparent TQE process, the first step of a TQE is to review and ensure access to the translation specifications that were originally negotiated between the requester of the translation and the project manager for the undertaking entity. For ease of data transfer, this is done by means of a Structured Translation Specifications (STS) file, which is an XML file with data fields corresponding to the structured parameters in ASTM F2575. Once the appropriate specifications are filled in, they may be viewed on a variety of different platforms, including the TRG Annotation Tool that we will use in Section B2.

If the original specifications of the project are unavailable or were never created, it is still possible to reconstruct the specifications by making educated guesses about the situation under which the translation was commissioned. Even such hypothetical specifications may still be important and useful in a TQE.

Formalizing recorded or reconstructed negotiations is a nontrivial task that should be conducted by a professional translation project manager, just as error annotation should be conducted by a professional translator. It requires intimate knowledge of the translation process and of stakeholder expectations.

A tool that provides a visual interface for producing an STS file is available here.

B1.2 Selecting a TQE Metric: Making a Metric File

The metric file (to be described hereunder) is an XML file that contains information essential to the Error-Annotation Stage (see Section B2). This information itself is known as the TQE metric. It is embedded here as an XML file for ease of data interchange.

It is the responsibility of the translation project manager to produce a metric that is tailored to the specifications of the translation project that they are managing. Like specifications, a metric must be reconstructed if one was never created or made available. Also as with specifications, it should be a professional project manager who creates, selects, or verifies the metric to be used in a TQE.

A key assumption in the MQM framework is that the design of any measure of goodness must be justified in relation to the subject whose goodness is to be measured (see the article What is quality? under Section A1). In other words: Because the translation project specifications vary by use case, there is no single, universal TQE metric.

For example, an error may be considered critical in the context of one translation project, but minor or neutral in another. That being said, metrics may be reused if two translation projects reuse the same specifications (such as two translations part of a series of translations, differing only in due date).

A metric is composed of three components, as stated in ASTM WK46396 New Practice for Analytic Evaluation of Translation Quality §4.5:

an appropriate selection of error types,
a scoring model, and
a mechanism to determine whether the evaluated translation has passed or failed.

The MQM implementation of each of these components is described in the subsections below (B1.2.1-1.2.3), followed by a description of a webapp that can be used to generate a metric XML file (B1.2.4). The way in which each component is embedded in XML can be found in the related sections on the tools page.

B1.2.1 An Appropriate Selection of Error Types

Instead of a universal metric, the MQM framework provides a universal typology of errors that may occur in translation (https://themqm.org/error-types-2/typology/). Each MQM TQE metric pulls its error selection from this typology. These errors are then used to annotate segments of a translation. Having every MQM metric be a subset of the MQM error typology ensures transparency and minimizes data loss between TQE systems, as well as between evaluators, translators, and clients. In this example, we use the MQM Full error typology, rather than the MQM Core error typology; MQM Core is a subset of MQM Full.

In addition to the error types themselves, a metric indicates the weight of each error type selected. This is the number by which the error counts of this type are multiplied during scoring. By default, this number is one.

Furthermore, a metric indicates the multipliers for each severity level. MQM provides four error severity levels: neutral, minor, major, and severe. The severity level penalty point multiplier is the number by which the error counts of each severity level are multiplied during scoring. By default, these numbers are 0, 1, 5, and 25, respectively.

After annotation is complete (see Section B3), the products of each error count, multiplied by the relevant error weights and severity level multipliers, are summated to give an Error Type Penalty Total (ETPT) for each error type. More on this will be given in Section B3.

B1.2.2 A Scoring Model

While error annotation is a crucial part of the TQE process—and also the most time- and effort-intensive—it only produces raw annotation data, not a quality score. The method to calculate the quality rating from the raw annotation data is called a scoring model.

A scoring model uses a series of formulas to convert error information (how many errors occurred, of what kind and where, how severe they were, etc.) into a single number representing a judgement of quality. In MQM, this is a conversion of the ETPT into the Quality Score (QS).

There are multiple types of scoring models that may be used, varying in reliability according to the size of the translation document and the sample being evaluated. A reliable scoring model more accurately predicts a human expert’s quality judgements. It also provides quality scores that are easily interpretable as a single-number representation of quality.

In this tutorial, we will focus on two types of linear scoring model: with and without calibration. The linear scoring model without calibration, also called the “raw” scoring model, is well known in the industry. The calibrated linear scoring model uses a quality scale that is different from the raw model’s error count scale. This is done for ease of human use and application. Both models are described in the paper The Multi-Range Theory of Translation Quality Measurement: MQM scoring models and Statistical Quality Control (13 authors).

The TQE Calculator that will be used in Section B3 automatically applies the scoring model, which can be specified as raw (non-calibrated) or calibrated in the metric. Certain calculation values are also part of the scoring model and must be included in the metric:

The Reference Word Count (RWC) an arbitrary number of words in a hypothetical reference evaluation text. Implementers use this uniform word count to compare results across different projects. The RWC is often set at 1000.
The Maximum Score Value (MSV) of 100 is the maximum possible QS that a translation may obtain—the score that the QS is “out of.” In a calibrated scoring model, it is also an arbitrary value designed to manipulate the QS in order to shift its value into a range which is easier to understand. It converts the score to a percentage-like value.
For a calibrated scoring model, the number of Acceptable Penalty Points (APP) is the number of penalty points that stakeholders would deem as still acceptable for the Reference Word Count. So, this is usually the maximum acceptable number of penalty points per 1000 words.

These values are used to calculate intermediate values before a QS is actually obtained.

For all linear scoring models, this includes the Per-Word Penalty Total (PWPT), which is determined by dividing the APT by the EWC.

A Raw QS (RQS) is obtained by simply subtracting the PWPT from 1:

\(RQS = 1 – PWPT \) or more simply, \(RQS = 1 – \frac{APT}{EWC}\)

The RQS can then be manipulated so that it may be displayed as a fraction of the MSV (or as a percentage, if the MSV is 100):

\(RQS \times MSV = RQS_{display} \\ 0.74 \times 100 = 74 \)

Thus, given an MSV of 100, a RQS of 0.74 could be displayed as 74%, or as 74 out of 100. This allows stakeholders to set their own scale and have quality scores be out of any number.

The raw scores calculated using non-calibrated scoring models can be problematic because they are not human-readable. A high-quality translation will always have a RQS very close to 1, and the small difference between 0.981 and 0.983 may actually have meaningful quality implications that are not apparent. Calibrated linear scoring models approach this problem by defining a scale and projecting that scale onto the range of possible quality scores, creating a "window" between the PT and the MSV. QSs are scaled to fit into this window.

This makes quality scores more human-readable, as there will be more obvious differences between each score and the PT. It also allows stakeholders to bring together different content types and situations under a single, standardized quality measurement scale, even if each use case has a different PT.

For calibrated linear scoring models, the extra values supplied to calibrate the model (including the RWC, MSV, and APP) are used to calculate intermediate values, finally arriving at a Calibrated QS (CQS):

The Defined Passing Interval (DPI) is the interval between the MSV (often 100) and the PT defined in B1.2.3.
The Normed Penalty Total (NPT) represents the PWPT relative to the RWC. Typically, 1000 is used as the RWC; therefore NPT is sometimes referred to as the Error Penalty Total per Thousand Words. It is obtained by multiplying the PWPT by RWC.
The Scaling Factor (SF) is the parameter to scale the NPT into the DPI in order to give in a meaningful relationship to the PT.

The CQS is obtained by using the SF to scale the NPT into the "window" represented by the DPI, and then subtracting that number, which represents a calibrated number of penalty points, from the MSV.

Or, more formally:

\(DPI = MSV - PT \\ NPT = PWPT \times RWC \\ SF = \frac{DPI}{APP} \\ CQS= MSV - NPT \times SF \)

These definitions are taken from sections 2.1 and 2.3 of the MQM website's description of scoring models.

A linear scoring model thus calibrated assumes a constant (linear) rate of quality degeneration (represented by penalty points assigned) as more errors appear in the text. That is, a 500-word text with 5 minor errors is penalized as much as a 10,000-word text is for 100 minor errors. Unfortunately, the non-linearity of human perception shows itself in a wide range of samples, because human readers tolerate fewer errors per unit as the text grows, due to cognitive effects like priming.

For simplicity, a linear scoring model is usually applied, because any non-linear function can be approximated by a linear one over a small interval. Therefore, if the scoring model is set up and verified for a certain sample size, it will perform reasonably well for nearby sample sizes at producing a reliable QS. However, in order produce scores that accurately reflect human perception on a wide range of sample sizes, it may be necessary to use a non-linear scoring model. More information about non-linear scoring models is forthcoming. Sampling and sample sizes are discussed in further detail in Section B1.3.

B1.2.3 A Mechanism to Determine Whether the Evaluated Translation Has Passed or Failed

Because numeric scores can be open to interpretation, a TQE metric must specify a method of converting the QS output by the scoring model into a quality rating, i.e. a binary verdict on whether the translation is acceptable or not. The most straightforward way, and the way implemented in this tutorial, to determine whether a translation has passed or failed a TQE is with a cutscore, also known as a Passing Threshold (PT). If the QS is above the cutscore, the translation passes. Otherwise, it fails.

While a cutscore takes the unassuming form of a simple number, there is actually wide discussion in the quality management community around how to best select one. A good cutscore reflects real-world human perception. And a metric is validated based on how well it predicts what experts will say about a translation. However, the process of setting a cutscore in a real-life project is beyond the scope of this tutorial.

B1.2.4 Generating a Metric XML File

This webapp is a simple tool that allows a user to upload a typology, and then select which error types they wish to include in their metric, along with the weights for each error type. In the "Scoring Information" box, the user inputs severity level multipliers, all scoring model information and the cutscore.

This tool provides the user with the XML version of the MQM Full error typology. If a user is conducting a TQE with another error typology, they may upload their own, but it must conform to the same XML structure as the provided mqm_typology_full.xml file.

If you have more questions about the MQM framework used for selecting metrics and defining error types, you can read about it here: https://themqm.org/.

With the selected metric's XML file created, the project manager delivers it to the evaluator, who then proceeds with the next step in the TQE preparing the text.

B1.3 Preparing the Document for Annotation: Making a New Bitext File

Once the translation specifications and the TQE metric are finalized, the last step before the Error-Annotation stage is the preparation of the source text and the target text, or sample thereof, for evaluation.

If it were possible to annotate the entire document for translation errors, then we could be confident that the translation quality evaluation would be as reliable as the specific TQE evaluation process, which has its own inherent reliability.

The reliability of any particular evaluation hangs upon the qualifications of the evaluator, who needs to be a highly proficient and experienced linguist.

Such highly qualified resources are scarce, expensive, and have limited availability. This is the primary reason why TQE is typically performed on samples of limited size, rather than on an entire large document, or on all small documents in the case of a series of self-sufficient texts within a project. (Not to mention that TQE is usually done for someone else; if the goal is to fix the errors, the linguist can simply edit the text without annotating it.)

It is therefore common practice to evaluate only a sample and extrapolate the score to estimate the quality of the full document. In this sense, TQE evaluation works with a sample of a certain size taken from a larger document or content stream.

Both sample size and document size significantly affect the validity and reliability of a TQE, for two main reasons:

the mathematics of defect measurement is rooted in statistics, and
there is strong practical evidence that human perception is non-linear.

Statistics tells us that very small samples (under 250 words) introduce high uncertainty into quality measurements, regardless of the method, making them unsuitable for MQM’s linear analytic scoring and requiring alternative approaches.

Medium-sized samples (500 to 5000 words) are better suited for linear assumptions and work well with traditional MQM-based evaluation. For instance, a scoring model calibrated for a 2000-word reference sample can be applied, with reasonable accuracy, to samples between 1500 and 2500 words.

In order to score samples that are larger than 5000 words, it is recommended to avoid linear scoring models entirely. However, non-linear scoring models are outside of the scope of this Tutorial.

Once a sample has been selected, the evaluator determines the Evaluation Word Count (EWC), usually by means of a software app such as a CAT (computer assisted translation) tool. Usually, this is the word count of the source text.

Finally, the sample of the source text and the sample of the target text are combined into a tab-delimited bitext TXT file, where they must be segmented (split in translation units) and aligned (each translation unit appears on the same line as its corresponding source or target segment, separated by a tab character), just like the example in Section A4. This may be done in a CAT tool, using a command line interface or text editor, or with this webapp.

< Previous: MQM vs. BLEU

Next: B2 Error-Annotation >