About Translation Quality Evaluation (TQE)

The intended audience for this tutorial is members of the language community who are not necessarily professional translators, but are interested in learning about the processes and tools involved in evaluating finished translations. First, Part A will present the basics of TQE using a concrete example of the evaluation of a very short translation; and then Part B will dive deeper into the TQE process and describe how to conduct a new TQE on any real-life translation. We include links to one possible system of tools that can be used to this end. One intended audience for this tutorial is the machine translation community at the 20th Machine Translation Summit in Geneva, Switzerland, taking place in June 2025.

MQM vs BLEU

TQE using MQM focuses on producing analytic quality scores based on annotated, individual errors and in relation to human perceptions of quality. The type of evaluation explained here is different than that normally used to evaluate incremental versions of a machine translation engine or separate engines using the same source text and reference translation (e.g., a BLEU score¹).

In brief, MQM is a framework for conducting TQEs in a way that is both analytic, meaning that it produces a quality rating by tabulating penalties for individual, annotated errors (as opposed to a holistic approach, which would produce a quality rating for the translation as a whole); and reference-free, meaning that it does not require translations to be evaluated in comparison to another translation, such as a gold standard or previous translation of the same source text.

Although it is technically possible conduct a TQE with MQM in many settings—anywhere from a python script to the back of an envelope—there are advantages to using tools specifically designed to work with MQM. The advantage of working with such tools is that they highlight the analytic approach of MQM, which gives immediate feedback that is both 1) actionable enough to be used in human-centric quality improvement, and 2) specific enough to preserve the root cause of errors and enable a “stack trace” for a more thorough evaluation of a human or system’s translation competency. [the root causes page of themqm.org is incomplete and isn’t referenced here. I hope that will change.]

Part A: TQE Basics with a Concrete Example

Overview

Although there are many methods with which to evaluate the quality of translations, this tutorial is focused on conducting a TQE (Translation Quality Evaluation) based on MQM (Multidimensional Quality Metrics). It will consist of the following sections:

1. An introduction to the concrete example we will run with throughout Part A

1. Roles in TQE

1. Translation Quality (TQ)

1. Evaluation of TQ (the “E” of TQE)

1. A system of tools for all steps in a TQE

Section A of this tutorial will go through each of these sections as they apply to the concrete example. As mentioned above, the reader will then be prepared to apply these tools to their own TQE in Part B. If the reader considers themselves already well-versed in the principles and theory behind conduction a TQE, they may consider proceeding directly to Part B, referring back to Part A only as needed for examples.

A1. The Concrete Example

Below is very short French text that discusses several aspects of European Union law, as well as its English translation. Both are segmented into translation units (TUs) that correspond between the source and target texts. Together, these texts make up the document that we will use to illustrate the principal parts of a TQE in Part A of this tutorial. Many artefacts are generated over the course of a TQE, and all will be provided for the reader for Part A in the form of a public GitHub repository.

As will be discussed in Part B, the concrete example here in Part A is far too short for a real TQE. While such a small sample size is not sufficient for statistically significant evaluation, it is useful for demonstrating the basic principles of TQE.

TU	Source Text (French)
1	Le Parlement européen a adopté, le 11 novembre 2015, une résolution sur la réforme de la loi électorale de l’Union européenne.
2	Plusieurs principes ont alors été retenus:
3	(1) l’organisation des élections sur la base d’un scrutin de liste ou d’un vote unique transférable de type proportionnel;
4	(2) la suppression du cumul de tout mandat national avec celui de député européen;
5	(3) la liberté pour les États membres de constituer des circonscriptions au niveau national;

TU	Target Text (English)
1	On November 11th, 2015, the European Parliament adopted a resolution on the reform of the laws of the European Union.
2	Several principles were retained:
3	(1) conducting elections on the basis of proportional representation; using a list system or a single transferable ballot system;
4	(2) prohibiting the cumulation of any national office with one as Member of the European Parliament;
5	(3) upholding the freedom of Member States to draw up constituencies at national level;

A2. Roles in TQE

A clear distinction should be made between professional translators and other personnel involved in the language service industry. Relevant to TQE are:

- Project managers can oversee the translation, oversee the translation evaluation, or both.

- Technicians, who set up and run the mechanics of a TQE, creating and maintaining tools that are critical to all parts of the process.

- Evaluators, who examine the source and target texts to identify errors relative to the metric and produce annotation metadata.

One person may fill more than one of these roles, but it is critical that an evaluator have at least the competence of a professional translator for the languages, subject field, and type of text in question. Non-translators might fill the roles of technician or project manager. Part A of this tutorial will aid anyone in understanding all the roles in a TQE by following a guided, concrete example, even if they will never fill the role of evaluator; Part B enable readers to contribute to other TQEs by preparing tools, specifications, and metrics for evaluators, and by producing quality ratings based on the error data received from evaluators. After following this tutorial, conducting another TQE will require non-translators to work with a qualified translator to fill at least the role of evaluator.

A3. What is Translation Quality (TQ)?

If you are not familiar with the idea of translation quality (TQ) and its measurement, the following intertwined articles give an exhaustive overview:

- What is translation?

- What is quality?

- What is translation quality?

These articles can be summed up together with the following key terms:

- What is translation? “Translation … is a cover term for the creation of written output that corresponds to source content according to agreed-upon specifications.” (This comes from this paper by the International Federation of Translators)

- What are specifications? They are the answers to the standardized translation parameters, produced by dialog between stakeholders (the requester and the project manager)
  - - Translation parameters are standardized questions about a translation project.
  - - Translation specifications are the use-case-specific answers to those questions.
  - - It may be useful to look at this list of translation parameters: click here for the PDF.

- What is translation quality? The degree to which a translation meets the agreed-on specifications.

A4. What is TQE?

Once the reader has understood the notion of translation quality (the “TQ” in TQE), then the next point to approach is the evaluation step: the “E” of TQE. The evaluator of a translation should themselves be a professional translator.

The goal of the evaluation process is to identify errors in the text. The most obvious way to do this is with an annotation tool. In our example, we annotate use the TRG Annotation Tool. This tool must first be configured with a bitext, a metric, and an optional but recommended a specifications file. The files are prepared in a preliminary step and handed off to an evaluator, who can then use the tool as a visual interface to highlight and assign errors to portions of the text. The annotation data is exported. A visualization of the concrete example text annotated for errors can be found here.

Aside: Segmentation and Alignment

The above page contains a source text on the left side and a target text on the right side. They have been segmented, meaning that they have been split into translation units, which are manageable chunks of text that correspond roughly to the size of a clause or short sentence. And they have been aligned, meaning that the source and target texts’ segments have been set side-by-side. Together, they form a document that has been annotated for translation errors (in this system, click on one of the yellow or orange error buttons to see the erroneous text highlighted in the viewer), and is now ready to be assigned a quality rating.

After annotation, the exported data can be scored. In our example, we calculate scores using the MQM Tools TQE Calculator. This tool performs calculations on the Annotation Tool’s exported data, and returns a numeric quality score and a pass/fail quality rating. It also exports a summary of its calculations, as well as an error count table, in the form of an Excel spreadsheet.

The scoring process concludes the TQE. Each of these steps will be broken down further in Section 5.

A5. Putting the Concrete Example Together

This section guides the reader through each step of the TQE process described in Section 4, and will include all of the files and artefacts generated throughout the process, which can be found in this public GitHub repository. Part B of this tutorial gives a deeper explanation of how each of these files gets created on the pathway towards calculating a quality rating.

As a matter of practicality, and as implied in Section 4, a TQE is split into three stages:

1. Preliminary Stage

1. Error Annotation Stage

1. Automatic Calculation & Follow-Up Stage

A5.1 Preliminary Stage

The purpose of the preliminary stage is:

1. To prepare the sample to for evaluation by segmenting and aligning it,

1. To formalize the specifications negotiated when the translation was created, and

1. To select a metric to be used in later scoring.

This is all done in preparation for the Error Annotation Stage. For the concrete example, it results in

- a bitext .txt file, which is the source text and the target text, concatenated line-by-line with tab characters as delimiters.

- an STS .xml file, which contains detailed information on the source and target text, as well as the translation process, and

- a metric .xml file, which contains a list of errors that the evaluator will look out for, as well as information that will later be used to score the translation based on the evaluator’s reporting of those errors, such as the minimum threshold quality score, called the cutscore.

A5.2 Error Annotation Stage

The purpose of the error annotation stage is for the evaluator to annotate the bitext with errors from the metric file. The three files from the Preliminary Stage are uploaded to the TRG Annotation Tool, and the evaluator highlights and annotates the errors. In this concrete example, the evaluator determined that there were 6 errors:

- A minor Organizational Style error in TU1: the style guide for this project requires that the date be formatted “11 November 2015,” not “November 11th, 2015.”

- A major Omission error in TU1: the target text omits the word “electoral” when describing the types of laws mentioned in the EU resolution.

- A major Mistranslation error in TU2: “retained” is not an acceptable translation of “retenus.” It is a false cognate. An acceptable translation here would be “included.”

- A minor Punctuation error in TU3: the semicolon after “representation” should be a comma.

- A major Unidiomatic Style error in TU4: although it accurately conveys the intended meaning of the source text, the target text is unwieldy because of the literal word-for-word translation of the source material.
  - - The evaluator offered an idiomatic alternative: “prohibiting individuals from holding office as a member of a national parliament and as a Member of the European Parliament at the same time.”

- A minor Awkward Style error in TU5: the phrase “draw up” is unclear. A better alternative might be “establish.”

Once annotation is finished, the TRG Annotation Tool exports its data as a JSON file. This data can be converted into a TEI file (TEI is a widely used XML format in the Digital Humanities), then visually inspected for data integrity.

A5.3 Automatic Calculation & Follow-Up Stage

The purpose of the Automatic Calculation Stage is to get a quality rating by comparing a calculated overall quality score to a minimum threshold value called the cutscore. For this concrete example, we use a tool called the TQE Calculator, which has an automatic functionality that takes in two files:

- the TEI file containing error data exported by the TRG Annotation Tool, and

- the metric file, containing scoring model parameters.

The TQE Calculator automatically parses these files and generates an error count table, which summarizes the number and type of errors that were annotated in the TRG Annotation Tool in the context of all the possible error types in the metric, as well as those error types’ weights and the penalty multipliers for each severity type.:

The error count table contains all the information necessary to calculate the Absolute Penalty Total (APT), where each error, initially worth 1 penalty point, is multiplied first by its weight (here, they are all 1), and then by their severity multiplier (here, we use the standard of x1 for minor errors and x5 for major errors). The TQE Calculator automatically computes this, pulling everything from the error count table:

Then, the TQE Calculator automatically populates the parameters to be used in the scoring model specified in the metric. The only parameter that needs to be input is the length of the target text portion of the document in words, which is 74. The tool calculates an overall quality score, which is then compared to the cutscore to provide a quality rating:

The Overall Quality Score (OQS) falls short of the established cutscore of 80, and thus receives a quality rating of “fail.”

Note the option to download a summary of this whole process as an Excel spreadsheet (the spreadsheet summary for the concrete example is here).

In this example, we used a linear noncalibrated scoring model. The exact formulas used in different scoring models are outside the scope of Part A. To learn about scoring models, including calibrated scoring models, see Part B Section 2.2.

Again, it is emphasized that this concrete example is too short a sample to be statistically meaningful, but is provided as a demonstration of the TQE process. Now, Part B will describe how to expand these procedures to perform a TQE on any real-world translation, if the reader so dares.

Part B: Expanded Theory & Your Own TQE

After studying the critical components of a TQE and their applications to the concrete example provided, the reader might pass to their own TQE. The resources that were provided freely during the tutorial must be produced independently, to include:

1. The specifications of the translation, in the format of an STS XML file

1. The metric selected for evaluation, in the format of a metric XML file, including:
  - - Error types and severity levels
  - - A scoring model
  - - A cutscore

1. A bitext of source text and target text, aligned and segmented

1. Error annotations assigning issues from the metric to textual segments in the bitext

1. A tool that runs the scoring model on the error annotation data and compares the output to a cutscore, delivering a final quality rating

More information about how to use tools in this section is at TQE Tools for General Application. [tools page should have info for technicians, this page has info for PMs and other non-technicians]

Preliminary Stage

B1. Making an STS File

[explain that it is structured translation specifications, based on ASTMF2575 structured parameters]

reviewing and ensuring access to the agreed-on translation specifications

[not trivial, needs a good PM]

B2. Designing a New TQE Metric

A key assumption in the MQM framework is that, because the specifications vary by use case, there is no single, universal translation quality metric. For example, an error may be considered critical in the context of one translation project, but minor or neutral in another. The design of any measure of goodness must be justified in relation to the translation project that is to be evaluated.

It is the responsibility of the project manager to select a metric that is tailored to the specifications of the translation project that they are managing. Metrics may be reused if two translation projects reuse the same specifications (such as two translations part of a series of translations, differing only in due date). The project manager delivers the selected metric to the evaluator, who then proceeds with the TQE.

verifying (or selecting/creating, if necessary) the metric for performing the evaluation based on the translation specifications

A metric is composed of three components, as stated in WK46396 §4.5 [note: how should we call this?]:

A TQE metric shall consist of (1) an appropriate selection of error types, (2) a scoring model, and (3) a mechanism to determine whether the evaluated translation has passed or failed. In this standard practice all three components are based on the MQM framework.

B2.1 An Appropriate Selection of Error Types

Instead of a universal metric, MQM provides a common typology of errors that may occur in translation (https://themqm.org/error-types-2/typology/). Each MQM metric pulls its error selection from this typology. These errors are then used to annotate or otherwise grade a translation.

[provide typology.xml here]

B2.2 A Scoring Model

This is a method of converting error information (how many errors occurred, of what kind and where, how severe they were, etc.) into a single number: the Quality Score (QS).

[Discuss calibration and noncalibration here, as well as sampling]

[from Serge]

In this Tutorial, we will focus on two linear scoring models, with and without calibration. The linear scoring model without calibration, also called the “raw” scoring model, is well known in the industry. The calibrated linear scoring model uses a quality scale that is different from the raw model’s error count scale. This is done for ease of human use and application. Both models are described in the paper The Multi-Range Theory of Translation Quality Measurement: MQM scoring models and Statistical Quality Control (13 authors).

B2.3 A Mechanism to Determine Whether the Evaluated Translation Has Passed or Failed

The most straightforward way to implement this is with a cutscore, also known as a Passing Threshold (PT). If the score output by the scoring model is above the cutscore, the translation passes.

[Huge QM discussion around this. Needs to reflect real-world perception. A metric is validated based on how well it predicts what experts will say about a translation.]

Setting a cutscore in a real-life project is beyond the scope of this tutorial.

If you have more questions about the MQM framework that we use for selecting metrics and defining error types, you can read about it here: https://themqm.org/.

B3. Making a New Bitext (Preparing the text)

preparing the source text and target text, or sample thereof for evaluation

determining the Evaluation Word Count, usually by means of a software app such as a CAT (computer assisted translation) tool

[begin Serge]

If it is possible to annotate the entire document for translation errors, then we can be confident that the translation quality evaluation is as reliable as the specific TQE evaluation process, which has its own inherent reliability.

The reliability of any particular evaluation is hanging on the qualifications of the evaluator, who needs to be a highly proficient and experienced linguist.

Such highly qualified resources are scarce, expensive, and have limited availability. This is the primary reason why TQE is typically performed on samples of limited size, rather than on an entire large document, or on all small documents in the case of a series of
self-sufficient texts within a project. (Not to mention that TQE is usually done for someone else; if the goal is to fix the errors, the linguist can simply edit the text without annotating it.)

It is therefore common practice to evaluate only a sample and extrapolate the score to estimate the quality of the full document. In this sense, TQE evaluation works with a sample of a certain size taken from a larger document or content stream.

Error annotation is a crucial part of this process—and also the most time- and effort-intensive—but it only produces raw annotation data, not a quality score.

The method to calculate the quality rating from the raw annotation data is called a scoring model. For simplicity, a linear scoring model is usually applied—even though the world is profoundly non-linear—because any non-linear function can be approximated by a linear one over a small interval. Therefore, if the scoring model is set up and verified for a certain sample size, it will perform reasonably well for nearby sample sizes.

This brings us to a key point: both sample size and document size significantly affect the validity and reliability of a TQE, for two main reasons:

- the mathematics of defect measurement is rooted in statistics, and

- there is strong practical evidence that human perception is non-linear.

Statistics tells us that very small samples (under 250 words) introduce high uncertainty into quality measurements, regardless of the method, making them unsuitable for MQM’s linear analytic scoring and requiring alternative approaches.

Medium-sized samples (500 to 5000 words) are better suited for linear assumptions and work well with traditional MQM-based evaluation. For instance, a scoring model calibrated for a 2000-word reference sample can be applied, with reasonable accuracy,
to samples between 1500 and 2500 words.

Note that unfortunately, non-linearity of human perception shows itself of a wide range of samples, because human readers tolerate fewer errors per unit as the text grows, due to cognitive effects like priming. In order produce scores that accurately reflect
human perception on a wide range of sample sizes, it may be necessary to use a non-linear scoring model. More information about non-linear scoring models is forthcoming.

[end Serge]

The paper The Multi-Range Theory of Translation Quality Measurement: MQM scoring models and Statistical Quality Control describes linear scoring models, with and without calibration. Both types of linear scoring models will be discussed later on.

B4. Evaluation Stage: Annotating Errors with the TRG Annotation Tool

[point towards Microsoft Word – TC43-OnTheWeb-proceedings page 17]

[possibly point towards Valerie R. Mariana’s MA Thesis The Multidimensional Quality Metric (MQM) Framework: A New Framework for Translation Quality Assessment]

B5. Automatic Calculation & Follow-Up Stage: Computing the Score

The pass/fail quality rating is the ultimate goal of a TQE.

TTT

About TQE