July 11, 2025 5:42 pm

MQM vs. Other TQE Methods

TQE using MQM focuses on producing analytic quality scores based on annotated, individual errors and in relation to human perceptions of quality. This type of evaluation is different from that normally used to evaluate incremental versions of a machine-translation engine or separate engines using a source text and a reference translation (e.g., a BLEU score).

In brief, MQM is a framework for conducting TQEs in a way that is both analytic and reference-free. Analytic means that it produces a quality rating by tabulating penalties for individual, annotated errors (as opposed to a non-analytic approach, which would produce a quality rating for the translation as a whole). Reference-free means that it does not require translations to be evaluated in comparison to another translation, such as a gold standard or previous translation of the same source text. This is in contrast to a system like BLEU, which needs a reference translation.

Non-analytic methods of issuing quality scores to translation products produce quality ratings without pointing to specific errors. An example of this is called Translation Quality Estimation (another possible meaning of the acronym TQE—it is not to be confused with Translation Quality Evaluation in general). This produces an estimated quality score, usually using machine learning. Although there is no need to provide a reference translation for these kinds of algorithms, they still depend on their (inevitably biased) training data as a reference.

Although it is technically possible to conduct a TQE with MQM in many settings—anywhere from a python script to the back of an envelope—there are advantages to using tools specifically designed to work with it. The advantage of working with such tools is that they highlight the analytic approach of MQM: It gives immediate feedback that is both 1) actionable enough to be used in human-centric quality improvement, and 2) specific enough to preserve the root cause of errors and enable a “stack trace” for a more thorough evaluation of a human or system’s translation competency.

References

Lommel, A. (2016). Blues for BLEU: Reconsidering the Validity of Reference-Based MT Evaluation. Proceedings of the LREC 2016 Workshop “Translation Evaluation – From Fragmented Tools and Data Sets to an Integrated Ecosystem”, 63-70 (PDF pages 73-80).

Vashee, K. (2019, April 12). Understanding MT Quality: Bleu Scores. RWS.