This Translator Turing Test (TTT) evaluates translations of the same Spanish (es-ES) source text produced by GPT-4, DeepL Pro, and human professionals for two distinct use cases: a full translation for a UK academic journal (en-GB), and a summary translation for a US newspaper (en-US). Based on three criteria—adherence to specifications, correspondence, and fluency—the two machine-produced summary translations would not be mistaken for human-produced summary translations, and the two machine-produced full translations come closer but ultimately fail to pass as human-produced translations.
This case study provides a valuable counterpoint to recent studies that have compared GPT-4 to “human-only” translations (where the human professionals were not allowed to use any of their customary tools, e.g., translation memories, NMT, etc.) and qualified GPT-4’s output as almost as good as that of the best professionals (see Yan et al. 2024 and Marshall 2024). Such studies approach the question within an automation paradigm rather than an augmentation paradigm (see Brynjolfsson 2022).
On this page you will find the source text as well as the full thesis from which it was excerpted; the two sets of specifications; each provider’s translation for the two use cases; and notes on the case study (“appendices”) to supplement the write-up of the case study (under review). These appendices include a dry run with MS Copilot and examples of correspondence and fluency issues whose root cause is the language of the source text.
Full thesis (including bibliography)
GPT4
DeepL-Pro
Human