A cluster-based representation for multi-system MT evaluation
Stroppa, Nicolas, Owczarzak, Karolina and Way, AndyORCID: 0000-0001-5736-5930
(2007)
A cluster-based representation for multi-system MT evaluation.
In: TMI-07 - Proceedings of The 11th Conference on Theoretical and Methodological Issues in Machine Translation, 7-9 September 2007, Skövde, Sweden.
Automatic evaluation metrics are often used to compare the quality of different systems. However, a small difference
between the scores of two systems does not necessary reflect a real difference between their performance. Because
such a difference can be significant or only due to chance, it is inadvisable to use a hard ranking to represent
the evaluation of multiple systems. In this paper, we propose a cluster-based representation for quality ranking
of Machine Translation systems. A comparison of rankings produced by clustering based on automatic MT evaluation
metrics with those based on human judgements shows that such interpretation of automatic metric scores provides dependable means of ordering MT systems with respect to their quality. We report experimental results comparing clusterings produced by BLEU, NIST, METEOR, and GTM
with those derived from human judgement (of adequacy and fluency) on the IWSLT-2006 evaluation campaign data.