Why is it so difficult to compare treebanks? TIGER and TüBa-D/Z revisited
Rehbein, Ines and van Genabith, Josef
(2007)
Why is it so difficult to compare treebanks? TIGER and TüBa-D/Z revisited.
In: TLT 2007 - The 6th International Workshop on Treebanks and Linguistic Theories, 7-8 December, 2007, Bergen, Norway.
This paper is a contribution to the ongoing discussion on treebank annotation schemes and their impact on PCFG parsing results. We provide a thorough comparison of two German treebanks: the TIGER treebank and the TüBa-D/Z. We use simple statistics on sentence length and vocabulary
size, and more refined methods such as perplexity and its correlation with PCFG parsing results, as well as a Principal Components Analysis. Finally we present a qualitative evaluation of a set of 100 sentences from the TüBa-D/Z, manually annotated in the TIGER as well as in the TüBa-D/Z annotation scheme, and show that even the existence of a parallel subcorpus does not support a straightforward and easy comparison of both annotation schemes.