From news to comment: Resources and benchmarks for parsing the language of web 2.0
Foster, JenniferORCID: 0000-0002-7789-4853, Cetinoglu, Ozlem, Wagner, JoachimORCID: 0000-0002-8290-3849, Le Roux, Joseph, Nivre, Joakim, Hogan, Deirdre and van Genabith, JosefORCID: 0000-0003-1322-7944
(2011)
From news to comment: Resources and benchmarks for parsing the language of web 2.0.
In: The 5th International Joint Conference on Natural Language Processing (IJCNLP), 08-13 Nov 2011, Chiang Mai, Thailand.
ISBN 978-974-466-564-5
We investigate the problem of parsing the noisy language of social media. We evaluate four all-Street-Journal-trained statistical parsers (Berkeley, Brown, Malt and MST) on a new dataset containing 1,000 phrase structure trees for sentences from microblogs (tweets) and discussion forum posts. We compare the four parsers on their ability to produce Stanford dependencies for these Web 2.0 sentences. We find that the parsers have a particular problem with tweets and that a substantial part of this problem is related to POS tagging accuracy. We attempt three retraining experiments involving Malt, Brown and an in-house Berkeley-style parser and obtain a statistically significant improvement for all three parsers.
Proceedings of the 5th International Joint Conference on Natural Language Processing (IJCNLP).
.
Asian Federation of Natural Language Processing. ISBN 978-974-466-564-5