An automatically built named entity lexicon for Arabic

Attia, Mohammed, Toral, Antonio ORCID: 0000-0003-2357-2960, Tounsi, Lamia, Monachini, Monica and van Genabith, Josef ORCID: 0000-0003-1322-7944 (2010) An automatically built named entity lexicon for Arabic. In: LREC 2010 - 7th conference on International Language Resources and Evaluation, 17-23 May 2010, Valletta, Malta.

Abstract
Metadata
Downloads
Documents

[+][-]

Abstract

We have successfully adapted and extended the automatic Multilingual, Interoperable Named Entity Lexicon approach to Arabic, using Arabic WordNet (AWN) and Arabic Wikipedia (AWK). First, we extract AWN’s instantiable nouns and identify the corresponding categories and hyponym subcategories in AWK. Then, we exploit Wikipedia inter-lingual links to locate correspondences between articles in ten different languages in order to identify Named Entities (NEs). We apply keyword search on AWK abstracts to provide for Arabic articles that do not have a correspondence in any of the other languages. In addition, we perform a post-processing step to fetch further NEs from AWK not reachable through AWN. Finally, we investigate diacritization using matching with geonames databases, MADA-TOKAN tools and different heuristics for restoring vowel marks of Arabic NEs. Using this methodology, we have extracted approximately 45,000 Arabic NEs and built, to the best of our knowledge, the largest, most mature and well-structured Arabic NE lexical resource to date. We have stored and organised this lexicon following the Lexical Markup Framework (LMF) ISO standard. We conduct a quantitative and qualitative evaluation of the lexicon against a manually annotated gold standard and achieve precision scores from 95.83% (with 66.13% recall) to 99.31% (with 61.45% recall) according to different values of a threshold.

Metadata

Item Type:	Conference or Workshop Item (Paper)
Event Type:	Conference
Refereed:	Yes
Subjects:	Computer Science > Machine translating
DCU Faculties and Centres:	Research Initiatives and Centres > National Centre for Language Technology (NCLT) DCU Faculties and Schools > Faculty of Engineering and Computing > School of Computing
Published in:	Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC'10). . European Language Resources Association.
Publisher:	European Language Resources Association
Official URL:	http://www.lrec-conf.org/proceedings/lrec2010/summ...
Copyright Information:	Copyright 2010 European Language Resources Association
Use License:	This item is licensed under a Creative Commons Attribution-NonCommercial-Share Alike 3.0 License. View License
Funders:	European Framework Programme 7, Enterprise Ireland, Irish Research Council for Science Engineering and Technology
ID Code:	15979
Deposited On:	08 Dec 2010 14:37 by Shane Harper . Last Modified 20 Jan 2022 16:05

Documents

Full text available as:

[thumbnail of An_automatically_built_Named_Entity_lexicon_for_Arabic.pdf]

Preview

PDF - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
2MB

Downloads

Downloads per month over past year

Archive Staff Only: edit this record

Altmetric