Attia, Mohammed, Toral, Antonio ORCID: 0000-0003-2357-2960, Tounsi, Lamia, Monachini, Monica and van Genabith, Josef ORCID: 0000-0003-1322-7944 (2010) An automatically built named entity lexicon for Arabic. In: LREC 2010 - 7th conference on International Language Resources and Evaluation, 17-23 May 2010, Valletta, Malta.
Abstract
We have successfully adapted and extended the automatic Multilingual, Interoperable Named Entity Lexicon approach to Arabic, using Arabic WordNet (AWN) and Arabic Wikipedia (AWK). First, we extract AWN’s instantiable nouns and identify the corresponding categories and hyponym subcategories in AWK. Then, we exploit Wikipedia inter-lingual links to locate correspondences between articles in ten different languages in order to identify Named Entities (NEs). We apply keyword search on AWK abstracts to provide for Arabic articles that do not have a correspondence in any of the other languages. In addition, we perform a post-processing step to fetch further NEs from AWK not reachable through AWN. Finally, we investigate diacritization using matching with geonames databases, MADA-TOKAN tools and different heuristics for restoring vowel marks of Arabic NEs. Using this methodology, we have extracted approximately 45,000 Arabic NEs and built, to the best of our knowledge, the largest, most mature and well-structured Arabic NE lexical resource to date. We have stored and organised this lexicon following the Lexical Markup Framework (LMF) ISO standard. We conduct a quantitative and qualitative evaluation of the lexicon against a manually annotated gold standard and achieve precision scores from
95.83% (with 66.13% recall) to 99.31% (with 61.45% recall) according to different values of a threshold.
Metadata
Item Type: | Conference or Workshop Item (Paper) |
---|---|
Event Type: | Conference |
Refereed: | Yes |
Subjects: | Computer Science > Machine translating |
DCU Faculties and Centres: | Research Institutes and Centres > National Centre for Language Technology (NCLT) DCU Faculties and Schools > Faculty of Engineering and Computing > School of Computing |
Published in: | Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC'10). . European Language Resources Association. |
Publisher: | European Language Resources Association |
Official URL: | http://www.lrec-conf.org/proceedings/lrec2010/summ... |
Copyright Information: | Copyright 2010 European Language Resources Association |
Use License: | This item is licensed under a Creative Commons Attribution-NonCommercial-Share Alike 3.0 License. View License |
Funders: | European Framework Programme 7, Enterprise Ireland, Irish Research Council for Science Engineering and Technology |
ID Code: | 15979 |
Deposited On: | 08 Dec 2010 14:37 by Shane Harper . Last Modified 20 Jan 2022 16:05 |
Documents
Full text available as:
Preview |
PDF
- Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
2MB |
Downloads
Downloads
Downloads per month over past year
Archive Staff Only: edit this record