Nguyen, Dac, Nguyen, Quy H., Dao, Minh-Son, Dang-Nguyen, Duc-Tien ORCID: 0000-0002-2761-2213, Gurrin, Cathal ORCID: 0000-0003-2903-3968 and Nguyen, Binh T. (2020) Duplicate identification algorithms in SaaS platforms. In: 2020 Intelligent Cross-Data Analysis and Retrieval Workshop (ICDAR'20), 20-26 Oct 2020, Dublin, Ireland. ISBN 978-1-4503-7509-2
Abstract
Existing duplicate records is one of the most common issues in
many Software-as-as-Service (SaaS) platforms. In this paper, we
study the duplicate identification problem in one specific SaaS platform related to quality and compliance management by using the
address information. We interpret all typical mistakes from users
that can generate the existent duplicated organizations in a given
dataset, collected from the SaaS platform. Also, we create another
set by crawling location data from Open Address (US Zone). We
compare different methods, including Bag-of-words (using Cosine
Distance), Record Linkage Toolkits, and Siamese Neural Networks
using the triplet loss, in terms of precision, recall, and F1-score. The
experimental results show that using Siamese Neural Networks can
achieve a better performance in comparison with other techniques.
We plan to publish our Open Address dataset and all implementation codes to facilitate further research in the related fields.
Metadata
Item Type: | Conference or Workshop Item (Paper) |
---|---|
Event Type: | Workshop |
Refereed: | Yes |
Uncontrolled Keywords: | siamese; software-as-a-service; bi-gru; triplet loss; duplicate identification |
Subjects: | Computer Science > Computer security Computer Science > Software engineering |
DCU Faculties and Centres: | DCU Faculties and Schools > Faculty of Engineering and Computing > School of Computing Research Institutes and Centres > ADAPT |
Published in: | Proceedings of the 2020 Intelligent Cross-Data Analysis and Retrieval Workshop (ICDAR'20). . Association for Computing Machinery (ACM). ISBN 978-1-4503-7509-2 |
Publisher: | Association for Computing Machinery (ACM) |
Official URL: | https://doi.org/10.1145/3379174.3392319 |
Copyright Information: | © 2020 The Authors |
Use License: | This item is licensed under a Creative Commons Attribution-NonCommercial-Share Alike 3.0 License. View License |
Funders: | Science Foundation Ireland under grant number SFI/13/RC/2106, L. Meltzers Høyskolefonds, UiB 2019/2259-NILSO |
ID Code: | 24667 |
Deposited On: | 22 Jun 2020 15:32 by Cathal Gurrin . Last Modified 15 Dec 2021 15:40 |
Documents
Full text available as:
Preview |
PDF
- Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
1MB |
Downloads
Downloads
Downloads per month over past year
Archive Staff Only: edit this record