An active learning framework for duplicate detection in SaaS platforms
Nguyen, Quy H., Nguyen, Dac, Dao, Minh-Son, Dang-Nguyen, Duc-TienORCID: 0000-0002-2761-2213, Gurrin, CathalORCID: 0000-0003-2903-3968 and Nguyen, Binh T.
(2020)
An active learning framework for duplicate detection in SaaS platforms.
In: Proceedings of the 2020 International Conference on Multimedia Retrieval (ICMR '20), 26–29 Oct 2020, Dublin, Ireland.
ISBN 978-1-4503-7087-5
With the rapid growth of users’ data in SaaS (Software-as-a-service)
platforms using micro-services, it becomes essential to detect duplicated entities for ensuring the integrity and consistency of data
in many companies and businesses (primarily multinational corporations). Due to the large volume of databases today, the expected
duplicate detection algorithms need to be not only accurate but also
practical, which means that it can release the detection results as
fast as possible for a given request. Among existing algorithms for
the deduplicate detection problem, using Siamese neural networks
with the triplet loss has become one of the robust ways to measure the similarity of two entities (texts, paragraphs, or documents)
for identifying all possible duplicated items. In this paper, we first
propose a practical framework for building a duplicate detection
system in a SaaS platform. Second, we present a new active learning
schema for training and updating duplicate detection algorithms.
In this schema, we not only allow the crowd to provide more annotated data for enhancing the chosen learning model but also use the
Siamese neural networks as well as the triplet loss to construct an
efficient model for the problem. Finally, we design a user interface
of our proposed deduplicate detection system, which can easily
apply for empirical applications in different companies.
Metadata
Item Type:
Conference or Workshop Item (Paper)
Event Type:
Conference
Refereed:
Yes
Uncontrolled Keywords:
active learning; datasets; triplet loss; duplicate removal
Proceedings of the 2020 International Conference on Multimedia Retrieval (ICMR '20).
.
Association for Computing Machinery (ACM). ISBN 978-1-4503-7087-5