McCarthy, Suzanne, McCarren, AndrewORCID: 0000-0002-7297-0984 and Roantree, Mark
(2019)
An automated ETL for online datasets.
In: 23rd Enterprise Computing Conference (EDOC), 28-31 Oct 2019, Paris, France.
While using online datasets for machine learning is commonplace today, the quality of these datasets impacts on the performance
of prediction algorithms. One method for improving the semantics of new data sources is to map these sources to a common
data model or ontology. While semantic and structural heterogeneities must still be resolved, this provides a well established
approach to providing clean datasets, suitable for machine learning and analysis. However, when there is a requirement for a
close to real time usage of online data, a method for dynamic Extract-Transform-Load of new sources data must be developed.
In this work, we present a framework for integrating online and enterprise data sources, in close to real time, to provide
datasets for machine learning and predictive algorithms. An exhaustive evaluation compares a human built data transformation
process with our system’s machine generated ETL process, with very favourable results, illustrating the value and impact of
an automated approach.
Metadata
Item Type:
Conference or Workshop Item (Paper)
Event Type:
Conference
Refereed:
Yes
Uncontrolled Keywords:
ETL; data warehousing; data transformation; data mining; data models