Data quality for web log data using a Hadoop environment
Yang, Qishan and Helfert, MarkusORCID: 0000-0001-6546-6408
(2016)
Data quality for web log data using a Hadoop environment.
In: 21st ICIQ 2016, 22-23 Jun 2016, Ciudad Real, Spain.
Solving data quality problems is important for data warehouse construction and operation. This paper is based on developing a web log warehouse. It proposes a data quality problem methodology for data preprocessing within the log warehouse. It provides a hierarchical data warehouse architecture that is suitable for resource saving and ad hoc requirements. The data preprocessing is completed using Hadoop associated with its sub-projects such as Hive, HBase etc. In this paper we compare a Hadoop setup with a Oracle based architecture.