Lecture Series:

Example-driven Data Cleaning

Wednesday, 03.06.2020 · 16:00

Speaker: Ziawasch Abedjan, TU Berlin

Data cleaning is one of the most time-consuming and tedious tasks in data-driven tasks. Typically, it entails the identification of erroneous values and their correction. Effective error detection can significantly improve the subsequent correction step. Research in error detection has provided a variety of approaches, most of which require some prior knowledge about the dataset in order to set up and configure the approach with rules, sensitivity thresholds, or other parameters. Often these approaches only cover a certain type of errors. Recently, novel machine learning techniques have been proposed to treat error detection as a classification task. These approaches still require large amounts of training data scaling with the size of the dataset to cover the variety of residing error types inside a dataset. In this talk, I will present our work in progress towards a holistic error detection system, which significantly reduces the amount of required labels by leveraging label propagation techniques and meta-learning.