Developing and applying data science methods typically involves specifying and executing complex data processing and analysis pipelines, comprising pre-processing steps, model building, as well as evaluation. Heterogeneous data sources and systems for executing such pipelines can introduce complex dependencies on data or even processing architectures. When training a neural network, for example, sample transformations and preprocessing steps may be carried out with custom scripts, while the actual training may be executed on state of the art systems such as TensorFlow or MXNet, or scalable systems such as Spark and Flink. In order to simplify and automate the data analysis process, including interactive and iterative data selection or hyperparameter tuning, it is imperative to declaratively specify such pipelines and map them to potentially changing target systems and data sets. A declarative specification could enable automation and reproducibility of a data analysis process, and even help with detecting and validating properties of responsible data management, such as fairness, transparency, or the diversity. Current data analysis pipelines lack holistic declarative end-to-end specifications, preventing automatic reproducibility, comparability, re-use of previous results and models, and testing of experiments for properties of responsible data management. Training performance and prediction quality critically depend on configuration- and hyperparameters, but metadata, lineage information, and results of experiments are not systematically tracked and stored in a structured manner. Rather, these parameters are determined ad-hoc, or by using heuristics or explorative grid search for each pipeline anew. In order to overcome these deficiencies and challenges, we propose the introduction of truly declarative specifications of such pipelines and the creation of a repository of declarative descriptions of machine learning experiments and their corresponding evaluation data in an experiment database. We further plan to research and evaluate optimization and automation of the data science process, both in multi-tenant environments and the continuous deployment of machine learning pipelines.
- Sergey Redyuk, Volker Markl, and Sebastian Schelter. Towards Unsupervised Data Quality Validation
on Dynamic Data.Workshop paper at ETMLP 2020, Copenhagen, Denmark, 30 March 2020
Sergey Redyuk, Sebastian Schelter, Tammo Rukat, Volker Markl, Felix Biessmann. Learning to Validate the Predictions of Black Box Machine Learning Models on Unseen Data. Workshop paper at HILDA’19, Amsterdam, Netherlands, 5 July, 2019. doi.org/10.1145/3328519.3329126
Sergey Redyuk. Automated Documentation of End-to-End Experiments in Data Science.Poster presentation at ICDE’19, Macau, China, 8-11 April, 2019. 10.1109/ICDE.2019.00243