Developing and applying data science methods typically involves specifying and executing complex data processing and analysis pipelines, comprising pre-processing steps, model building, as well as evaluation. Heterogeneous data sources and systems for executing such pipelines can introduce complex dependencies on data or even processing architectures. When training a neural network, for example, sample transformations and preprocessing steps may be carried out with custom scripts, while the actual training may be executed on state of the art systems such as TensorFlow or MXNet, or scalable systems such as Spark and Flink. In order to simplify and automate the data analysis process, including interactive and iterative data selection or hyperparameter tuning, it is imperative to declaratively specify such pipelines and map them to potentially changing target systems and data sets. A declarative specification could enable automation and reproducibility of a data analysis process, and even help with detecting and validating properties of responsible data management, such as fairness, transparency, or the diversity. Current data analysis pipelines lack holistic declarative end-to-end specifications, preventing automatic reproducibility, comparability, re-use of previous results and models, and testing of experiments for properties of responsible data management. Training performance and prediction quality critically depend on configuration- and hyperparameters, but metadata, lineage information, and results of experiments are not systematically tracked and stored in a structured manner. Rather, these parameters are determined ad-hoc, or by using heuristics or explorative grid search for each pipeline anew. In order to overcome these deficiencies and challenges, we propose the introduction of truly declarative specifications of such pipelines and the creation of a repository of declarative descriptions of machine learning experiments and their corresponding evaluation data in an experiment database. We further plan to research and evaluate optimization and automation of the data science process, both in multi-tenant environments and the continuous deployment of machine learning pipelines.
- S. Redyuk, Z. Kaoudi, V. Markl, and S. Schelter (2021). Automating data quality validation for dynamic data ingestion. In Proceedings of the International Conference on Extending Database Technology (EDBT). ISBN 978-3-89318-084-4 on OpenProceedings.org.
- S. Baunsgaard, M. Boehm, A. Chaudhary, B. Derakhshan, S. Geißelsöder, P. M. Grulich, M. Hildebrand, K. Innerebner, V. Markl, C. Neubauer, S. Osterburg, O. Ovcharenko, S. Redyuk, T. Rieger, A. R. Mahdiraji, S. B. Wrede, S. Zeuch (2021). ExDRa: Exploratory data science on federated raw data. In Proceedings of the 2021 International Conference on Management of Data, 2450-2463.
- S. Redyuk, S. Schelter, T. Rukat, V. Markl, and F. Biessmann. Learning to Validate the Predictions of Black Box Machine Learning Models on Unseen Data.(Workshop paper and presentation), HILDA’19, Amsterdam, Netherlands, 5 July, 2019. doi.org/10.1145/3328519.3329126
S. Redyuk. Automated Documentation of End-to-End Experiments in Data Science.(Workshop paper and presentation), ICDE’19, Macau, China, 8-11 April, 2019. 10.1109/ICDE.2019.00243
H.J. Meyer, H. Grunert, T. Waizenegger, L. Woltmann, C. Hartmann, W. Lehner, M. Esmailoghli, S. Redyuk, R. Martinez, Z. Abedjan, and A. Ziehn (2019). Particulate Matter Matters - The Data Science Challenge @ BTW 2019. Datenbank-Spektrum, 19(3), pp.165-182.
M. Esmailoghli, S. Redyuk, R. Martinez, Z. Abedjan, T. Rabl, and V. Markl (2019). Explanation of air pollution using external data sources. BTW 2019–Workshopband.
S. Redyuk, V. Markl, and S. Schelter. Towards Unsupervised Data Quality Validation on Dynamic Data. (Workshop paper and presentation), ETMLP 2020, Copenhagen, Denmark, 30 March 2020. https://www.youtube.com/watch?v=Xhq8X64RA1Q
S. Redyuk, Z. Kaoudi, V. Markl, and S. Schelter. Automating data quality validation for Dynamic Data Ingestion. (Oral presentation), International Conference on Extending Database Technology (EDBT), Nicosia, Cyprus, 23-26 March 2021. https://www.youtube.com/watch?v=v9IR1zjqAek