Unravelling the Interior Evolution of Terrestrial Planets Through Machine Learning Understanding the physical processes that control the evolution and present-day state of terrestrial planets around the Sun (Mercury, Venus, Earth, Moon, and Mars) and in extrasolar systems requires computationally-intensive forward models of the thermal history of the interior. These models are constrained by spacecraft- and telescope-based observations bearing information on interior processes, such as gravity and magnetic fields, surface topography, and composition. Yet the problem of inferring the interior evolution from surface observables is severely underdetermined: a large number of parameters – initial conditions as well as material properties – are poorly known and need to be systematically varied. The application of advanced machine learning (ML) (e.g. deep learning algorithms) can obviate the need to perform extensive explorations of the parameter space, which are often impractical if not impossible. We combine state-of-the-art forward models of the thermal evolution of terrestrial planets with machine learning algorithms with the goal of identifying and constraining the key parameters controlling the planetary evolution. Through this joint approach we will develop an innovative computational framework for the interpretation of the growing amount of data delivered by spacecraft missions to planetary bodies of the solar system and by telescope missions aimed at detecting planets orbiting other stars.
Low-power data analytics for self-localization systems Recent advances in ultra-low-power microcontrollers and FPGAs together with the possibility of tailoring optimization algorithms and new machine learning techniques to such hardware make it possible to perform, on the edge, complex data analytics that were previously only possible on powerful computers. These techniques are especially relevant in applications such as planetary exploration missions where communication is not available in real-time and all computations should occur on-board. This project focuses on the following three areas: Development of novel methods for embedded data analytics: Many applications in the space sciences or the internet of Things require the use of low-power devices. New research will be performed to develop new algorithms for the solution of optimization problems and machine learning techniques that are tailored to new hardware architectures. In particular, ultra-low-power microcontrollers and FPGAs will be studied. Low-power and energy-aware data analytics: a co-design of the developed algorithms will be performed by analyzing performance and energy consumption. The goal is to provide optimal tradeoffs between performance and energy consumption, which can be adapted according to the current energy availability in different applications. Self-localization systems: when satellite-based systems are not available, being able to perform autolocalization is a critical task to any tasks that requires autonomous decision making as in planetary exploration missions. The developed methods will be applied and tailored for the challenging tasks usually encountered in self-localization systems for exploration missions.
Online Learning and Decision Making for Real-Time Analytics of Synthetic Aperture Radar (SAR) Data Enabled by recent technological advances, the field of radar remote sensing has entered the era of explosively-growing wide-swath Synthetic Aperture Radar (SAR) missions with short revisit times (1-6 days), such as Sentinel-1 and the planned Tandem-L and NISAR missions, providing an unprecedented wealth of topography and surface change time-series using interferometric SAR (InSAR) technique. Such data volume can be characterized with (i) huge volume and large variety; (ii) complexity and high dimension; (iii) partial unreliability; and (iv) correlation or similarity. Thus, for the retrieval of geophysical signal from InSAR time-series, the data should be classified, clustered and cleaned, performed by data analytics, to reliably detect anomaly changes and improve susceptibility in the areas affected by deformation due to natural and manmade hazards. In the state of- the-art literature for exploiting InSAR time-series data, the separation of geophysical signals from noise artifacts such as atmosphere and decorrelation can be divided into two main parts: (i) data processing for efficient estimation of the phase considering long stack of the data; and (ii) data analysis and decision. In brief, the main goal of this project is to develop a generic framework for real-time InSAR data analytics using the theory and methods from online machine learning and sequential decision making, and to design efficient algorithmic solutions for the retrieval of geophysical signals from SAR measurement. In particular, the concentration is on SAR data from Sentinel-1 satellite. On the online learning and classification side, the methodology is concentrated on online machine learning algorithms. Specific attention is given to submodular optimization. On the online decision making side, the basic method is sequential optimization with limited feedback, especially multi-armed bandit.
On-board Image Classification based on Space-Based FPGA Processing A general trend in remote sensing is the simultaneous increase in the number of spectral bands and the geometric resolution. Data rates and data volumes approach the physical limits of onboard memory and downlink data rates to earth. However, the feasibility of much more expensive and complex calculations directly on the satellite has been demonstrated already. Application areas beyond the early detection of fires include, for instance, the situation description after a hurricane or earthquake. For disaster and security research applications, short‐term visual and radar derived information are required to describe the situation for rescue workers and relevant services. Reconfigurable logic on FPGAs is a promising direction for low‐latency, real‐time, high‐volume data processing (also) in space. The goal of the thesis is to bring FPGA‐based in‐satellite data processing solutions to representative real‐time applications.
Enhanced Computational Approaches for Seismic Risk Assessment of Infrastructure Networks In many regions of the world earthquakes pose a persistent threat to the built environment, especially with respect to the civil infrastructures that are now fundamental to our society. In the aftermath of recent earthquakes, such as the 2010‐2011 Christchurch (New Zealand) events, damage to road, railway and utility/communications networks may be the dominant contributor to economic loss, with socio‐economic impacts that can last for a long period after the event and impede the recovery. The importance of analysing the seismic risk and vulnerability of spatially distributed infrastructure networks is becoming widely recognized by engineers, insurers and the scientific community at large. Such analyses present a challenge to scientists and engineers due to the complex interactions between interconnected elements within the infrastructure. The statistical models require a computational complexity so large as to prohibit the real‐time assessment of the post‐event network state. Conversely, simplified models may fail to capture the correlations and dependencies within a system in its entirety. In this project we introduce novel machine learning techniques into this process to provide statistically robust assessments of the performance of a network, in terms of both connectivity and flow, that would allow for rapid evaluation of the impact of an event for use in the immediate aftermath and recovery phase, or as part of a probabilistic assessment of economic loss.
An unsupervised census of astrophysical transients in the universe The Universe holds several avenues for the (catastrophic) end of stars. These include their gravitational collapse to a Neutron star, resulting in a so-called core-collapse Supernova, stars being swallowed by the central Black Hole of a galaxy, as well as Kilonova, the result of two merging Neutron stars recently detected for the first through the electromagnetic follow-up of a Gravitational wave event. The diversity of energetic and explosive events serves as a laboratory for fundamental physics that is explored through increasingly powerful observational facilities. With the start of the Zwicky Transient Facility (ZTF), the detection rate of time-variable phenomena in the Universe will increase by a factor 10 compared to existing surveys, far beyond what can be manually examined by astronomers. This PhD project focuses on developing new data management and machine learning approaches that will allow the scalable analysis of ZTF data through the implementation of flexible/scalable data infrastructure for classifying new transients. As the computing resources needed for this kind of computation will vary, there is also the need to manage them in an elastic manner thus leading to new monitoring and resource management strategies.
Fast assessment of earthquakes Earthquakes emit two basic types of waves, fast travelling but less energetic P waves (pressure waves, i.e. acoustic), and slower travelling, more energetic S waves (shear-waves, i.e. transverse) and surface waves (also mostly supported by shear motion). P waves arrive first, but all the shaking damage is caused by the later arriving shear- and surface waves. Early warning works by recording the P-waves, ideally close to the source, locating the earthquake and determining its magnitude based on these, and thus warn of the impending damaging S waves up to 10-20 s before they arrive (depending on the geological situation). This works quite well for earthquakes with magnitude up to M~6.5, and algorithms based on a very small number of waveform features do quite well to quickly estimate earthquake size. For larger earthquakes the total rupture time of the earthquake becomes comparable or longer than the typical warning times, meaning that the first damaging waves arrive, while the earthquake itself is still progressing, making it very difficult to set the proper alarm level. In addition, some earthquakes have a slow start before suddenly growing large.
The research question studied in this project is whether the ultimate size (rupture duration) of an earthquake can be predicted based on the initial few seconds of the P wave and by taking into account supplementary data about the environment in which the earthquake happens, and where the station is located. A related and also important research goal concerns a fundamental question of earthquakes physics: Is the whole fault interface in a preparatory condition before a great earthquake, e.g., due to an accelerating creeping motion (nucleation model), or is the growth of a small earthquake into a large earthquake ultimately a stochastic phenomenon (cascade model)? Both questions can be studied based on data available openly or at the GFZ. Earthquakes with M>~6 have a sufficient duration that it is meaningful to discuss the progression of the rupture. Such earthquakes occur globally approximately every 3 days, with data from dense global networks available for the last 15 years at least. Where borehole sensors are available, much smaller events can be examined.
Arctic Environmental Data Analytics The goal of this PhD is to detect the past ecosystem‐climate relationships in Arctic lake settings by big data analytics of a polar proxy dataset. We focus on two topics: Data management and data science ‐ development of a data analytics system for palaeolimnological proxy data designed for multivariate statistics. Geoscience ‐ Past and present environmental dynamics in Arctic landscapes and their impact on polar lake ecosystems. A unique, standardized, data set of proxy data from lake sediment cores in the Eastern Arctic will be compiled using the new PALIM Database. To correlate ecosystem changes with climate changes, multivariate statistics will be performed on quality controlled biotic and abiotic proxy data. The objective of this project is to develop a state‐of‐the‐art data analytics system that allows to detect the main relationships of ecosystem dynamics and climate changes and their spatiotemporal pattern in dependence to lake attributes, i.e. thermokarst or glacial origin, landscape‐type, lake‐ecosystem‐type, lake age, and catchment‐vegetation.
End-to-End Management of Experimental Data Science on Biomedical Molecular Data Developing and applying data science methods typically involves specifying and executing complex data processing and analysis pipelines, comprising pre-processing steps, model building, as well as evaluation. Heterogeneous data sources and systems for executing such pipelines can introduce complex dependencies on data or even processing architectures. When training a neural network, for example, sample transformations and preprocessing steps may be carried out with custom scripts, while the actual training may be executed on state of the art systems such as TensorFlow or MXNet, or scalable systems such as Spark and Flink. In order to simplify and automate the data analysis process, including interactive and iterative data selection or hyperparameter tuning, it is imperative to declaratively specify such pipelines and map them to potentially changing target systems and data sets. A declarative specification could enable automation and reproducibility of a data analysis process, and even help with detecting and validating properties of responsible data management, such as fairness, transparency, or the diversity. Current data analysis pipelines lack holistic declarative end-to-end specifications, preventing automatic reproducibility, comparability, re-use of previous results and models, and testing of experiments for properties of responsible data management. Training performance and prediction quality critically depend on configuration- and hyperparameters, but metadata, lineage information, and results of experiments are not systematically tracked and stored in a structured manner. Rather, these parameters are determined ad-hoc, or by using heuristics or explorative grid search for each pipeline anew. In order to overcome these deficiencies and challenges, we propose the introduction of truly declarative specifications of such pipelines and the creation of a repository of declarative descriptions of machine learning experiments and their corresponding evaluation data in an experiment database. We further plan to research and evaluate optimization and automation of the data science process, both in multi-tenant environments and the continuous deployment of machine learning pipelines
Optimizing nanotextured solar cells for realistic weather conditions Currently, perovskite-silicon (pero-Si) tandem solar cells are the most investigated concept to overcome the theoretical limit for the power conversion efficiency of single-junction silicon solar cells, with is 29.4%. Optical simulations are extremely valuable to study the distribution of light within the solar cells, and allow to minimize losses from reflection and parasitic absorption. For monolithic perovskite-silicon solar cells, it is vital that the available light is equally distributed between the two subcells, which is known as current matching. Nanotextures have proven to strongly reduce reflective losses. In this project we investigate, how realistic weather conditions affect the performance of pero-Si modules. We study, how different light management approaches, such as pyramidal texturing or (sinusoidal) nanotexturing influence the sensitivity of the solar module to the illumination condition. In contrast to single-junction silicon solar cells, (two-terminal) tandem solar cells are more sensitive to the spectral distribution of the incident light.
New routines to explore modern genomic data to assess ancient DNA records from the Last Ice Age The loss of megaherbivors, such as mammoth, by the end of the last ice age more than ten thousand years ago still represents an enigma that had a strong impact on world-wide ecosystems. Our understanding is so restricted, because the traditionally explored fossil record, including bones and pollen, is extremely scarce and strongly biased. During the last few years sedimentary ancient DNA (sedaDNA) has proven to be a valuable proxy to trace past local presences of organisms even though other fossils are absent. A significant advantage of sedaDNA over traditional palaeoecological methods is that it permits the analysis of past population dynamics within a species of interest from sediments and not only its presence or absence which allows to identify the cause of taxa loss. However, to date the potential of ancient DNA analyses is not yet fully utilized despite many sediment records that were explored for its ancient DNA content using next-generation sequencing. This is mainly because (1) routines to match the obtained ancient DNA data with modern genomic data are lacking or are computational too expensive, and (2) suitable markers to identify specific taxa in the ancient DNA record were not yet developed using the wealth of available genomic data. In this PhD project these shortcomings of the state of the art ancient DNA data analyses will be overcome. This will allow that available ancient DNA data sets (such as produced in the “Ancient DNA Lab” at Alfred Wegener Institute”) can be reanalysed and new markers can be developed to enhance our understanding of a major past biodiversity turnover in the Earth’s History. The starting point of the thesis is the recent advances in computing bidirectional distinguishing statistics for collections of DNA. Distinguishing words can form the basis of genetic markers to identify the specific taxa in ancient DNA. Since the computation can now be done in linear time with the help of succinct data structures, we can use such approaches to develop methods to a) compute markers for sedaDNA and b) analyse the data available at the AWI.
Pattern identification and clustering of single cell RNA-sequencing data using concepts from data analytics and network science Single cell RNA-sequencing (scRNA-seq) allows massively parallel acquisition of gene expression profiles in heterogeneous cell populations such as dissociated tissues and organs. The measured single cell transcription profiles can be used to identify cell types, cell sub-types and continuous gene expression gradients e.g. during developmental or disease processes. However, a key challenge in the analysis of scRNA-seq data is the highly discrete, sparse and variable nature of single-cell mRNA molecule counts. Specifically, high levels of sampling noise and missing data can obscure transcriptional measures of cell type similarity and render identification of co-regulated groups of genes difficult. Moreover, it is currently not possible to systematically determine the origin of cell types in complex organisms based on single cell data. Extracting such information, however, might have a drastic impact, for example in preventing, diagnosing and treating a variety of physical and mental disorders. In brief, the focus of this project is on developing and adopting new analytical approaches to efficiently use single-cell data to solve the following problems: 1. Finding informative genes that allow clustering of cells and identification of cell types 2. Analysis of co-regulated gene modules 3. Integration of other data types including lineage barcodes that allow to trace cell origins
Corpus-wide inference of gene relationships using semantic word representations Current attempts to decipher the molecular basis of cellular processes and human diseases are based on quantitative or qualitative models of the complex interplay between molecules in the cell, for instance in gene regulation, cellular signaling, or the metabolism. Obtaining such models in sufficient quality and breadth is a laborious task which today is predominantly based on human experts manually searching and reading the scientific literature with the aim to collect the many dispersed pieces of knowledge necessary to derive at a comprehensive picture. This work can be supported by using Text Mining, however, current research in this area focuses on extracting information from isolated sentences, which often produces unsatisfactory results as important contextual information is ignored (such as the experimental evidence of a reported fact, the precise species in which a finding was experimentally observed, the strength of the observed effects, possible previous treatments (with certain drugs) of the experimental system etc.). In this PhD project, we follow a radically different approach. We use the entire corpus of available scientific publications (roughly 30 Million abstracts, 1.5 Million full texts, possibly patents) as the source of inference for single relationships. To this end, a machine learning setup will be designed, where models of valid relationships are learned from all mentions of their constituents trained on a set of proven relationships. We use that approach to significantly expand the molecular network of several clinically relevant molecular pathways of which the PIs have comprehensive background knowledge, such as NF-kB signaling pathway, a pathway that is critically involved in cell faith decisions and perturbed in a number of diseases including cancer and inflammatory diseases, and the p53 pathway, which is strongly perturbed in cancer. The central aim of the PhD project is the extension of the currently available restricted pathway models, however, additional directions of expansion will also be investigated, such as development of cell-type -specific models, or elucidation of cross-talk to other pathways. We also envision using the new method to study connections between signaling pathways and existing targeted cancer therapies, for which patent texts would be extremely useful. Results from such text mining algorithms will be rigorously assessed in terms of their quality and relevance for biomedical research by (a) qualitatively checking the results at the literature level, and (b) quantitatively evaluating the performance of the expanded or improved pathways in typical analysis settings using OMICS data, such as pathways enrichment analysis and predictive power for selected phenotypes. The approach would allow a new way of predicting treatments that ideally can be adapted and specified for subgroups harboring individual combinations of perturbations in the disease-relevant pathways.