Current attempts to decipher the molecular basis of cellular processes and human diseases are based on quantitative or qualitative models of the complex interplay between molecules in the cell, for instance in gene regulation, cellular signaling, or the metabolism. Obtaining such models in sufficient quality and breadth is a laborious task which today is predominantly based on human experts manually searching and reading the scientific literature with the aim to collect the many dispersed pieces of knowledge necessary to derive at a comprehensive picture. This work can be supported by using Text Mining, however, current research in this area focuses on extracting information from isolated sentences, which often produces unsatisfactory results as important contextual information is ignored (such as the experimental evidence of a reported fact, the precise species in which a finding was experimentally observed, the strength of the observed effects, possible previous treatments (with certain drugs) of the experimental system etc.). In this PhD project, we follow a radically different approach. We use the entire corpus of available scientific publications (roughly 30 Million abstracts, 1.5 Million full texts, possibly patents) as the source of inference for single relationships. To this end, a machine learning setup will be designed, where models of valid relationships are learned from all mentions of their constituents trained on a set of proven relationships. We use that approach to significantly expand the molecular network of several clinically relevant molecular pathways of which the PIs have comprehensive background knowledge, such as NF-kB signaling pathway, a pathway that is critically involved in cell faith decisions and perturbed in a number of diseases including cancer and inflammatory diseases, and the p53 pathway, which is strongly perturbed in cancer. The central aim of the PhD project is the extension of the currently available restricted pathway models, however, additional directions of expansion will also be investigated, such as development of cell-type -specific models, or elucidation of cross-talk to other pathways. We also envision using the new method to study connections between signaling pathways and existing targeted cancer therapies, for which patent texts would be extremely useful. Results from such text mining algorithms will be rigorously assessed in terms of their quality and relevance for biomedical research by (a) qualitatively checking the results at the literature level, and (b) quantitatively evaluating the performance of the expanded or improved pathways in typical analysis settings using OMICS data, such as pathways enrichment analysis and predictive power for selected phenotypes. The approach would allow a new way of predicting treatments that ideally can be adapted and specified for subgroups harboring individual combinations of perturbations in the disease-relevant pathways.
- Leon Weber, Jannes Münchmeyer, Tim Rocktäschel, Maryam Habibi and Ulf Leser. (2019). HUNER: Improving Biomedical NER with Pretraining. Bioinformatics, 36(1), 295-302 10.1093/bioinformatics/btz528
- Leon Weber, Pasquale Minervini, Jannes Münchmeyer, Ulf Leser and Tim Rocktäschel (2019). NLProlog: Reasoning with Weak Unification for Question Answering in Natural Language. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 6151-6161 10.18653/v1/P19-1618
- Leon Weber, K. Thobe, O. A. M. Lozano, Jana Wolf and Ulf Leser (2020). PEDL: Extracting protein-protein associations using deep language models and distant supervision. Int. Conf. on Intelligent Systems in Molecular Biology
- Leon Weber, Pasquale Minervini, Jannes Münchmeyer, Ulf Leser and Tim Rocktäschel. NLProlog: Reasoning with Weak Unification for Question Answering in Natural Language. Poster presentation at the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July - 2 August, 2019.