Corpus-wide inference of gene relationships using semantic word representations (2018 - )

Current attempts to decipher the molecular basis of cellular processes and human diseases are based on quantitative or qualitative models of the complex interplay between molecules in the cell, for instance in gene regulation, cellular signaling, or the metabolism. Obtaining such models in sufficient quality and breadth is a laborious task which today is predominantly based on human experts manually searching and reading the scientific literature with the aim to collect the many dispersed pieces of knowledge necessary to derive at a comprehensive picture. This work can be supported by using Text Mining, however, current research in this area focuses on extracting information from isolated sentences, which often produces unsatisfactory results as important contextual information is ignored (such as the experimental evidence of a reported fact, the precise species in which a finding was experimentally observed, the strength of the observed effects, possible previous treatments (with certain drugs) of the experimental system etc.). In this PhD project, we follow a radically different approach. We use the entire corpus of available scientific publications (roughly 30 Million abstracts, 1.5 Million full texts, possibly patents) as the source of inference for single relationships. To this end, a machine learning setup will be designed, where models of valid relationships are learned from all mentions of their constituents trained on a set of proven relationships. We use that approach to significantly expand the molecular network of several clinically relevant molecular pathways of which the PIs have comprehensive background knowledge, such as NF-kB signaling pathway, a pathway that is critically involved in cell faith decisions and perturbed in a number of diseases including cancer and inflammatory diseases, and the p53 pathway, which is strongly perturbed in cancer. The central aim of the PhD project is the extension of the currently available restricted pathway models, however, additional directions of expansion will also be investigated, such as development of cell-type -specific models, or elucidation of cross-talk to other pathways. We also envision using the new method to study connections between signaling pathways and existing targeted cancer therapies, for which patent texts would be extremely useful. Results from such text mining algorithms will be rigorously assessed in terms of their quality and relevance for biomedical research by (a) qualitatively checking the results at the literature level, and (b) quantitatively evaluating the performance of the expanded or improved pathways in typical analysis settings using OMICS data, such as pathways enrichment analysis and predictive power for selected phenotypes. The approach would allow a new way of predicting treatments that ideally can be adapted and specified for subgroups harboring individual combinations of perturbations in the disease-relevant pathways.
Peer-reviewed Publications (journal or conference)
- L. Weber, J. Münchmeyer, T. Rocktäschel, M. Habibi, and U. Leser (2019). HUNER: Improving biomedical NER with pretraining. Bioinformatics, 36(1), 295-302. 10.1093/bioinformatics/btz528
- L. Weber, P. Minervini, J. Münchmeyer, U. Leser, and T. Rocktäschel (2019). NLProlog: Reasoning with weak unification for question answering in Natural Language. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 6151-6161. 10.18653/v1/P19-1618
- L. Weber, K. Thobe, O.A.M. Lozano, J. Wolf, and U. Leser (2020). PEDL: Extracting protein-protein associations using deep language models and distant supervision. Bioinformatics, 36(1), i490–i498. https://doi.org/10.1093/bioinformatics/btaa430.
- W. D. Xing, L. Weber, and U. Leser (2020). Biomedical event extraction as multi-turn question answering. In Proceedings of the 11th Int. Workshop on Health Text Mining and Information Analysis, 88-96. 10.18653/v1/2020.louhi-1.10
- L. Weber, M. Sänger, J. Münchmeyer, M. Habibi, U. Leser, and A. Akbik (2021). HunFlair: An easy-to-use tool for state-of-the-art biomedical named entity recognition. Bioinformatics, btab042, https://doi.org/10.1093/bioinformatics/btab042
- L. Weber, M. Sänger, S. Garda, F. Barth, Ch. Alt, and U. Leser (2021). Humboldt @ DrugProt: Chemical-protein relation extraction with pretrained transformers and entity descriptions. In Proceedings of the 7th BioCreative Challenge Evaluation Workshop.
- L. Weber, S. Garda, J. Münchmeyer, and U. Leser (2021). Extend, don’t rebuild: Phrasing conditional graph modification as autoregressive sequence labelling. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 1213–1224.
- K. Singh, J. Münchmeyer, L. Weber, U. Leser, and A. Bande (2022). Graph Neural Networks for Learning Molecular Excitation Spectra.J. Chem. Theory Comp., 18, 7, 4408-4417. DOI: 10.1021/acs.jctc.2c00255
- J.A. Fries, N. Seelam, G. Altay, L. Weber, M. Kang, D. Datta, R. Su, S. Garda, B. Wang, S. Ott, M. Samwald, and W. Kusa (2022). Dataset Debt in Biomedical Language Modeling. In Proceedings of the Workshop on Challenges & Perspectives in Creating Large Language Models, 137-145. https://doi.org/10.18653/v1/2022.bigscience-1.10
- X. Wang, U. Leser, and L. Weber (2022). BEEDS: Large-Scale Biomedical Event Extraction using Distant Supervision and Question Answering. In Proceedings of BioNLP, 298-309. 10.18653/v1/2022.bionlp-1.28
- L. Weber, M. Sänger, S. Garda, F. Barth, C. Alt and U. Leser (2022). Chemical-Protein Relation Extraction with Ensembles of Carefully Tuned Pretrained Language Models. Database, 2022, baac098. https://doi.org/10.1093/database/baac098
Other (presentations at conferences or preprints)
- L. Weber, P. Minervini, J. Münchmeyer, U. Leser, and T. Rocktäschel. NLProlog: Reasoning with weak unification for question answering in Natural Language. (Poster presentation) 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July - 2 August, 2019.
- M. Saenger, L. Weber, and U. Leser. WBI at MEDIQA 2021: Summarizing Consumer Health Questions with Generative Transformers. BioNLP Workshop - MEDIQA, 11 June 2021. https://www.aclweb.org/anthology/2021.bionlp-1.9.pdf
- J.A. Fries, L. Weber, N. Seelam, G. Altay, et al. (2022). BigBIO: A Framework for Data-Centric Biomedical Natural Language Processing.https://arxiv.org/abs/2206.15076 [Preprint]
- H. Laurençon, L. Saulnier, T. Wang, C. Akik, A. V. del Moral, T. Le Scao, ... L. Weber, ... et al. (2022). The BigScience Corpus A 1.6 TB Composite Multilingual Dataset.https://openreview.net/forum?id=UoEw6KigkUn [Preprint]