Researchers are currently producing so many publications that it is impossible to keep up with the boom of discoveries even within a single field. Biomedical information extraction (IE) encompasses methods that aim to automatically collect biomedical knowledge from the scientific literature. These techniques are considered crucial for efficient access to published results at a scale that can cope with scientific progress. IE plays is essential in database curation, the construction of comprehensive models of pathways and cells, and fields such as Personalised Medicine. A key task for IE is the extraction of relationships between entities, such as drugs or proteins that interact with each other in a pathway or cell. While considerable progress in IE has been made over the two decades, there are deficits. Almost all the techniques have focused on extracting relationships from single sentences or single articles.
All sentence- and article-based methods suffer from a number of severe disadvantages in terms of design. First, a single record rarely provides enough evidence to establish the biological validity of a relationship, as the experimental evidence might be weak, or limited to a very specific context. Statements in texts may be more speculative than confirmative, and different articles often contradict each other. Experts therefore usually (a) try to acquire a comprehensive picture of the published state-of-the-art for any given question, and (b) need to include information from other sources in making informed decisions about relationships. There is no consensus on the best way to achieve this automatically. A solution will require finding suitable ways to encode the knowledge contained in large collections of texts and design efficient approaches to integrate different kinds of information (e.g. textual, numerical, categorical and molecular data) that originates from various sources.
This PhD project will contribute to this question while examining, harnessing and combining multiple information sources, such as the entire corpus of literature available through PubMed and additional knowledge base information, in hopes of improving the extraction of information on biomedical relationships. Our approach is fundamentally different than traditional approaches. We classify relations on a global, corpus-based level instead of the sentence- or article-based approaches currently in use. In particular, we want to explore representation learning techniques: instead of explicitly, manually modelling the connections between biomedical concepts, we will apply methods capable of learning adequate representations for these concepts by exploring correlations in large collections of (textual) data.
- M. Sänger and U. Leser (2020) Large-scale entity representation learning for biomedical relationship extraction. Bioinformatics, btaa674. https://doi.org/10.1093/bioinformatics/btaa674
- M. Kittner, M. Lamping, D. Rieke, J. Götze, B. Bajwa, I. Jelas, G. Rüter, H. Hautow, M. Sänger, Habibi, et al. (2021). Annotation and initial evaluation of a large annotated German oncological corpus. JAMIA Open, 4(2), ooab025. https://doi.org/10.1093/jamiaopen/ooab025
- L. Weber, M. Sänger, J. Münchmeyer, M. Habibi, U. Leser, and A. Akbik (2021) HunFlair: An easy-to-use tool for state-of-the-art biomedical named entity recognition. Bioinformatics, btab042. https://doi.org/10.1093/bioinformatics/btab042
- L. Weber, M. Sänger, S. Garda, F. Barth, Ch. Alt, and U. Leser (2021). Humboldt @ DrugProt: Chemical-protein relation extraction with pretrained transformers and entity descriptions. In Proceedings of the 7th BioCreative Challenge Evaluation Workshop.
- L. Weber, M. Sänger, S. Garda, F. Barth, C. Alt and U. Leser (2022). Chemical-Protein Relation Extraction with Ensembles of Carefully Tuned Pretrained Language Models.Database (accepted)
- L. Weber, M. Sänger, J. Münchmeyer, M. Habibi, U. Leser and A. Akbik (2020) HunFlair: An Easy-to-Use Tool for State-of-the-Art Biomedical Named Entity Recognition arXiv:2008.07347. https://arxiv.org/abs/2008.07347
- J. Seva, M. Sänger and U. Leser. Language-independent ICD-10 Coding using Multi-lingual Embeddings and Recurrent Neural Networks. Oral presentation at CLEF eHealth 2018.
- M. Sänger, L. Weber, M. Kittner and U Leser. Classifying German Animal Experiment Summaries with Multi-lingual BERT. Oral presentation at CLEF eHealth 2019.
- M. Saenger, L. Weber and U. Leser. WBI at MEDIQA 2021: Summarizing Consumer Health Questions with Generative Transformers. BioNLP Workshop - MEDIQA, 11 June 2021. https://www.aclweb.org/anthology/2021.bionlp-1.9.pdf