Biomedical information (IE) extraction aims to automatically extract biomedical knowledge from the large body of available scientific articles. It is considered a crucial technique to provide efficient access to published research results at a scale that can cope with the progress made in science. It is important in many fields, such as database curation, the construction of comprehensive models of pathways and cells, or Personalised Medicine. One of the key tasks in IE is the extraction of relationships between (biomedical) entities, e.g. finding drugs or proteins that interact with each other. The field has been investigated intensively over the two decades. However, almost all existing works focused on extracting relationships from single sentences (sentence- based) or from single articles (article-based).
However, all sentence- and article-based methods suffer from a number of severe disadvantages by design. First of all, a single record is rarely sufficient to judge upon the biological correctness of a relation, as experimental evidence might be weak or only valid in a certain context. Furthermore, statements may be more speculative than confirmative, and different articles often contradict each other. Experts therefore usually (a) take a comprehensive picture of the published state-of-the-art and (b) need to include information given in other knowledge sources to take an informed decision upon a relationship. It’s an open research question how to do this effectively in an automatic manner. Solving this requires to find suitable ways to encode the knowledge given in large text corpora and design efficient approaches to integrate different kinds of information (e.g. textual, numerical, categorical and molecular data) originating from various sources.
In this PhD project we contribute to this research question and examine to harness and combine multiple information sources, i.e. the entire PubMed corpus and additional knowledge base information, for biomedical relation extraction. Through this we follow a fundamentally different way than traditional approaches, i.e. we classify relations on a global, corpus-based level instead of sentence- or article-based as in prior art. In particular, we want to explore representation learning techniques, i.e. instead of explicitly modeling the connections between biomedical concepts manually, we will apply methods capable of learning adequate representations for these concepts by exploring correlations in large (textual) data.
1. Mario Sänger, Ulf Leser, Large-scale Entity Representation Learning for Biomedical Relationship Extraction, Bioinformatics, btaa674, https://doi.org/10.1093/bioinformatics/btaa674
1. Leon Weber, Mario Sänger, Jannes Münchmeyer, Maryam Habibi, Ulf Leser, and Alan Akbik (2020). HunFlair: An Easy-to-Use Tool for State-of-the-Art Biomedical Named Entity Recognition arXiv:2008.07347.
2. Jurica Seva, Mario Sänger, Ulf Leser. Language-independent ICD-10 Coding using Multi-lingual Embeddings and Recurrent Neural Networks.Oral presentation at CLEF eHealth 2018.
3. Mario Sänger, Leon Weber, Madeleine Kittner, Ulf Leser. Classifying German Animal Experiment Summaries with Multi-lingual BERT. Oral presentation at CLEF eHealth 2019.