Biomedical Language processing
Automated information extraction using natural language processing (NLP) tools is crucial for analyzing the overwhelming volume of medical publications, beyond human capacity. A key challenge for NLP is the variability in medical terminology, especially for new diseases or fields. We present an NLP toolbox with extensive English dictionaries of synonyms for SARS-CoV-2 (including variants), compatible with dictionary-based NLP tools. It includes a silver standard corpus generated from these dictionaries and a gold standard corpus of manually annotated PubMed abstracts, covering key medical terms. The toolbox, available on GitHub Code and Zenodo, supports various COVID-19 text analytics tasks, such as creating knowledge graphs and developing text mining tools.
In our group, we have developed AI-based text mining tools that can be used to extract and combine information from all accessible medical scientific literature. With this, we have analysed all accessible biomedical literature in PubMed (almost 40 million entries) to extract relevant biomedical entities, e.g. cells, diseases, chemicals, species, genes/proteins, and their relations. Based on this a cell death knowledge graph and similar knowledge graphs for COVID-19 and other diseases are being assembled. The preprints and the associated AI tools are already published.