A modular framework for biomedical concept recognition

doi:10.1186/1471-2105-14-281

. 2013 Sep 24:14:281.

doi: 10.1186/1471-2105-14-281.

A modular framework for biomedical concept recognition

David Campos¹, Sérgio Matos, José Luís Oliveira

Affiliations

PMID: 24063607
PMCID: PMC3849280
DOI: 10.1186/1471-2105-14-281

A modular framework for biomedical concept recognition

David Campos et al. BMC Bioinformatics. 2013.

. 2013 Sep 24:14:281.

doi: 10.1186/1471-2105-14-281.

Authors

David Campos¹, Sérgio Matos, José Luís Oliveira

Affiliation

¹ IEETA/DETI, University of Aveiro, Campus Universitário de Santiago, 3810-193 Aveiro, Portugal. david.campos@ua.pt.

PMID: 24063607
PMCID: PMC3849280
DOI: 10.1186/1471-2105-14-281

Abstract

Background: Concept recognition is an essential task in biomedical information extraction, presenting several complex and unsolved challenges. The development of such solutions is typically performed in an ad-hoc manner or using general information extraction frameworks, which are not optimized for the biomedical domain and normally require the integration of complex external libraries and/or the development of custom tools.

Results: This article presents Neji, an open source framework optimized for biomedical concept recognition built around four key characteristics: modularity, scalability, speed, and usability. It integrates modules for biomedical natural language processing, such as sentence splitting, tokenization, lemmatization, part-of-speech tagging, chunking and dependency parsing. Concept recognition is provided through dictionary matching and machine learning with normalization methods. Neji also integrates an innovative concept tree implementation, supporting overlapped concept names and respective disambiguation techniques. The most popular input and output formats, namely Pubmed XML, IeXML, CoNLL and A1, are also supported. On top of the built-in functionalities, developers and researchers can implement new processing modules or pipelines, or use the provided command-line interface tool to build their own solutions, applying the most appropriate techniques to identify heterogeneous biomedical concepts. Neji was evaluated against three gold standard corpora with heterogeneous biomedical concepts (CRAFT, AnEM and NCBI disease corpus), achieving high performance results on named entity recognition (F1-measure for overlap matching: species 95%, cell 92%, cellular components 83%, gene and proteins 76%, chemicals 65%, biological processes and molecular functions 63%, disorders 85%, and anatomical entities 82%) and on entity normalization (F1-measure for overlap name matching and correct identifier included in the returned list of identifiers: species 88%, cell 71%, cellular components 72%, gene and proteins 64%, chemicals 53%, and biological processes and molecular functions 40%). Neji provides fast and multi-threaded data processing, annotating up to 1200 sentences/second when using dictionary-based concept identification.

Conclusions: Considering the provided features and underlying characteristics, we believe that Neji is an important contribution to the biomedical community, streamlining the development of complex concept recognition solutions. Neji is freely available at http://bioinformatics.ua.pt/neji.

PubMed Disclaimer

Figures

**Figure 1**
Spectrum of existing solutions for biomedical concept recognition according to their specificity.

**Figure 2**
Illustration of the processing pipeline and modular architecture of Neji.

**Figure 3**
Interface diagram to model implementation of pipelines and respective modules.

**Figure 4**
Overview of the internal data structure to support processed data.

**Figure 5**
**Illustration of implemented concept tree.** Such structure automatically supports nested and intersected concepts, clearly exposing ambiguity problems (PRGE: Proteins and genes; DISO: Disorders; and ANAT: Anatomy).

**Figure 6**
Example of the Neji output format.

**Figure 7**
Java code snippets to create a runnable processing pipeline and use it in a batch executor with context.

**Figure 8**
Evaluation results for named entity recognition, considering precision, recall, and F1-measure achieved on CRAFT corpus, using exact (E), left (L), right (R), shared (S) and overlap (O) names matching. Evaluation considers species, cell, cellular component, gene and protein, chemical, biological processes and molecular functions concept names.

**Figure 9**
Evaluation results for normalization considering precision, recall, and F1-measure achieved on CRAFT corpus, using exact (E), left (L), right (R), shared (S) and overlap (O) names matching and ‘exact’ and ‘contains’ matching of identifiers. Evaluation considers species, cell, cellular component, gene and protein, chemical, biological processes and molecular functions concept names.

**Figure 10**
**Comparison of precision, recall, and F1-measure results achieved on AnEM and NCBI corpora for named entity recognition, considering exact (E), left (L), right (R), shared (S) and overlap (O) matching.** The various sub-classes from each corpus were merged into a single class, in order to evaluate the general ability to recognize disorder and anatomical concept names.

See this image and copyright information in PMC

Cited by

An automatic hypothesis generation for plausible linkage between xanthium and diabetes.
Syafiandini AF, Song G, Ahn Y, Kim H, Song M. Syafiandini AF, et al. Sci Rep. 2022 Oct 20;12(1):17547. doi: 10.1038/s41598-022-20752-0. Sci Rep. 2022. PMID: 36266295 Free PMC article.
Parallel sequence tagging for concept recognition.
Furrer L, Cornelius J, Rinaldi F. Furrer L, et al. BMC Bioinformatics. 2022 Mar 24;22(Suppl 1):623. doi: 10.1186/s12859-021-04511-y. BMC Bioinformatics. 2022. PMID: 35331131 Free PMC article.
MedTAG: a portable and customizable annotation tool for biomedical documents.
Giachelle F, Irrera O, Silvello G. Giachelle F, et al. BMC Med Inform Decis Mak. 2021 Dec 18;21(1):352. doi: 10.1186/s12911-021-01706-4. BMC Med Inform Decis Mak. 2021. PMID: 34922517 Free PMC article.
Extraction of Family History Information From Clinical Notes: Deep Learning and Heuristics Approach.
Silva JF, Almeida JR, Matos S. Silva JF, et al. JMIR Med Inform. 2020 Dec 29;8(12):e22898. doi: 10.2196/22898. JMIR Med Inform. 2020. PMID: 33372893 Free PMC article.
Gold-standard ontology-based anatomical annotation in the CRAFT Corpus.
Bada M, Vasilevsky N, Baumgartner WA, Haendel M, Hunter LE. Bada M, et al. Database (Oxford). 2017 Jan 1;2017:bax087. doi: 10.1093/database/bax087. Database (Oxford). 2017. PMID: 31725864 Free PMC article.

See all "Cited by" articles

References

1. Zhou G, Zhang J, Su J, Shen D, Tan C. Recognizing names in biomedical texts: a machine learning approach. Bioinformatics. 2004;20:1178–1190. doi: 10.1093/bioinformatics/bth060. - DOI - PubMed
1. Campos D, Matos S, Oliveira JL, Campos D, Matos S, Oliveira JL. In: Biological Knowledge Discovery Handbook: Preprocessing, Mining And Postprocessing Of Biological Data (to appear) Elloumi M, Zomaya AY, editor. Wiley Online Library; 2014. Current Methodologies for Biomedical Named Entity Recognition.
1. He Y, Kayaalp M. A Comparison of 13 Tokenizers on MEDLINE. Bethesda, MD: U.S. National Library of Medicine; 2006.
1. Ferrucci D, Lally A. UIMA: an architectural approach to unstructured information processing in the corporate research environment. Nat Lang Eng. 2004;10(3-4):327–348. doi: 10.1017/S1351324904003523. - DOI
1. Cunningham H. GATE, a general architecture for text engineering. Comput Hum. 2002;36:223–254. doi: 10.1023/A:1014348124664. - DOI

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations

[1] Zhou G, Zhang J, Su J, Shen D, Tan C. Recognizing names in biomedical texts: a machine learning approach. Bioinformatics. 2004;20:1178–1190. doi: 10.1093/bioinformatics/bth060. - DOI - PubMed

[2] Zhou G, Zhang J, Su J, Shen D, Tan C. Recognizing names in biomedical texts: a machine learning approach. Bioinformatics. 2004;20:1178–1190. doi: 10.1093/bioinformatics/bth060. - DOI - PubMed

[3] Campos D, Matos S, Oliveira JL, Campos D, Matos S, Oliveira JL. In: Biological Knowledge Discovery Handbook: Preprocessing, Mining And Postprocessing Of Biological Data (to appear) Elloumi M, Zomaya AY, editor. Wiley Online Library; 2014. Current Methodologies for Biomedical Named Entity Recognition.

[4] Campos D, Matos S, Oliveira JL, Campos D, Matos S, Oliveira JL. In: Biological Knowledge Discovery Handbook: Preprocessing, Mining And Postprocessing Of Biological Data (to appear) Elloumi M, Zomaya AY, editor. Wiley Online Library; 2014. Current Methodologies for Biomedical Named Entity Recognition.

[5] He Y, Kayaalp M. A Comparison of 13 Tokenizers on MEDLINE. Bethesda, MD: U.S. National Library of Medicine; 2006.

[6] He Y, Kayaalp M. A Comparison of 13 Tokenizers on MEDLINE. Bethesda, MD: U.S. National Library of Medicine; 2006.

[7] Ferrucci D, Lally A. UIMA: an architectural approach to unstructured information processing in the corporate research environment. Nat Lang Eng. 2004;10(3-4):327–348. doi: 10.1017/S1351324904003523. - DOI

[8] Ferrucci D, Lally A. UIMA: an architectural approach to unstructured information processing in the corporate research environment. Nat Lang Eng. 2004;10(3-4):327–348. doi: 10.1017/S1351324904003523. - DOI

[9] Cunningham H. GATE, a general architecture for text engineering. Comput Hum. 2002;36:223–254. doi: 10.1023/A:1014348124664. - DOI

[10] Cunningham H. GATE, a general architecture for text engineering. Comput Hum. 2002;36:223–254. doi: 10.1023/A:1014348124664. - DOI

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

A modular framework for biomedical concept recognition

Affiliation

A modular framework for biomedical concept recognition

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Related information

LinkOut - more resources

Full Text Sources

Other Literature Sources