Author Name Disambiguation for PubMed
- PMID: 28758138
- PMCID: PMC5530597
- DOI: 10.1002/asi.23063
Author Name Disambiguation for PubMed
Abstract
Log analysis shows that PubMed users frequently use author names in queries for retrieving scientific literature. However, author name ambiguity may lead to irrelevant retrieval results. To improve the PubMed user experience with author name queries, we designed an author name disambiguation system consisting of similarity estimation and agglomerative clustering. A machine-learning method was employed to score the features for disambiguating a pair of papers with ambiguous names. These features enable the computation of pairwise similarity scores to estimate the probability of a pair of papers belonging to the same author, which drives an agglomerative clustering algorithm regulated by 2 factors: name compatibility and probability level. With transitivity violation correction, high precision author clustering is achieved by focusing on minimizing false-positive pairing. Disambiguation performance is evaluated with manual verification of random samples of pairs from clustering results. When compared with a state-of-the-art system, our evaluation shows that among all the pairs the lumping error rate drops from 10.1% to 2.2% for our system, while the splitting error rises from 1.8% to 7.7%. This results in an overall error rate of 9.9%, compared with 11.9% for the state-of-the-art method. Other evaluations based on gold standard data also show the increase in accuracy of our clustering. We attribute the performance improvement to the machine-learning method driven by a large-scale training set and the clustering algorithm regulated by a name compatibility scheme preferring precision. With integration of the author name disambiguation system into the PubMed search engine, the overall click-through-rate of PubMed users on author name query results improved from 34.9% to 36.9%.
Figures
![FIG. 1](https://www.ncbi.nlm.nih.gov/pmc/articles/instance/5530597/bin/nihms877886f1.gif)
![FIG. 2](https://www.ncbi.nlm.nih.gov/pmc/articles/instance/5530597/bin/nihms877886f2.gif)
![FIG. 3](https://www.ncbi.nlm.nih.gov/pmc/articles/instance/5530597/bin/nihms877886f3.gif)
![FIG. 4](https://www.ncbi.nlm.nih.gov/pmc/articles/instance/5530597/bin/nihms877886f4.gif)
![FIG. 5](https://www.ncbi.nlm.nih.gov/pmc/articles/instance/5530597/bin/nihms877886f5.gif)
Similar articles
-
Aggregating large-scale databases for PubMed author name disambiguation.J Am Med Inform Assoc. 2021 Aug 13;28(9):1919-1927. doi: 10.1093/jamia/ocab095. J Am Med Inform Assoc. 2021. PMID: 34180522 Free PMC article.
-
Data-driven modeling and prediction of blood glucose dynamics: Machine learning applications in type 1 diabetes.Artif Intell Med. 2019 Jul;98:109-134. doi: 10.1016/j.artmed.2019.07.007. Epub 2019 Jul 26. Artif Intell Med. 2019. PMID: 31383477 Review.
-
A new approach and gold standard toward author disambiguation in MEDLINE.J Am Med Inform Assoc. 2019 Oct 1;26(10):1037-1045. doi: 10.1093/jamia/ocz028. J Am Med Inform Assoc. 2019. PMID: 30958542 Free PMC article.
-
Analyzing Medical Image Search Behavior: Semantics and Prediction of Query Results.J Digit Imaging. 2015 Oct;28(5):537-46. doi: 10.1007/s10278-015-9792-6. J Digit Imaging. 2015. PMID: 25810317 Free PMC article. Review.
-
Author Name Disambiguation in MEDLINE.ACM Trans Knowl Discov Data. 2009 Jul 1;3(3):11. doi: 10.1145/1552303.1552304. ACM Trans Knowl Discov Data. 2009. PMID: 20072710 Free PMC article.
Cited by
-
Development and Validation of an Automated Tool to Retrieve and Curate Faculty Publications of Academic Departments.Cureus. 2023 Oct 30;15(10):e47976. doi: 10.7759/cureus.47976. eCollection 2023 Oct. Cureus. 2023. PMID: 38034270 Free PMC article.
-
Database resources of the National Center for Biotechnology Information.Nucleic Acids Res. 2024 Jan 5;52(D1):D33-D43. doi: 10.1093/nar/gkad1044. Nucleic Acids Res. 2024. PMID: 37994677 Free PMC article.
-
Notes on the data quality of bibliographic records from the MEDLINE database.Database (Oxford). 2023 Nov 4;2023:baad070. doi: 10.1093/database/baad070. Database (Oxford). 2023. PMID: 37935584 Free PMC article.
-
O-JMeSH: creating a bilingual English-Japanese controlled vocabulary of MeSH UIDs through machine translation and mutual information.Genomics Inform. 2021 Sep;19(3):e26. doi: 10.5808/gi.21014. Epub 2021 Sep 30. Genomics Inform. 2021. PMID: 34638173 Free PMC article.
-
TrendyGenes, a computational pipeline for the detection of literature trends in academia and drug discovery.Sci Rep. 2021 Aug 3;11(1):15747. doi: 10.1038/s41598-021-94897-9. Sci Rep. 2021. PMID: 34344904 Free PMC article.
References
-
- Bernardi R, Le D-T. Proceedings of the 2009 International Conference on Advanced Language Technologies for Digital Libraries. Viareggio, Italy: Springer; 2011. Metadata enrichment via topic models for author name disambiguation; pp. 92–113.
-
- Bhattacharya I, Getoor L. Collective entity resolution in relational data. ACM Trans Knowl Discov Data. 2007;1(1):5.
-
- Cota RG, Ferreira AA, Nascimento C, Gonçalves MA, Laender AHF. An unsupervised heuristic-based hierarchical method for name disambiguation in bibliographic citations. Journal of the American Society for Information Science and Technology. 2010;61(9):1853–1870.
-
- Culotta A, Kanani P, Hall R, Wick M, McCallum A. Author disambiguation using error-driven machine learning with a ranking loss function; Proceedings of the AAAI 6th International Workshop on Information Integration on the Web.2007.
-
- Elliot S. Survey of author name disambiguation: 2004 to 2010. Library Philosophy and Practice 2010
Grants and funding
LinkOut - more resources
Full Text Sources
Other Literature Sources