Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2014 Apr;65(4):765-781.
doi: 10.1002/asi.23063. Epub 2013 Nov 21.

Author Name Disambiguation for PubMed

Affiliations

Author Name Disambiguation for PubMed

Wanli Liu et al. J Assoc Inf Sci Technol. 2014 Apr.

Abstract

Log analysis shows that PubMed users frequently use author names in queries for retrieving scientific literature. However, author name ambiguity may lead to irrelevant retrieval results. To improve the PubMed user experience with author name queries, we designed an author name disambiguation system consisting of similarity estimation and agglomerative clustering. A machine-learning method was employed to score the features for disambiguating a pair of papers with ambiguous names. These features enable the computation of pairwise similarity scores to estimate the probability of a pair of papers belonging to the same author, which drives an agglomerative clustering algorithm regulated by 2 factors: name compatibility and probability level. With transitivity violation correction, high precision author clustering is achieved by focusing on minimizing false-positive pairing. Disambiguation performance is evaluated with manual verification of random samples of pairs from clustering results. When compared with a state-of-the-art system, our evaluation shows that among all the pairs the lumping error rate drops from 10.1% to 2.2% for our system, while the splitting error rises from 1.8% to 7.7%. This results in an overall error rate of 9.9%, compared with 11.9% for the state-of-the-art method. Other evaluations based on gold standard data also show the increase in accuracy of our clustering. We attribute the performance improvement to the machine-learning method driven by a large-scale training set and the clustering algorithm regulated by a name compatibility scheme preferring precision. With integration of the author name disambiguation system into the PubMed search engine, the overall click-through-rate of PubMed users on author name query results improved from 34.9% to 36.9%.

PubMed Disclaimer

Figures

FIG. 1
FIG. 1
User behavior statistics with different number of retrieved citations.
FIG. 2
FIG. 2
Workflow of similarity computation and clustering.
FIG. 3
FIG. 3
Weight functions of PubMed field features.
FIG. 4
FIG. 4
PAV functions of Huber score.
FIG. 5
FIG. 5
Coauthor pair proportion and name space size (x-axis shows the floor of natural logarithm of namespace size).

Similar articles

Cited by

References

    1. Bernardi R, Le D-T. Proceedings of the 2009 International Conference on Advanced Language Technologies for Digital Libraries. Viareggio, Italy: Springer; 2011. Metadata enrichment via topic models for author name disambiguation; pp. 92–113.
    1. Bhattacharya I, Getoor L. Collective entity resolution in relational data. ACM Trans Knowl Discov Data. 2007;1(1):5.
    1. Cota RG, Ferreira AA, Nascimento C, Gonçalves MA, Laender AHF. An unsupervised heuristic-based hierarchical method for name disambiguation in bibliographic citations. Journal of the American Society for Information Science and Technology. 2010;61(9):1853–1870.
    1. Culotta A, Kanani P, Hall R, Wick M, McCallum A. Author disambiguation using error-driven machine learning with a ranking loss function; Proceedings of the AAAI 6th International Workshop on Information Integration on the Web.2007.
    1. Elliot S. Survey of author name disambiguation: 2004 to 2010. Library Philosophy and Practice 2010

LinkOut - more resources

-