skip to main content
research-article

Author name disambiguation in MEDLINE

Published: 28 July 2009 Publication History
  • Get Citation Alerts
  • Abstract

    Background: We recently described “Author-ity,” a model for estimating the probability that two articles in MEDLINE, sharing the same author name, were written by the same individual. Features include shared title words, journal name, coauthors, medical subject headings, language, affiliations, and author name features (middle initial, suffix, and prevalence in MEDLINE). Here we test the hypothesis that the Author-ity model will suffice to disambiguate author names for the vast majority of articles in MEDLINE. Methods: Enhancements include: (a) incorporating first names and their variants, email addresses, and correlations between specific last names and affiliation words; (b) new methods of generating large unbiased training sets; (c) new methods for estimating the prior probability; (d) a weighted least squares algorithm for correcting transitivity violations; and (e) a maximum likelihood based agglomerative algorithm for computing clusters of articles that represent inferred author-individuals. Results: Pairwise comparisons were computed for all author names on all 15.3 million articles in MEDLINE (2006 baseline), that share last name and first initial, to create Author-ity 2006, a database that has each name on each article assigned to one of 6.7 million inferred author-individual clusters. Recall is estimated at ∼98.8%. Lumping (putting two different individuals into the same cluster) affects ∼0.5% of clusters, whereas splitting (assigning articles written by the same individual to >1 cluster) affects ∼2% of articles. Impact: The Author-ity model can be applied generally to other bibliographic databases. Author name disambiguation allows information retrieval and data integration to become person-centered, not just document-centered, setting the stage for new data mining and social network tools that will facilitate the analysis of scholarly publishing and collaboration behavior. Availability: The Author-ity 2006 database is available for nonprofit academic research, and can be freely queried via http://arrowsmith.psych.uic.edu.

    References

    [1]
    Bhattacharya, I. and Getoor, L. 2006. A latent Dirichlet model for unsupervised entity resolution. In Proceedings of the 6th SIAM Conference on Data Mining, J. Ghosh, D. Lambert, D. B. Skillicorn, and J. Srivastava Eds. SIAM, 47--58.
    [2]
    Bhattacharya, I. and Getoor, L. 2007. Collective entity resolution in relational data. ACM Trans. Knowl. Discov. Data 1, 1--36.
    [3]
    Bilenko, M., Kamath, B., and Mooney, R. J. 2006. Adaptive blocking: Learning to scale up record linkage. In Proceedings of the IEEE Computer Society 6th International Conference on Data Mining. 87--96.
    [4]
    Culotta, A. and McCallum, A. 2006. Tractable learning and inference of high-order representations. In Proceedings of the ICML Workshop on Open Problems in Statistical Relational Learning. http://www.cs.umd.edu/projects/srl2006/proceedings.html.
    [5]
    Culotta, A., Kanani, P., Hall, R., Wick, M., and McCallum, A. 2007. Author disambiguation using error-driven machine learning with a ranking loss function. In Proceedings of the 6th AAAI International Workshop on Information Integration on the Web.
    [6]
    Dominguez, J. and Gonzalez-Lima, M. D. 2006. A primal-dual interior-point algorithm for quadratic programming. Numer. Algor. 42, 1--30.
    [7]
    Fisher, D. H. 1987. Knowledge acquisition via incremental conceptual clustering. Mach. Learn. 2, 139--172.
    [8]
    French, J. C., Powell, A., and Schulman, E. 2000. Using clustering strategies for creating authority files. J. Amer. Soc. Inform. Sci. Technol. 51, 774--786.
    [9]
    Galvez, C. and Moya-Anegón, F. 2007. Approximate personal name-matching through finite state graphs. J. Amer. Soc. Inform. Sci. Technol. 58, 1960--1976.
    [10]
    Garfield, E. 1969. British quest for uniqueness versus American egocentrism. Nature 223, 763.
    [11]
    Han, H., Zha, H., and Giles, C. L. 2005. Name disambiguation in author citations using a K-way spectral clustering method. In Proceedings of the 5th ACM/IEEE Joint Conference on Digital Libraries. M. Marlino, T. Sumner, and F. M. Shipman III Eds. ACM, 334--343.
    [12]
    Han, H., Giles, C. L., Zha, H., Li, C., and Tsioutsiouliklis, K. 2004. Two supervised learning approaches for name disambiguation in author citations. In Proceedings of the 4th ACM/IEEE Joint Conference on Digital Libraries, H. Chen, H. D. Wactlar, C.-C. Chen, E.-P. Lim, and M. G. Christel Eds. ACM, 296--305.
    [13]
    Herskovic, J. R., Tanaka, L. Y., Hersh, W., and Bernstam, E. V. 2007. A day in the life of : analysis of a typical day's query log. J. Amer. Med. Inform. Ass. 14, 212--220.
    [14]
    Holmes, D. I., Robertson, M., and Paez, R. 2001. Stephen Crane and the New-York Tribune: A case study in traditional and non-traditional authorship attribution. Comput. Human. 35, 315--331.
    [15]
    Huang, J., Ertekin, S., and Giles, C. L. 2006. Efficient name disambiguation for large-scale databases. In Proceedings of the 10th European Conference on Principles and Practice of Knowledge Discovery in Databases, J. Fürnkranz, T. Scheffer, and M. Spiliopoulou Eds. Springer-Verlag, 536--544.
    [16]
    Jaro, M. A. 1995. Probabilistic linkage of large public health data files. Statis. Med. 14, 491--498.
    [17]
    Kalashnikov, D. V. and Mehrotra, S. 2006. Domain-independent data cleaning via analysis of entity-relationship graph. ACM Trans. Data. Syst. 31, 716--767.
    [18]
    Kanani, P., McCallum, A., and Pal, C. 2007. Improving author coreference by resource-bounded information gathering from the Web. In Proceedings of the 20th International Joint Conference on Artificial Intelligence, M. M. Veloso Ed. 429--434.
    [19]
    Koudas, N., Sarawagi, S., and Srivstava, D. 2006. Record linkage: Similarity measures and algorithms. In Proceedings of the ACM SIGMOD International Conference on Management of Data, 802--803. (Supplementary tutorial slides available from http://queens.db.toronto.edu/koudas/docs/aj.pdf.)
    [20]
    Madigan, D., Genkin, A., Lewis, D. D., Argamon, S., Fradkin, D., and Ye, L. 2005. Author identification on the large-scale. Annual Meeting of the Classification Society of North America. http://www.stat.rutgers.edu/~madigan/PAPERS/authorid-csna05.pdf.
    [21]
    Mann, G. S. and Yarowsky, D. 2003. Unsupervised personal name disambiguation. In Proceedings of the 7th Conference on Natural Language Learning. Association for Computational Linguistics, Morristown, 33--40.
    [22]
    On, B. W., Lee, D., Kang, J., and Mitra, P. 2005. Comparative study of name disambiguation problem using a scalable blocking-based framework. In Proceedings of the 5th ACM/IEEE Joint Conference on Digital Libraries 2. M. Marlino, T. Sumner, F. M. Shipman III Eds. ACM, 344--353.
    [23]
    Qiu, J. 2008. Scientific publishing: identity crisis. Nature 451, 766--767.
    [24]
    Reuther, P. and Walter, B. 2006. Survey on test collections and techniques for personal name matching. Int. J. Metadata, Seman. Ontol. 1, 89--99.
    [25]
    Scoville, C. L., Johnson, E. D., and McConnell, A. L. 2003. When A. rose is not A. Rose: The vagaries of author searching. Med. Refer. Serv. Quart. 22, 1--11.
    [26]
    Smalheiser, N. R., Zhou, W., and Torvik, V. I. 2008. Anne O'Tate: A tool to support user-driven summarization, drill-down and browsing of search results. J. Biomed. Discov. Collab. 3, 2.
    [27]
    Smalheiser, N. R. and Torvik, V. I. 2009. Author name disambiguation. In Annual Review of Information Science and Technology 43, B. Cronin Ed. 287--313.
    [28]
    Smalheiser, N. R., Torvik, V. I., and Zhou, W. 2009. Arrowsmith two-node search interface: A tutorial on finding meaningful links between two disparate sets of articles in MEDLINE. Comput. Meth. Prog. Biomed. 94, 190--197.
    [29]
    Soler, J. M. 2007. Separating the articles of authors with the same name. Scientometrics 72, 281--290.
    [30]
    Song, Y., Huang, J., Councill, I. G., Li, J., and Giles, C. L. 2007. Efficient topic-based unsupervised name disambiguation. In Proceedings of the ACM/IEEE Joint Conference on Digital Libraries. E. M. Rasmussen, R. R. Larson, E. Toms, and S. Sugimoto Eds. ACM, 342--351.
    [31]
    Tan, Y. F., Kan, M. Y., and Lee, D. 2006. Search engine-driven author disambiguation. In Proceedings of the 6th ACM/IEEE Joint Conference on Digital Libraries. G. Marchionini, M. L. Nelson, and C. C. Marshall Eds. ACM, 314--315.
    [32]
    Torvik, V. I., Weeber, M., Swanson, D. R., and Smalheiser, N. R. 2005. A probabilistic similarity metric for Medline records: A model for author name disambiguation. J. Amer. Soc. Inform. Sci. Technol. 56, 140--158.
    [33]
    Torvik, V. I. and Smalheiser, N. R. 2007. A quantitative model for linking two disparate sets of articles in MEDLINE. Bioinformatics 23, 1658--1665.
    [34]
    Wilbur, W. J. and Yang, Y. 1996. An analysis of statistical term-strength and its use in the indexing and retrieval of molecular biology texts. Comput. Bio. Med. 26, 209--222.
    [35]
    Winkler, W. E. 1995. Matching and record linkage. In Business Survey Methods, B. G. Cox, D. A. Binder, B. N. Chinnappa, A. Christianson, M. J. Colledge, and P. S. Kott Eds. Wiley, New York, 355--384.
    [36]
    Yin, X., Han, J., and Yu, P. S. 2007. Object distinction: Distinguishing objects with identical names by link analysis. In Proceedings of the IEEE 23rd International Conference on Data Engineering. IEEE, 1242--1246.

    Cited By

    View all
    • (2024)Bridging the gap in author names: building an enhanced author name dataset for biomedical literature systemJournal of the American Medical Informatics Association10.1093/jamia/ocae127Online publication date: 25-Jun-2024
    • (2024)Rethinking the author name ambiguity problem and beyond: The case of the Chinese contextAccountability in Research10.1080/08989621.2024.2349115(1-24)Online publication date: 5-May-2024
    • (2024)The impact of gender diversity on junior versus senior biomedical scientists’ NIH research awardsNature Biotechnology10.1038/s41587-024-02234-y42:5(815-819)Online publication date: 17-May-2024
    • Show More Cited By

    Recommendations

    Reviews

    Quinsulon Israel

    Solving the difficult problem of author name disambiguation will help greatly with social networking analysis and determining an individual author's "image." Partitioning articles along multiple dimensions-such as name derivation, email address, coauthor identification, self-citations, and research areas-has shown that articles can be accurately identified as belonging to the correct person, even when the name of the individual on those separate articles is derived or incomplete. As a byproduct of this research area, the links between these partitions (individual authors) and the links between the works within these partitions can be studied for further conjectures about an author's collaborative research and relationships. Torvik and Smalheiser provide readers with a thorough understanding of the problem, the solutions, and the issues in between, including a clear presentation of their methodology. They use many methods that are becoming fundamental, such as similarity analysis of multidimensional vectors. However, although the paper is very well organized, it suffers greatly from conceptual overload, so that the many statistics and discoveries are lost in the dense text. The paper needs more adequate listings, tables, and diagrams to help readers better conceptualize the reported findings. The authors' ingenuity is apparent in their robust and valid dataset. The freely available dataset performs well against other author disambiguation resources. Unfortunately, the authors heavily cite their own work, with few references to other approaches and research. Despite the paper's weaknesses, I still recommend it to those who are interested in author name disambiguation; those who are looking for an excellent dataset; and those who are interested in similarity measures and experimenting with author metadata. Online Computing Reviews Service

    Access critical reviews of Computing literature here

    Become a reviewer for Computing Reviews.

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Knowledge Discovery from Data
    ACM Transactions on Knowledge Discovery from Data  Volume 3, Issue 3
    July 2009
    122 pages
    ISSN:1556-4681
    EISSN:1556-472X
    DOI:10.1145/1552303
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 28 July 2009
    Accepted: 01 March 2009
    Revised: 01 February 2009
    Received: 01 July 2007
    Published in TKDD Volume 3, Issue 3

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Name disambiguation
    2. bibliographic databases

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)83
    • Downloads (Last 6 weeks)7

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Bridging the gap in author names: building an enhanced author name dataset for biomedical literature systemJournal of the American Medical Informatics Association10.1093/jamia/ocae127Online publication date: 25-Jun-2024
    • (2024)Rethinking the author name ambiguity problem and beyond: The case of the Chinese contextAccountability in Research10.1080/08989621.2024.2349115(1-24)Online publication date: 5-May-2024
    • (2024)The impact of gender diversity on junior versus senior biomedical scientists’ NIH research awardsNature Biotechnology10.1038/s41587-024-02234-y42:5(815-819)Online publication date: 17-May-2024
    • (2024)ANDez: An open-source tool for author name disambiguation using machine learningSoftwareX10.1016/j.softx.2024.10171926(101719)Online publication date: May-2024
    • (2024)Research on scientific knowledge evolution patterns based on ego-centered fine-granularity citation networkInformation Processing & Management10.1016/j.ipm.2024.10376661:4(103766)Online publication date: Jul-2024
    • (2024)The dual dimension of scientific research experience acquisition and its development: a 40-year analysis of Chinese Humanities and Social Sciences JournalsScientometrics10.1007/s11192-024-05002-6129:5(2827-2853)Online publication date: 1-May-2024
    • (2024)Author name disambiguation literature review with consolidated meta-analytic approachInternational Journal on Digital Libraries10.1007/s00799-024-00398-1Online publication date: 10-Apr-2024
    • (2023)Development and Validation of an Automated Tool to Retrieve and Curate Faculty Publications of Academic DepartmentsCureus10.7759/cureus.47976Online publication date: 30-Oct-2023
    • (2023)Science, interrupted: Funding delays reduce research activity but having more grants helpsPLOS ONE10.1371/journal.pone.028057618:4(e0280576)Online publication date: 26-Apr-2023
    • (2023)Deterministic bibliometric disambiguation challenges in company names2023 IEEE 17th International Conference on Semantic Computing (ICSC)10.1109/ICSC56153.2023.00047(239-243)Online publication date: Feb-2023
    • Show More Cited By

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media

    -