Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2007 Oct 30:8:423.
doi: 10.1186/1471-2105-8-423.

PubMed related articles: a probabilistic topic-based model for content similarity

Affiliations

PubMed related articles: a probabilistic topic-based model for content similarity

Jimmy Lin et al. BMC Bioinformatics. .

Abstract

Background: We present a probabilistic topic-based model for content similarity called pmra that underlies the related article search feature in PubMed. Whether or not a document is about a particular topic is computed from term frequencies, modeled as Poisson distributions. Unlike previous probabilistic retrieval models, we do not attempt to estimate relevance-but rather our focus is "relatedness", the probability that a user would want to examine a particular document given known interest in another. We also describe a novel technique for estimating parameters that does not require human relevance judgments; instead, the process is based on the existence of MeSH in MEDLINE.

Results: The pmra retrieval model was compared against bm25, a competitive probabilistic model that shares theoretical similarities. Experiments using the test collection from the TREC 2005 genomics track shows a small but statistically significant improvement of pmra over bm25 in terms of precision.

Conclusion: Our experiments suggest that the pmra model provides an effective ranking algorithm for related article search.

PubMed Disclaimer

Figures

Figure 1
Figure 1
A typical view in the PubMed search interface showing an abstract in detail. The "Related Links" panel on the right is populated with titles of articles that may be of interest.
Figure 2
Figure 2
P5 for the bm25 model given different settings of the parameters k1 and b. This plot was generated by exhaustively trying all k1 values 0.5 to 3.0 (in 0.1 increments) and b values 0.6 to 1.0 (in 0.05 increments). Notice that except for low values of k1 and b, P5 performance is relatively insensitive to parameter settings.
Figure 3
Figure 3
P5 for the pmra model given different settings of the parameters λ (Poisson parameter for the elite distribution) and μ (Poisson parameter for the non-elite distribution). Notice that the parameter settings resulting in high P5 values lie along a "ridge" in the parameter space.
Figure 4
Figure 4
Optimal μ (Poisson parameter for the non-elite distribution) for each λ value (Poisson parameter for the elite distribution) in the pmra model. Regression line shows a linear relationship between these two parameters, corresponding to the "ridge" in Figure 3.
Figure 5
Figure 5
P5 at optimal and interpolated values of μ for each λ in the pmra model. Squares represent optimal μ at each λ, corresponding to the squares in Figure 4. Diamonds represent interpolated μ at each λ, corresponding to the regression line in Figure 4. P5 of the globally optimal parameter setting is shown as the dotted line. The filled square and diamond represent points at which P5 is significantly lower than the globally optimal setting.
Figure 6
Figure 6
Distribution of original ranks for reranked run: pmra (λ = 0.022, μ = 0.013). The bar graph divides the original rank positions into ten bins and tallies the fraction of hits that were brought into the top five by pmra; for example, approximately 80% of the top five pmra results came from the top ten results in the original ranked list. The line graph shows the cumulative distribution.

Similar articles

Cited by

References

    1. Wilbur WJ. Modeling Text Retrieval in Biomedicine. In: Chen H, Fuller SS, Friedman C, Hersh W, editor. Medical Informatics: Knowledge Management and Data Mining in Biomedicine. New York: Springer; 2005. pp. 277–297.
    1. Lin J, DiCuccio M, Grigoryan V, Wilbur WJ. Tech Rep LAMP-TR-145/CS-TR-4877/UMIACS-TR-2007-36/HCIL-2007-10. University of Maryland, College Park, Maryland; 2007. Exploring the Effectiveness of Related Article Search in PubMed.
    1. Harman DK. The TREC Test Collections. In: Voorhees EM, Harman DK, editor. TREC: Experiment and Evaluation in Information Retrieval. Cambridge, Massachusetts: MIT Press; 2005. pp. 21–52.
    1. Cleverdon CW, Mills J, Keen EM. Factors Determining the Performance of Indexing Systems. Two volumes, ASLIB Cranfield Research Project, Cranfield, England. 1968.
    1. Robertson SE, Walker S, Jones S, Hancock-Beaulieu M, Gatford M. Okapi at TREC-3. Proceedings of the 3rd Text REtrieval Conference (TREC-3) 1994.

Publication types

LinkOut - more resources

-