Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2014 May 29;509(7502):575-81.
doi: 10.1038/nature13302.

A draft map of the human proteome

Min-Sik Kim  1 Sneha M Pinto  2 Derese Getnet  3 Raja Sekhar Nirujogi  2 Srikanth S Manda  2 Raghothama Chaerkady  1 Anil K Madugundu  2 Dhanashree S Kelkar  2 Ruth Isserlin  4 Shobhit Jain  4 Joji K Thomas  2 Babylakshmi Muthusamy  2 Pamela Leal-Rojas  5 Praveen Kumar  2 Nandini A Sahasrabuddhe  2 Lavanya Balakrishnan  2 Jayshree Advani  2 Bijesh George  2 Santosh Renuse  2 Lakshmi Dhevi N Selvan  2 Arun H Patil  2 Vishalakshi Nanjappa  2 Aneesha Radhakrishnan  2 Samarjeet Prasad  6 Tejaswini Subbannayya  2 Rajesh Raju  2 Manish Kumar  2 Sreelakshmi K Sreenivasamurthy  2 Arivusudar Marimuthu  2 Gajanan J Sathe  2 Sandip Chavan  2 Keshava K Datta  2 Yashwanth Subbannayya  2 Apeksha Sahu  2 Soujanya D Yelamanchi  2 Savita Jayaram  2 Pavithra Rajagopalan  2 Jyoti Sharma  2 Krishna R Murthy  2 Nazia Syed  2 Renu Goel  2 Aafaque A Khan  2 Sartaj Ahmad  2 Gourav Dey  2 Keshav Mudgal  7 Aditi Chatterjee  2 Tai-Chung Huang  6 Jun Zhong  6 Xinyan Wu  1 Patrick G Shaw  6 Donald Freed  6 Muhammad S Zahari  8 Kanchan K Mukherjee  9 Subramanian Shankar  10 Anita Mahadevan  11 Henry Lam  12 Christopher J Mitchell  6 Susarla Krishna Shankar  11 Parthasarathy Satishchandra  13 John T Schroeder  14 Ravi Sirdeshmukh  2 Anirban Maitra  15 Steven D Leach  16 Charles G Drake  17 Marc K Halushka  18 T S Keshava Prasad  2 Ralph H Hruban  15 Candace L Kerr  19 Gary D Bader  4 Christine A Iacobuzio-Donahue  20 Harsha Gowda  2 Akhilesh Pandey  21
Affiliations

A draft map of the human proteome

Min-Sik Kim et al. Nature. .

Abstract

The availability of human genome sequence has transformed biomedical research over the past decade. However, an equivalent map for the human proteome with direct measurements of proteins and peptides does not exist yet. Here we present a draft map of the human proteome using high-resolution Fourier-transform mass spectrometry. In-depth proteomic profiling of 30 histologically normal human samples, including 17 adult tissues, 7 fetal tissues and 6 purified primary haematopoietic cells, resulted in identification of proteins encoded by 17,294 genes accounting for approximately 84% of the total annotated protein-coding genes in humans. A unique and comprehensive strategy for proteogenomic analysis enabled us to discover a number of novel protein-coding regions, which includes translated pseudogenes, non-coding RNAs and upstream open reading frames. This large human proteome catalogue (available as an interactive web-based resource at http://www.humanproteomemap.org) will complement available human genome and transcriptome data to accelerate biomedical research in health and disease.

PubMed Disclaimer

Figures

Extended Data Figure 1
Extended Data Figure 1. Summary of proteome analysis
a, Mass error in parts per million for precursor ions of all identified peptides. b, Number of peptides detected per gene binned as shown. c, Distribution of sequence coverage of identified proteins. d–f, %FDR with a q value of <0.01 plotted against peptide length in number of amino acids, charge state of peptide ion and number of cleavage sites missed by enzyme. p values computed from two-tailed t-test are shown. Error bars indicate s.d. calculated from FDRs of multiple fetal samples. g–h, A comparison of peptides identified in this study with PeptideAtlas and GPMDB. i, Mass error in parts per million for precursor ions identified from proteogenomics analysis.
Extended Data Figure 2
Extended Data Figure 2. Tissue-wise gene expression and housekeeping proteins
a, A heat map shows a partial list of not well-characterized, LOC genes. b, The bulk of protein mass is contributed by only a small number of genes. Only 2,350 ‘housekeeping genes’ account for ∼75% of proteome mass. c, The number of cell/tissue types where a gene was observed was counted. Some genes were found to be specifically restricted in a few samples while others were observed in the majority of samples analyzed. For example, 1,537 genes were detected only in one sample, and 2,350 genes were found in all samples. These later list of genes can be defined as highly abundant ‘housekeeping proteins.’ d, Distribution of genes in the RefSeq database based on the number of protein isoforms resulting from their annotated transcripts (left). Distribution of the transcripts with two or more protein isoforms annotated based on the number of isoform-specific or shared peptides (right). e, A representative example of sequence coverage of PSMB8 protein along with tissue distribution of all of its identified peptides and the MS/MS spectrum of one of the peptides is shown along with seven SRM transitions.
Extended Data Figure 3
Extended Data Figure 3. Western blot analysis of select tissue-restricted proteins
a, Eight proteins showing tissue-restricted expression were tested using Western blot analysis in 17 adult tissues. GAPDH was used as a loading control. b, Four proteins found to be expressed in a broad range of tissues although bands that do not correspond to the expected molecular weight are also observed. CST - Cell Signaling Technology. SCB - Santa Cruz Biotechnology.
Extended Data Figure 4
Extended Data Figure 4. Identification of novel genes/ORFs and translated non-coding RNAs
a, An example of a novel ORF in an alternate reading frame located in the 3’ UTR of CHTF8 gene. The relative abundance of peptides from the CHTF8 protein and the protein encoded by the novel ORF is shown (bottom). b, An example of translated non-coding RNA identified by searching 3-frame translated transcript database. The MS/MS spectrum of one of the five identified peptides (LEVASSPPVSEAVPR) is shown along with a similar fragmentation pattern observed from the corresponding synthetic peptide.
Extended Data Figure 5
Extended Data Figure 5. Human genome annotation through proteogenomic analysis using GeneSpring
a, Four genome search specific peptides (GSSPs; red boxes) map to an upstream ORF (denoted as black hashes) located in 5’ UTR of the SLC35A4 gene (ORF shown as blue rectangle) b, GSSP mapping in the intergenic region between two RefSeq annotated genes NDUFv3 and PKNOX1. The ORF region is depicted in dotted lines of human endogenous retroviral element (HERV). c, GSSPs mapping to an annotated pseudogene MAGEB6P1, the alignments of parent gene and pseudogene are shown below the peptides.
Extended Data Figure 6
Extended Data Figure 6. Frequency of nucleotides surrounding translational start sites
a, Frequency of nucleotides at positions ranging from −5 to +1 surrounding the AUG codon for confirmed translational start sites. b, Frequency of nucleotides at positions ranging from −5 to +1 surrounding the AUG codon for novel translational start sites identified in this study.
Figure 1
Figure 1. Overview of the workflow and comparison of data with public repositories
a, The adult/fetal tissues and hematopoietic cell types that were analyzed to generate a draft map of the normal human proteome are shown. b, The samples were fractionated, digested and analyzed on the high resolution and high accuracy Orbitrap mass analyzer as shown. Tandem mass spectrometry data was searched against a known protein database using SEQUEST and MASCOT database search algorithms.
Figure 2
Figure 2. Landscape of the normal human proteome
a, Tissue-supervised hierarchical clustering reveals the landscape of gene expression across the analyzed cells and tissues. Selected tissue-restricted genes are highlighted in boxes to show some well-studied (black) as well as hypothetical proteins of unknown function (red). The color key indicates the normalized spectral counts per gene detected across the tissues. b, A heat map showing tissue expression of fetal tissue-restricted genes ordered by average expression across fetal tissues (left) and a zoom-in of the top 40 most abundant genes (right). The color key indicates the spectral counts per gene. c, An ROC curve showing a comparison of the performance of the current dataset (blue, area under the curve = 0.762) with 111 individual gene expression datasets (orange) and an composite of the 111 individual datasets (red, area under the curve = 0.692). d, Developmental stage-specific differential expression of protein complexes in fetal and adult liver tissues. Heat map shows protein complexes with less than or equal to half of their subunits expressed in one of the tissue types. The darker the color, the greater the number of expressed subunits.
Figure 3
Figure 3. Isoform-specific expression
a, Exon structure of three known isoforms of FYN (left) along with abundance of isoform-specific peptides detected in the indicated cells/tissues (right). The color key indicates a relative expression based on the spectral counts of isoform-specific peptides detected. b, 20S constitutive proteasome and 20S immunoproteasome core complexes. Expression of their corresponding components are depicted by a heat map (red indicates higher expression) in the Human Proteome Map portal.
Figure 4
Figure 4. Proteogenomic analysis
a, An overview of the multiple databases used in the proteogenomic analysis. A subset of peptides corresponding to genome search-specific peptides were synthesized and analyzed by mass spectrometry. b, Overall summary of the results from the current study.
Figure 5
Figure 5. Translation of pseudogenes and identification of novel N-termini
a, A heat map shows the expression of pseudogenes across the analyzed cells/tissues. Some pseudogenes such as VDAC1P7 and GAPDHP1 were found to be globally expressed while others were more restricted in their expression or were detected only in a single cell/tissue as indicated. b, The distribution of novel N-termini detected with N-terminal acetylation is shown with respect to the location of the annotated translational start site. All sites in the 5’ UTR are labeled upstream while those located downstream of the annotated AUG start sites are labeled as 1st Met, 2nd Met and so on.

Comment in

Similar articles

Cited by

References

    1. Dunham I, et al. An integrated encyclopedia of DNA elements in the human genome. Nature. 2012;489:57–74. - PMC - PubMed
    1. Aebersold R, Mann M. Mass spectrometry-based proteomics. Nature. 2003;422:198–207. - PubMed
    1. Bensimon A, Heck AJ, Aebersold R. Mass spectrometry-based proteomics and network biology. Annu Rev Biochem. 2012;81:379–405. - PubMed
    1. Cravatt BF, Simon GM, Yates JR., 3rd The biological impact of mass-spectrometry-based proteomics. Nature. 2007;450:991–1000. - PubMed
    1. Nagaraj N, et al. System-wide perturbation analysis with nearly complete coverage of the yeast proteome by single-shot ultra HPLC runs on a bench top Orbitrap. Mol Cell Proteomics. 2012;11 M111 013722. - PMC - PubMed

Publication types

MeSH terms

-