Discovery of high-confidence human protein-coding genes and exons by whole-genome PhyloCSF helps elucidate 118 GWAS loci
- PMID: 31537640
- PMCID: PMC6886504
- DOI: 10.1101/gr.246462.118
Discovery of high-confidence human protein-coding genes and exons by whole-genome PhyloCSF helps elucidate 118 GWAS loci
Abstract
The most widely appreciated role of DNA is to encode protein, yet the exact portion of the human genome that is translated remains to be ascertained. We previously developed PhyloCSF, a widely used tool to identify evolutionary signatures of protein-coding regions using multispecies genome alignments. Here, we present the first whole-genome PhyloCSF prediction tracks for human, mouse, chicken, fly, worm, and mosquito. We develop a workflow that uses machine learning to predict novel conserved protein-coding regions and efficiently guide their manual curation. We analyze more than 1000 high-scoring human PhyloCSF regions and confidently add 144 conserved protein-coding genes to the GENCODE gene set, as well as additional coding regions within 236 previously annotated protein-coding genes, and 169 pseudogenes, most of them disabled after primates diverged. The majority of these represent new discoveries, including 70 previously undetected protein-coding genes. The novel coding genes are additionally supported by single-nucleotide variant evidence indicative of continued purifying selection in the human lineage, coding-exon splicing evidence from new GENCODE transcripts using next-generation transcriptomic data sets, and mass spectrometry evidence of translation for several new genes. Our discoveries required simultaneous comparative annotation of other vertebrate genomes, which we show is essential to remove spurious ORFs and to distinguish coding from pseudogene regions. Our new coding regions help elucidate disease-associated regions by revealing that 118 GWAS variants previously thought to be noncoding are in fact protein altering. Altogether, our PhyloCSF data sets and algorithms will help researchers seeking to interpret these genomes, while our new annotations present exciting loci for further experimental characterization.
© 2019 Mudge et al.; Published by Cold Spring Harbor Laboratory Press.
Figures
![Figure 1.](https://www.ncbi.nlm.nih.gov/pmc/articles/instance/6886504/bin/2073f01.gif)
![Figure 2.](https://www.ncbi.nlm.nih.gov/pmc/articles/instance/6886504/bin/2073f02.gif)
![Figure 3.](https://www.ncbi.nlm.nih.gov/pmc/articles/instance/6886504/bin/2073f03.gif)
![Figure 4.](https://www.ncbi.nlm.nih.gov/pmc/articles/instance/6886504/bin/2073f04.gif)
Similar articles
-
The Protein-Coding Human Genome: Annotating High-Hanging Fruits.Bioessays. 2019 Nov;41(11):e1900066. doi: 10.1002/bies.201900066. Epub 2019 Sep 23. Bioessays. 2019. PMID: 31544971 Review.
-
GENCODE: the reference human genome annotation for The ENCODE Project.Genome Res. 2012 Sep;22(9):1760-74. doi: 10.1101/gr.135350.111. Genome Res. 2012. PMID: 22955987 Free PMC article.
-
Combining RT-PCR-seq and RNA-seq to catalog all genic elements encoded in the human genome.Genome Res. 2012 Sep;22(9):1698-710. doi: 10.1101/gr.134478.111. Genome Res. 2012. PMID: 22955982 Free PMC article.
-
GENCODE: producing a reference annotation for ENCODE.Genome Biol. 2006;7 Suppl 1(Suppl 1):S4.1-9. doi: 10.1186/gb-2006-7-s1-s4. Epub 2006 Aug 7. Genome Biol. 2006. PMID: 16925838 Free PMC article.
-
EGASP: the human ENCODE Genome Annotation Assessment Project.Genome Biol. 2006;7 Suppl 1(Suppl 1):S2.1-31. doi: 10.1186/gb-2006-7-s1-s2. Epub 2006 Aug 7. Genome Biol. 2006. PMID: 16925836 Free PMC article. Review.
Cited by
-
Comparative Genome Annotation.Methods Mol Biol. 2024;2802:165-187. doi: 10.1007/978-1-0716-3838-5_7. Methods Mol Biol. 2024. PMID: 38819560 Review.
-
Biophysical characterization of high-confidence, small human proteins.bioRxiv [Preprint]. 2024 Apr 15:2024.04.12.589296. doi: 10.1101/2024.04.12.589296. bioRxiv. 2024. Update in: Biophys Rep (N Y). 2024 Jun 21:100167. doi: 10.1016/j.bpr.2024.100167. PMID: 38659920 Free PMC article. Updated. Preprint.
-
No country for old methods: New tools for studying microproteins.iScience. 2024 Jan 20;27(2):108972. doi: 10.1016/j.isci.2024.108972. eCollection 2024 Feb 16. iScience. 2024. PMID: 38333695 Free PMC article. Review.
-
Long non-coding RNA generated from CDKN1A gene by alternative polyadenylation regulates p21 expression during DNA damage response.Nucleic Acids Res. 2023 Nov 27;51(21):11911-11926. doi: 10.1093/nar/gkad899. Nucleic Acids Res. 2023. PMID: 37870464 Free PMC article.
-
The status of the human gene catalogue.Nature. 2023 Oct;622(7981):41-47. doi: 10.1038/s41586-023-06490-x. Epub 2023 Oct 4. Nature. 2023. PMID: 37794265 Free PMC article. Review.
References
-
- Akimoto C, Sakashita E, Kasashima K, Kuroiwa K, Tominaga K, Hamamoto T, Endo H. 2013. Translational repression of the McKusick–Kaufman syndrome transcript by unique upstream open reading frames encoding mitochondrial proteins with alternative polyadenylation sites. Biochim Biophys Acta 1830: 2728–2738. 10.1016/j.bbagen.2012.12.010 - DOI - PubMed
-
- Bazzini AA, Johnstone TG, Christiano R, Mackowiak SD, Obermayer B, Fleming ES, Vejnar CE, Lee MT, Rajewsky N, Walther TC, et al. 2014. Identification of small ORFs in vertebrates using ribosome footprinting and evolutionary conservation. EMBO J 33: 981–993. 10.1002/embj.201488411 - DOI - PMC - PubMed
Publication types
MeSH terms
Grants and funding
LinkOut - more resources
Full Text Sources
Other Literature Sources
Molecular Biology Databases