Abstract

The PRIDE ( http://www.ebi.ac.uk/pride ) database of protein and peptide identifications was previously described in the NAR Database Special Edition in 2006. Since this publication, the volume of public data in the PRIDE relational database has increased by more than an order of magnitude. Several significant public datasets have been added, including identifications and processed mass spectra generated by the HUPO Brain Proteome Project and the HUPO Liver Proteome Project. The PRIDE software development team has made several significant changes and additions to the user interface and tool set associated with PRIDE. The focus of these changes has been to facilitate the submission process and to improve the mechanisms by which PRIDE can be queried. The PRIDE team has developed a Microsoft Excel workbook that allows the required data to be collated in a series of relatively simple spreadsheets, with automatic generation of PRIDE XML at the end of the process. The ability to query PRIDE has been augmented by the addition of a BioMart interface allowing complex queries to be constructed. Collaboration with groups outside the EBI has been fruitful in extending PRIDE, including an approach to encode iTRAQ quantitative data in PRIDE XML.

INTRODUCTION

The PRIDE database has been developed to provide a standards-compliant repository for mass-spectrometry-based proteomics data comprising identifications of proteins, peptides and post-translational modifications, together with the mass spectra that provide evidence for these identifications. PRIDE has previously been described in both Proteomics ( 1 ) and in the 2006 NAR Database Special Edition ( 2 ) that should be referred to for a description of the PRIDE data structure. PRIDE has been reviewed by Mead and co-workers as one of the most important proteomics data repositories in the field ( 3 ), and the specific infrastructure in PRIDE to support data privacy and anonymous peer reviewing has been well received by journals ( 4 ).

Several other databases exist for the purpose of capturing and disseminating proteomics data, some of which provide their own data analysis pipelines. The Global Proteome Machine Database (GPMDB) provides data from the GPM servers to support the validation of peptide MS/MS spectra and protein coverage patterns ( 5 ). PeptideAtlas provides a publicly accessible compendium of identifications from MS/MS that have been processed through PeptideProphet to provide a uniform score ( 6 ). The Human Proteinpedia ( http://www.humanproteinpedia.org ) is a portal for community annotation that is used as an addendum to the expert curated Human Protein Reference Database (HPRD) ( 7 ). The Open Proteomics Database (OPD) is a public database of mass spectrometry-based proteomics data ( 8 ). Tranche provides a secure distributed file system that is designed to handle the sharing of massive datasets ( http://www.proteomecommons.org/dev/dfs /). PRIDE makes use of Tranche to allow the sharing of massive data files, currently including search engine output files and binary raw data from mass spectrometers that can be accessed via a hyperlink from PRIDE. The public proteomics data repositories are now poised to focus on collaboration and data sharing, through membership of the ProteomExchange consortium ( 9 ). This will allow each repository to share their data in a collaborative fashion while remaining independent and able to focus their efforts as they see fit.

This article describes the new data available in PRIDE, the infrastructure being developed to support submission to PRIDE and additions to the user interface that are intended to improve the utility of PRIDE as a query and analysis tool.

Some of the new developments in PRIDE address long-standing requirements that have been present since the inception of the PRIDE project, such as the need for more sophisticated and user-configurable query that can return results in multiple file formats. Other developments, such as the PRIDE Wizard developed at the University of Manchester address requirements that are becoming increasingly important in proteomics, such as the use of quantitative labeling.

At the same time, proteomics journals are exerting increasing pressure to submit data supporting proteomics journal submissions to public repositories in standard formats ( 10 ).

DATABASE DESCRIPTION

New data content

A significant number of new public datasets are available in PRIDE. The majority of recent PRIDE submissions have accompanied corresponding journal submissions, indeed submission to PRIDE has regularly been used to support manuscript submissions. The accepted practice is to maintain data privacy until publication of the manuscript. PRIDE supports this by providing a mechanism for private data submission with the option of generating a random username and password that will grant access to the private dataset. This anonymous login can then be passed to peer reviewers, allowing them confidential access to the details of the proteomics experiment supporting the manuscript under review.

At the time of writing, PRIDE comprises 2703 public experiments out of a total of 3185 submitted experiments (85% of the total being public). All of the experiments in PRIDE are organized into projects to improve the accessibility of the data, via both the PRIDE browse page and the PRIDE BioMart.

The complete set of data in PRIDE covers 25 different species including several model animal and plant species as well as fungi, bacteria and viruses ( Figure 1 ). Direct access to PRIDE data organized by species, tissue, sub-cellular location, disease state and project name can be obtained via the ‘Browse Experiments’ menu item. PRIDE makes use of the NEWT taxonomy ( 11 ) to annotate species and selected ontologies from the OBO Foundry ( http://obofoundry.org /) that cover specific domains to annotate other aspects of the sample type. These include the BRENDA tissue ontology ( 12 ), the Cell Type ontology ( 13 ) and the Gene Ontology ( 14 ) to annotate sub-cellular location. Disease state is annotated using the Disease Ontology ( http://diseaseontology.sourceforge.net/#overview ).

PRIDE includes data for 25 different species. This pie chart illustrates the representation of each species in terms of the total number of peptide identifications for each species.
Figure 1.

PRIDE includes data for 25 different species. This pie chart illustrates the representation of each species in terms of the total number of peptide identifications for each species.

As may be expected, the majority of identifications of proteins and peptides come from human samples, comprising 81% of the peptide identifications in PRIDE (48% of unique peptide identifications in PRIDE). Apart from species-specific studies, PRIDE also contains an interesting dataset describing the protein content of an environmental community sample, the acid mine drainage dataset described in Nature ( 15 ), in which a community genomic dataset was searched to generate strain-specific protein identifications for the resident acidophilic bacterial biofilm.

The inception of the PRIDE project was in part inspired, and indeed funded, by the HUPO Plasma Proteome Project (HPPP). Since the HPPP pilot phase has been completed, two other collaborative HUPO projects have also contributed significant datasets to PRIDE, including mass spectra. The European HUPO Brain Proteome Project (HBPP) ( 16 ) has submitted 5555 protein identifications to PRIDE, based upon 154 132 peptide identifications in both human and mouse brain tissue. The Beijing Proteome Research Center has submitted 32 421 separate protein identifications, based upon 299 869 peptide identifications in liver, as part of their contribution to the international HUPO Liver Proteome Project (HLPP) ( 17 , 18 ).

Part of the work of the PRIDE team has been to contribute to the development of data conversions into PRIDE XML format from commonly used proteomic analysis pipelines. A specific example of this is the large set of data supporting an analysis of the proteome of human cerebrospinal fluid ( 19 ). This dataset includes 890 identified proteins based upon 49 185 peptide identifications. The analysis for this dataset was performed using the Trans-Proteomic Pipeline (TPP) ( 20 ) originally developed by the Institute for Systems Biology. The output formats from the TPP (mzXML for mass spectra, pepXML for peptide identifications and protXML for protein identifications) were parsed and the publicly available PRIDE core API was used to convert this data into PRIDE XML format. This work has since served as the basis for importing other TPP-formatted data into PRIDE.

The large number and variety of bioinformatics data resources available from the EBI can be daunting, however using and linking these resources can add value to complex datasets. This has been achieved with the recent submission from the Cellzome research team, reporting the interaction of protein kinases with small inhibitory molecules ( 21 ). For this dataset, bi-directional links to the IntAct database of molecular interactions ( http://www.ebi.ac.uk/intact ) have been included from PRIDE ( 22 ). PRIDE also links to ChEBI, the EBI's database of chemical entities of biological interest ( http://www.ebi.ac.uk/chebi /).

It is of note that as the quantity of data submitted to PRIDE grows, repeat identifications of the same unique peptide sequence are becoming increasingly frequent. Indeed, at the time of writing, each unique peptide identification in PRIDE is represented an average of 7 times, comprising repeat identifications both within individual experiments and across separate experiments ( Figure 2 ).

This graph illustrates the increasing redundancy of the peptide identifications submitted to PRIDE over the last year, as repeated identifications of the same peptides are performed. The total number of peptide identifications has increased 5.5-fold, however the number of unique peptide identifications in PRIDE has only doubled.
Figure 2.

This graph illustrates the increasing redundancy of the peptide identifications submitted to PRIDE over the last year, as repeated identifications of the same peptides are performed. The total number of peptide identifications has increased 5.5-fold, however the number of unique peptide identifications in PRIDE has only doubled.

Supporting data submission

The submission format for the PRIDE database is necessarily complex, reflecting the complexity of the domain and technologies used in proteomics. This complexity is compounded by the need to add value to datasets with thorough annotation. As described above and in Ref. ( 2 ), PRIDE makes use of several controlled vocabularies and ontologies to support this annotation in a uniform manner.

Unfortunately, this complexity makes creating a complete submission a difficult task, especially where access to programming support is limited. To mitigate this, several strategies have been employed. For laboratories with good programming support, a comprehensive Java Application Programming Interface (API) can be used to generate PRIDE XML and mzData XML.

For laboratories with more limited bioinformatics resources, two avenues are available. The PRIDE team has expanded to include a data curator who provides direct support to submitting laboratories. This support may be limited to checking XML files that the laboratory has produced, or for a limited number of cases, it is possible for the PRIDE curator to provide direct programming support for the generation of PRIDE XML.

Finally, the PRIDE team has developed an interactive tool that runs in Microsoft Excel. The Proteome Harvest PRIDE Submission Spreadsheet ( http://www.ebi.ac.uk/pride/proteomeharvest ) is an Excel workbook that breaks down the complexity of a complete submission into several relatively simple spreadsheets. The ‘Peptides’ sheet is illustrated in Figure 3 . The workbook makes use of embedded Visual Basic for Applications (VBA) to assist the user in generating a PRIDE XML file directly from the data that they have entered into the spreadsheet. To assist with the problematic step of annotating various parts of the data with appropriate ontology or controlled vocabulary terms, the workbook includes a form giving direct access to the Ontology Lookup Service (OLS) ( 23 ).

A single sheet from the ProteomeHarvest PRIDE Submission Spreadsheet—Peptide Identification Data Entry.
Figure 3.

A single sheet from the ProteomeHarvest PRIDE Submission Spreadsheet—Peptide Identification Data Entry.

Collaborative development of PRIDE submission tools

A welcome development in the evolution of PRIDE has been the increasing involvement of collaborating groups outside the EBI. A team led by Simon Hubbard from the Faculty of Life Sciences, University of Manchester, has developed a mechanism to allow quantitative proteomic data to be encoded in PRIDE XML and mzData XML ( 24 ) by using cross-referenced controlled vocabulary terms to describe the samples and their relative quantities, illustrating the extensibility of the PRIDE XML format. This mechanism has been successfully demonstrated for iTRAQ labeling, but has the scope to encompass other quantitative techniques. The same team has developed ‘PrideWizard, a tool that parses mass spectrometry data and Mascot search engine output, converting the collated data to PRIDE XML. This tool is available from http://www.mcisb.org/software/PrideWizard .

IMPROVING QUERY ACCESS TO PRIDE

Providing query and visualization of the large and complex datasets describing complete proteomics experiments is challenging and problematic. To attempt to meet this challenge, several new facilities have been added to the PRIDE user interface over the last two years and PRIDE has also benefited from the re-engineering of the EBI website, including the new EB-eye search engine that incorporates indexed data from almost every resource at the EBI, including PRIDE.

Resolving the database accession problem

The PRIDE database is populated by submissions of proteomic data from a wide variety of laboratories around the world, each of which selects a protein sequence or genomic sequence database against which searches are performed. The criteria for selecting a database varies with the species and the group concerned, with the consequence that the identifications in PRIDE do not fit under a single sequence accession system. This has proved problematic in that searching PRIDE with an accession from one sequence database will not return results annotated with a different database, even though they may identify the same protein.

The PRIDE team has developed the Protein Identifier Cross-Reference Service (PICR) ( 25 ) ( http://www.ebi.ac.uk/Tools/picr /) which is able to map protein sequence identifiers from over 60 different databases via UniParc (the UniProt Archive) ( 26 ). These cross-references are now being included in the PRIDE database, which will enable users to successfully query PRIDE with their favored accession system. The accession mapping task is performed for new data within 24 h and is refreshed for the entire PRIDE database every week. A useful side-effect is that the latest active accession is available for all submitted identifications, irrespective of the time that has passed since submission. The submitted accession is maintained in the PRIDE database. The PRIDE team is now working on including this mapping data in all PRIDE query and reporting mechanisms.

The PRIDE BioMart query interface

The PRIDE database now includes a BioMart interface ( 27 ) that offers several advantages. This query interface can be accessed from the menu item ‘PRIDE BioMart’ on the left of the PRIDE home page or directly at ( http://www.ebi.ac.uk/pride/prideMart.do ).

The PRIDE BioMart provides access to public PRIDE data from a query-optimized data warehouse that is synchronized with the main PRIDE database at regular intervals. The BioMart interface allows simple or complex queries to be built. The user has control over how the data is filtered, to restrict which records are included, and is able to select the attributes, equivalent to columns in a spreadsheet, that are included in the results. This avoids the need to search through a large table of results, much of which may be irrelevant, allowing the user to focus specifically on the information that is important to them.

The user can specify how the results are formatted; choosing from an HTML table displayed in a browser, a comma or tab-separated values file or a Microsoft Excel spreadsheet. The results file can be compressed to speed up data retrieval. The latest version of BioMart allows asynchronous data access in which the user specifies an email address to which a link to large result sets can be sent.

The PRIDE BioMart provides programmatic web service access to public PRIDE data with the same flexibility as the web form.

The PRIDE BioMart is illustrated with the screen shot in Figure 4 , showing a summary of the results for a customized query.

The PRIDE BioMart: results summary view for a simple query. The filter and display attributes can be seen on the left.
Figure 4.

The PRIDE BioMart: results summary view for a simple query. The filter and display attributes can be seen on the left.

DISCUSSION

The focus of the PRIDE team at the EBI over the last two years has been on improving the ability of proteomics scientists to submit their data to PRIDE and to query PRIDE in more powerful and flexible ways. Progress has been made in both respects with the development of the Proteome Harvest PRIDE Submission Spreadsheet, the support provided by the PRIDE data curator, and the development of new user interface elements such as the PRIDE BioMart. Usage of the BioMart represents ∼50% of the data volume downloaded from PRIDE. As BioMart queries return much more compact and customized result sets, this corresponds to the majority of queries to PRIDE now being made via the BioMart interface.

The Proteome Harvest PRIDE Submission Spreadsheet has been used extensively over the last year, contributing 17 PRIDE experiments including the Acid Mine Drainage environmental dataset described above.

It is recognized however that there is still work to be done. The PRIDE team continues to follow closely the development of the HUPO PSI data exchange formats ( http://psidev.info ). The goal of keeping PRIDE compatible with these standards will pay dividends in supporting data submission to PRIDE. The PRIDE query interface still requires further development, including the incorporation of the Dasty2 DAS client ( http://www.ebi.ac.uk/dasty ) to allow graphical visualization of peptide identifications, together with improvements to the main PRIDE query interface and the PRIDE BioMart. An interesting consequence of the incorporation of PICR protein cross-references is that it will be possible to link the PRIDE BioMart directly to other BioMarts, allowing federated queries across these resources. We intend to set up links to the BioMart services offered by Ensembl (28) and Reactome (29) over the next few months, potentially with similar links in the reverse direction.

ACKNOWLEDGEMENTS

The PRIDE team would like to thank all data submitters for their contributions. The authors would also like to thank Dr Rolf Apweiler for his support and Dr Matthieu Visser for his valuable suggestions. S.Y.C. was supported by a grant from the Korea Health 21 R&D Project, Ministry of Health & Welfare, Republic of Korea (A030003 to Y.-K. Paik). Biotechnology and Biological Sciences Research Council (BBS/B/17239, BB/E00573X/1). Funding to pay the Open Access publication charges for the article was provided by the European Union, “ProDaC” grant number LSHG-CT-2006-036814.

Conflict of interest statement . None declared.

REFERENCES

1
Martens
L
Hermjakob
H
Jones
P
Adamski
M
Taylor
C
States
D
Gevaert
K
Vandekerckhove
J
Apweiler
R
Pride: the proteomics identifications database
Proteomics
2005
, vol. 
5
 (pg. 
3537
-
3545
)
2
Jones
P
Côté
RG
Martens
L
Quinn
AF
Taylor
CF
Derache
W
Hermjakob
H
Apweiler
R
Pride: a public repository of protein and peptide identifications for the proteomics community
Nucleic Acids Res.
2006
, vol. 
34
 (pg. 
D659
-
D663
)
3
Mead
JA
Shadforth
IP
Bessant
C
Public proteomic ms repositories and pipelines: available tools and biological applications
Proteomics
2007
, vol. 
7
 (pg. 
2769
-
2786
)
4
Editorial
Democratizing proteomics data
Nat. Biotechnol.
2007
, vol. 
25
 pg. 
262
 
5
Craig
R
Cortens
JC
Fenyo
D
Beavis
RC
Using annotated peptide mass spectrum libraries for protein identification
J. Proteome Res.
2006
, vol. 
5
 (pg. 
1843
-
1849
)
6
Desiere
F
Deutsch
EW
King
NL
Nesvizhskii
AI
Mallick
P
Eng
J
Chen
S
Eddes
J
Loevenich
SN
, et al. 
The peptideatlas project
Nucleic Acids Res
2006
, vol. 
34
 (pg. 
D655
-
D658
)
7
Peri
S
Navarro
JD
Amanchy
R
Kristiansen
TZ
Jonnalagadda
CK
Surendranath
V
Niranjan
V
Muthusamy
B
Gandhi
TKB
, et al. 
Development of human protein reference database as an initial platform for approaching systems biology in humans
Genome Res.
2003
, vol. 
13
 (pg. 
2363
-
2371
)
8
Prince
JT
Carlson
MW
Wang
R
Lu
P
Marcotte
EM
The need for a public proteomics repository
Nat. Biotechnol.
2004
, vol. 
22
 (pg. 
471
-
472
)
9
Hermjakob
H
Apweiler
R
The proteomics identifications database (pride) and the proteomexchange consortium: making proteomics data accessible
Expert Rev. Proteomics
2006
, vol. 
3
 (pg. 
1
-
3
)
10
Editor
Mind the technology gap
Nat. Methods
2007
, vol. 
4
 (pg. 
765
-
765
)
11
Phan
IQH
Pilbout
SF
Fleischmann
W
Bairoch
A
Newt, a new taxonomy portal
Nucleic Acids Res.
2003
, vol. 
31
 (pg. 
3822
-
3823
)
12
Schomburg
I
Chang
A
Ebeling
C
Gremse
M
Heldt
C
Huhn
G
Schomburg
D
Brenda, the enzyme database: updates and major new developments
Nucleic Acids Res.
2004
, vol. 
32
 (pg. 
D431
-
D433
)
13
Bard
J
Rhee
SY
Ashburner
M
An ontology for cell types
Genome Biol.
2005
, vol. 
6
 pg. 
R21
 
14
Ashburner
M
Ball
CA
Blake
JA
Botstein
D
Butler
H
Cherry
JM
Davis
AP
Dolinski
K
Dwight
SS
, et al. 
Gene ontology: tool for the unification of biology. the gene ontology consortium
Nat Genet
2000
, vol. 
25
 (pg. 
25
-
29
)
15
Lo
I
Denef
VJ
Verberkmoes
NC
Shah
MB
Goltsman
D
DiBartolo
G
Tyson
GW
Allen
EE
Ram
RJ
, et al. 
Strain-resolved community proteomics reveals recombining genomes of acidophilic bacteria
Nature
2007
, vol. 
446
 (pg. 
537
-
541
)
16
Hamacher
M
Apweiler
R
Arnold
G
Becker
A
Blüggel
M
Carrette
O
Colvis
C
Dunn
MJ
Fröhlich
T
, et al. 
Hupo brain proteome project: summary of the pilot phase and introduction of a comprehensive data reprocessing strategy
Proteomics
2006
, vol. 
6
 (pg. 
4890
-
4898
)
17
He
F
Human liver proteome project: plan, progress, and perspectives
Mol. Cell. Proteomics
2005
, vol. 
4
 (pg. 
1841
-
1848
)
18
Zheng
J
Gao
X
Beretta
L
He
F
The human liver proteome project (hlpp) workshop during the 4th hupo world congress
Proteomics
2006
, vol. 
6
 (pg. 
1716
-
1718
)
19
Pan
S
Zhu
D
Quinn
JF
Peskind
ER
Montine
TJ
Lin
B
Goodlett
DR
Taylor
G
Eng
J
, et al. 
A combined dataset of human cerebrospinal fluid proteins identified by multi-dimensional chromatography and tandem mass spectrometry
Proteomics
2007
, vol. 
7
 (pg. 
469
-
473
)
20
Keller
A
Eng
J
Zhang
N
Li
X
Aebersold
R
A uniform proteomics ms/ms analysis platform utilizing open xml file formats
Mol. Syst. Biol.
2005
, vol. 
1
 pg. 
2005.0017
 
21
Bantscheff
M
Eberhard
D
Abraham
Y
Bastuck
S
Boesche
M
Hobson
S
Mathieson
T
Perrin
J
Raida
M
, et al. 
Quantitative chemical proteomics reveals mechanisms of action of clinical abl kinase inhibitors
Nat. Biotechnol.
2007
, vol. 
25
 (pg. 
1035
-
1044
)
22
Kerrien
S
Alam-Faruque
Y
Aranda
B
Bancarz
I
Bridge
A
Derow
C
Dimmer
E
Feuermann
M
Friedrichsen
A
, et al. 
Intact – open source resource for molecular interaction data
Nucleic Acids Res.
2007
, vol. 
35
 (pg. 
D561
-
D565
)
23
Côté
RG
Jones
P
Apweiler
R
Hermjakob
H
The ontology lookup service, a lightweight cross-platform tool for controlled vocabulary queries
BMC Bioinformatics
2006
, vol. 
7
 pg. 
97
 
24
Siepen
JA
Swainston
N
Jones
AR
Hart
SR
Hermjakob
H
Jones
P
Hubbard
SJ
An informatic pipeline for the data capture and submission of quantitative proteomic data using itraq
Proteome Sci.
2007
, vol. 
5
 pg. 
4
 
25
Côté
RG
Jones
P
Martens
L
Kerrien
S
Reisinger
F
Lin
Q
Leinonen
R
Apweiler
R
Hermjakob
H
The protein identifier cross-reference (picr) service: reconciling protein identifiers across multiple source databases
BMC Bioinformatics
2007
, vol. 
8
 pg. 
401
 
26
Leinonen
R
Diez
FG
Binns
D
Fleischmann
W
Lopez
R
Apweiler
R
Uniprot archive
Bioinformatics
2004
, vol. 
20
 (pg. 
3236
-
3237
)
27
Durinck
S
Moreau
Y
Kasprzyk
A
Davis
S
De Moor
B
Brazma
A
Huber
W
Biomart and bioconductor: a powerful link between biological databases and microarray data analysis
Bioinformatics
2005
, vol. 
21
 (pg. 
3439
-
3440
)
28
Hubbard
TJP
Aken
BL
Beal
K
Ballester
B
Caccamo
M
Chen
Y
Clarke
L
Coates
G
Cunningham
F
, et al. 
Ensembl 2007
Nucleic Acids Res.
2007
, vol. 
35
 (pg. 
D610
-
D617
)
29
Joshi-Tope
G
Gillespie
M
Vastrik
I
D’Eustachio
P
Schmidt
E
de Bono
B
Jassal
B
Gopinath
GR
Wu
GR
, et al. 
Reactome: a knowledgebase of biological pathways
Nucleic Acids Res.
2005
, vol. 
33
 (pg. 
D428
-
D432
)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Comments

0 Comments
Submit a comment
You have entered an invalid code
Thank you for submitting a comment on this article. Your comment will be reviewed and published at the journal's discretion. Please check for further notifications by email.