Confidence comes as standard
21 Feb 2018 by Evoluted New Media
As the era of proteomics becomes firmly established, reliable analytical standards are vital – but lacking. A team from Munich has set about changing this by creating the most comprehensive synthetic protein library yet. Known as ProteomeTools, Andreas Humer catches us up with this monumental effort
Proteomics has emerged as a critical aspect of biological research given that proteins in a cell represent the functional molecules themselves, whereas genes provide instructions to build the proteins.
Integrating genomics and bioinformatics only goes so far in predicting the functions of a gene. Protein analysis provides an additional layer of information, delving into how things work and not just what they can do. Motivated by the potential that proteomics has to further current functional knowledge, integration of protein analysis into both research and clinical applications is gaining traction. Proteomic approaches will soon be routine in detecting protein biomarkers, identifying disease-specific proteins, or offering insights into disease mechanisms and drug-protein interactions.
Proteomics was developed by combining the aggregation of DNA and protein sequences, improvements in mass spectrometry (MS), and developments in specialised informatics solutions to discover the functional nature of proteins and how they are made. Typical proteomic approaches rely on comparing signature spectra of digested protein fragments – peptides – from mass spectrometry experiments with computationally generated peptide spectra predictions. Essentially, analysis is done by inference, predicting peptide sequences without actually knowing what the sequences are. While this method works well for many proteins, its application is limited with more complex protein mixtures and whole proteome analysis. The high molecular complexity of proteomes due to variation in gene expression, mRNA splicing, and post-translational modification of proteins can interfere with such simple comparative analysis.
Updated methods utilise higher analytical MS performance and error correction approaches to help improve protein detection in complex mixtures but have not tipped the scale to making error-free proteomics routine.
Building confidence
If proteome composition is largely unknown and computational prediction of content compromises sensitivity and specificity, resulting proteome data sets may not be reliable sources of comparison because they may contain incomplete and potentially erroneous peptides. Reliable analytical standards overcome this challenge by verifying the identity of molecules, and this concept is just starting to catch on in proteomics. By generating a comprehensive library of synthetic peptides across all families of human proteins, confident protein discovery using standardised reference peptides could not only be faster but instil more certainty in results.
This is just what a group at the Technical University of Munich, Germany has set out to do. Led by Professor Bernhard Kuster, the team has embarked on creating the largest, most comprehensive known synthetic protein library as part of a project called ProteomeTools. As is typical in analytical chemistry, reference standards verify a molecule’s identity. By applying this approach to proteomics, Kuster and his team are using mass spectrometry to generate a reference library of 1.4 million synthetic peptides that cover all human proteins. Currently, they have completed the synthesis and liquid chromatography-tandem mass spectrometry (LC-MS/MS) analysis of over 330,000 tryptic peptides, comprising all canonical human proteins. By creating synthetic peptide standards for general use, the project will improve on current methods used to study proteins, especially in complex biological samples, and support proteomics research in numerous ways.
Creating the tool
The team is using a batch approach to decide how to group peptides into easy to analyze collections. This first set of peptides was chosen based on two main criteria, selecting either peptides with experimentally determined proteotypicity, which uniquely maps a peptide to a single protein, or peptides for proteins with little to no experimental evidence confined to the typical range optimal for MS. A third subset was combined into these groups from the HumanSRM Atlas, a resource for selected/multiple reaction monitoring (SRM/MRM)-based proteomic workflows, for a total number of 330,286 non-redundant peptides covering 19,840 human genes.
Further peptide sets for future synthesis and analysis have also been determined, and include an estimated 300,000 peptides generated by alternative proteases, 200,000 peptide sequence variants, 350,000 post-translationally modified peptides, and 200,000 additional peptides representing other biologies. These groupings reflect the many ways in which peptide libraries might be used by the scientific community and make up the 1.4 million total peptides expected to be generated and analysed for ProteomeTools.
Once the peptides were synthesised, they were combined into pools of 1,000 peptides and spiked with non-naturally occurring isotope-labelled peptides to help with retention time calibration in LC-MS/MS measurements. Each pool was subjected to LC-MS/MS to assess successful synthesis, and determine peptide chromatographic retention times (RT).
An inclusion list was generated for each pool to target peptides in subsequent LC-MS/MS experiments to collect fragmentation data for different fragmentation methods and collision energies.
Bioinformatics was used to estimate peptide yield by measuring how much of the total MS signal can be attributed to the desired product. This also enabled the identification of chemical by-products and truncated peptides present in the mixtures which assisted in verifying the correct full peptides and served to refine RT and fragmentation prediction. The outcome was an observed large library of tandem mass spectra of exquisite quality for more than 200,000 peptides.
Detailed data, broad reach
Once completed, ProteomeTools will change the way protein detection is approached experimentally by offering an easy verification tool for protein identification. Estimated to generate 11.3 million peptide spectrum matches that map to 211,895 peptides covering each gene in the human genome by 9 peptides on average, creates an extensive set of searchable data within the library. This scope of information covers 19,735 of 20,036 human genes, generating extremely high quality reference spectra for 95% of the human proteome and providing reliable established standards to compare with experimental data.
While previous methods for protein identification were able to obtain valuable information for proteins, this resource is a detailed guide, providing researchers with true mass spectral data for every protein in the human proteome and allowing them to confirm peptide identities in simple and complex cases. The resource could provide additional impetus to move towards a spectral matching approach, supplementing data-independent acquisition and leaving behind conventional database searching methods.
Novel information about unknown proteins revealed by using ProteomeTools demonstrates just one of the benefits of such a comprehensive databaseFor example, a researcher can look up a protein or peptide of interest to complement incomplete experimental data. Potential proteins could include detection of missing proteins that are predicted based on genetic code but have not been experimentally observed. These missing proteins represent various proteins that have gone undetected, most likely because some genetic data is not built into proteins or because the proteins are too difficult to detect with current instrumentation. Synthetically building peptides for these proteins and testing detection with current methods will determine whether they are present or undetectable. Novel information about unknown proteins revealed by using ProteomeTools demonstrates just one of the benefits of such a comprehensive database.
The project is expected to take three years, with a release of data for about 250,000 peptides every six months. With so much new data being generated, there is an opportunity to develop improved statistical methods for the assessment of large-scale proteomics experiments. Applying peptide data to the development of new software that can complement or even replace experimental data will help facilitate further analysis. To test this, the team built a prototype classifier based on multiple fragmentation spectra for the same peptide at a particular collision energy, predicting the fragmentation intensity of MS/MS spectra for any peptide. Success of new software would negate the need for stable isotope-labeled peptides since fragmentation spectra can be easily simulated based on un-labelled spectra.
The availability of such a complete set of peptide standards can also be applied to a variety of basic research or clinical applications. With the addition of pre-built targeted assays for functional proteins such as kinases and phosphopeptides, representing the activation status of signalling pathways, researchers can use the exact retention time and spectrum image for each peptide in the library to investigate proteins of interest in cells and tissues. Once assays are developed, the resource will help researchers to confidently identify novel biomarkers and work to develop simpler and quicker diagnostic tools to improve the speed and accuracy with which diseases can be detected. The library could also enrich data for proteogenomics studies where proteomic data complements genomic analyses, aiding researchers in discovering genomic variants at the protein level.
The ProteomeTools project aims to strengthen our current ability to study proteins in biological samples using synthetic peptide standards, and expand our reach for what can be discovered. The molecular and digital tools that can arise from the ProteomeTools project will become valuable resources for the proteomics community.
Author:
Andreas Huhmer is global marketing director for mass spectrometry solutions (Proteomics) at Thermo Fisher Scientific