A quartet of proteomic surprises…
6 Nov 2014 by Evoluted New Media
The draft map of the Human proteome is just the beginning of our understanding of the incredible complexity of the human machine. Here Paul Ko Ferrigno discusses some of the remaining challenges and highlights four surprises in the Human Proteome Map as we currently understand it Sequencing the human genome revolutionised biology and medicine – so much is now old hat. This year, two papers published in Nature1,2 have given us a much broader view of Life (or at least human life) at the level of the protein molecules that perform biological processes – and whose malfunctions cause disease. Using new data from 30 different normal human tissues1 or a combination of legacy data mined from public databases and newly generated data from more than 60 tissues, 13 body fluids and 147 cell lines2, the two groups have compiled catalogues of protein expression that now rival in richness the data from genomics and transcriptomics. Twenty seven years after the first draft map of the human genome3, we have a draft map of the human proteome. The two proteome map papers1,2 build on work published by other groups just three years ago, showing that model human cells (HeLa4 and U2os5 cells) express just 10-12,000 proteins. This is a refreshingly small number, one that can realistically be studied using mass spec or even, if the Human Protein Atlas has its way, using antibodies. When reading the detail of the human proteome map papers1,2, there are four surprises. The first is that this catalogue is not complete. The most widely used approach to this analysis is ‘bottom-up’ mass spec, where all the proteins in the sample are digested using the protease trypsin into peptides small enough that they will fly in the mass spectrometer. It turns out that there are (at least) 257 human proteins that do not contain a tryptic cleavage site, and so that cannot be detected by this approach. The second surprise is that, despite the fact that these studies double the number of peptides detected by mass spec1 this apparently comprehensive approach ‘saturates’ at 17-19,000 proteins – leaving 6,000 predicted proteins still unaccounted for2. At least some of these ‘missing proteins’ may actually be proteins whose existence is incorrectly predicted by sequence gazing – the ‘genes’ that encode these missing proteins could actually be pseudo-genes that have fallen silent during evolution, although the number seems rather large for this to be true. Other explanations are also proposed – the proteins may just be difficult to isolate, trypsinise and/or ionise. Another idea is that the proteins are simply expressed at very low levels – below the limit of detection of mass spec, at least in discovery mode, leading to the suggestion that people start looking for these missing proteins in the much more sensitive target mode available on many machines. Conversely and here is the third surprise, Kim et al1 report 16 million MS/MS spectra that did not match currently annotated proteins – but now suggest that at least 140 presumed pseudo-genes are in fact true, protein-encoding genes. They also suggest the existence of at least 44 novel ORFs and 106 new axons within previously (mis-) annotated genes1. This is consistent with a new bioinformatics study that suggests that 2,000 new alternative splice variants might exist where axons that were thought to lead to nonsense (frame shifted) transcripts are actually coupled to a new, alternative and in-frame start codon.6 It is also possibly the case that post-translational modifications and sample processing artefacts are changing the mass of many of the peptides being analysed to such an extent that they cannot be identified by scanning databases of predicted peptides derived simply from the DNA sequence of the human genome. The fourth surprise is that the core protein set that is nearly ubiquitously expressed, and that probably accounts for the maintenance of basic cellular biological processes in all cells, may nonetheless be differentially expressed between different organs and fluids. There is a discrepancy here between the two papers: Kim et al1 mention 2,350 proteins that are ubiquitously expressed but appear not to vary between tissues, while Wilhelm et al’s approach, which amalgamates data from multiple previous studies, suggests that these levels do vary2. According to Wilhelm et al, the differentiation is not a simple presence/absence, but a difference in level of expression, covering five orders of magnitude (a protein may be expressed 100,000 more highly in one organ or fluid). One wonders whether this discrepancy may be an artefact of Wilhelm et al’s approach, and that it reflects differences in experimental approaches between the studies that they collate rather than true biological differences.
The proteome map only provides us with a framework that we can use to try and integrate data produced by other methodsLess surprising, but comforting, is that regulatory and effector proteins that have been implicated in a particular function or process (e.g. immunity) can be seen to be preferentially expressed in cells or organs of the corresponding system (e.g. the immune system). This implies that if a protein is found to preferentially expressed in a particular organ, it probably plays a role in the process in which that organ is involved. As with the genome map, which was only really useful as a framework onto which to overlay the sequence data produced by the Human Genome Project, the proteome map only provides us with a framework that we can use to try and integrate data produced by other methods. The real question is what do all these proteins do – or more interestingly, how do they do their work, how are they regulated, what goes wrong in disease, and what can we do to correct the problem? One of the key facts we know about many of the proteins we study is that they bind to other proteins. In attempts to catalogue these interactions, a yeast two hybrid analysis of just over 5,500 clones expressing human proteins found that 1,705 of them could bind to at least one other protein in the same set, and that these 1,705 actually are capable of making 3,186 pair-wise interactions. Of these, nearly half (47%) bind to only one partner, while 24 of these proteins (1.4%) have at least 30 partners and are designated as ‘hubs’ that probably play essential roles in cell biology7. If we extrapolate these numbers to even just 10,000 proteins in a ‘typical’ human cell, that would mean that 120 of them are capable of interacting with any one (or more) of 30 other proteins each… which equates to over 140 billion potential protein complexes. So we can infer that proteins do their work through a succession of protein interactions, on a time scale that is likely to be of seconds or fractions of a second. How would this be regulated? There are over 300 post-translational modifications known8 with 47,673 experimentally-confirmed occurrences on 66,260 different proteins in the Uniprot database. Of these only a handful can claim to be well studied (protein phosphorylation, lysine methylation or acetylation, and protein ubiquitylation) due to a lack of tools or technologies for their study – so there are plenty of ways in which protein complex assembly and disassembly can be driven, and protein function modulated. Finally, looking at protein expression levels across the numerous tissues and fluids covered in their analysis, it is possible to infer that the rate of mRNA translation is probably conserved across all tissues2. Using their new ProteomicsDB resource2, Wilhelm et al propose that it should be possible to calculate the protein/mRNA expression ratio for every pair, and then use mRNA levels (which can be measured in high throughput using microarrays or multiplexed quantitative PCR) as a fairly reliable predictor of protein levels in biology. This is an interesting idea, but it has long been known that the presence of an mRNA cannot be used as a reliable predictor of the presence of the corresponding protein. It is possible that if the protein is already known to be expressed then this quantitative inference is acceptable, but much more work is needed before this proposition can be accepted. The other thing of course is that knowing the protein is present doesn’t tell you what it’s doing- for that, you’ll need to know about its post-translational modification state8, which means that PCR will not yet be able to replace mass spec, or indeed western blots, ELISA and other antibody-based- or antibody-like assays for the analysis of proteins. There are a number of large-scale efforts to provide proteome-wide reagents for the analysis of individual proteins, of which the most successful is probably the Human Proteome Atlas. A key challenge though will be to provide a source of renewable protein detection reagents (most of the antibodies made by the Human Proteome Atlas are polyclonals) that are stable over time (a risk with any monoclonal antibody) and can be shared across the community. We believe that Affimers will provide the solution – not just because they can be used as a direct replacement for antibodies in established detection workflows (westerns, ELISAs etc) but also because they can be used in a microarray format9 enabling the direct coupling of the simultaneous analysis of many thousands of proteins at once to the detailed singleplex analysis of proteins of interest, and putting some detail on that proteome map. References 1. M-S Kim, SM Pinto, D Getnet, R Sekhar Nirujogi, SS Manda, R Chaerkady, AK Madugundu, DS Kelkar, R Isserlin, S Jain, JK Thomas, B Muthusamy, P Leal-Rojas, P Kumar, NA Sahasrabuddhe, L Balakrishnan, J Advani, B George, S Renuse, LDN Selvan, AH Patil, V Nanjappa, A Radhakrishnan, S Prasad, T Subbannayya, R Raju, M Kumar, SK Sreenivasamurthy, A Marimuthu, GJ Sathe, S Chavan, KK Datta, Y Subbannayya, A Sahu, SD Yelamanchi, S Jayaram, P Rajagopalan, J Sharma, KR Murthy, N Syed, R Goel, AA Khan, S Ahmad, G Dey, K Mudgal, A Chatterjee, T-C Huang, J Zhong, X Wu, PG Shaw, D Freed, MS Zahari, KK Mukherjee, S Shankar, A Mahadevan, H Lam, CJ Mitchell, S Krishna Shankar, P Satishchandra, JT Schroeder, R Sirdeshmukh, A Maitra, SD Leach, CG Drake, MK Halushka, TS Keshava Prasad, RH Hruban, CL Kerr, GD Bader, CA Iacobuzio-Donahue, H Gowda & A Pandey. 2014. A draft map of the human proteome. Nature 509, 575-581 doi:10.1038/nature13302 2. M Wilhelm, J Schlegl, H Hahne, A Moghaddas Gholami, M Lieberenz, MM Savitski, E Ziegler, L Butzmann, S Gessulat, H Marx, T Mathieson, S Lemeer, K Schnatbaum, U Reimer, H Wenschuh, M Mollenhauer, J Slotta-Huspenina, J-H Boese, M Bantscheff, A Gerstmair, F Faerber & B Kuster. 2014. Mass-spectrometry-based draft of the human proteome Nature 509 582- 587 3. H Donis-Keller, P Green, C Helms, S Cartinhour, B Weiffenbach, K Stephens, TP Keith, DW Bowden, DR Smith, ES Lander, D Botstein, G Akots, KS Rediker, T Gravius, VA Brown, MB Rising, C Parker, JA Powers, DE Watt, ER Kauffman, A Bricker, P Phipps, H Muller-Kahle, TR Fulton, S Ng, JW Schumm, JC Braman, RG Knowlton, DF Barker, SM Crooks, SE Lincoln, MJ Daly, J Abrahamson (1987). A genetic linkage map of the human genome. Cell 51, p319–337 4. M Beck, A Schmidt, J Malmstroem, M Claasen, A Ori, A Szymborska, F Herzog, O Rinner, J Ellenberg and R Aebersold. 2011 The quantitative proteome of a human cell line. Mol Syst Biol. 7:549. doi: 10.1038/msb.2011.82. 5. N Nagaraj, JR Wisniewski, T Geiger, J Cox, M Kircher, J Kelso, S Pääbo, M Mann. 2011 Deep proteome and transcriptome mapping of a human cancer cell line. Mol Syst Biol. 7:548. doi: 10.1038/msb.2011.81. 6. LOW Wilson, A Spriggs, JM Taylor, AM Fahrer. (2014). A novel splicing outcome reveals more than 2000 new mammalian protein isoforms. Bioinformatics 30 151-156 7. U Stelzl, U Worm, M Lalowski, C Haenig, FH Brembeck, H Goehler, M Stroedicke, M Zenkner, A Schoenherr, S Koeppen, J Timm, S Mintzlaff, C Abraham, N Bock, S Kietzmann, A Goedde, E Toksöz, A Droege, S Krobitsch, B Korn, W Birchmeier, H Lehrach, EE Wanker(2005). A Human Protein-Protein Interaction Network: A Resource for Annotating the Proteome. Cell 122, 957–968 8. GA Khoury, RC Baliban & CA Floudas (2011) Proteome-wide post-translational modification statistics: frequency analysis and curation of the swiss-prot database Scientific Reports 1, Article number: 90 doi:10.1038/srep00090 9. Q Song, LK Stadler, J Peng, P Ko Ferrigno. (2011) Peptide aptamer microarrays: bridging the bio-detector interface. Faraday Discuss. 149:79-92; discussion 137-57. More information www.avactalifesciences.com www.proteinatlas.org Author Paul Ko Ferringno is chief scientific officer at Avacta. He obtained his PhD in Biochemistry at the MRC Protein Phosphorylation Unit. Additionally, he is a Visiting Professor in the Leeds Institute of Molecular Medicine and a Visiting Member of the Astbury Centre for Structural and Molecular Biology.