Turning data in to knowledge - the rise of bioinformatics
30 Nov 2005 by Evoluted New Media
With the explosion of data that has arrived with the dawning of the post-genomic era, just how are scientists going to make sense of it all? The answer lies on a silicon chip say the bioinformaticians
With the explosion of data that has arrived with the dawning of the post-genomic era, just how are scientists going to make sense of it all? The answer lies on a silicon chip say the bioinformaticians
From Gregor Mendel’s first experiments with pea plants to the raging debate on cloning, the study and application of genetics has revolutionised how we think about biology and, like most scientific endeavours, advances have been made in distinct leaps and bounds.
After the acceptance of Darwin’s work on the origin of species, our knowledge of inheritance rapidly expanded and developed. Then, with Watson and Crick’s finding that DNA existed in the form of a Double Helix, functional genetics was finally linked to molecular structure. It was the development of this new field of molecular biology that allowed the dissection of complex biological phenomena into specific molecules and their functions. Now, the advancement of genomic technologies such as high-throughput genome sequencing and whole-genome expression has meant that complete genome sequences of complex organisms can be accessed easily by scientists.
Some say this is the end point of the main goal of molecular biology – to identify, clone and analyse specific gene products for given functions. Once all the genes are known, the game changes. Managing the data to give insights into the genetic blueprints being uncovered is surely the next leap for the science of genomics.
Data data everywhere…
With the amount of genomic and proteomic data now well over a petabyte (a quadrillion bytes), the question becomes, what do you do with it all once you have stored it on various computers? The answer: use those same computers to manage, retrieve, organise and integrate the data. This is the growing discipline of Bioinformatics.
“Bioinformatics is the art of applying modern computational approaches to biological problems,” explains Dr Matt Wood of the Institute of Computational Biomedicine at Cornell University. “The wealth of data flowing from experimental life science research has huge potential to unlock some of the long un-answered biological questions. However, extracting meaning from these vast bodies of data is challenging. Bioinformatics attempts to use methods from statistics, computer science and other informatic disciplines to turn this data into knowledge.”
The initial challenge for the bioinformaticians in the transformation of data into knowledge is the efficient storage of the data itself. However, on its own, the data is meaningless. Where before it was possible to analyse the results of an experiment with relatively simple lab based techniques, sophisticated databases and visualisation techniques are now required just to store the data and begin to look at it. Therefore, incisive computer tools have been developed to allow the extraction of meaningful biological information.
Despite the development of these databases being very much within the realm of computer scientists, it was vital that the biological questions remained at the centre of the problem. To ensure this there are three central biological tenants upon which all bioinformatics developments are based. That DNA sequence determines protein sequence, protein sequence determines protein structure, and protein structure determines protein function.
The world wide protein sequence repository contains 32,000 structures
From database to organism
Once created, biological databases are archives of consistent data that are stored in a uniform and efficient manor. A simple database might be a single file containing many records, each of which includes the same set of information. These databases can contain information from a broad spectrum of molecular biological areas, for example a record associated with a nucleotide sequence typically contains the input sequence with a description of the type of molecule, the name of the source organism from which it was isolated and, often, literature citations associated with the sequence.
Ultimately, however, all of this information must be combined to form a comprehensive picture of cellular activity. So has it been working so far? “Most definitely” says Dr Conrad Bessant of the Department of Analytical Science and Informatics at Cranfield University. “Bioinformatics has been an integral part of the post-genomic era. If you just consider the example of the human genome, traditional ways of recording sequences such as lab books are just not practical when you're dealing with three billion base pairs. The same goes for analysis of the data - how would we have found all the genes in that data without the aid of a computer?”
Several bioinformatics software tools are now available for scientists to use and most are available over the internet. One good source of free access to these tools is the European Bioinformatic Institute (EBI). Part of the European Molecular Biology Laboratory (EMBL), the EBI manages databases of biological data including nucleic acid, protein sequences and macromolecular structures. The tools available include so called Similarity Searching Tools which allow scientists to examine the similarity of one sequence to another, revealing clues as to the structure of the sequence. Once the structure has been elucidated the biochemical functions of a protein can also be probed using a Protein Function Analysis tool which provides details on motifs, signatures and protein domains.
Despite the many tools which can provide a lot of information for investigators, one of the main roles of bioinformatics, argues Dr Bessant, is to ask more questions than it answers. “It’s important to remember that because bioinformatics is a computational discipline, it generally provides hypotheses rather than answers, and most hypotheses ultimately need to be confirmed in the lab.”
It’s all part of the system
Many biologists see the next era of biology as being based increasingly on a systems approach, and indeed genomic and post-genomic technologies are making this more and more possible. For example, measuring the expression of all genes (as opposed to following specific genes) in the yeast genome, every 20 minutes over a four hour nutrient starvation experiment can now be achieved using gene-chip technologies. The difficulty of this kind of experiment is that it produces large amounts of data, a problem that is ideally solved with bioinformatics.
Dr Wood explains: “Pooling different approaches to biological study and applying them at the cell or tissue level is extremely challenging, however, the field of systems biology attempts to do exactly that. By accurately modelling intra and extra-cellular interactions – such as signal transduction pathways and gene regulatory networks – it will provide a huge amount of data that will require new bioinformatics tools to decipher it all. The requirements of systems biology will very much push bioinformatics research forward.”
This challenge will define bioinformatics as a science in the coming years. It will need to continue to provide the analysis tools for making sense of genomics data and proteomics data - including gene sequence, gene expression, gene polymorphism, and protein structure and functional data - and put it together to make cohesive models of biological systems.
“Post-genomic techniques such as microarrays and proteomics are generating even larger data sets,” explains Dr Bessant. “So bioinformatics is continually growing and developing to respond to the challenges of different and larger sets of data.”
The skills gap
It is clear that bioinformatics is a powerful - indeed vital - tool for the decoding of otherwise impenetrable data. But not all biologists are computer experts. Could the continuous development of more powerful and complex tools mean that the computing technology will outpace the biologist’s knowledge of how to use it?
“I think this is definitely happening,” says Dr Bessant. “Probably more with the analysis tools than with the data itself. There are so many bits of bioinformatics software being produced all the time that it is difficult for the laboratory biologist to keep track of which ones are useful for their research.”
Clear communication between biologists and bioinformatictians is the key to combat this gap between what the tools can do and how to use them effectively to solve biological problems.
Dr Bessant, who runs a Master’s degree in bioinformatics at Cranfield, thinks that the answer lies in good training programmes. “This issue can be addressed in two ways,” he says. “Either through educating the biologists in bioinformatics, or by working closely with biologists to make sure we produce tools that are acutely useful to them.”
The availability of training courses like the one at Cranfield has increased over the last few years, but it is not enough. The supply of trained bioinformaticists is still smaller than the than the demand by academic, government and industrial laboratories. However, this looks set to change says Peter Gwynne of ScienceCareers.org: “At present, trained bioinformaticists have a huge advantage, as it hasn’t been easy so far to obtain the necessary training in the field. That situation has started to change, however. In the past two years several universities have started to offer Bachelor’s, Master’s, and doctoral degrees in bioinformatics, many of them supported by government funding.”
The goal of Bioinformatics must now be to build connections between those who are actively producing data experimentally, and those with the skills to analyse it theoretically. Biologists must utilise new forms of analysis if they are to get the best out of their genomic and proteomic data, and bioinformatics must rise to the challenge of this new class of theoretical problem.
Genomic and proteomic advances have opened the book to life, now, with the help of their hard drives and micro-chips, the bioinformaticians must learn its complex language so that we can read it.
By Phil Prime, Assistant Editor, Laboratory News