Seeing is believing
27 Dec 2012 by Evoluted New Media
With “big data” now a fundamental part of so many research projects, it’s often difficult for biologists to uncover all essential parts of their findings, says Carl-Johan Ivarsson. However, new data visualisation techniques are now making these vital results much easier to see One of the most pressing challenges facing scientists today is what’s known as “big data,” the name given to complex datasets that have grown so large that they become nearly impossible to analyse using traditional tools. Faced with such a large volume of data, it is often very difficult – if not impossible – for scientists to derive any real biological meaning from their findings with the naked eye alone, so to help, software tools are required.
Within the field of molecular biology, the trend towards big data is rapidly accelerating rather than slowing down, since the benefits of working with ever-larger datasets is a vital part of the fight against human disease. As a result, much of the computer software that has been designed for use in this area has focussed on being able to handle increasingly vast amounts of data.
Unfortunately, this shift in focus risks pushing biology scientists and researchers to one side, since a lot of data analysis is now being performed by specialist bioinformaticians and biostatisticians, especially when complicated algorithms are required. This model has several drawbacks, however, since it is typically the scientist who knows the most about the specific subject area being studied. A partnership between these two groups is vital for achieving the very best results.
To address this issue, a new generation of data visualisation tools has been designed to take full advantage of the most powerful pattern recogniser that exists: the human brain. Powerful software programs are already being used to help researchers to visualise their data in graphical form, so that they can identify hidden structures and patterns more easily and therefore identify any interesting and/or significant results easily, by themselves, without having to rely exclusively on specialist bioinformaticians and biostatisticians.
The latest data visualisation techniques and presentation technologies are already making it much easier for researchers to examine enormous quantities of data, to test different hypothesis, and to explore alternative scenarios within seconds, since important findings can now be displayed in a graphical form that is much easier to interpret.
This approach to data visualisation works by projecting high dimensional data down to lower dimensions, which can then be plotted in 3D on a computer screen, and then rotated manually or automatically and examined by the naked eye. With the benefit of instant user feedback on all of these actions, scientists studying human disease can therefore analyse their findings in real-time, directly on their computer screen, in a graphical form.
When used during research in this way, the ability to visualise data represents a very powerful tool for scientists, since the human brain is very good at detecting structures and patterns. The idea behind this approach is that highly complex data will be easier to understand and comprehend by giving it a graphic form, so that scientists can make decisions based on information that they can identify and understand easily.
Dr Anna Andersson, a scientist studying childhood Leukaemia in both the United States and Sweden, is just one scientist using this new technology in a real world setting.
“For me, the visualisation of data into images is an absolute requirement, since genome wide data simply contains too much information to interpret otherwise,” Dr Andersson says. “With data visualisation, however, it’s now possible for scientists to play a very active role in the analysis of key data, since it is much easier and faster to interpret the results.”
“Of course, I still think that a good collaboration between statisticians and biologists is vital. It is very useful, for example, to discuss cut-offs and statistical significance with a statistician in order to make sure that the data is not ‘over-interpreted’. However, with the latest data visualisation tools, a biologist is now able to query the data instantly, and to be perhaps more critical about the data, by looking at the information in a way that is different from a statistician. As a result, I believe that this model actually strengthens the relationship between biologists and statisticians, which means that much better results can be achieved.” New visualisation methods included in the latest data analysis applications are currently allowing scientists to analyse very large data sets by using a combination of different visualisation techniques, such as Heatmaps and Principal Component Analysis (PCA). With visualisation tools like these, it is possible to investigate large and complex data sets without being a statistics expert, since visualising information reduces the time required to take in data, make sense of it, and draw conclusions from it.
The process begins by reducing high dimension data down to lower dimensions so that it can be plotted in 3D. PCA is often used for this purpose, as it uses a mathematical procedure to transform a number of possibly correlated variables into a number of uncorrelated variables (called principal components).
One of the key breakthroughs in the latest generation of bioinformatics software is the introduction of dynamic PCA, an innovative way of combining PCA analysis with immediate user interaction. This unique feature allows scientists to manipulate different PCA-plots – interactively and in real time – directly on the computer screen, and at the same time to work with all annotations and other links in a fully integrated way. With this approach, researchers are given full freedom to explore all possible versions of the presented view, and are therefore able to analyse a large dataset easily.
By using a tool known as a 'heatmap' alongside this dynamic PCA analysis, Dr Andersson has yet another way of visualising her data, since heatmaps can take the values of a variable in a two-dimensional map and represent them as different colours.
Also, when data is obtained from DNA microarrays, biology heat maps represents the level of expression of many genes across a number of comparable samples, such as cells in different states or samples from different patients.
“Heatmaps provide an intuitive way of visualising data and are therefore easy for everyone to understand, not only people with a good knowledge about genomics,” says Dr Andersson. “The heat maps tell the viewer in an instant how strong the signature is without going into details. Without a doubt, data visualisation tools like these make it possible to achieve better results, since they make it much easier to notice something that might otherwise be hidden.”
According to Dr Andersson, when performing her own research, she typically begins her analysis by opening up a dataset, and then taking a look at the inherent structures that form, without performing any filtering. Usually, if there are any outliers or natural groups forming, Dr Andersson can see it here. Next, she colours the samples based on their class, and then takes another look at the data to see if samples belonging to the same class form natural clusters.
Once this step is completed, Dr Andersson begins to filter the data by variance in order to reduce the noise levels. Usually, if natural classes were seen during her initial inspection, they will begin to form even tighter clusters when genes with no or very little variation are removed.
Finally, Dr Andersson uses the multi-group comparison (F-test based on Anova) to identify genes that significantly correlate with the classes. Once that process has been completed, she can export a variable (gene) list and a data matrix that will allow her to use data visualisation to investigate the genes even further.
The process is similar when trying to identify unknown subtypes, except that here Dr Andersson uses the software to create a new class before marking the samples of the potential new class, and then cross-validates them before again exporting a variable (gene) list and reviewing the images that are produced.
“The ability to change – and also get feedback on – a specific question in real time produces better data in the end, as it is now possible to query the data instantly,” says Dr Andersson. “With the latest generation of data analysis software, nothing is too difficult. As a result, you can just try something out if you want to, rather than deciding not to explore your data because you know it’s going to be too difficult or time-consuming.”
As computer technology improves – with greater processing power, better graphics applications and more sophisticated analysis software – data visualisation will continue to develop as well. As such, these new methods of visualising data are likely to make traditional forms of data presentation (such as spreadsheets and basic graphics) obsolete for the data analysis phase in the future.
Even though the exploration and analysis of large data sets can be challenging, the use of tools like PCA and heat maps can provide a powerful way of identifying important structures and patterns very quickly, especially as visualisation typically provides the user with instant feedback, and with results that present themselves as they are being generated.
“A lot of people are talking about the benefits of having a 3D view of the data, as this functionality can be extremely useful. However, a 2D image can also be very informative, when presented in the form of a heat map, for example. The most important thing is to have a graphical view of the data – that is what really makes the biggest difference, not only to the scientists that are performing the study, but also to the people who will be reading the research papers that will be produced as a result.”
Already, the latest technological advances in this area are making it much easier for scientists to compare the vast quantity of data generated by genomic research studies and to test different hypotheses very quickly. As a result, the latest generation of data analysis software is helping scientists to realise the true potential of the important research being conducted in this area.
- Contact:
- Author: