Under a private cloud
19 Apr 2016 by Evoluted New Media
Dr Jacky Pallas discusses the need for cloud technology in a biomedical setting, and details the challenges and benefits of setting up a cloud HPC system.
Dr Jacky Pallas discusses the need for cloud technology in a biomedical setting, describes the challenges of setting up a cloud HPC system and details the considerable benefits it can bring.
eMedLab was formed in 2014 with £8.9m funding from the Medical Research Council (MRC). Our vision, the same today as it was then, is to maximise the gains for patients and for medical research that will come from the explosion in data generated through large-scale genetics projects. Our exemplar projects focus on three disease domains of rare diseases, cardiovascular diseases, and cancer. We also focus on three data types: genomic (primarily genetic) or DNA sequencing, imaging (ranging in scale from whole organs to histopathological samples) and clinical information (patient records and deep phenotyping).
The original partnership was formed by leading scientists at University College London, Queen Mary University of London, the London School of Hygiene & Tropical Medicine, the Francis Crick Institute, the Wellcome Trust Sanger Institute and the EMBL European Bioinformatics Institute. Following staff moves the partnership has extended to include King’s College London. Achieving our vision has been hampered in the past due to several challenges such as the necessary security around access to clinical data, the absence of large-scale genetics data from patients and the lack of a high-performance computing facility designed specifically for these types of analysis. Lots of biomedical projects want to access the same datasets. However, it’s simply not practical – from a data transfer and data storage perspective – to have scientists replicating the same core datasets across their own, separate physical high performance computing resources.[caption id="attachment_53063" align="alignnone" width="620"] Large datasets in the cloud reduce the time needed for researchers to conduct work.[/caption]
By way of example, we have a cancer genomics dataset that is Petabyte (PB) scale and we really don’t want a dataset of that size duplicated and replicated across organisations. It could take weeks to transfer this amount of data between sites. Duplication of datasets is costly and presents both security and data protection risks. By hosting large datasets in one system, accessed by a number of users for their own projects, we reduce time and cost for the researchers while minimising the risk of data loss. Our award from the MRC provided funding for two important initiatives; creating a high performance computing and data storage facility as well as supporting four Fellowships – early career scientists to lead new research in personalised medicine. In 2014 we started a procurement process which set out key requirements for a new infrastructure, such as the latest high-performance computing systems and the ability to leverage cloud, so that the potential from high performance computing (HPC) could be met without being constrained by a single software stack in a typical HPC system. The major challenge was the funding requirement to design, procure and install the system within a 12-month period.
[caption id="attachment_53062" align="alignnone" width="620"] THe new system can store 5.5PB of data[/caption]Duplication of datasets is costly and presents both security and data protection risks
We made the decision to have a data-sharing infrastructure that could provide data to computing resources at a very high rate and be processed quickly, while at the same time provide as much flexibility as possible, owing to the diverse nature of the research happening within the different institutions and our collaborative projects. Informatics skills aren’t as embedded in the biomedical and clinical community as other science areas although the day-to-day use of high performance and high throughput computing is increasing rapidly. We are having to learn very quickly with the advent of genome and other imaging projects coming out of initiatives like Genomics England, which are generating petabytes of data for analysis. The community is trying to grow its bioinformatics skills quite considerably! The partner institutions carried out an intensive international search to identify the finest early career researchers in medical bioinformatics, and have recruited four career development fellows to take up junior group leader positions in eMedLab institutions. These new research groups will form the eMedlab Research Academy and we are already growing with new PhD students working with the Fellows.
In January 2016, launched at the eMedLab 2016 symposium, we went live with a new private cloud, HPC environment and big data system. The system can store up to 5.5PB of data. This significant storage capacity means that data from eMedLab can stay in one secure place. The system has multi-tenancy features which enables different institutions and research groups to securely co-exist on the same hardware and share data when appropriate. Scientists can also use the system to ‘spin up’ virtual HPC clusters bespoke to their needs selecting, for example, computer memory, processors, networking, and storage, all orchestrated by a simple web-based user-interface. Practically speaking, researchers can build virtual machines, perform their analysis pipeline on their own computers, and then move this into the cluster – they can request as much computer power as the analysis requires and, as agreed by the resource allocation governance, researchers will be able access up to 6,000 cores of processing power. We can support some very large analysis requirements with the very low latency and high bandwidth required for our high throughput bioinformatics applications.The system uses Red Hat Enterprise Linux OpenStack Platform with a combination of technology from Lenovo IBM, and Mellanox. It has been designed, integrated and configured by OCF, a HPC, big data and predictive analytics provider and is hosted at a shared data centre for education and research, offered by digital technologies charity Jisc. The data centre has the capacity, technological capability and flexibility to future-proof and support all of eMedLab’s HPC needs, with its ability to accommodate multiple and varied research projects concurrently in a highly collaborative environment. It also hosts infrastructure from other biomedical informatics projects at Kings College London and Imperial College, forming a unique hub of resources. The datacentre is connected to the academic network, Janet, which makes possible the hosting of infrastructure off-site from institutional campuses. We preferred a private cloud system because it gave us control over our infrastructure and we know it is secure. However, as with anything, a private cloud is not without its challenges. Private cloud infrastructure requires considerable technical expertise. Fortunately, we can access this from across our consortium. Our operations team is formed from staff across the partnership who work together to support the system and grow expertise at all of those partners. We also talk to some of the other academic cloud system providers to share knowledge.
We preferred a private cloud system because it gave us control over our infrastructure and we know it is secure
During the procurement process, the private cloud/OpenStack approach was weighed against buying in a more off-the-shelf HPC system, as well as use of public cloud. All the options were close in terms of quality, but there were various constraints in our requirements and with the OCF system we found a better fit. Use of commercial cloud providers, for example, was tricky because of questions around location, with the regulatory and security issues around access to the data. Plus, we questioned the speed of access to the data in the cloud. There were also pricing considerations around cloud providers charge for data egress. We’re keeping an eye on public cloud for academic research. We know there are opportunities developing, but they have to overcome hurdles in terms of cost and regulatory concerns plus technical considerations. For many academic big data and big compute projects we need specialised cutting-edge architectures. For the work we’re doing, private cloud is most appropriate at the moment.
As a partnership, our next step is to ensure that eMedLab is properly coordinated with other institutes and organisations involved in eHealth. To this end, in January 2016 we launched the eMedLab Research Academy to the community, bringing together representatives from the MRC, eMedLab and its partner organisations, the Farr Institute, Genomics England, and the Alan Turing Institute. We listened to details of new research, exchanged ideas and ways of working and, of course, unveiled our new private cloud infrastructure.Author: Dr Jacky Pallas, director of research platforms, University College London.