Where does all the data go?
1 Dec 2011 by Evoluted New Media
Modern science and huge data sets go hand in hand, but are you sure yours is being backed-up correctly? Laurent Fanichet gives us some top tips on how to make data archiving work for your laboratory Scientific research facilities today are faced with exponential growth of unstructured data that needs to be stored safely, often for the long-term, while still remaining accessible for future use. Some scientific data and research methods need to be stored in the public archive to allow fellow researchers to replicate and test the research. This is especially the case when the research deals with health issues or public policy.
Research typically generates hundreds of gigabytes, and in some cases terabytes, of valuable data per day at great expense. Some tests may take days to complete and in some cases the circumstances and affecting factors are difficult and expensive to recreate. However, preserving this scientific data and protecting it from data loss, while at the same time optimising storage resources can prove to be a major challenge.
Two research facilities that were facing this challenge were Genethon and the ALBA synchrotron facility. Genethon is a biotherapy research centre that was created by the French Muscular Dystrophy Association (AFM) and funded almost exclusively by donations from France's annual Telethon. Its goal is to deliver gene therapies to patients with rare diseases, with emphasis on neuromuscular diseases.
With over 200 scientists, physicians, engineers and regulatory affairs specialists, Genethon is a world-leading centre for preclinical and clinical research and development in the field of gene therapy. As pioneers in the global effort to decipher the human genome, Genethon researchers have deeply and permanently changed the landscape of global research in human genetics. Their research data must be protected and archived to preserve Genethon’s scientific heritage.
With data volumes growing at 30 to 40% per year, primarily from the thousands of high resolution pictures and videos from electron microscopes and DNA sequencers, the IT department of Genethon faced lengthening backup windows, and thus began investigating file-archiving solutions to free up space on primary storage. After a study of the potential solutions available on the market, Genethon chose Atempo Digital Archive (ADA).
The IT department manages a fleet of over 250 computers and PC notebooks, about 20 Windows and Linux servers and has two storage arrays for research data, a Dell CX300 with an eight terabyte capacity and a disk array Dell EqualLogic PS6000 Series with a 10 terabyte capacity. Data is backed up and stored on a Dell PowerVault tape library.
After running Atempo Meter, a free diagnostic tool, the IT department learned that over 50% of the images and other files stored on the primary storage, which is backed up on a regular basis, were not accessed daily. Using ADA’s Hierarchical Storage Management (HSM) capabilities, the IT department was able to migrate this scientific data to a secondary storage array and free up 1.5 terabytes of space on the primary storage. During the migration, data files were automatically replaced by small stub files so that researchers can transparently retrieve the archived data. Since the migration, backup windows have been shortened by 40%.
"We were able to free up more than 50% capacity on primary storage without any impact on our users. These results mean significantly improved performance on our tape-based backups, plus savings from reduced administration time and reduced tape supplies, "said Tien-Dung Le Van, IT manager for Genethon. "In addition, the digital archive enables us to maintain our precious scientific assets in open formats while keeping these archives accessible and retrievable by our researchers."
Located near Barcelona, ALBA is a world-leading synchrotron light facility, accelerating electrons to produce x-rays that help scientists to understand the inner structures of particles. Using its facilities, research groups, including Nobel prize-winners, from all over the world carry out cutting-edge experiments across all scientific fields, from material stress tests in manufacturing to measuring bone growth.
ALBA’s synchrotron can run up to 30 experiments. Seven are currently being conducted around the accelerator, with each detector producing potentially up to 300MB/per second of raw data in the case of the two to three high throughput beam lines. With each test lasting several minutes at a time, in the course of one day the amount of scientific data generated is a major challenge for the computing and controls team at ALBA.
Research typically generates hundreds of gigabytes, and in some cases terabytes, of valuable data per day at great expenseIn order to provide scientists with fast access to experiment data, including results for analysis, ALBA has 250TB of primary storage with Hitachi HNAS platform powered by BlueArc, Cluster online storage where data is held. ALBA needed a file-archiving solution to free up space on its primary storage and keep experiment data safe for future reference. To meet these demands, ALBA chose ADA. Using ADA’s Storage Management capabilities, data is automatically migrated after a period of three months from ALBA’s primary online storage to an Overland NE08000 with eight LTO5 drives, a low-cost tape storage solution. By archiving data with ADA for long-term preservation to tape, valuable space is freed on the primary storage solution for new experiment data. This ensures cost effective, fast access to the new data as research continues.
In addition, the digital archive will archive the specific and varying metadata that is recorded as each synchrotron experiment carried out, such as the name of the project, the beam line used and the name of the researcher. By archiving this information the scientists are able to easily search and retrieve the experiment data based on this metadata criteria.
"Using Atempo, we can centrally manage and securely archive 250TB of precious scientific experiment data for long-term retention, so it can always be accessed. ADA allows us to easily and quickly search for and retrieve information,” said Joachim Metge, IT Manager at ALBA. “In addition, being able to transfer a large amount of data to a low-cost tape solution means ALBA can keep costs down while at the same time continuing to provide fast easy access to the scientists’ experiment data when they need it the most.”
As demonstrated by Genethon and ALBA, archiving represents an effective, efficient way to streamline primary storage usage and eliminate the problems associated with ballooning storage volumes, while preserving the scientific assets for the long-term. Apart from purchasing a powerful archiving software solution, how do you make archiving work in your research facility?
Step 1: Empower Content Owners Users are in many ways the main culprits for a research facility’s storage struggles and their participation is vital to getting storage in shape. Content creators know better than anyone else which files are fixed content and are unlikely to change, and therefore suitable for archiving. By arming users with an effective mechanism for archiving and retrieving themselves, significant capacity savings can be quickly realised.
Step 2: Make the Archive Habit Automatic While user intervention can significantly reduce storage capacity, it’s not always enough. For many organisations, automation of many archival processes is critical. By creating archiving rules and schedules that run automatically, policies can be enforced more consistently to maximise the benefits of archiving.
Best-in-class archiving software allows users to need to create site-specific criteria for managing automatic archiving policies, considering attributes like last access date and file type.
To ensure users can locate the files they need after they’re archived, archival solutions should offer stub files – small files that remain on primary storage and provide a link to the original archived file. With a simple double-click in the file browser; the file is retrieved from the archive.
Step 3: Eliminate Unnecessary Duplication Fighting duplication is one of the biggest challenges of containing storage growth. Today, an executive may deliver a 10MB presentation that gets routed to employees across any number of departments and teams, who may need to access that presentation at some point in the future. In the process, tens, hundreds, or even thousands of copies of that presentation may be stored. Archival solutions can minimise the toll redundancy takes on storage volumes.
Step 4: Ensure Archived Files Can Be Found For all its potential cost savings and benefits, an archive won’t be truly viable if users can’t easily locate and retrieve the files they need. To ensure fast, easy access to archived files, users need an archive that has the following capabilities:
- Enables users to retrieve a file without knowing its original location.
- Lets users search by basic file properties such as research name, creation date, researcher’s name and type of file.
- Offers metadata tags that users can use to create properties for searching groups of documents, such as tagging a set of files as “Genome project 2010.”
- Supports searching on the full text of the content. The file should be scanned and indexed when archived so that it doesn’t need to be retrieved for its content to be searched.
- Tape: For many research facilities, traditional tape storage is ideal. It is inexpensive, has large storage capacity and there are many proven tape vendors to choose from. While tape has some upfront latency, it can subsequently deliver strong performance for large files. Plus many research facilities already have tape systems on site for backup, and they can often leverage these same systems for archiving.
- Disk arrays: Lower cost disk arrays can make great archives for applications that require low latency. In particular, massive arrays of idle disk (MAID) systems offer a high performance, but more environmentally friendly, storage alternative to traditional disk arrays. In a MAID system, each drive is only spinning when needed to access the data stored on that drive, which saves power.
- Content addressable storage (CAS): CAS stores information based on its content, not its storage location. CAS offers strict control over document access and modification, making it an ideal alternative for regulated industries.
- Cloud Storage: Cloud storage offers infinite scalability without purchasing additional hardware, making it an ideal alternative for certain environments.
The Author: Laurent Fanichet, Director, Field Marketing EMEA & Corporate Communications at Atempo