Data storage moves from disk to DNA

By Lauren Davis
Tuesday, 29 January, 2013

With an ever-expanding amount of digital information at our fingertips, solutions are being sought to find a way to store it all. This is particularly important when it comes to the archiving of information which, while important, may not get accessed very often. Now, researchers at the EMBL-European Bioinformatics Institute (EMBL-EBI) have created a way to store data in a somewhat unlikely place - DNA.

Current data storage methods are, according to the researchers, either too expensive and power-dependent (such as hard disks) or too prone to degradation (such as magnetic tape). DNA, on the other hand, is a material that lasts for tens of thousands of years - as has been shown through its extraction from woolly mammoth bones - as well as being small, dense and requiring no power for storage.

The researchers noted that existing schemes have already examined the principle of DNA storage, stating that “existing schemes used for DNA computing in principle permit large-scale memory”, however, “data encoding in DNA computing is inextricably linked to the specific application or algorithm and no practical storage schemes have been realised”.

The main challenge the researchers faced was in synthesising (ie, creating) long sequences of DNA to an exactly specified design, as it is currently only possible to manufacture DNA in short strings. So that is what they did - representing the information being stored as a hypothetical long DNA molecule and encoding this in vitro using shorter DNA fragments.

“This offers the benefits that isolated DNA fragments are easily manipulated in vitro, and that the routine recovery of intact fragments from samples that are tens of thousands of years old indicates that well-prepared synthetic DNA should have an exceptionally long lifespan in low-maintenance environments,” said the researchers.

Once the method was decided, the researchers selected and encoded a range of computer file formats for storage. These included all of Shakespeare’s sonnets (ASCII text), a scientific paper (PDF format), a colour photograph of the EMBL-EBI (JPEG 2000 format), an excerpt from Martin Luther King’s ‘I have a dream’ speech (MP3 format) and a Huffman code used in the study to convert bytes to base-3 digits (ASCII text) - a total of 739 KB. These were sent to Californian company Agilent Technologies, which used the files to synthesise what amounted to 153,335 strings of DNA - the result of which looked like “a tiny piece of dust”, according to Agilent’s Emily Leproust. The company then sent the sample back to EMBL-EBI for DNA sequencing and decoding.

The DNA sequences were designed to reduce the probability of systematic failure, errors and data loss. So they contained no homopolymers (runs of ≥2 identical bases), which are associated with high error rates in existing high-throughput sequencing technologies. Each sequence was split into overlapping segments and alternate segments were converted to their reverse complement. Indexing information augmented into each segment showed where it belonged in the overall code, and the coding scheme did not allow repeats. The researchers synthesised oligonucleotides (oligos) corresponding to their designed DNA strings using an updated version of Agilent’s OLS (oligo library synthesis) process.

The DNA sequences representing the encoded files were reconstructed in silico for decoding. Four of the five sequences could be fully decoded without intervention - the fifth, however, contained two gaps, each a run of 25 bases, for which no segment was detected corresponding to the original DNA. The gaps were caused by the failure to sequence any oligo representing any of four overlapping segments. However, the researchers were able to hypothesise what the missing nucleotides should have been and so they manually inserted those 50 bases. The sequence could then be decoded, which resulted in all files being reconstructed with 100% accuracy.

The researchers were therefore able to prove that a small amount of data could be stored in DNA. As for larger applications, they showed that although the number of bases of synthesised DNA needed to encode information grows linearly with the amount of information to be stored, the decrease in efficiency is fairly slow over time. The same was said of the increase in cost and rates of error.

“DNA-based storage remains feasible on scales many orders of magnitude greater than current global data volumes,” the researchers said.

The researchers suggest that the method might be economically viable for long-term archives with a low expectation of extensive access, such as government, historical and scientific records.

“As with any storage system, a large-scale DNA archive would need stable DNA management and physical indexing of depositions,” they concluded. “But whereas current digital schemes for archiving require active and continuing maintenance and regular transferring between storage media, the DNA-based storage medium requires no active maintenance other than a cold, dry and dark environment … yet remains viable for thousands of years even by conservative estimates.

“Existing technologies for copying DNA are highly efficient, meaning that DNA is an excellent medium for the creation of copies of any archive for transportation, sharing or security.”

The method has been published in the journal Nature and can be viewed at http://dx.doi.org/10.1038/nature11875.

Data storage moves from disk to DNA

Protein build-up in brain blood vessels and dementia risk

Proteomic approach identifies new blood-clotting disorder

Blood test for chronic fatigue syndrome developed

Content from other channels on our network