By Johanne Kloster Ellingsen
Digital data production is growing exponentially, but current storage technologies are not keeping up with demand. Some researchers are advocating for DNA-based data storage as an alternative. DNA can hold 9Tb per mm3 after considering practical system overheads, resulting in a storage density 115,000 times higher than current archival storage methods can provide. Additionally, DNA-based storage requires little to no maintenance and fewer resources than present storage technology, and it is unlikely to ever become obsolete. The DNA storage pipeline of going from bits to DNA bases and vice versa consists of the following steps: writing (encoding and DNA synthesis), storage, retrieval and reading (DNA sequencing and decoding). DNA synthesis is currently the major bottleneck for commercialising the technology due to its high costs and time consumption. This article discusses the principles of DNA-based storage, the current commercial position of DNA-mediated archival storage and technological improvements necessary for further upscaling.
As we move further into the information age, human life is becoming increasingly digital. Digitalisation in combination with a growing number of devices that monitor and log information has led to an exponential growth in data production. It is estimated that the annual digital data production will reach 180 zettabytes (Zb) by 2025 (figure 1), that is 180 billion terabytes (Tb) or 180 trillion gigabytes (Gb), which is 60 times the amount of data produced in 2010 and 6 times that in 2017 (Holst, 2021; Vitak, 2021). Digital data must be stored on some type of physical storage medium, but current data storage methods are struggling to keep up with demand due to lack of capacity as well as considerable maintenance and sustainability issues. As a result, some researchers are turning to nature’s information storage medium, DNA (Ceze, Nivala and Strauss, 2019). Here, it will be highlighted how synthetic DNA can act as an alternative for tackling the short-comings of the present storage mediums and how the overarching DNA-based data storage system would operate.
Figure 1 – Volume of data/information (Zb) created, captured, copied, and consumed worldwide from 2010 to 2025 (Holst, 2021).
2. The Data storage problem
Current data storage methods rely on changing a physical state of a medium, as is the case for magnetic or solid-state recording technologies (Ceze, Nivala and Strauss, 2019; DDSA, 2021). Choice of medium is predominantly based on how frequently the data needs to be retrieved – in other words, how “hot” it is. “Hot” data which is frequently retrieved is stored on solid-state drives (SDDs), “warm” data on hard disk drives (HDDs) and “cold” data, which is infrequently retrieved or not at all, is stored on magnetic tapes. SDDs and HDDs have a lower storage density, and require more power and maintenance than magnetic tape, which leads to the “hotness” of the data being proportional to the price of storage per bit (DDSA, 2021).
The present storage technologies are struggling to keep up with the data volumes of the digital world (Ceze, Nivala and Strauss, 2019; DDSA, 2021). The primary issue is that improvements in storage density are being outcompeted by the exponential growth of data (DDSA, 2021). Secondly, storage maintenance is becoming an increasing problem. The physical state of all media degrades over time, meaning that they must be checked periodically and re-written to ensure data integrity (every 3-5 years for SDDs and HDDs, and every 7-10 years for tape) (Bornholt et al., 2016; DDSA, 2021). In addition, most digital storage technologies eventually become obsolete as new and better storage devices are developed. In parallel, technologies for reading data of obsolete devices also disappear over time, which risks the data being lost and necessitates rewriting it onto new devices. Furthermore, there are significant sustainability concerns that must be addressed. Some estimates suggest that data centres will account for up to 13% of the global energy consumption by 2030. In addition, several media currently on the market rely on rare metals, such as silver, gold, platinum and palladium, whose mining is unsustainable and often unethical (DDSA, 2021).
3. DNA-based digital data storage
The idea of using DNA for digital data storage dates back to the 1960s, and was first tested experimentally in 1981 (Ceze, Nical and Strauss, 2019). After significant breakthroughs in the 2010s, a boost in the direction of making DNA data storage a reality came last year with the formation of the DNA Data Storage Alliance (DDSA) (Vitak, 2021). The DDSA was formed as a collaboration of major companies in computing and biotechnology, such as Microsoft and Illumina, with the goal to build an ecosystem around the technology (DDSA, 2021). The association consists of over 30 members that include both computing and biotechnology companies, research labs and foundations. As the body of research on why and how DNA-based data storage can be achieved grows, the hope is that the technology will gain momentum and shortly become commercialised.
3.1 Why DNA is an attractive alternative
The appeal of DNA as a storage medium comes from its potential to overcome the main challenges of present technologies. Firstly, DNA provides a significantly higher storage density than currently available technology. In its newly published white paper, the DDSA estimates that even considering practical system overheads, such as error correction codes and physical redundancy of the DNA, the storage density of DNA is as high as 9Tb per mm3 (DDSA, 2021). In the volume of one LTO-9 tape (235,000 mm3), which provides one of the highest storage densities currently on the market with 18Tb per casette, DNA could hold up to 2,000,000Tb (figure 2) (DDSA, 2021; LTO Ultrium., n.d).
Figure 2 – Schematic representation of the storage density of DNA compared to an LTO-9 cassette. 1mm3 of DNA holds 9Tb of encoded data, the same half a LTO-9 tape and provides a storage density which is 115,000 times the density of the tape (DDSA, 2021)
Thus, DNA-based storage has a storage density approximately 115,000 times higher than current state-of-the-art archival storage. Moreover, DNA is significantly more durable. Half-life simulations of single stranded DNA encoding data stored at room temperature showed a 99.99% readability after 100 years at strand copy numbers as low as 10 (Bornholt et al., 2016). Optimal storage in lower temperatures would further increase the half-life. Additionally, DNA is unlikely to ever become obsolete as it encodes information that makes known life possible. Lastly, there is no need for rare metals in DNA-based storage and energy consumption is minimal compared to current storage methods (DDSA, 2021). Hence, DNA could be a durable storage medium with high storage density that requires minimal maintenance and resources, making it an attractive alternative to current storage technologies.
3.2 The DNA storage pipeline
The process of transitioning from digital data to DNA and vice versa can be divided into four main stages, namely writing, storage, retrieval and reading (step 1-4, 5-9, 10-18 and 19-22 in figure 3, respectively) (Ceze, Nivala and Strauss, 2019). Here, the focus will be on writing, retrieval and reading, but it is important to mention that the storage architecture is also crucial to enable upscaling of DNA data storage.
Figure 3 – Process overview of the DNA storage pipeline. Steps 1-4 represent the writing of bits to DNA bases, steps 5-9 – the storage of physical DNA, steps 10-18 – the preparation and retrieval of the wanted sequence, and steps 19-22 – the reading of bases back to bits (Meiser et al., 2019)
3.2.1 Writing – encoding and synthesis
The first step in storing digital data in DNA is to build a system to go from binary ‘bits’ (1s and 0s) that the computer can read to DNA bases (A, T, C and Gs) (step 1-2, figure 3). First, a DNA-encoding scheme must be decided. One approach is single base coding, where each base represents 2 bits (for instance A=00, T=11, G=10 and C=01). A second approach involves creating a standardised “alphabet” of short DNA sequences, oligonucleotides, where each oligonucleotide “letter” represents a piece of binary code based on a standard decided by the encoding scheme (DDSA, 2021). Error correction codes (ECCs) and a system for correctly aligning several physical DNA fragments must also be included (DDSA, 2021; Organick et al., 2020).
The second step of writing is to produce the DNA molecule (steps 3-4, figure 3). Nearly all synthetic DNA is produced using a cyclic, chemical process called the phosphoramidite process (Sandahl et al., 2021). This process has been used for decades in life science, but a far more efficient and cost-effective synthesis technology is needed to enable large scale DNA-based data storage applicable for commercial use. Furthermore, the phosphoramidite process is problematic as some of the by-products are harmful chemicals (Ceze, Nivala and Strauss, 2019).
Research focused on enzymatic DNA synthesis is now emerging (Lee et al., 2019; Lee et al., 2020). Enzymatic DNA synthesis involves a type of DNA polymerase which does not require a template strand combined with strategies to control polymerisation (for instance cycles of chemical de-blocking of the bases followed by annealing to the growing DNA strand) to ensure correct base incorporation (DDSA; 2021; Lee et al., 2019; Lee et al., 2020). Enzymatic synthesis produces less waste and is predicted to be easier to speed up and automate than the phosphoramidite process, but it is still in its infancy (Ceze, Nivala and Strauss, 2019). Nevertheless, successful use of enzymatic synthesis in DNA data storage has been demonstrated (Lee et al., 2020). Regardless of which synthesis method is used, it is crucial to build a fully automated, parallel synthesis architecture to meet the speed and cost requirements of large scale production needed for data storage.
To make large scale DNA data storage a reality, it is crucial to ensure that the data can be retrieved correctly when needed (Ceze, Nivala and Strauss, 2019). Firstly, to achieve an acceptable error rate, two characteristics are central: physical redundancy of the DNA molecule and ECCs. One copy of each sequence would theoretically provide the highest data density, but this would result in unreliable recovery and high chances of losing data. There is, however, an upper limit where an increase in copy number does not provide a notable retrieval benefit and will only take up unnecessary space (Bornholt et al., 2016; Organick et al., 2020). Organick et al. (2020) showed that reliable data retrieval could be achieved with as low as 10 sequence copies in combination with ECCs.
The second crucial aspect of data retrieval is the enabling of so-called “random access”, which means being able to selectively access a desired file out of the data pool without having to read the whole pool (Ceze, Nivala and Strauss, 2019; Organick et al., 2020). Selective extraction of DNA fragments is common practice in molecular biology where primer pairs unique to a specific DNA sequence are used during PCR amplification. This technique has also been used in random access retrieval of DNA encoded data (Ceze, Nivala and Strauss, 2019; Organick et al., 2020). Another approach involves using magnetic beads and sequences that include a unique identifier for each data piece (Ceze, Nivala and Strauss, 2019). A key challenge of random access retrieval in DNA-based storage is to create a storage architecture that can store enough DNA in one storage pool to include significant quantities of data, while at the same time being able to recognise the correct fragment quickly without the lag that comes with biological systems, for instance, due to the time it takes for enzymatic recognition and binding (Ceze, Nivala and Strauss, 2019).
3.2.3 Reading – sequencing and decoding
Once the correct DNA fragment has been retrieved, the next step is to read the physical DNA molecule and convert the sequence back into the binary code. DNA sequencing is already a fundamental part of life science and further developed than the DNA synthesis technology. Commercially available techniques can be broadly divided into two categories: sequencing by synthesis and nanopore sequencing (DDSA, 2021). Nanopore sequencing, commercialised by Oxford Nanopore, deciphers the sequence by interpreting the fluctuations in the electric currency as the DNA strand moves through a nanopore. Sequencing by synthesis has been heavily commercialised by Illumina and is based on reading the DNA sequence while copying the fragment using specially designed nucleotides that emit a detectable signal when it is incorporated into the complementary DNA strand (Ceze, Nivala and Strauss, 2019). A major advantage of nanopore sequencing is that it gives real-time readouts as the sequence moves through the pore. However, it is subjected to a significantly higher error rate than sequencing by synthesis, although this could largely be neutralized by ECCs (Ceze, Nivala and Strauss, 2019). After sequencing, the base sequence is decoded into bits, and the error is corrected based on the coding scheme used.
3.3 The current state of DNA-based digital data storage
Using DNA as a digital storage medium can help solve the data storage problem. The primary challenges for commercial use of DNA as a storage medium are cost and efficiency. The main bottleneck is the writing step, as the current DNA synthesis methods are simply too slow and expensive for the technology to be commercially feasible. However, even though it is essential to keep the error rate low, the technology does not need to be as accurate as when used in life science, greatly due to ECCs, which can facilitate upscaling for data storage applications (Ceze, Nivala and Strauss, 2019). Additionally, a higher cost of writing and reading DNA compared to current media could be compensated by a comparably low DNA maintenance cost (DDSA, 2021). The supply chain is also shorter than for other media as there is no need to first produce a physical device, such as an empty disk or tape, and then fill it with data, since the encoding and physical production happens simultaneously during DNA synthesis. Several proof-of-concept studies have been published going from bits to bases then back to bits, where research groups have used DNA to encode different kinds of media ranging from the Mario Bros theme song, the American constitution, an episode of the Netflix show “Biohackers”, and even instructions for a 3D printed object (Vitak, 2021). The initial use of DNA as a storage medium is predicted to be archival storage, and the Boston based company Catalogue already offers DNA-based archival storage commercially (CATALOG, 2021).
To conclude, DNA could provide a thousand-fold higher storage density, with high long-term readability, requiring close to no maintenance and significantly less resources than data storage technologies currently on the market (Bornholt et al., 2016; Ceze, Nivala and Strauss, 2019). Even though unforeseen challenges might arise reducing the storage density and durability, DNA as a data storage medium still appears to be superior to current media (Bornholt et al., 2016; DDSA, 2021). However, it is crucial to advance the technology and lower its cost if digital storage in DNA is to be commercialised on a large scale. The formation of the DDSA could be of huge importance for the field, standardising and accelerating the development of the technology. Current storage technologies are unlikely to become obsolete in the near future, but the addition of a new storage medium such as DNA is needed to keep up with demand and to lower the environmental impact of storage of the ever-growing data masses produced.
Bornholt, J., Lopez, R., Carmean, D., Ceze, L., Seelig, G. and Strauss, K., 2016. A DNA-Based Archival Storage System. Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems,.
Catalogdna. 2021. Catalog Homepage. [online] Available at: <https://www.catalogdna.com/> [Accessed 7 August 2021].
Ceze, L., Nivala, J. and Strauss, K., 2019. Molecular digital data storage using DNA. Nature Reviews Genetics, 20(8), pp.456-466.
DNA Data Storage Alliance (DDSA), 2021. Preserving our digital legacy: An introduction to DNA data storage. [online] DDSA. Available at: <https://dnastoragealliance.org/dev/wp-content/uploads/2021/06/DNA-Data-Storage-Alliance-An-Introduction-to-DNA-Data-Storage.pdf> [Accessed 6 August 2021].
Dnastoragealliance.org. 2021. Official site of the DNA Data Storage Alliance. [online] Available at: <https://dnastoragealliance.org/dev/> [Accessed 6 August 2021].
Holst, A., 2021. Total data volume worldwide 2010-2025 | Statista. [online] Statista. Available at: <https://www.statista.com/statistics/871513/worldwide-data-created/> [Accessed 9 August 2021].
Lee, H., Kalhor, R., Goela, N., Bolot, J. and Church, G., 2019. Terminator-free template-independent enzymatic DNA synthesis for digital information storage. Nature Communications, 10(1).
Lee, H., Wiegand, D., Griswold, K., Punthambaker, S., Chun, H., Kohman, R. and Church, G., 2020. Photon-directed multiplexed enzymatic DNA synthesis for molecular digital data storage. Nature Communications, 11(1).
Meiser, L., Antkowiak, P., Koch, J., Chen, W., Kohll, A., Stark, W., Heckel, R. and Grass, R., 2019. Reading and writing digital data in DNA. Nature Protocols, 15(1), pp.86-101.
Organick, L., Chen, Y., Dumas Ang, S., Lopez, R., Liu, X., Strauss, K. and Ceze, L., 2020. Probing the physical limits of reliable DNA data retrieval. Nature Communications, 11(1).
Sandahl, A., Nguyen, T., Hansen, R., Johansen, M., Skrydstrup, T. and Gothelf, K., 2021. On-demand synthesis of phosphoramidites. Nature Communications, 12(1).
Ultrium LTO. n.d. LTO-9 · New LTO Generation 9 Specifications | LTO Ultrium. [online] Available at: <https://www.lto.org/lto-9/> [Accessed 7 August 2021].
Vitak, S., 2021. Technology alliance boosts efforts to store data in DNA. Nature,.