Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

VL-DNA: Enhance DNA Storage Capacity with Variable Payload (Strand) Lengths (2403.14204v1)

Published 21 Mar 2024 in cs.ET

Abstract: DNA storage is a promising archival data storage solution to today's big data problem. A DNA storage system encodes and stores digital data with synthetic DNA sequences and decodes DNA sequences back to digital data via sequencing. For efficient target data retrieving, existing Polymerase Chain Reaction PCR based DNA storage systems apply primers as specific identifier to tag different set of DNA strands. However, the PCR based DNA storage system suffers from primer-payload collisions, causing a significant reduction of storage capacity. This paper proposes using variable strand length, which takes advantage of the inherent payload-cutting process, to split collisions and recover primers. The executing time of our scheme is linear to the number of primer-payload collisions. The scheme serves as a post-processing method to any DNA encoding scheme. The evaluation of three state-of-the-art encoding schemes shows that the scheme can recover thousands of usable primers and improve tube capacity ranging from 18.27% to 19x.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (104)
  1. ]computationalbiology [n. d.]. The digital side of biology. https://phys.org/news/2011-03-digital-side-biology.html. Accessed: 2022-09-12.
  2. ]DNAsimulator [n. d.]. DNA Storage Simulation. https://master.dbahb2jho41s4.amplifyapp.com/. Accessed: 2022-09-12.
  3. ]OligoAnalyzer [n. d.]. Integrated DNA technology: OligoAnalyzer. https://www.idtdna.com/calc/analyzer. Accessed: 2022-09-12.
  4. ]MSI [n. d.]. Minnesota Supercomputing Institute (MSI) at the University of Minnesota. http://www.msi.umn.edu. Accessed: 2022-09-12.
  5. )]deltaG1 [n. d.]a. OligoArchitect Online - Glossary of Parameters. https://www.gene-quantification.de/oligo_architect_glossary.pdf. Accessed: 2022-09-12.
  6. ]depth [n. d.]. Sequencing depth. https://www.genomicseducation.hee.nhs.uk/glossary/read-depth/. Accessed: 2022-09-12.
  7. )]deltaG2 [n. d.]b. What is delta G value? https://www.researchgate.net/post/What-is-delta-G-value. Accessed: 2022-09-12.
  8. [Online] Accessed on Augest. 6, 2022. New breakthrough gets us closer to using DNA as data storage. ([Online] Accessed on Augest. 6, 2022).
  9. Kamel A Abd-Elsalam. 2003. Bioinformatic tools and guideline for PCR primer design. african Journal of biotechnology 2, 5 (2003), 91–95.
  10. The half-life of DNA in bone: measuring decay kinetics in 158 dated fossils. Proceedings of the Royal Society B: Biological Sciences 279, 1748 (2012), 4724–4733.
  11. DNA Data Storage Alliance. 2021. Preserving Our Digital Legacy: an Introduction To Dna Data Storage. Technical Report. tech. rep. June.
  12. Basic local alignment search tool. Journal of molecular biology 215, 3 (1990), 403–410.
  13. Improved DNA based storage capacity and fidelity using composite DNA letters. bioRxiv (2018), 433524.
  14. OligoArchive: Using DNA in the DBMS storage hierarchy.. In CIDR.
  15. Workload analysis of a large-scale key-value store. In Proceedings of the 12th ACM SIGMETRICS/PERFORMANCE joint international conference on Measurement and Modeling of Computer Systems. 53–64.
  16. Reconstructing strings from random traces. Departmental Papers (CIS) (2004), 173.
  17. Eric B Baum. 1995. Building an associative memory vastly larger than the brain. Science 268, 5210 (1995), 583–585.
  18. Extreme binning: Scalable, parallel deduplication for chunk-based file backup. In 2009 IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems. IEEE, 1–9.
  19. Forward error correction for DNA data storage. Procedia Computer Science 80 (2016), 1011–1022.
  20. A DNA-based archival storage system. In Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems. 637–649.
  21. {{\{{ALACC}}\}}: Accelerating restore performance of data deduplication systems using adaptive look-ahead window assisted chunk caching. In 16th {normal-{\{{USENIX}normal-}\}} Conference on File and Storage Technologies ({normal-{\{{FAST}normal-}\}} 18). 309–324.
  22. R Carlson. 2017. Guesstimating the size of the global array synthesis market. Synthesis August 30 (2017).
  23. DNA data storage and hybrid molecular–electronic computing. Proc. IEEE 107, 1 (2018), 63–72.
  24. Molecular digital data storage using DNA. Nature Reviews Genetics 20, 8 (2019), 456–466.
  25. Addition of degenerate bases to DNA-based data storage for increased information capacity. bioRxiv (2018), 367052.
  26. High information capacity DNA-based data storage with augmented encoding characters using degenerate bases. Scientific reports 9, 1 (2019), 1–7.
  27. Next-generation digital information storage in DNA. Science 337, 6102 (2012), 1628–1628.
  28. International Data Corporation. 2023. Worldwide Global StorageSphere Forecast, 2021–2025: To Save or Not to Save Data, That Is the Question. Retrieved March 1, 2023 from https://www.idc.com/getdoc.jsp?containerId=US47509621
  29. Studies on primer-dimer formation in polymerase chain reaction (PCR). Biotechnology Techniques 13, 10 (1999), 643–646.
  30. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09.
  31. Optimized code design for constrained DNA data storage with asymmetric errors. IEEE Access 7 (2019), 84107–84121.
  32. General concepts for PCR primer design. PCR methods appl 3, 3 (1993), S30–S37.
  33. Mike Dutch. 2008. Understanding data deduplication ratios. In SNIA Data Management Forum. 7.
  34. High-scale random access on DNA storage systems. NAR genomics and bioinformatics 4, 1 (2022), lqab126.
  35. Yaniv Erlich and Dina Zielinski. 2017. DNA Fountain enables a robust and efficient storage architecture. Science 355, 6328 (2017), 950–954.
  36. Cluster Analysis. –John Wiley & Sons. Ltd., New York ([n. d.]), 330.
  37. Andy Extance. 2016. How DNA could store all the world’s data. Nature 537, 7618 (2016).
  38. Robert E Fontana Jr and Gary M Decad. 2018. Moore’s law realities for recording systems and memory storage components: HDD, tape, NAND, and optical. AIP Advances 8, 5 (2018), 056506.
  39. Potomac Institute for Policy Studies. 2018. The Future of DNA Data Storage. Retrieved March 1, 2023 from https://potomacinstitute.org/images/studies/Future_of_DNA_Data_Storage.pdf
  40. Design tradeoffs for data deduplication performance in backup workloads. In 13th {normal-{\{{USENIX}normal-}\}} Conference on File and Storage Technologies ({normal-{\{{FAST}normal-}\}} 15). 331–344.
  41. John Gantz and David Reinsel. 2010. The Digital Universe Decade – Are You Ready? IDC White Paper (2010).
  42. Towards practical, high-capacity, low-maintenance information storage in synthesized DNA. Nature 494, 7435 (2013), 77–80.
  43. Robust chemical preservation of digital information on DNA in silica with error-correcting codes. Angewandte Chemie International Edition 54, 8 (2015), 2552–2555.
  44. Fanglu Guo and Petros Efstathopoulos. 2011. Building a High-performance Deduplication System.. In USENIX annual technical conference.
  45. Thomas Heinis. 2019. Survey of Information Encoding Techniques for DNA. arXiv preprint arXiv:1906.11062 (2019).
  46. InternetArchive. 2023. Internet Archive Public library. Retrieved March 1, 2023 from https://archive.org/
  47. Coding for Optimized Writing Rate in DNA Storage. 2020 IEEE International Symposium on Information Theory (ISIT) (2020), 711–716.
  48. Srinivasaraghavan Kannan and Martin Zacharias. 2007. Folding of a DNA hairpin loop structure in explicit solvent using replica-exchange molecular dynamics simulations. Biophysical journal 93, 9 (2007), 3218–3228.
  49. Bernhard Korte and Jens Vygen. 2012. Bin-packing. In Kombinatorische Optimierung. Springer, 499–516.
  50. Sriram Kosuri and George M Church. 2014. Large-scale de novo DNA synthesis: technologies and applications. Nature methods 11, 5 (2014), 499.
  51. Bimodal content defined chunking for backup streams.. In Fast. 239–252.
  52. Atlas: Baidu’s key-value storage system for cloud data. In 2015 31st Symposium on Mass Storage Systems and Technologies (MSST). IEEE, 1–14.
  53. Enzymatic DNA synthesis for digital information storage. bioRxiv (2018), 348987.
  54. Coding Over Sets for DNA Storage. IEEE Transactions on Information Theory 66 (2020), 2331–2351.
  55. Img-dna: approximate dna storage for images. In Proceedings of the 14th ACM International Conference on Systems and Storage. 1–9.
  56. Can We Store the Whole World’s Data in {{\{{DNA}}\}} Storage?. In 12th {normal-{\{{USENIX}normal-}\}} Workshop on Hot Topics in Storage and File Systems (HotStorage 20).
  57. Managing reliability skew in DNA storage. In Proceedings of the 49th Annual International Symposium on Computer Architecture. 482–494.
  58. Xiaozhou Lu and Sunghwan Kim. 2021. Design of Nonbinary Error Correction Codes With a Maximum Run-Length Constraint to Correct a Single Insertion or Deletion Error for DNA Storage. IEEE Access 9 (2021), 135354–135363.
  59. Michael Luby. 2002. LT codes. In The 43rd Annual IEEE Symposium on Foundations of Computer Science, 2002. Proceedings. IEEE Computer Society, 271–271.
  60. Exploring the energy landscape of a small RNA hairpin. Journal of the American Chemical Society 128, 5 (2006), 1523–1530.
  61. DNA stability: a central design consideration for DNA data storage systems. Nature communications 12, 1 (2021), 1–9.
  62. Mark Douglas Matteucci and M Ho Caruthers. 1981. Synthesis of deoxyoligonucleotides on a polymer support. Journal of the American Chemical Society 103, 11 (1981), 3185–3191.
  63. Dutch T Meyer and William J Bolosky. 2012. A study of practical deduplication. ACM Transactions on Storage (ToS) 7, 4 (2012), 1–20.
  64. Ethan L Miller. 2020. The Future of the Past: Challenges in Archival Storage. (2020).
  65. A low-bandwidth network file system. In Proceedings of the eighteenth ACM symposium on Operating systems principles. 174–187.
  66. An analysis of approximations for maximizing submodular set functions—I. Mathematical programming 14, 1 (1978), 265–294.
  67. Mark EJ Newman. 2004. Fast algorithm for detecting community structure in networks. Physical review E 69, 6 (2004), 066133.
  68. Mark EJ Newman and Michelle Girvan. 2004. Finding and evaluating community structure in networks. Physical review E 69, 2 (2004), 026113.
  69. High density DNA data storage library via dehydration with digital microfluidic retrieval. Nature communications 10, 1 (2019), 1–6.
  70. Landscape of next-generation sequencing technologies. Analytical chemistry 83, 12 (2011), 4327–4341.
  71. Random access in large-scale DNA data storage. Nature biotechnology 36, 3 (2018), 242.
  72. Probing the physical limits of reliable DNA data retrieval. Nature communications 11, 1 (2020), 1–7.
  73. Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 5206–5210.
  74. Same queries, different data: Can we predict runtime performance?. In 2012 IEEE 28th International Conference on Data Engineering Workshops. IEEE, 275–280.
  75. Sean Quinlan and Sean Dorward. 2002. Venti: A New Approach to Archival Storage.. In FAST, Vol. 2. 89–101.
  76. The digitization of the world from edge to core. IDC White Paper (2018).
  77. Peter Richterich. 1998. Estimation of errors in “raw” DNA sequences: a validation study. Genome Research 8, 3 (1998), 251–259.
  78. Encoding information into polymers. Nature Reviews Chemistry 2, 11 (2018), 365–381.
  79. DNA archival storage, a bottom up approach. In Proceedings of the 13th ACM Workshop on Hot Topics in Storage and File Systems. 58–63.
  80. The effect of hairpin structure on PCR amplification efficiency. Mol Biol Today 1, 3 (2000), 67–69.
  81. Multidimensional data organization and random access in large-scale DNA storage systems. bioRxiv (2019), 743369.
  82. Andrew Steadman and Ivan Fair. 2012. Variable-length constrained sequence codes. IEEE communications letters 17, 1 (2012), 139–142.
  83. Andrew Steadman and Ivan Fair. 2016. Simplified search and construction of capacity-approaching variable-length constrained sequence codes. IET Communications 10, 14 (2016), 1697–1704.
  84. Generating realistic datasets for deduplication analysis. In Presented as part of the 2012 USENIX Annual Technical Conference (ATC 12). 261–272.
  85. Driving the scalability of DNA-based information storage systems. ACS synthetic biology 8, 6 (2019), 1241–1248.
  86. Ten years of next-generation sequencing technology. Trends in genetics 30, 9 (2014), 418–426.
  87. Characteristics of backup workloads in production systems.. In FAST, Vol. 12. 4–4.
  88. High capacity DNA data storage with variable-length Oligonucleotides using repeat accumulate code and hybrid mapping. Journal of biological engineering 13, 1 (2019), 1–11.
  89. DNA Storage: A Promising Large Scale Archival Storage? arXiv preprint arXiv:2204.01870 (2022).
  90. A cost-efficient rewriting scheme to improve restore performance in deduplication systems. In Proc. MSST.
  91. HD-Code: End-to-End High Density Code for DNA Storage. IEEE Transactions on NanoBioscience 20 (2021), 455–463.
  92. A comprehensive study of the past, present, and future of data deduplication. Proc. IEEE 104, 9 (2016), 1681–1710.
  93. Similarity and locality based indexing for high performance data deduplication. IEEE transactions on computers 64, 4 (2014), 1162–1176.
  94. Combining deduplication and delta compression to achieve low-overhead data reduction on backup datasets. In 2014 Data Compression Conference. IEEE, 203–212.
  95. Mingyu Xiao and Hiroshi Nagamochi. 2017. Exact algorithms for maximum independent set. Information and Computation 255 (2017), 126–146.
  96. Design of 240,000 orthogonal 25mer DNA barcode probes. Proceedings of the National Academy of Sciences 106, 7 (2009), 2289–2294.
  97. Large-scale DNA memory based on the nested PCR. Natural Computing 7, 3 (2008), 335–346.
  98. Portable and error-free DNA-based data storage. Scientific reports 7, 1 (2017), 1–6.
  99. DNA-based storage: Trends and methods. IEEE Transactions on Molecular, Biological and Multi-Scale Communications 1, 3 (2015), 230–248.
  100. A rewritable, random-access DNA-based storage system. Scientific reports 5 (2015), 14138.
  101. Design of Constraint Coding Sets for Archive DNA Storage. IEEE/ACM transactions on computational biology and bioinformatics PP (2021).
  102. NUPACK: Analysis and design of nucleic acid systems. Journal of computational chemistry 32, 1 (2011), 170–173.
  103. Nucleic acid memory. Nature materials 15, 4 (2016), 366–370.
  104. Avoiding the Disk Bottleneck in the Data Domain Deduplication File System.. In Fast, Vol. 8. 1–14.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Yixun Wei (3 papers)
  2. Wenlong Wang (77 papers)
  3. Huibing Dong (2 papers)
  4. Bingzhe Li (11 papers)
  5. David Du (6 papers)

Summary

We haven't generated a summary for this paper yet.