Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
162 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Combinatorial Perspective on Random Access Efficiency for DNA Storage (2401.15722v2)

Published 28 Jan 2024 in cs.IT, math.CO, and math.IT

Abstract: We investigate the fundamental limits of the recently proposed random access coverage depth problem for DNA data storage. Under this paradigm, it is assumed that the user information consists of $k$ information strands, which are encoded into $n$ strands via a generator matrix $G$. During the sequencing process, the strands are read uniformly at random, as each strand is available in a large number of copies. In this context, the random access coverage depth problem refers to the expected number of reads (i.e., sequenced strands) required to decode a specific information strand requested by the user. This problem heavily depends on the generator matrix $G$, and besides computing the expectation for different choices of $G$, the goal is to construct matrices that minimize the maximum expectation over all possible requested information strands, denoted by $T_{\max}(G)$. In this paper, we introduce new techniques to investigate the random access coverage depth problem, capturing its combinatorial nature and identifying the structural properties of generator matrices that are advantageous. We establish two general formulas to determine $T_{\max}(G)$ for arbitrary generator matrices. The first formula depends on the linear dependencies between columns of $G$, whereas the second formula takes into account recovery sets and their intersection structure. We also introduce the concept of recovery balanced codes and provide three sufficient conditions for a code to be recovery balanced. These conditions can be used to compute $T_{\max}(G)$ for various families of codes, such as MDS, simplex, Hamming, and binary Reed-Muller codes. Additionally, we study the performance of modified systematic MDS and simplex matrices, showing that the best results for $T_{\max}(G)$ are achieved with a specific combination of encoded strands and replication of the information strands.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (24)
  1. J. Rydning, “Worldwide idc global datasphere forecast, 2022–2026: Enterprise organizations driving most of the data growth,” tech. rep., Technical Report, 2022.
  2. DNA Data Storage Alliance, “Preserving our digital legacy: an introduction to DNA data storage,” 2021.
  3. D. Markowitz, “Biology is all you need,” The International Conference for High Performance Computing, Networking, Storage, and Analysis, 2023.
  4. L. Anavy, I. Vaknin, O. Atar, R. Amit, and Z. Yakhini, “Data storage in DNA with fewer synthesis cycles using composite DNA letters,” Nature biotechnology, vol. 37, no. 10, pp. 1229–1236, 2019.
  5. M. Blawat, K. Gaedke, I. Huetter, X.-M. Chen, B. Turczyk, S. Inverso, B. W. Pruitt, and G. M. Church, “Forward error correction for DNA data storage,” Procedia Computer Science, vol. 80, pp. 1011–1022, 2016.
  6. J. Bornholt, R. Lopez, D. M. Carmean, L. Ceze, G. Seelig, and K. Strauss, “A DNA-based archival storage system,” in Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 637–649, 2016.
  7. L. Organick, S. D. Ang, Y.-J. Chen, R. Lopez, S. Yekhanin, K. Makarychev, M. Z. Racz, G. Kamath, P. Gopalan, B. Nguyen, et al., “Random access in large-scale DNA data storage,” Nature biotechnology, vol. 36, no. 3, pp. 242–248, 2018.
  8. S. H. T. Yazdi, R. Gabrys, and O. Milenkovic, “Portable and error-free DNA-based data storage,” Scientific reports, vol. 7, no. 1, p. 5011, 2017.
  9. S. Tabatabaei Yazdi, Y. Yuan, J. Ma, H. Zhao, and O. Milenkovic, “A rewritable, random-access DNA-based storage system,” Scientific reports, vol. 5, no. 1, pp. 1–10, 2015.
  10. D. Bar-Lev, I. Orr, O. Sabary, T. Etzion, and E. Yaakobi, “Deep DNA storage: Scalable and robust DNA storage via coding theory and deep learning,” arXiv preprint arXiv:2109.00031, 2021.
  11. I. Shomorony, R. Heckel, et al., “Information-theoretic foundations of DNA data storage,” Foundations and Trends® in Communications and Information Theory, vol. 19, no. 1, pp. 1–106, 2022.
  12. S. H. T. Yazdi, H. M. Kiah, E. Garcia-Ruiz, J. Ma, H. Zhao, and O. Milenkovic, “DNA-based storage: Trends and methods,” IEEE Transactions on Molecular, Biological and Multi-Scale Communications, vol. 1, no. 3, pp. 230–248, 2015.
  13. R. Heckel, G. Mikutis, and R. N. Grass, “A characterization of the DNA data storage channel,” Scientific reports, vol. 9, no. 1, p. 9663, 2019.
  14. Y. Erlich and D. Zielinski, “DNA fountain enables a robust and efficient storage architecture,” science, vol. 355, no. 6328, pp. 950–954, 2017.
  15. S. Chandak, K. Tatwawadi, B. Lau, J. Mardia, M. Kubit, J. Neu, P. Griffin, M. Wootters, T. Weissman, and H. Ji, “Improved read/write cost tradeoff in DNA-based data storage using LDPC codes,” in 2019 57th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pp. 147–156, IEEE, 2019.
  16. D. Bar-Lev, O. Sabary, R. Gabrys, and E. Yaakobi, “Cover your bases: How to minimize the sequencing coverage in DNA storage systems,” arXiv preprint arXiv:2305.05656, 2023.
  17. I. Preuss, B. Galili, Z. Yakhini, and L. Anavy, “Sequencing coverage analysis for combinatorial DNA-based storage systems,” bioRxiv, pp. 2024–01, 2024.
  18. P. Erdős and A. Rényi, “On a classical problem of probability theory,” Magyar Tud. Akad. Mat. Kutató Int. Közl, vol. 6, no. 1, pp. 215–220, 1961.
  19. W. Felleb, “An introduction to probability theory and its applications, vol. 1,” IWiley, New York, 1968.
  20. P. Flajolet, D. Gardy, and L. Thimonier, “Birthday paradox, coupon collectors, caching algorithms and self-organizing search,” Discrete Applied Mathematics, vol. 39, no. 3, pp. 207–229, 1992.
  21. D. J. Newman, “The double dixie cup problem,” The American Mathematical Monthly, vol. 67, no. 1, pp. 58–61, 1960.
  22. I. Preuss, M. Rosenberg, Z. Yakhini, and L. Anavy, “Efficient DNA-based data storage using shortmer combinatorial encoding,” bioRxiv, pp. 2021–08, 2021.
  23. H. Abraham, R. Gabrys, and E. Yaakobi, “Covering all bases: The next inning in DNA sequencing efficiency,” arXiv preprint, 2024.
  24. Cambridge University Press, 2nd ed., 2011.
Citations (5)

Summary

We haven't generated a summary for this paper yet.