Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
60 tokens/sec
GPT-4o
12 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Two-Stage Block Orthogonalization to Improve Performance of $s$-step GMRES (2402.15033v1)

Published 23 Feb 2024 in math.NA, cs.DC, and cs.NA

Abstract: On current computer architectures, GMRES' performance can be limited by its communication cost to generate orthonormal basis vectors of the Krylov subspace. To address this performance bottleneck, its $s$-step variant orthogonalizes a block of $s$ basis vectors at a time, potentially reducing the communication cost by a factor of $s$. Unfortunately, for a large step size $s$, the solver can generate extremely ill-conditioned basis vectors, and to maintain stability in practice, a conservatively small step size is used, which limits the performance of the $s$-step solver. To enhance the performance using a small step size, in this paper, we introduce a two-stage block orthogonalization scheme. Similar to the original scheme, the first stage of the proposed method operates on a block of $s$ basis vectors at a time, but its objective is to maintain the well-conditioning of the generated basis vectors with a lower cost. The orthogonalization of the basis vectors is delayed until the second stage when enough basis vectors are generated to obtain higher performance. Our analysis shows the stability of the proposed two-stage scheme. The performance is improved because while the same amount of computation as the original scheme is required, most of the communication is done at the second stage of the proposed scheme, reducing the overall communication requirements. Our performance results with up to 192 NVIDIA V100 GPUs on the Summit supercomputer demonstrate that when solving a 2D Laplace problem, the two-stage approach can reduce the orthogonalization time and the total time-to-solution by the respective factors of up to $2.6\times$ and $1.6\times$ over the original $s$-step GMRES, which had already obtained the respective speedups of $2.1\times$ and $1.8\times$ over the standard GMRES. Similar speedups were obtained for 3D problems and for matrices from the SuiteSparse Matrix Collection.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (27)
  1. A Newton basis GMRES implementation. IMA J. Numer. Anal., 14(4):563–581, 1994.
  2. Multigrid smoothers for ultraparallel computing. SIAM J. Sci. Comput., 33:2864–2887, 2011.
  3. O. Balabanov. Randomized Cholesky QR factorizations, 2022. arXiv:2210.09953.
  4. J. L. Barlow. Some added flexibility for block classical Gram-Schmidt with reorthogonalization, 2021. Talk presented at the SIAM Conference on Applied Linear Algebra. Virtual.
  5. J. L. Barlow and A. Smoktunowicz. Reorthogonalized block classical Gram–Schmidt. Numer. Math., 123:395––423, 2013.
  6. E. Carson. Communication-avoiding Krylov subspace methods in theory and practice. PhD thesis, EECS Dept., U.C. Berkeley, 2015.
  7. Block Gram-Schmidt algorithms and their stability properties. Linear Algebra Appl., 638:150–195, 2022.
  8. E. de Sturler and H. van der Vorst. Reducing the effect of global communication in GMRES(m) and CG on parallel distributed memory computers. Applied Numer. Math., 18:441–459, 1995.
  9. Communication-optimal parallel and sequential QR and LU factorizations. SIAM J. Sci. Comput., 34:A206–A239, 2012.
  10. Parallel graph coloring for manycore architectures. In IEEE Int. Par. Dist. Proc. Symp. (IPDPS), pages 892–901, 2016.
  11. Shifted cholesky qr for computing the qr factorization of ill-conditioned matrices. SIAM J. Sci. Comput., 42:A477–A503, 2020.
  12. Numerical behaviour of the modified Gram-Schmidt GMRES implementation. BIT Numer. Math., 37:706–719, 1997.
  13. L. Grigori and S. Moufawad. Communication Avoiding ILU0 Preconditioner. SIAM J. Sci. Comput., 37:C217–C246, 2015.
  14. An overview of the Trilinos project. ACM Trans. Math. Softw., 31(3):397–423, sep 2005.
  15. Algorithms for quad-double precision floating point arithmetic. In Proc. 15th IEEE Symp. Comput. Arith (ARITH-15), pages 155–162. IEEE, 2001.
  16. M. Hoemmen. Communication-avoiding Krylov subspace methods. PhD thesis, EECS Dept., U.C. Berkeley, 2010.
  17. W. Joubert and G. F. Carey. Parallelizable restarted iterative methods for nonsymmetric linear systems. II: parallel implementation. Int. J. Comput. Math., 44:269–290, 1992.
  18. Minimizing communication in sparse matrix solvers. In Proc. Int. Conf. High Perf. Comput., Netw., Stor. and Anal. (SC), pages 36:1–36:12, 2009.
  19. Y. Saad and M. H. Schultz. GMRES: a generalized minimal residual algorithm for solving nonsymmetric linear systems. SIAM J. Sci. Statist. Comput., 7:856–869, 1986.
  20. A. Stathopoulos and K. Wu. A block orthogonalization procedure with constant synchronization requirements. SIAM J. Sci. Comput., 23:2165–2182, 2002.
  21. The Trilinos Project Team. The Trilinos Project Website: https://trilinos.github.io.
  22. H. Weyl. Das asymptotische Verteilungsgesetz der Eigenwerte linearer partieller Differentialgleichungen (mit einer Anwendung auf die Theorie der Hohlraumstrahlung). Math. Annal., 71:441–479, 1912.
  23. Roundoff error analysis of the Cholesky QR2 algorithm. Electronic Trans. Numer. Anal., 44:306–326, 2015.
  24. Domain Decomposition Preconditioners for Communication-avoiding Krylov Methods on a Hybrid CPU-GPU Cluster. In Proc. Int. Conf. High Perf. Comput., Netw., Stor. and Anal. (SC), pages 933–944, 2014.
  25. Low-synchronization orthogonalization schemes for s-step and pipelined Krylov solvers in Trilinos. In Proc. of SIAM Conf. Parallel Processing for Sci. Comput., pages 118–128, 2020.
  26. Mixed-precision orthogonalization scheme and adaptive step size for improving the stability and performance of CA-GMRES on gpus. In High Performance Computing for Computational Science - VECPAR, volume 8969, pages 17–30, 2014.
  27. Mixed-precision block Gram-Schmidt orthogonalization. In Proc. 6th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, pages 1–8, 2015. Article No. 2.
Citations (2)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com