Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 56 tok/s
Gemini 2.5 Pro 39 tok/s Pro
GPT-5 Medium 15 tok/s Pro
GPT-5 High 16 tok/s Pro
GPT-4o 99 tok/s Pro
Kimi K2 155 tok/s Pro
GPT OSS 120B 476 tok/s Pro
Claude Sonnet 4 38 tok/s Pro
2000 character limit reached

Gravitational octree code performance evaluation on Volta GPU (1811.02761v1)

Published 7 Nov 2018 in cs.MS, astro-ph.IM, cs.PF, and physics.comp-ph

Abstract: In this study, the gravitational octree code originally optimized for the Fermi, Kepler, and Maxwell GPU architectures is adapted to the Volta architecture. The Volta architecture introduces independent thread scheduling requiring either the insertion of the explicit synchronizations at appropriate locations or the enforcement of the same implicit synchronizations as do the Pascal or earlier architectures by specifying \texttt{-gencode arch=compute_60,code=sm_70}. The performance measurements on Tesla V100, the current flagship GPU by NVIDIA, revealed that the $N$-body simulations of the Andromeda galaxy model with $2{23} = 8388608$ particles took $3.8 \times 10{-2}$~s or $3.3 \times 10{-2}$~s per step for each case. Tesla V100 achieves a 1.4 to 2.2-fold acceleration in comparison with Tesla P100, the flagship GPU in the previous generation. The observed speed-up of 2.2 is greater than 1.5, which is the ratio of the theoretical peak performance of the two GPUs. The independence of the units for integer operations from those for floating-point number operations enables the overlapped execution of integer and floating-point number operations. It hides the execution time of the integer operations leading to the speed-up rate above the theoretical peak performance ratio. Tesla V100 can execute $N$-body simulation with up to $25 \times 2{20} = 26214400$ particles, and it took $2.0 \times 10{-1}$~s per step. It corresponds to $3.5$~TFlop/s, which is 22\% of the single-precision theoretical peak performance.

Citations (1)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-Up Questions

We haven't generated follow-up questions for this paper yet.

Authors (1)