6. GPUによる高速化¶

6.1. 概要¶

このページでは、GPU計算についての説明を行います。
TSUBAME4では計算ノードの各ノードにNvidia tesla H100が4基内蔵されており、ユーザはGPUによる高速化の恩恵を受けることができます。
近年、GPU計算に対応するアプリケーションも増えており、Amberもその一つです。
PMEMDの新機能としてNVIDIA製GPUを用いたNVE、NVT、NPTカノニカルアンサンブルでの溶媒分子を露わに扱うPMEシミュレーションおよび連続誘電体モデルであるGeneralized Bornシミュレーションの高速化が挙げられます。また、マルチGPUでの計算にも対応しています。

なお、この章にある情報はほとんどが次のURLからの引用、和訳となっています。より詳しく調べたい方、または最新の内容を確認したい方は ambermd.orgのページをご参照ください。

http://ambermd.org/GPUSupport.php

GPUによる高速化は恩恵が大きいですが、歴史は浅いためユーザは注意深くなる必要があるとambermd.org では述べられています。もし、問題に遭遇したら同等のシミュレーションをCPUで実施して、シミュレーション設定の問題かどうかの切り分けを行うようにしてください。

6.2. pmemd.cudaとpmemdとの違い¶

6.2.1. 機能の違い¶

NVE、NVT、NPTカノニカルアンサンブルでのExplicit solvent(陽溶媒)PMEシミュレーションおよびImplicit solvent(暗溶媒)Generalized Bornシミュレーションは、標準のGPUを使用しないpmemdとほぼ同等になるようにデザインされています。
しかしいくつかの制約があり、制約内容は概要で示したambermd.orgのページをご参照ください。また、念のためCPUで短いシミュレーションを実行して Ewald error estimateを確認し、妥当な数値となっているかどうか確認することをおすすめします。

6.2.2. 出力の違い¶

出力ファイルのフォーマットにいくつかの差異があります。GPU計算に対応したpmemd.cudaで実行した際、pmemdとの実行結果の主な違いは次の4点です。

GPU の著作情報が現れること
GPUのdevice 情報が出力されること
Conditional Compilation Defines UsedとしてCUDAが含まれていること
Ewald error estimateの出力がなされないこと

なお、同一の入力ファイルを用いてGPU, CPUそれぞれで計算した結果をsdiffコマンドで差分をとって確認すると次のようになります。

[ 左：GPU利用時 / 右： CPU利用時 ]

          ---------------------------------------------------             ---------------------------------------------------
          Amber 24 PMEMD                              2024                Amber 24 PMEMD                              2024
          ---------------------------------------------------             ---------------------------------------------------

| PMEMD implementation of SANDER, Release 24                    | PMEMD implementation of SANDER, Release 24

|  Compiled date/time: Wed Apr  2 09:19:03 2025                 |  Compiled date/time: Wed Apr  2 09:19:03 2025
| Run on 06/03/2025 at 16:02:17                               | | Run on 06/03/2025 at 16:05:03

|   Executable path: /apps/t4/rhel9/isv/amber/24up03_ambertoo | |   Executable path: /apps/t4/rhel9/isv/amber/24up03_ambertoo
| Working directory:                                          | | Working directory: 
|          Hostname: r10n7                                      |          Hostname: r10n7

  [-O]verwriting output                                           [-O]verwriting output

File Assignments:                                               File Assignments:
|   MDIN: mdin.GPU                                            | |   MDIN: mdin.CPU
|  MDOUT: md.out                                                |  MDOUT: md.out
| INPCRD: inpcrd                                                | INPCRD: inpcrd
|   PARM: prmtop                                                |   PARM: prmtop
| RESTRT: md.r                                                  | RESTRT: md.r
|   REFC: refc                                                  |   REFC: refc
|  MDVEL: mdvel                                                 |  MDVEL: mdvel
|   MDEN: md.e                                                  |   MDEN: md.e
|  MDCRD: md.x                                                  |  MDCRD: md.x
| MDINFO: mdinfo                                                | MDINFO: mdinfo
|LOGFILE: logfile                                               |LOGFILE: logfile
|  MDFRC: mdfrc                                                 |  MDFRC: mdfrc


 Here is the input file:                                         Here is the input file:

 Production MD NVE (note, this has bad energy conservation)   |  Typical Production MD NVE with
                                                              >  GOOD energy conservation.
 &cntrl                                                          &cntrl
   ntx=5, irest=1,                                                 ntx=5, irest=1,
   ntc=2, ntf=2, tol=0.000001,                                     ntc=2, ntf=2, tol=0.000001,
   nstlim=100000,                                             |    nstlim=1000,
   ntpr=2500, ntwx=2500,                                      |    ntpr=50, ntwx=50,
   ntwr=100000,                                               |    ntwr=1000,
   dt=0.004, cut=9.,                                               dt=0.004, cut=9.,
   ntt=0, ntb=1, ntp=0,                                            ntt=0, ntb=1, ntp=0,
   ioutfm=1,                                                       ioutfm=1,
 &end                                                            &end


Note: ig = -1. Setting random seed to   182685 based on wallc | Note: ig = -1. Setting random seed to   305290 based on wallc
      microseconds and disabling the synchronization of rando         microseconds and disabling the synchronization of rando
      between tasks to improve performance.                           between tasks to improve performance.
| irandom = 1, using AMBER's internal random number generator   | irandom = 1, using AMBER's internal random number generator

|--------------------- INFORMATION ----------------------     <
| GPU (CUDA) Version of PMEMD in use: NVIDIA GPU IN USE.      <
|                    Version 18.0.0                           <
|                                                             <
|                      03/25/2018                             <
|                                                             <
| Implementation by:                                          <
|                    Ross C. Walker     (SDSC)                <
|                    Scott Le Grand     (nVIDIA)              <
|                                                             <
| Version 18 performance extensions by:                       <
|                    David Cerutti     (Rutgers)              <
|                                                             <
| Precision model in use:                                     <
|      [SPFP] - Single Precision Forces, 64-bit Fixed Point   <
|               Accumulation. (Default)                       <
|                                                             <
|--------------------------------------------------------     <

<< 省略 >>

|------------------- GPU DEVICE INFO --------------------     <
|                                                             <
|                         Task ID:      0                     <
|            CUDA_VISIBLE_DEVICES: not set                    <
|   CUDA Capable Devices Detected:      4                     <
|           CUDA Device ID in use:      0                     <
|                CUDA Device Name: NVIDIA H100                <
|     CUDA Device Global Mem Size:  95330 MB                  <
| CUDA Device Num Multiprocessors:    132                     <
|           CUDA Device Core Freq:   1.98 GHz                 <
|                                                             <
|                                                             <
|                         Task ID:      1                     <
|            CUDA_VISIBLE_DEVICES: not set                    <
|   CUDA Capable Devices Detected:      4                     <
|           CUDA Device ID in use:      1                     <
|                CUDA Device Name: NVIDIA H100                <
|     CUDA Device Global Mem Size:  95330 MB                  <
| CUDA Device Num Multiprocessors:    132                     <
|           CUDA Device Core Freq:   1.98 GHz                 <
|                                                             <
|                                                             <
|                         Task ID:      2                     <
|            CUDA_VISIBLE_DEVICES: not set                    <
|   CUDA Capable Devices Detected:      4                     <
|           CUDA Device ID in use:      2                     <
|                CUDA Device Name: NVIDIA H100                <
|     CUDA Device Global Mem Size:  95330 MB                  <
| CUDA Device Num Multiprocessors:    132                     <
|           CUDA Device Core Freq:   1.98 GHz                 <
|                                                             <
|                                                             <
|                         Task ID:      3                     <
|            CUDA_VISIBLE_DEVICES: not set                    <
|   CUDA Capable Devices Detected:      4                     <
|           CUDA Device ID in use:      3                     <
|                CUDA Device Name: NVIDIA H100                <
|     CUDA Device Global Mem Size:  95330 MB                  <
| CUDA Device Num Multiprocessors:    132                     <
|           CUDA Device Core Freq:   1.98 GHz                 <
|                                                             <
|--------------------------------------------------------     <
                                                              <
|---------------- GPU PEER TO PEER INFO -----------------     <
|                                                             <
|   Peer to Peer support: ENABLED                             <
|                                                             <
|   NCCL support: DISABLED                                    <
|                                                             <
|--------------------------------------------------------     <
                                                              <
| INFO:    Axis order optimization will be used.                | INFO:    Axis order optimization will be used.

<< 省略 >>

|  Final Performance Info:                                      |  Final Performance Info:
|     -----------------------------------------------------     |     -----------------------------------------------------
|     Average timings for last   97500 steps:                 | |     Average timings for last     200 steps:
|     Elapsed(s) =      33.94 Per Step(ms) =       0.35       | |     Elapsed(s) =      16.66 Per Step(ms) =      83.29
|         ns/day =     992.82   seconds/ns =      87.02       | |         ns/day =       4.15   seconds/ns =   20822.53
|                                                               |
|     Average timings for all steps:                            |     Average timings for all steps:
|     Elapsed(s) =      34.81 Per Step(ms) =       0.35       | |     Elapsed(s) =      82.75 Per Step(ms) =      82.75
|         ns/day =     992.79   seconds/ns =      87.03       | |         ns/day =       4.18   seconds/ns =   20686.58
|     -----------------------------------------------------     |     -----------------------------------------------------

|  Master Setup CPU time:            4.10 seconds             | |  Master Setup CPU time:            1.49 seconds
|  Master NonSetup CPU time:        34.58 seconds             | |  Master NonSetup CPU time:        82.24 seconds
|  Master Total CPU time:           38.67 seconds     0.01 ho | |  Master Total CPU time:           83.73 seconds     0.02 ho
                                                              |
|  Master Setup wall time:           9    seconds             | |  Master Setup wall time:           4    seconds
|  Master NonSetup wall time:       35    seconds             | |  Master NonSetup wall time:       82    seconds
|  Master Total wall time:          44    seconds     0.01 ho | |  Master Total wall time:          86    seconds     0.02 ho