Issue causing GPUs to become unavailable during specific workloads

2026-04-06

(2026-04-08 added:Added that the emergency maintenance has been completed.)
(2026-04-07 added:Information on the emergency maintenance on Wednesday, April 8 has been added.)
(2026-04-28 added:Update regarding the refund of TSUBAME Points for jobs affected by the outage.)

We have confirmed an issue on TSUBAME where, when certain GPU-based workloads are executed, the affected GPU enters an error state and becomes unavailable.
We are currently investigating the root cause of this issue.
At this time, it is considered highly likely that the issue is caused by the GPU driver rather than by hardware failure.
Additionally, re-running a job that has encountered this error may result in GPUs becoming unavailable on a large number of nodes.

We sincerely apologize for the inconvenience.
If a GPU-based job stops during execution, and particularly if the output of nvidia-smi after the error includes messages such as those shown below, we kindly ask that you refrain from submitting similar jobs.
We will update this article once this issue has been resolved.

$ nvidia-smi | grep ERR
|ERR!   34C    P0            N/A  /  N/A  |   11637MiB /  95830MiB |     N/A      Default |
|                                         |                        |                 ERR! |
|ERR!   37C    P0            N/A  /  N/A  |   14457MiB /  95830MiB |     N/A      Default |
|                                         |                        |                 ERR! |

(2026-04-07 added)

To address this issue, emergency maintenance of the compute nodes will be conducted from 11:00 to 17:00 on April 8.
Please note that compute node reservations will not be available during this period.
(Job submission is still possible; nodes will be released sequentially as maintenance is completed.)
In addition, due to this issue, the MPS is currently unavailable.
We will inform you separately once the schedule for lifting this restriction has been determined.

 

(2026-04-08 added)

Emergency maintenance on the compute nodes has been completed. We are also lifting the restrictions on MPS functionality.
Please note that the GPU driver version has been updated from 590.48.01 to 580.105.08.

Due to this GPU driver version change, applications built using the CUDA Library 13.1.1 provided by TSUBAME may behave differently.
If such applications do not function as expected or if any issues occur, please try rebuilding them.

(2026-04-28 added)

We have refunded TSUBAME points for the following jobs that the system has determined were affected by this outage.

7158917 7159092 7159152 7159258 7159282 7159285 7159324 7159339 7159438 7159498 7159956 7160023 7160027 7160036 7160042 7160049 7160079 7160209 7160218 7160266 7160307 7160309 7160310 7160325 7160508 7160530 7160533 7160553 7160571 7160578 7160599 7160607 7160615 7160619 7160632 7160633 7160634 7160635 7160637 7160639 7160641 7160642 7160648 7160660 7160661 7160664 7160666 7160668 7160672 7160673 7160684 7160692 7160700 7160707 7160713 7160715 7160724 7160733 7160736 7160737 7160738 7160739 7160740 7160741 7160745 7160748 7160749 7160750 7160751 7160752 7160754 7160756 7160758 7160761 7160762 7160780 7160786 7160787 7160789 7160793 7160795 7160796 7160799 7160800 7160804 7160810 7160813 7160816 7160819 7160827 7160830 7160831 7160837 7160838 7160839 7160840 7160841 7160842 7160843 7160844 7160845 7160846 7160847 7160855 7160856 7160859 7160862 7160866 7160870 7160879 7160882 7160887 7160894 7160895 7160900 7160901 7160906 7160909 7160910 7160912 7160914 7160915 7160922 7160929 7160941 7160947 7160948 7160950 7160951 7160953 7160954 7160955 7160961 7160964 7160970 7160972 7160973 7160974 7160977 7160979 7160980 7160993 7160997 7160998 7160999 7161001 7161004 7161012 7161014 7161021 7161025 7161027 7161029 7161033 7161039 7161040 7161041 7161042 7161043 7161044 7161045 7161046 7161048 7161051 7161053 7161055 7161057 7161060 7161063 7161071 7161082 7161113 7161121 7161124 7161126 7161139 7161146 7161155 7161168 7161182 7161188 7161191 7161192 7161194 7161197 7161198 7161199 7161200 7161201 7161202 7161203 7161205 7161206 7161207 7161213 7161214 7161215 7161216 7161217 7161219 7161220 7161221 7161222 7161230 7161231 7161233 7161236 7161244 7161273 7161295 7161296 7161298 7161305 7161307 7161313 7161318 7161324 7161326 7161328 7161329 7161336 7161337 7161338 7161339 7161340 7161343 7161345 7161346 7161347 7161353 7161355 7161356 7161357 7161358 7161363 7161364 7161365 7161366 7161371 7161372 7161373 7161374 7161375 7161388 7161394 7161395 7161452 7161454 7161477 7161570 7161638 7161736 7161737 7161783 7161784 7161785