Issue causing GPUs to become unavailable during specific workloads

2026-04-06

(2026-04-08 added:Added that the emergency maintenance has been completed.)
(2026-04-07 added:Information on the emergency maintenance on Wednesday, April 8 has been added.)

We have confirmed an issue on TSUBAME where, when certain GPU-based workloads are executed, the affected GPU enters an error state and becomes unavailable.
We are currently investigating the root cause of this issue.
At this time, it is considered highly likely that the issue is caused by the GPU driver rather than by hardware failure.
Additionally, re-running a job that has encountered this error may result in GPUs becoming unavailable on a large number of nodes.

We sincerely apologize for the inconvenience.
If a GPU-based job stops during execution, and particularly if the output of nvidia-smi after the error includes messages such as those shown below, we kindly ask that you refrain from submitting similar jobs.
We will update this article once this issue has been resolved.

$ nvidia-smi | grep ERR
|ERR!   34C    P0            N/A  /  N/A  |   11637MiB /  95830MiB |     N/A      Default |
|                                         |                        |                 ERR! |
|ERR!   37C    P0            N/A  /  N/A  |   14457MiB /  95830MiB |     N/A      Default |
|                                         |                        |                 ERR! |

(2026-04-07 added)

To address this issue, emergency maintenance of the compute nodes will be conducted from 11:00 to 17:00 on April 8.
Please note that compute node reservations will not be available during this period.
(Job submission is still possible; nodes will be released sequentially as maintenance is completed.)
In addition, due to this issue, the MPS is currently unavailable.
We will inform you separately once the schedule for lifting this restriction has been determined.

 

(2026-04-08 added)

Emergency maintenance on the compute nodes has been completed. We are also lifting the restrictions on MPS functionality.
Please note that the GPU driver version has been updated from 590.48.01 to 580.105.08.

Due to this GPU driver version change, applications built using the CUDA Library 13.1.1 provided by TSUBAME may behave differently.
If such applications do not function as expected or if any issues occur, please try rebuilding them.