[Failure Report] 2025-01-13 : Job scheduler Failure

2025-01-15

The following Job scheduler failure has been restored as follows:

1. Date
 Failure time : 2025-01-13 19:00
 Recovery time : 2025-01-14 18:55

2. Impact and Details of the disorder
 The job scheduler failed, and job execution did not start.
 Also, job scheduler related commands (qsub, qstat, etc.) were not available during the period.
 Queues dedicated to interactive jobs (iqrsh, etc.) were not affected.

 To recover from this failure, some job submission information was deleted.
 Non-reserved jobs that had already started executing during the outage were not affected by this outage and are still running normally.

3. Affected Jobs
 The jobs described below have been deleted due to the failure.
 If a job has been deleted, it is necessary to submit the job again.
 The points for the deleted jobs will be returned to you.

 [Applicable jobs]
 Job ID 2600000-2649999 (Applies to jobs submitted between approximately 1/13 17:49:39 and 1/14 18:20.)

 We will contact you individually for compensation of TSUBAME points for the following AR-ID reservations that are taking execution time during the outage.

 2053 2119 2120 2191 2192 2203 2209 2212 2213 2215 2217 2222 2223 2224 2227 2234 2235 2236 2238 2239 2240 2245 2247 2252 2253 2255

4. Prevention of recurrence and future actions
 To reduce the system load, we will set an upper limit to the number of simultaneous job submissions per user.
 Details will be announced separately.