A little background
Beocat runs GlusterFS for serving home directories, node images, et al. The home directories volume is distributed across two file servers for better performance and greater storage capacity.
We also run SGE for job scheduling. This allows our users to submit jobs to the cluster, wait for them to finish and gather the results.
Normally for maintenance on the fileservers, we have to reserve the cluster for a period of time about 2 weeks in the future. This can cause backlogs and keep long jobs from starting because they won't finish before the maintenance period goes into effect. The other issue with these large maintenance periods is that we have to turn off, or at the very least reboot, all of the nodes while the fileservers are down.
Since maintenance periods for fileservers take so long to schedule, and rebooting the cluster takes so long, we thought we might try to do it "live" if we stopped all of the writing happening on the distributed (not-replicated) volumes. Think SIGSTOP and SIGCONT. The solution we came up with involved the following steps:
- "Pause" all of the running jobs. With SGE you can issue qmod -sq \*. Outside SGE you could issue SIGSTOP to the processes writing to the distributed volume
- Reboot the fileservers 1 at a time. One at a time so that GlusterFS can preserve locks.
- "Resume" all of the paused jobs. With SGE you can issue qmod -usq \*. Outside SGE you could issue SIGCONT to the processes you SIGSTOP'd
The plan actually worked exactly as anticipated. The jobs stopped their writing, we rebooted the fileservers, and the jobs resumed. What would have been a maintenance period that slowed our researchers turned into a short (30 minute) pause.