Other Issues: Fault Tolerance
Consider a job running on 40 processors for a week, then there is a power outage, losing all the work.
Long-running jobs must be able to save a program’s state and then be able to restart from that state.
This is called check-pointing.