Performance Impact of Checkpoint-Restart on Applications
- Experimental Testbed: 128 Intel Westmere servers having 8 processing cores and a Mellanox MT26428 InfiniBand QDR Adapter each, running RHEL 6. The OpenFabrics OFED 1.5.3 InfiniBand subnet management stack is used on the system. A Lustre parallel file system with 8 Object Storage Servers running over native InfiniBand is deployed to write application checkpoints generated by the checkpointing library BLCR v0.8.5.
- The left side of the bar graph below shows the total runtime of
High-Performance Linpack (HPL) application that was run with 512 MPI ranks, with
varying number of checkpoint snapshots. For the input size used (N = 64000), the
aggregate size of a single checkpoint is ~40GB. The bar on the right shows a
breakdown of the time taken for different phases in the Checkpoint/Restart
protocol. The performance overheads involved in the Checkpointing protocol is
very minimal as seen in these graphs.
- The performance impact of the Checkpoint-Restart mechanism provided
with MVAPICH2 is illustrated in the graph below. The ENZO Cosmology simulation
is used as a representative application for this purpose. The Radiation
Transport sample workload provided with the application was executed using 512
MPI ranks. For the input parameters used, the aggregate size of a single
checkpoint is ~13GB. The impact of the default Checkpoint-Restart mechanism in
MVAPICH2, and the SCR-assisted multi-level checkpointing mechanism are shown in
the graph below.
- The SCR library implements three redundancy schemes which trade of
performance, storage space, and reliability. The graph below compares the
checkpointing-writing time of these different schemes against the default model
of writing to a parallel file system. The aggregate checkpoint size for each of
these runs that use 512 MPI ranks was ~50GB.

