Shared memory Allreduce on the TACC Ranger (10/29/09)
-
Experimental Testbed:
Each node contains four AMD Opteron Quad-Core 64-bit processors (16 cores in
all) on a single board, as an SMP unit. The core frequency is 2.3 GHz and each
node has 32 GB of memory. The memory subsystem has a 1.0 GHz HyperTransport
system Bus, and 2 channels with 667 MHz DDR2 DIMMS. Each socket possesses an
independent memory controller connected directly to an L3 cache. TACC uses a
7-stage, full fat-tree topology with InfiniBand SDR interconnects and two large
Sun InfiniBand Datacenter switches.
The following plot compares the latency of the MPI_Bcast operation for
small messages between the 1.2p1 and the 1.4 versions for small messages on 1K
processes. In 1.2p1, we had the
conventional binomial tree based algorithm. In 1.4, we use the 2-level knomial
algorithm.

