

## High Performance and Scalable MPI+X Library for Emerging HPC Clusters

#### Talk at Intel HPC Developer Conference (SC '16)

#### by

Dhabaleswar K. (DK) Panda

The Ohio State University

E-mail: panda@cse.ohio-state.edu

http://www.cse.ohio-state.edu/~panda

#### **Khaled Hamidouche**

The Ohio State University

E-mail: hamidouc@cse.ohio-state.edu

http://www.cse.ohio-state.edu/~hamidouc

#### High-End Computing (HEC): ExaFlop & ExaByte



#### ExaFlop & HPC

#### **ExaByte & BigData**

## Trends for Commodity Computing Clusters in the Top 500 List (http://www.top500.org)



Timeline

### **Drivers of Modern HPC Cluster Architectures**





**High Performance Interconnects -**InfiniBand <1usec latency, 100Gbps Bandwidth>

**Multi-core Processors** 

Multi-core/many-core technologies



Accelerators / Coprocessors high compute density, high performance/watt >1 TFlop DP on a chip



SSD, NVMe-SSD, NVRAM

- Remote Direct Memory Access (RDMA)-enabled networking (InfiniBand and RoCE)
- Solid State Drives (SSDs), Non-Volatile Random-Access Memory (NVRAM), NVMe-SSD
- Accelerators (NVIDIA GPGPUs and Intel Xeon Phi)



# **Designing Communication Libraries for Multi-Petaflop and Exaflop Systems: Challenges**



## **Exascale Programming models**

- The community believes exascale programming model will be MPI+X
- But what is X?
  - Can it be just OpenMP?
- Many different environments and systems are emerging
  - Different `X' will satisfy the respective needs



#### **MPI+X Programming model: Broad Challenges at Exascale**

- Scalability for million to billion processors
  - Support for highly-efficient inter-node and intra-node communication (both two-sided and one-sided)
  - Scalable job start-up
- Scalable Collective communication
  - Offload
  - Non-blocking
  - Topology-aware
- Balancing intra-node and inter-node communication for next generation nodes (128-1024 cores)
  - Multiple end-points per node
- Support for efficient multi-threading (OpenMP)
- Integrated Support for GPGPUs and Accelerators (CUDA)
- Fault-tolerance/resiliency
- QoS support for communication and I/O
- Support for Hybrid MPI+PGAS programming (MPI + OpenMP, MPI + UPC, MPI + OpenSHMEM, CAF, ...)
- Virtualization
- Energy-Awareness

### **Additional Challenges for Designing Exascale Software Libraries**

#### • Extreme Low Memory Footprint

Memory per core continues to decrease

#### D-L-A Framework

- Discover
  - Overall network topology (fat-tree, 3D, ...), Network topology for processes for a given job
  - Node architecture, Health of network and node
- Learn
  - Impact on performance and scalability
  - Potential for failure
- Adapt
  - Internal protocols and algorithms
  - Process mapping
  - Fault-tolerance solutions
- Low overhead techniques while delivering performance, scalability and fault-tolerance

### **Overview of the MVAPICH2 Project**

- High Performance open-source MPI Library for InfiniBand, Omni-Path, Ethernet/iWARP, and RDMA over Converged Ethernet (RoCE)
  - MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.0), Started in 2001, First version available in 2002
  - MVAPICH2-X (MPI + PGAS), Available since 2011
  - Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014
  - Support for Virtualization (MVAPICH2-Virt), Available since 2015
  - Support for Energy-Awareness (MVAPICH2-EA), Available since 2015
  - Support for InfiniBand Network Analysis and Monitoring (OSU INAM) since 2015
  - Used by more than 2,690 organizations in 83 countries
  - More than 402,000 (> 0.4 million) downloads from the OSU site directly
  - Empowering many TOP500 clusters (Nov '16 ranking)
    - 1<sup>st</sup> ranked 10,649,640-core cluster (Sunway TaihuLight) at NSC, Wuxi, China
    - 13<sup>th</sup> ranked 241,108-core cluster (Pleiades) at NASA
    - 17<sup>th</sup> ranked 519,640-core cluster (Stampede) at TACC
    - 40<sup>th</sup> ranked 76,032-core cluster (Tsubame 2.5) at Tokyo Institute of Technology and many others
  - Available with software stacks of many vendors and Linux Distros (RedHat and SuSE)
  - <u>http://mvapich.cse.ohio-state.edu</u>
- Empowering Top500 systems for over a decade
  - System-X from Virginia Tech (3<sup>rd</sup> in Nov 2003, 2,200 processors, 12.25 TFlops) ->

Sunway TaihuLight at NSC, Wuxi, China (1st in Nov'16, 10,649,640 cores, 93 PFlops)

#### Intel HPC Dev Conf (SC '16)



#### Outline

- Hybrid MPI+OpenMP Models for Highly-threaded Systems
- Hybrid MPI+PGAS Models for Irregular Applications
- Hybrid MPI+GPGPUs and OpenSHMEM for Heterogeneous Computing with Accelerators

## **Highly Threaded Systems**

- Systems like KNL
- MPI+OpenMP is seen as the best fit
  - 1 MPI process per socket for Multi-core
  - 4-8 MPI processes per KNL
  - Each MPI process will launch OpenMP threads
- However, current MPI runtimes are not "efficiently" handling the hybrid
  - Most of the application use Funneled mode: Only the MPI processes perform communication
  - Communication phases are the bottleneck
- Multi-endpoint based designs
  - Transparently use threads inside MPI runtime
  - Increase the concurrency

### **MPI and OpenMP**

- MPI-4 will enhance the thread support
  - Endpoint proposal in the Forum
  - Application threads will be able to efficiently perform communication
  - Endpoint is the communication entity that maps to a thread
    - Idea is to have multiple addressable communication entities within a single process
    - No context switch between application and runtime => better performance
- OpenMP 4.5 is more powerful than just traditional data parallelism
  - Supports task parallelism since OpenMP 3.0
  - Supports heterogeneous computing with accelerator targets since OpenMP 4.0
  - Supports explicit SIMD and threads affinity pragmas since OpenMP 4.0

## **MEP-based design: MVAPICH2 Approach**

- Lock-free Communication
  - Threads have their own resources
- Dynamically adapt the number of threads
  - Avoid resource contention
  - Depends on application pattern and system performance
- Both intra- and inter-nodes communication
  - Threads boost both channels
- New MEP-Aware collectives
- Applicable to the endpoint proposal in MPI-4



M. Luo, X. Lu, K. Hamidouche, K. Kandalla and D. K. Panda, Initial Study of Multi-Endpoint Runtime for MPI+OpenMP Hybrid Applications on Multi-Core Systems. International Symposium on Principles and Practice of Parallel Programming (PPoPP '14).

## Performance Benefits: OSU Micro-Benchmarks (OMB) level



- Reduces the latency from 40us to 1.85 us (21X)
- Achieves the same as Processes
- 40% improvement on latency for Bcast on 4,096 cores
- 30% improvement on latency for Alltoall on 4,096 cores

#### Intel HPC Dev Conf (SC '16)

## **Performance Benefits: Application Kernel level**



- 6.3% improvement for MG, 11.7% improvement for CG, and 12.6% improvement for LU on 4,096 cores.
- With P3DFFT, we are able to observe a 30% improvement in communication time and 13.5% improvement in the total execution time.

## **Enhanced Designs for KNL: MVAPICH2 Approach**

- On-load approach
  - Takes advantage of the idle cores
  - Dynamically configurable
  - Takes advantage of highly multithreaded cores
  - Takes advantage of MCDRAM of KNL processors
- Applicable to other programming models such as PGAS, Task-based, etc.
- Provides portability, performance, and applicability to runtime as well as applications in a transparent manner

## **Performance Benefits of the Enhanced Designs**



- New designs to exploit high concurrency and MCDRAM of KNL
- Significant improvements for large message sizes
- Benefits seen in varying message size as well as varying MPI processes

## **Performance Benefits of the Enhanced Designs**



CNTK: MLP Training Time using MNIST (BS:64)

Multi-Bandwidth using 32 MPI processes

 Benefits observed on training time of Multi-level Perceptron (MLP) model on MNIST dataset using CNTK Deep Learning Framework

Enhanced Designs will be available in upcoming MVAPICH2 releases

Network Based Computing Laboratory

Intel HPC Dev Conf (SC '16)

#### Outline

- Hybrid MPI+OpenMP Models for Highly-threaded Systems
- Hybrid MPI+PGAS Models for Irregular Applications
- Hybrid MPI+GPGPUs and OpenSHMEM for Heterogeneous Computing with Accelerators

## **Maturity of Runtimes and Application Requirements**

- MPI has been the most popular model for a long time
  - Available on every major machine
  - Portability, performance and scaling
  - Most parallel HPC code is designed using MPI
  - Simplicity structured and iterative communication patterns
- PGAS Models
  - Increasing interest in community
  - Simple shared memory abstractions and one-sided communication
  - Easier to express irregular communication
- Need for hybrid MPI + PGAS
  - Application can have kernels with different communication characteristics
  - Porting only part of the applications to reduce programming effort

### Hybrid (MPI+PGAS) Programming

- Application sub-kernels can be re-written in MPI/PGAS based on communication characteristics
- Benefits:
  - Best of Distributed Computing Model
  - Best of Shared Memory Computing Model



## **Current Approaches for Hybrid Programming**

- Layering one programming model over another
  - Poor performance due to semantics mismatch
  - MPI-3 RMA tries to address
- Separate runtime for each programming model



- Need more network and memory resources
- Might lead to deadlock!

## **The Need for a Unified Runtime**



- Deadlock when a message is sitting in one runtime, but application calls the other runtime
- Prescription to avoid this is to barrier in one mode (either OpenSHMEM or MPI) before entering the other
- Or runtimes require dedicated progress threads
- Bad performance!!
- Similar issues for MPI + UPC applications over individual runtimes

### **MVAPICH2-X for Hybrid MPI + PGAS Applications**



- Unified communication runtime for MPI, UPC, OpenSHMEM, CAF, UPC++ available with MVAPICH2-X 1.9 onwards! (since 2012)
  - <u>http://mvapich.cse.ohio-state.edu</u>
- Feature Highlights
  - Supports MPI(+OpenMP), OpenSHMEM, UPC, CAF, UPC++, MPI(+OpenMP) + OpenSHMEM, MPI(+OpenMP)
     + UPC
  - MPI-3 compliant, OpenSHMEM v1.0 standard compliant, UPC v1.2 standard compliant (with initial support for UPC 1.3), CAF 2008 standard (OpenUH), UPC++
- Scalable Inter-node and intra-node communication point-to-point and collectives
   Network Based Computing Laboratory
   Intel HPC Dev Conf (SC '16)

### **OpenSHMEM Atomic Operations: Performance**



- OSU OpenSHMEM micro-benchmarks (OMB v4.1)
- MV2-X SHMEM performs up to 40% better compared to UH-SHMEM

### **UPC Collectives Performance**



J. Jose, K. Hamidouche, J. Zhang, A. Venkatesh, and D. K. Panda, Optimizing Collective Communication in UPC (HiPS'14, in association with IPDPS'14)

Network Based Computing Laboratory

Intel HPC Dev Conf (SC '16)

#### **Performance Evaluations for CAF model**



- Micro-benchmark improvement (MV2X vs. GASNet-IBV, UH CAF test-suite)
  - Put bandwidth: 3.5X improvement on 4KB; Put latency: reduce 29% on 4B
- Application performance improvement (NAS-CAF one-sided implementation)
  - Reduce the execution time by 12% (SP.D.256), 18% (BT.D.256)

J. Lin, K. Hamidouche, X. Lu, M. Li and D. K. Panda, High-performance Co-array Fortran support with MVAPICH2-X: Initial experience and evaluation, HIPS'15

Network Based Computing Laboratory

### **UPC++ Collectives Performance**



Inter-node Broadcast (64 nodes 1:ppn)

- Full and native support for hybrid MPI + UPC++ applications
- Better performance compared to IBV and MPI conduits
- OSU Micro-benchmarks (OMB) support for UPC++
- Available since MVAPICH2-X 2.2RC1

J. M. Hashmi, K. Hamidouche, and D. K. Panda, Enabling Performance Efficient Runtime Support for hybrid MPI+UPC++ Programming Models, IEEE International Conference on High Performance Computing and Communications (HPCC 2016)

Network Based Computing Laboratory

#### Intel HPC Dev Conf (SC '16)

## **Application Level Performance with Graph500 and Sort**





- Performance of Hybrid (MPI+ OpenSHMEM) Graph500 Design
  - 8,192 processes
    - 2.4X improvement over MPI-CSR
    - 7.6X improvement over MPI-Simple
  - 16,384 processes
    - 1.5X improvement over MPI-CSR
    - 13X improvement over MPI-Simple

- Performance of Hybrid (MPI+OpenSHMEM) Sort Application
  - 4,096 processes, 4 TB Input Size
    - MPI 2408 sec; 0.16 TB/min
    - Hybrid 1172 sec; 0.36 TB/min
    - 51% improvement over MPI-design

J. Jose, S. Potluri, H. Subramoni, X. Lu, K. Hamidouche, K. Schulz, H. Sundar and D. Panda Designing Scalable Out-of-core Sorting with Hybrid MPI+PGAS Programming Models, PGAS'14

J. Jose, S. Potluri, K. Tomko and D. K. Panda, Designing Scalable Graph500 Benchmark with Hybrid MPI+OpenSHMEM Programming Models, International Supercomputing Conference (ISC'13), June 2013

Intel HPC Dev Conf (SC '16)

## MiniMD – Total Execution Time



- Hybrid design performs better than MPI implementation
- 1,024 processes
  - 17% improvement over MPI version
- Strong Scaling

```
Input size: 128 * 128 * 128
```

## Accelerating MaTEx k-NN with Hybrid MPI and OpenSHMEM

- MaTEx: MPI-based Machine learning algorithm library
- **k-NN:** a popular supervised algorithm for classification
- Hybrid designs:
  - Overlapped Data Flow; One-sided Data Transfer; Circular-buffer Structure



- Benchmark: KDD Cup 2010 (8,407,752 records, 2 classes, k=5)
- For truncated KDD workload on 256 cores, reduce 27.6% execution time
- For full KDD workload on 512 cores, reduce 9.0% execution time

J. Lin, K. Hamidouche, J. Zhang, X. Lu, A. Vishnu, D. Panda. Accelerating k-NN Algorithm with Hybrid MPI and OpenSHMEM,

#### **OpenSHMEM 2015**

Network Based Computing Laboratory

#### Outline

- Hybrid MPI+OpenMP Models for Highly-threaded Systems
- Hybrid MPI+PGAS Models for Irregular Applications
- Hybrid MPI+GPGPUs and OpenSHMEM for Heterogeneous Computing with Accelerators

## **GPU-Aware (CUDA-Aware) MPI Library: MVAPICH2-GPU**

- Standard MPI interfaces used for unified data movement
- Takes advantage of Unified Virtual Addressing (>= CUDA 4.0)
- Overlaps data movement from GPU with RDMA transfers



## CUDA-Aware MPI: MVAPICH2-GDR 1.8-2.2 Releases

- Support for MPI communication from NVIDIA GPU device memory
- High performance RDMA-based inter-node point-to-point communication (GPU-GPU, GPU-Host and Host-GPU)
- High performance intra-node point-to-point communication for multi-GPU adapters/node (GPU-GPU, GPU-Host and Host-GPU)
- Taking advantage of CUDA IPC (available since CUDA 4.1) in intra-node communication for multiple GPU adapters/node
- Optimized and tuned collectives for GPU device buffers
- MPI datatype support for point-to-point and collective communication from GPU device buffers
- Unified memory

### Performance of MVAPICH2-GPU with GPU-Direct RDMA (GDR)



Network Based Computing Laboratory

### **Application-Level Evaluation (HOOMD-blue)**

#### 64K Particles

256K Particles



- Platform: Wilkes (Intel Ivy Bridge + NVIDIA Tesla K20c + Mellanox Connect-IB)
- HoomdBlue Version 1.0.5
  - GDRCOPY enabled: MV2\_USE\_CUDA=1 MV2\_IBA\_HCA=mlx5\_0 MV2\_IBA\_EAGER\_THRESHOLD=32768 MV2\_VBUF\_TOTAL\_SIZE=32768 MV2\_USE\_GPUDIRECT\_LOOPBACK\_LIMIT=32768 MV2\_USE\_GPUDIRECT\_GDRCOPY=1 MV2\_USE\_GPUDIRECT\_GDRCOPY\_LIMIT=16384

Intel HPC Dev Conf (SC '16)

## Application-Level Evaluation (Cosmo) and Weather Forecasting in Switzerland





- 2X improvement on 32 GPUs nodes
- 30% improvement on 96 GPU nodes (8 GPUs/node)

<u>Cosmo model: http://www2.cosmo-model.org/content</u> /tasks/operational/meteoSwiss/

### On-going collaboration with CSCS and MeteoSwiss (Switzerland) in co-designing MV2-GDR and Cosmo Application

C. Chu, K. Hamidouche, A. Venkatesh, D. Banerjee, H. Subramoni, and D. K. Panda, Exploiting Maximal Overlap for Non-Contiguous Data Movement Processing on Modern GPU-enabled Systems, IPDPS'16

Network Based Computing Laboratory

Intel HPC Dev Conf (SC '16)

3000

# Need for Non-Uniform Memory Allocation in OpenSHMEM for Heterogeneous Architectures

MIC cores have limited

memory per core

 OpenSHMEM relies on symmetric memory, allocated using shmalloc()



- shmalloc() allocates same amount of memory on all PEs
- For applications running in symmetric mode, this limits the total heap size
- Similar issues for applications (even host-only) with memory load imbalance (Graph500, Out-of-Core Sort, etc.)
- How to allocate different amounts of memory on host and MIC cores, and still be able to communicate?

# **OpenSHMEM Design for MIC Clusters**

- Non-Uniform Memory Allocation:
  - Team-based Memory Allocation (Proposed Extensions)

```
void shmem_team_create(shmem_team_t team, int *ranks,
int size, shmem_team_t *newteam);
void shmem_team_destroy(shmem_team_t *team);
```

void shmem\_team\_split(shmem\_team\_t team, int color, int key, shmem\_team\_t \*newteam);

int shmem\_team\_rank(shmem\_team\_t team);
int shmem\_team\_size(shmem\_team\_t team);

void \*shmalloc\_team (shmem\_team\_t team, size\_t size); void shfree\_team(shmem\_team\_t team, void \*addr);

- Address Structure for non-uniform memory allocations



### Intel HPC Dev Conf (SC '16)

# **Proxy-based Designs for OpenSHMEM**



- Current generation architectures impose limitations on read bandwidth when HCA reads from MIC memory
  - Impacts both put and get operation performance
- Solution: Pipelined data transfer by proxy running on host using IB and SCIF channels
- Improves latency and bandwidth!

# **OpenSHMEM Put/Get Performance**



- Proxy-based designs alleviate hardware limitations
- Put Latency of 4M message: Default: 3911us, Optimized: 838us
- Get Latency of 4M message: Default: 3889us, Optimized: 837us

# **Performance Evaluations using Graph500**



- Graph500 Execution Time (Native Mode):
  - 8 processes per MIC node
  - At 512 processes , Default: 5.17s, Optimized: 4.96s
  - Performance Improvement from MIC-aware collectives design
- Graph500 Execution Time (Symmetric Mode):
  - 16 processes on each Host and MIC node
  - At 1,024 processes, Default: 15.91s, Optimized: 12.41s
  - Performance Improvement from MIC-aware collectives and proxy-based designs

# **Graph500 Evaluations with Extensions**



- Redesigned Graph500 using MIC to overlap computation/communication
  - Data Transfer to MIC memory; MIC cores pre-processes received data
  - Host processes traverses vertices, and sends out new vertices
- Graph500 Execution time at 1,024 processes:
  - 16 processes on each Host and MIC node
  - Host-Only: .33s, Host+MIC with Extensions: .26s
- Magnitudes of improvement compared to default symmetric mode
  - Default Symmetric Mode: 12.1s, Host+MIC Extensions: 0.16s

J. Jose, K. Hamidouche, X. Lu, S. Potluri, J. Zhang, K. Tomko and D. K. Panda, High Performance OpenSHMEM for Intel MIC Clusters: Extensions, Runtime Designs and Application Co-Design, IEEE International Conference on Cluster Computing (CLUSTER '14) (Best Paper Finalist)

# Looking into the Future ....

- Architectures for Exascale systems are evolving
- Exascale systems will be constrained by
  - Power
  - Memory per core
  - Data movement cost
  - Faults
- Programming Models, Runtimes and Middleware need to be designed for
  - Scalability
  - Performance
  - Fault-resilience
  - Energy-awareness
  - Programmability
  - Productivity
- High Performance and Scalable MPI+X libraries are needed
- Highlighted some of the approaches taken by the MVAPICH2 project
- Need continuous innovation to have the right MPI+X libraries for Exascale systems

# **Funding Acknowledgments**

## **Funding Support by**















## **Equipment Support by**

advanced clustering

technologies, inc.









INVIDIA





Technology you can count on-

# **Network Based Computing Laboratory**

# **Personnel Acknowledgments**

### **Current Students**

Past Students

\_

\_

\_

\_

\_

\_

\_

\_

\_

- A. Awan (Ph.D.) \_
- M. Bayatpour (Ph.D.) \_
- S. Chakraborthy (Ph.D.) \_

A. Augustine (M.S.)

P. Balaji (Ph.D.)

S. Bhagvat (M.S.)

D. Buntinas (Ph.D.)

B. Chandrasekharan (M.S.)

N. Dandapanthula (M.S.)

A. Bhat (M.S.)

L. Chai (Ph.D.)

V. Dhanraj (M.S.)

C.-H. Chu (Ph.D.) \_

- S. Guganani (Ph.D.) \_
- J. Hashmi (Ph.D.) \_

W. Huang (Ph.D.)

W. Jiang (M.S.)

J. Jose (Ph.D.)

S. Kini (M.S.)

M. Koop (Ph.D.)

K. Kulkarni (M.S.)

R. Kumar (M.S.)

K. Kandalla (Ph.D.)

P. Lai (M.S.)

J. Liu (Ph.D.)

J. Lin

M. Luo

E. Mancini

S. Krishnamoorthy (M.S.)

- N. Islam (Ph.D.) \_
- M. Li (Ph.D.) \_

\_

\_

\_

\_

\_

\_

\_

-

\_

\_

\_

\_

- M. Rahman (Ph.D.) \_
- D. Shankar (Ph.D.) \_
- A. Venkatesh (Ph.D.) \_
- J. Zhang (Ph.D.) \_

-

\_

-

\_

\_

\_

### **Current Research Scientists**

- K. Hamidouche \_
- X. Lu \_
- H. Subramoni \_

R. Rajachandrasekar (Ph.D.)

G. Santhanaraman (Ph.D.)

A. Singh (Ph.D.)

J. Sridhar (M.S.)

H. Subramoni (Ph.D.)

K. Vaidyanathan (Ph.D.)

S. Sur (Ph.D.)

### **Current Research Specialist**

J. Smith \_

### Past Research Scientist

S. Sur \_

#### Past Programmers

- D. Bureddy \_
- M. Arnold \_
- J. Perkins \_

- S. Potluri (Ph.D.) \_ S. Marcarelli I. Vienne \_ \_ H. Wang
- A. Mamidala (Ph.D.) G. Marsh (M.S.) V. Meshram (M.S.) A. Moody (M.S.)
- R. Noronha (Ph.D.)
- X. Ouyang (Ph.D.)
- --\_
  - A. Vishnu (Ph.D.) \_ \_
    - J. Wu (Ph.D.) W. Yu (Ph.D.) \_

### M. Luo (Ph.D.) \_ \_ \_ \_

- S. Naravula (Ph.D.) \_

#### \_ S. Pai (M.S.) \_

### T. Gangadharappa (M.S.) K. Gopalakrishnan (M.S.)

### Past Post-Docs

- D. Banerjee
- X. Besseron
- H.-W. Jin \_



# **OSU Team Will be Participating in Multiple Events at SC '16**

- Three Conference Tutorials (IB+HSE, IB+HSE Advanced, Big Data)
- HP-CAST
- Technical Papers (SC main conference; Doctoral Showcase; Poster; PDSW-DISC, PAW, COMHPC, and ESPM2 Workshops)
- Booth Presentations (Mellanox, NVIDIA, NRL, PGAS)
- HPC Connection Workshop
- Will be stationed at Ohio Supercomputer Center/OH-TECH Booth (#1107)
  - Multiple presentations and demos
- More Details from <a href="http://mvapich.cse.ohio-state.edu/talks/">http://mvapich.cse.ohio-state.edu/talks/</a>

# **Thank You!**



Network-Based Computing Laboratory http://nowlab.cse.ohio-state.edu/



The MVAPICH Project <u>http://mvapich.cse.ohio-state.edu/</u>

panda@cse.ohio-state.edu, hamidouch@cse.ohio-state.edu

Network Based Computing Laboratory

Intel HPC Dev Conf (SC '16)