MVAPICH :: Publications

Journals (34)
1	Q. Anthony, B. Michalowicz, J. Hatef, L. Xu, M. Abduljabbar, A. Shafi, H. Subramoni, and DK Panda, Understanding and Characterizing Communication Characteristics for Distributed Transformer Models, IEEE Micro, Jan 2025.
2	T. Tran, G. Kuncham, B. Ramesh, S. Xu, H. Subramoni, and DK Panda, OHIO: Enhancing RDMA Scalability in Alltoall with Optimized Communication Overlap, IEEE Micro, Jan 2025.
3	T. Tran, B. Ramesh, B. Michalowicz, M. Abduljabbar, H. Subramoni, A. Shafi, and DK Panda, Accelerating Communication with Multi-HCA Aware Collectives in MPI, Concurrency and Computation: Practice and Experience (CCPE), July 2023,
4	K. Suresh, K. Khorassani, C. Chen, B. Ramesh, M. Abduljabbar, A. Shafi, H. Subramoni, and DK Panda, Network Assisted Non-Contiguous Transfers for GPU-Aware MPI Libraries, IEEE Micro, Jan 2023.
5	K. Khorassani, C. Chen, B. Ramesh, A. Shafi, H. Subramoni, and DK Panda, High Performance MPI over the Slingshot Interconnect, Special Issue of Journal of Computer Science and Technology (JCST), Feb 2023.
6	DK Panda, H. Subramoni, C. Chu, and M. Bayatpour, The MVAPICH project: Transforming Research into High-Performance MPI Library for HPC Community , Journal of Computational Science (JOCS), Special Issue on Translational Computer Science, Oct 2020.
7	J. Hashmi, C. Chu, S. Chakraborty, M. Bayatpour, H. Subramoni, and DK Panda, FALCON-X: Zero-copy MPI Derived Datatype Processing on Modern CPU and GPU Architectures, Journal of Parallel and Distributed Computing (JPDC), Volume 144, October 2020, Pages 1-13, doi.org/10.1016/j.jpdc.2020.05.008,
8	Ammar Awan, A. Jain, C. Chu, H. Subramoni, and DK Panda, Communication Profiling and Characterization of Deep Learning Workloads on Clusters with High-Performance Interconnects, IEEE Micro, vol. 40, no. 1, pp. 35-43, 1 Jan.-Feb. 2020.,
9	A. Ruhela, H. Subramoni, S. Chakraborty, M. Bayatpour, P. Kousha, and DK Panda, Effcient Design for MPI Asynchronous Progress without Dedicated Resources, Parallel Computing - Systems & Applications, Volume 85, July 2019, Pages 13-26, https://doi.org/10.1016/j.parco.2019.03.003,
10	Ammar Awan, K. Vadambacheri Manian, C. Chu, H. Subramoni, and DK Panda, Optimized Large-Message Broadcast for Deep Learning Workloads: MPI, MPI+NCCL, or NCCL2?, Volume 85, July 2019, Pages 141-152, https://doi.org/10.1016/j.parco.2019.03.005,
11	S. Chakraborty, Ignacio Laguna, Murali Emani, Kathryn Mohror, DK Panda, Martin Schulz, and H. Subramoni, EReinit: Scalable and Efficient Fault Tolerance for Bulk-Synchronous MPI Applications, Concurrency and Computation: Practice and Experience, 14 August 2018, https://doi.org/10.1002/cpe.4863,
12	S. Ramesh, A. Mahéo, S. Shende, A. Malony, H. Subramoni, A. Ruhela, and DK Panda, MPI performance engineering with the MPI tool interface: The integration of MVAPICH and TAU, ISSN 0167-8191, Volume 77, Sep 2018.
13	H. Wang, S. Potluri, D. Bureddy, and DK Panda, GPU-Aware MPI on RDMA-Enabled Cluster: Design, Implementation and Evaluation, IEEE Transactions on Parallel & Distributed Systems, Vol. 25, No. 10, pp. 2595-2605, Oct 2014.
14	S. Sur, S. Potluri, K. Kandalla, H. Subramoni, K. Tomko, and DK Panda, Co-Designing MPI Library and Applications for InfiniBand Clusters IEEE Computer, Nov 2011.
15	P. Lai, P. Balaji, R. Thakur, and DK Panda, ProOnE: A General-Purpose Protocol Onload Engine for Multi- and Many-Core Architectures Computer Science: Research and Development, Special Issue of Scientific Papers from ISC '09, Jun 2009.
16	A. Vishnu, M. Koop, A. Moody, A. Mamidala, S. Narravula, and DK Panda, Topology Agnostic Hot-Spot Avoidance with InfiniBand Concurrency and Computation: Practice and Experience, Special Issue of Best Papers from CCGrid '07, Jan 2008.
17	H. Jin, P. Balaji, C. Yoo, J. -Y. Choi, and DK Panda, Exploiting NIC Architectural Support for Enhancing IP based Protocols on High Performance Networks OSU-CISRC-5/04-TR37, Nov 2005.
18	J. Liu, A. Mamidala, A. Vishnu, and DK Panda, Performance Evaluation of InfiniBand with PCI Express, IEEE Micro, Jan 2005.
19	J. Liu, J. Wu, and DK Panda, High Performance RDMA-Based MPI Implementation over InfiniBand, Int'l Journal of Parallel Programming: Volume 32, Number 3, Jun 2004.
20	J. Liu, B. Chandrasekaran, W. Yu, J. Wu, D. Buntinas, S. Kini, P. Wyckoff, and DK Panda, Micro-Benchmark Performance Comparison of High-Speed Cluster Interconnects IEEE Micro, Jan 2004.
21	A. Wagner, D. Buntinas, R. Brightwell, and DK Panda, Application-Bypass Reduction for Large-Scale Clusters. Int'l Journal of High Performance Computing and Networking Internationall Journal of High Performance Computing and Networking, Cluster 2003 Special Issue. In Press, Dec 2003.
22	R. Sivaram, C. Stunkel, and DK Panda, HIPIQS: A High-Performance Switch Architecture using Input Queuing IEEE Transactions on Parallel and Distributed Systems. Vol. 13, No. 3, pp. 275-289, Mar 2002.
23	M. Banikazemi, B. Abali, L. Herger, and DK Panda, Design Alternatives for Virtual Interface Architecture (VIA) and an Implementation on IBM Netfinity NT Cluster Journal of Parallel and Distributed Computing, Special Issue on Clusters, Volume 61, Number 11, pp. 1512-1545, Nov 2001.
24	M. Banikazemi, R. K. Govindaraju, R. Blackmore, and DK Panda, MPI-LAPI: An Efficient Implementation of MPI for IBM RS/6000 SP Systems IEEE Transactions on Parallel and Distributed Systems, Vol. 12, No. 10, pp. 1081-1093, Oct 2001.
25	B. Abali, C. B. Stunkel, J. Herring, M. Banikazemi, DK Panda, C. Aykanat, and Y. Aydogan, Adaptive Routing on the New Switch Chip for IBM SP Systems Journal of Parallel and Distributed Computing, Special Issue on Routing in Computer and Communication Networks, Volume 61, Number 9, pp. 1148-1179, Sep 2001.
26	R. Kesavan, and DK Panda, Efficient Multicast on Irregular Switch-based Cut-Through Networks with Up-Down Routing IEEE Transactions on Parallel and Distributed Systems, Vol. 12, No. 8, pp. 808-828, Aug 2001.
27	R. Sivaram, R. Kesavan, DK Panda, and C. Stunkel Architectural Support for Efficient Multicasting in Irregular Networks, Architectural Support for Efficient Multicasting in Irregular Networks IEEE Transactions on Parallel and Distributed Systems, Vol. 12, No. 5, pp. 489-513, May 2001.
28	R. Sivaram, C. Stunkel, and DK Panda, Implementing Multidestination Worms in Switch-Based Parallel Systems: Architectural Alternatives and their Impact IEEE Transactions on Parallel and Distributed Systems, Vol. 11, No. 8, pp. 794-812, Aug 2000.
29	R. Kesavan, and DK Panda, Multiple Multicast with Minimized Node Contention on Wormhole k-ary n-cube Networks IEEE Transactions on Parallel and Distributed Systems, Vol. 10, No. 4, pp. 371-393, Apr 1999.
30	D. Dai, and DK Panda, Exploiting the Benefits of Multiple-Path Network in DSM Systems: Architectural Alternatives and Performance Evaluation IEEE Transactions on Computers, Special Issue on Cache Memory, Vol. 48, No. 2, pp. 236-244, Feb 1999.
31	R. Prakash, and DK Panda, Designing Communication Strategies for Heterogeneous Parallel Systems, Parallel Computing, Volume 24, pp. 2035-2052, Dec 1998.
32	R. Sivaram, DK Panda, and C. B. Stunkel, Efficient Broadcast and Multicast on Multistage Interconnection Networks using Multiport Encoding, IEEE Transactions on Parallel and Distributed Systems, Vol. 9, No. 10, pp. 1004-1028, Oct 1998.
33	D. Basak, and DK Panda, Designing Clustered Multiprocessor Systems under Packaging and Technological Advancements IEEE Transactions on Parallel and Distributed Systems, Vol. 7, No. 9, pp. 962-978, Sep 1996.
34	Srinivasan Ramesh, Aurele Maheo, Sameer Shende, Allen Malony, H. Subramoni, and DK Panda, MPI Performance Engineering with the MPI Tool Interface: the Integration of MVAPICH and TAU, Sep 2018.

Book Chapter (2)
1	X. Lu, J. Zhang, and DK Panda, Building Efficient HPC Cloud with SR-IOV Enabled InfiniBand: The MVAPICH2 Approach , Book "Research Advances in Cloud Computing", edited by Sanjay Chaudhary, Gaurav Somani, and Rajkumar Buyya, Springer International Publishing , Aug 2017.
2	X. Lu, and DK Panda, Contribution on Multiple Chapters related to OpenStack, Virtualized HPC, HPC Network Fabric, and HPC Workload Management , Book "The Crossroads of Cloud and HPC: OpenStack for Scientific Research; Exploring OpenStack Cloud Computing for Scientific Workloads", Edited by Stig Telfer - OpenStack Foundation Publishing (Invited Book Chapter) , Nov 2016.

Conferences & Workshops (434)
1	Design and Optimization of GPU-Aware MPI Allreduce Using Direct Sendrecv Communication C. Chen, J. Yao, H. Subramoni, and DK Panda, 54th International Conference on Parallel Processing, Sep 2025 [Bib - Plain]
2	Characterizing Communication Patterns in Distributed Large Language Model Inference L. Xu, K. Suresh, Q. Anthony, N. Alnaasan, and DK Panda, IEEE Hot Interconnects Symposium 2025, Aug 2025 [Bib - Plain]
3	OMB-Compr: An Extension to OSU Micro Benchmarks for Collective Compression Error Measurement J. Queiser, N. Contini, H. Subramoni, and DK Panda, Practice and Experience in Advanced Research Computing 2025, Jul 2025 [Bib - Plain]
4	Use of BlueField-SmartNICs in Offloading One-Sided Communication Primitives B. Michalowicz, K. Suresh, H. Subramoni, DK Panda, and S. Poole, ISC HIGH PERFORMANCE 2025, Jun 2025 [Research Poster] [Bib - Plain]
5	Design and Implementation of MPI Collective Operations for Large Message Communication on AMD GPUs C. Chen, L. Xu, H. Subramoni, and DK Panda, ISC HIGH PERFORMANCE 2025, Jun 2025 [Research Poster] [Bib - Plain]
6	Design and Implementation of a GPU-Aware MPI Collective Library for Intel GPUs C. Chen, G. Kuncham, H. Subramoni, and DK Panda, ISC HIGH PERFORMANCE 2025, Jun 2025 [Research Poster] [Bib - Plain]
7	Unified Designs of Multi-rail-aware MPI Allreduce and Alltoall Operations Across Diverse GPU and Interconnect Systems C. Chen, J. Yao, L. Xu, H. Subramoni, and DK Panda, 39th IEEE International Parallel & Distributed Processing Symposium, Jun 2025 [Bib - Plain]
8	Training ultra long context language model with fully pipelined distributed transformer J. Yao, S. Jacobs, M. Tanaka, O. Ruwase, H. Subramoni, and DK Panda, The Eighth Annual Conference on Machine Learning and Systems, May 2025 [Bib - Plain]
9	Effective and Efficient Offloading Designs for One-Sided Communication to SmartNICs B. Michalowicz, K. Suresh, H. Subramoni, M. Abduljabbar, DK Panda, and S. Poole, 31st IEEE International Conference on High Performance Computing, Data, and Analytics, Dec 2024 [Bib - Plain]
10	Using BlueField-3 SmartNICs to Offload Vector Operations in Krylov Subspace Methods K. Suresh, B. Michalowicz, N. Contini, B. Ramesh, M. Abduljabbar, A. Shafi, H. Subramoni, and DK Panda, 31st IEEE International Conference on High Performance Computing, Data, and Analytics, Dec 2024 [Bib - Plain]
11	Design and Implementation of Kernel-based MPI Reduction Operations for Intel GPUs C. Chen, G. Kuncham, H. Subramoni, and DK Panda, 31st IEEE International Conference on High Performance Computing, Data, and Analytics, Dec 2024 [Bib - Plain]
12	Scaling Large Language Model Training on Frontier with Low-Bandwidth Partitioning L. Xu, Q. Anthony, J. Hatef, A. Shafi, H. Subramoni, and DK Panda, 31st IEEE International Conference on High Performance Computing, Data, and Analytics, Dec 2024 [Bib - Plain]
13	HyperSack: Distributed Hyperparameter Optimization for Deep Learning using Resource-Aware Scheduling on Heterogeneous GPU Systems N. Alnaasan, B. Ramesh, J. Yao, A. Shafi, H. Subramoni, and DK Panda, 31st IEEE International Conference on High Performance Computing, Data, and Analytics, Dec 2024 [Bib - Plain]
14	HARVEST-2.0: High-Performance Vision Framework for End-to-end Preprocessing, Training, Inference, and Visualization N. Alnaasan, A. Potlapally, T. Chen, M. Lieber, A. Shafi, H. Subramoni, S. Shearer, and DK Panda, The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC'24), Nov 2024 [Research Poster] [Bib - Plain]
15	Characterizing Communication in Distributed Parameter-Efficient Fine-Tuning for Large Language Models N. Alnaasan, H. Huang, A. Shafi, H. Subramoni, and DK Panda, IEEE Hot Interconnects Symposium 2024, Aug 2024 [Bib - Plain]
16	OHIO: Improving RDMA Network Scalability in MPI_Alltoall through Optimized Hierarchical and Intra/Inter-Node Communication Overlap Design T. Tran, G. Kuncham, B. Ramesh, S. Xu, H. Subramoni, M. Abduljabbar, and DK Panda, IEEE Hot Interconnects Symposium 2024, Aug 2024 [Bib - Plain]
17	The Case for Co-Designing Model Architectures with Hardware Q. Anthony, J. Hatef, Deepak Narayanan, Stella Biderman, Stas Bekman, Junqi Yin, A. Shafi, H. Subramoni, and DK Panda, 53rd International Conference on Parallel Processing, Aug 2024 [Bib - Plain]
18	Design and Implementation of an IPC-based Collective MPI Library for Intel GPUs C. Chen, G. Kuncham, P. Kousha, H. Subramoni, and DK Panda, Practice and Experience in Advanced Research Computing, Jul 2024 [Bib - Plain]
19	A Novel LLM-enabled Framework for Accelerating the Creation of Knowledge Graphs for HPC P. Kousha, V. Sathu, H. M. Han, J. Jani, N. Alnaasan, H. Subramoni, and DK Panda, Practice and Experience in Advanced Research Computing, Jul 2024 [Jul 2024] [Bib - Plain]
20	OMB-FPGA: A Microbenchmark Suite for FPGA-aware MPIs using OpenCL and SYCL N. Contini, M. Abduljabbar, H. Subramoni, and DK Panda, Practice and Experience in Advanced Research Computing, Jul 2024 [Jul 2024] [Bib - Plain]
21	Exploiting Inter-Layer Expert Affinity for Accelerating Mixture-of-Experts Model Inference J. Yao, Q. Anthony, A. Shafi, H. Subramoni, and DK Panda, 38th IEEE International Parallel & Distributed Processing Symposium, May 2024 [Bib - Plain]
22	Accelerating MPI AllReduce Communication with Efficient GPU-Based Compression Schemes on Modern GPU Clusters Q. Zhou, B. Ramesh, A. Shafi, M. Abduljabbar, H. Subramoni, and DK Panda, ISC HIGH PERFORMANCE 2024, May 2024 [Bib - Plain]
23	Accelerating Large Language Model Training with Hybrid GPU-based Compression L. Xu, Q. Anthony, Q. Zhou, N. Alnaasan, R. Gulhane, A. Shafi, H. Subramoni, and DK Panda, IEEE/ACM International Symposium on Cluster, Cloud, and Internet Computing 2024, May 2024 [Bib - Plain]
24	Flover: A Temporal Fusion Framework for Efficient Autoregressive Model Parallel Inference J. Yao, N. Alnaasan, T. Chen, A. Shafi, H. Subramoni, and DK Panda, 30th IEEE INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING, DATA, & ANALYTICS, Dec 2023 [Bib - Plain]
25	MPI Allgather Utilizing CXL Shared Memory Pool in Multi-Node Computing Systems H. Ahn, Seonyoung Kim, Yoomi Park, Woojong Han, H. Ahn, T. Tran, B. Ramesh, H. Subramoni, and DK Panda, IEEE International Conference on Big Data, Dec 2023 [Dec 15-18, 2024 @ Washington DC, USA] [Bib - Plain]
26	HARVEST: High-Performance Artificial Vision Framework for Expert Labeling using Semi-Supervised Training N. Alnaasan, M. Lieber, A. Shafi, H. Subramoni, S. Shearer, and DK Panda, 2023 IEEE International Conference on Big Data, Dec 2023 [Bib - Plain]
27	Benchmarking Modern Databases for Storing and Profiling Very Large Scale HPC Communication Data P. Kousha, Q. Zhou, H. Subramoni, and DK Panda, The 15th BenchCouncil International Symposium On Benchmarking, Measuring And Optimizing, Dec 2023 [Bib - Plain]
28	MPI-xCCL: A Portable MPI Library over Collective Communication Libraries for Various Accelerators C. Chen, K. Khorassani, P. Kousha, Q. Zhou, J. Yao, H. Subramoni, and DK Panda, Sixth Annual Workshop on Emerging Parallel and Distributed Runtime Systems and Middleware, Nov 2023 [Bib - Plain]
29	Democratizing HPC Access and Use with Knowledge Graphs P. Kousha, V. Sathu, M. Lieber, H. Subramoni, and DK Panda, D-HPC 2023: The First International Workshop on Democratizing High-Performance Computing, Nov 2023 [Bib - Plain]
30	DPU-Bench: A Micro-Benchmark Suite to Measure Offload Efficiency Of SmartNICs B. Michalowicz, K. Suresh, H. Subramoni, DK Panda, and S. Poole, Practice and Experience in Advanced Research Computing 23, Jul 2023 [Bib - Plain]
31	Enabling Reconfigurable HPC through MPI-based Inter-FPGA Communication N. Contini, B. Ramesh, K. Suresh, T. Tran, B. Michalowicz, M. Abduljabbar, H. Subramoni, and DK Panda, International Conference on Supercomputing 2023, Jun 2023 [Bib - Plain]
32	SAI: AI-Enabled Speech Assistant Interface for Science Gateways in HPC P. Kousha, A. Jain, A. Kolli, M. Lieber, M. Han, N. Contini, H. Subramoni, and DK Panda, ISC HIGH PERFORMANCE 2023, May 2023 [Bib - Plain]
33	A Novel Framework for Efficient Offloading of Communication Operations to Bluefield SmartNICs K. Suresh, B. Michalowicz, B. Ramesh, N. Contini, J. Yao, S. Xu, A. Shafi, H. Subramoni, and DK Panda, 37th IEEE International Parallel & Distributed Processing Symposium (IPDPS '23), May 2023 [Bib - Plain]
34	Accelerating Distributed Deep Learning Training with Compression Assisted Allgather and Reduce-Scatter Communication Q. Zhou, Q. Anthony, L. Xu, A. Shafi, M. Abduljabbar, H. Subramoni, and DK Panda, 37th IEEE International Parallel & Distributed Processing Symposium (IPDPS '23), May 2023 [Bib - Plain]
35	MCR-DL: Mix-and-Match Communication Runtime for Deep Learning Q. Anthony, Ammar Awan, J. Rasley, Y. He, A. Shafi, M. Abduljabbar, H. Subramoni, and DK Panda, 37th IEEE International Parallel & Distributed Processing Symposium (IPDPS '23), May 2023 [Bib - Plain]
36	Designing and Optimizing GPU-aware Nonblocking MPI Neighborhood Collective Communication for PETSc K. Khorassani, C. Chen, H. Subramoni, and DK Panda, 37th IEEE International Parallel & Distributed Processing Symposium (IPDPS '23), May 2023 [Bib - Plain]
37	In-Depth Evaluation of a Lower-Level Direct-Verbs API on InfiniBand-based Clusters: Early Experiences B. Michalowicz, K. Suresh, B. Ramesh, A. Shafi, H. Subramoni, M. Abduljabbar, and DK Panda, 25th Workshop on Advances in Parallel and Distributed Computational Models, May 2023 [Held in conjunction with IPDPS 2023] [Bib - Plain]
38	Implementing and Optimizing a GPU-aware MPI Library for Intel GPUs: Early Experiences C. Chen, K. Khorassani, G. Kuncham, R. Vaidya, M. Abduljabbar, A. Shafi, H. Subramoni, and DK Panda, THE 23RD IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing, May 2023 [Bib - Plain]
39	AccDP: Accelerated Data-Parallel Distributed DNN Training for Modern GPU-Based HPC Clusters N. Alnaasan, A. Jain, A. Shafi, H. Subramoni, and DK Panda, 29th IEEE International Conference on High Performance Computing, Data, and Analytics, Dec 2022 [Bib - Plain]
40	Accelerating Broadcast Communication with GPU Compression for Deep Learning Workloads Q. Zhou, Q. Anthony, A. Shafi, H. Subramoni, and DK Panda, 29th IEEE International Conference on High Performance Computing, Data, and Analytics, Dec 2022 [Bib - Plain]
41	Spark Meets MPI: Towards High-Performance Communication Framework for Spark using MPI K. Al Attar, A. Shafi, M. Abduljabbar, H. Subramoni, and DK Panda, IEEE Cluster '22, Sep 2022 [Bib - Plain]
42	Network-Assisted Non-Contiguous Transfers for GPU-Aware MPI Libraries K. Suresh, K. Khorassani, C. Chen, B. Ramesh, M. Abduljabbar, A. Shafi, and DK Panda, Hot Interconnects 29, Aug 2022 [Bib - Plain]
43	High Performance MPI over the Slingshot Interconnect: Early Experiences K. Khorassani, C. Chen, B. Ramesh, A. Shafi, H. Subramoni, and DK Panda, Practice and Experience in Advanced Research Computing, Jul 2022 [Best Student Paper Award] [Bib - Plain]
44	Highly Efficient Alltoall and Alltoallv Communication Algorithms for GPU Systems C. Chen, K. Khorassani, Q. Anthony, A. Shafi, H. Subramoni, and DK Panda, Heterogeneity in Computing Workshop (HCW 2022), May 2022 [held in conjunction with IPDPS'22] [Bib - Plain]
45	OMB-Py: Python Micro-Benchmarks for Evaluating Performance of MPI Libraries on HPC Systems N. Alnaasan, A. Jain, A. Shafi, H. Subramoni, and DK Panda, 23rd Parallel and Distributed Scientific and Engineering Computing Workshop (PDSEC) at IPDPS22, May 2022 [Bib - Plain]
46	Hy-Fi: Hybrid Five-Dimensional Parallel DNN Training on High-Performance GPU Clusters A. Jain, A. Shafi, Q. Anthony, P. Kousha, H. Subramoni, and DK Panda, ISC HIGH PERFORMANCE 2022, May 2022 [Bib - Plain]
47	Accelerating MPI All-to-All Communication with Online Compression on Modern GPU Clusters Q. Zhou, P. Kousha, Q. Anthony, K. Khorassani, A. Shafi, H. Subramoni, and DK Panda, ISC HIGH PERFORMANCE 2022, May 2022 [Bib - Plain]
48	OMB-Py: Python Micro-Benchmarks for Evaluating Performance of MPI Libraries and Machine Learning Applications on HPC Systems N. Alnaasan, A. Jain, A. Shafi, H. Subramoni, and DK Panda, ISC HIGH PERFORMANCE 2022, May 2022 [Research Poster] [Best Poster Award] [Bib - Plain]
49	DistMILE: A Distributed Multi-Level Framework for Scalable Graph Embedding Yuntian He, Saket Gurukar, P. Kousha, H. Subramoni, and Dhabaleswar K. Panda and Srinivasan Parthasarathy, 28th IEEE International Conference on High Performance Computing, Data, Analytics, and Data Science, Dec 2021 [Bib - Plain]
50	Towards Architecture-aware Hierarchical Communication Trees on Modern HPC Systems B. Ramesh, J. Hashmi, S. Xu, A. Shafi, M. Ghazimirsaeed, M. Bayatpour, H. Subramoni, and DK Panda, 28th IEEE International Conference on High Performance Computing, Data, Analytics, and Data Science, Dec 2021 [Best Paper Finalist] [Bib - Plain]
51	Accelerating CPU-based Distributed DNN Training on Modern HPC Clusters using BlueField-2 DPUs A. Jain, N. Alnaasan, A. Shafi, H. Subramoni, and DK Panda, 28th IEEE Hot Interconnects, Aug 2021 [Bib - Plain]
52	BluesMPI: Efficient MPI Non-blocking Alltoall Offloading Designs on Modern BlueField Smart NICs M. Bayatpour, N. Sarkauskas, H. Subramoni, J. Hashmi, and DK Panda, ISC HIGH PERFORMANCE 2021, Jun 2021 [Bib - Plain]
53	Designing a ROCm-aware MPI Library for AMD GPUs: Early Experiences K. Khorassani, J. Hashmi, C. Chu, C. Chen, H. Subramoni, and DK Panda, ISC HIGH PERFORMANCE 2021, Jun 2021 [Bib - Plain]
54	SUPER: SUb-Graph Parallelism for TransformERs A. Jain, T. Moon, T. Benson, H. Subramoni, S. Jacobs, DK Panda, and B. Essen, 35th IEEE International Parallel & Distributed Processing Symposium, May 2021 [Bib - Plain]
55	Designing High-Performance MPI Libraries with On-the-fly Compression for Modern GPU Clusters Q. Zhou, C. Chu, N. Senthil Kumar, P. Kousha, M. Ghazimirsaeed, H. Subramoni, and DK Panda, 35th IEEE International Parallel & Distributed Processing Symposium, May 2021 [Best Paper Finalist] [Bib - Plain]
56	Adaptive and Hierarchical Large Message All-to-all Communication Algorithms for Large-scale Dense GPU Systems K. Khorassani, C. Chu, Q. Anthony, H. Subramoni, and DK Panda, The 21st IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing, May 2021 [Bib - Plain]
57	Efficient MPI-based Communication for GPU-Accelerated Dask Applications A. Shafi, J. Hashmi, H. Subramoni, and DK Panda, The 21st IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing, May 2021 [Bib - Plain]
58	Blink: Towards Efficient RDMA-based Communication Coroutines for Parallel Python Applications A. Shafi, J. Hashmi, H. Subramoni, and DK Panda, 27TH IEEE INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING, DATA, AND ANALYTICS, Dec 2020 [Bib - Plain]
59	GEMS: GPU Enabled Memory Aware Model Parallelism System for Distributed DNN Training A. Jain, Ammar Awan, A. Aljuhani, J. Hashmi, Q. Anthony, H. Subramoni, DK Panda, R. Machiraju, and A. Parwani, SC 2020, Nov 2020 [Bib - Plain]
60	Exploring Hybrid MPI+Kokkos Tasks Programming Model Samuel Khuvis, K. Tomko, J. Hashmi, and DK Panda, The 3rd Annual Parallel Applications Workshop, Alternatives to MPI+X (PAW-ATM), Nov 2020 [held in conjunction with SC’20] [Bib - Plain]
61	Design and Characterization of Infiniband Hardware Tag Matching in MPI M. Bayatpour, M. Ghazimirsaeed, S. Xu, H. Subramoni, and DK Panda, The 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing, Nov 2020 [Bib - Plain]
62	Accelerating GPU-based Machine Learning in Python using MPI Library: A Case Study with MVAPICH2-GDR M. Ghazimirsaeed, Q. Anthony, A. Shafi, H. Subramoni, and DK Panda, 6th Workshop on Machine Learning in HPC Environments, Nov 2020 [Bib - Plain]
63	Dynamic Kernel Fusion for Bulk Non-contiguous Data Transfer on GPU Clusters C. Chu, K. Khorassani, Q. Zhou, H. Subramoni, and DK Panda, 22nd IEEE International Conference on Cluster Computing (IEEE Cluster 2020), Sep 2020 [Bib - Plain]
64	Accelerated Real-time Network Monitoring and Profiling at Scale using OSU INAM P. Kousha, S. D. Kamal Raj, H. Subramoni, DK Panda, H. Na, T. Dockendorf, and K. Tomko, Practice and Experience in Advanced Research Computing 2020, Jul 2020 [Bib - Plain]
65	NV-Group: Link-Efficient Reductions for Distributed Deep Learning on Modern Dense GPU Systems C. Chu, P. Kousha, Ammar Awan, K. Khorassani, H. Subramoni, and DK Panda, The 34th ACM International Conference on Supercomputing (ICS-2020), Jun 2020 [Bib - Plain]
66	Communication-Aware Hardware-Assisted MPI Overlap Engine M. Bayatpour, J. Hashmi, S. Chakraborty, K. Suresh, M. Ghazimirsaeed, B. Ramesh, H. Subramoni, and DK Panda, ISC HIGH PERFORMANCE 2020, Jun 2020 [Bib - Plain]
67	HyPar-Flow: Exploiting MPI and Keras for Scalable Hybrid-Parallel DNN Training with TensorFlow Ammar Awan, A. Jain, Q. Anthony, H. Subramoni, and DK Panda, ISC HIGH PERFORMANCE 2020, Jun 2020 [Bib - Plain]
68	Machine-agnostic and Communication-aware Designs for MPI on Emerging Architectures J. Hashmi, S. Xu, B. Ramesh, M. Bayatpour, H. Subramoni, and DK Panda, 34th IEEE International Parallel & Distributed Processing Symposium (IPDPS '20), May 2020 [Bib - Plain]
69	High-Performance Adaptive MPI Derived Datatype Communication for Modern Multi-GPU Systems C. Chu, J. Hashmi, K. Khorassani, H. Subramoni, and DK Panda, 26th IEEE International Conference on High Performance Computing, Data, Analytics and Data Science (HiPC '19), Dec 2019 [Bib - Plain]
70	Design and Evaluation of Shared Memory CommunicationBenchmarks on Emerging Architectures using MVAPICH2 S. Xu, J. Hashmi, S. Chakraborty, H. Subramoni, and DK Panda, Third Annual Workshop on Emerging Parallel and Distributed Runtime Systems and Middleware, Nov 2019 [Bib - Plain]
71	Leveraging Network-level parallelism with Multiple Process-Endpoints for MPI Broadcast A. Ruhela, B. Ramesh, S. Chakraborty, H. Subramoni, J. Hashmi, and DK Panda, Third Annual Workshop on Emerging Parallel and Distributed Runtime Systems and Middleware, Nov 2019 [Bib - Plain]
72	OMB-UM: Design, Implementation, and Evaluation of CUDA Unified Memory Aware MPI Benchmarks K. Vadambacheri Manian, C. Chu, Ammar Awan, K. Khorassani, H. Subramoni, and DK Panda, 10th International Workshop in Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems, Nov 2019 [Bib - Plain]
73	Scaling TensorFlow, PyTorch, and MXNet using MVAPICH2 for High-Performance Deep Learning on Frontera A. Jain, Ammar Awan, H. Subramoni, and DK Panda, 3rd Deep Learning on Supercomputers Workshop (DLS) at SC19, Nov 2019 [Bib - Plain]
74	Performance Characterization of DNN Training using TensorFlow and PyTorch on Modern Clusters A. Jain, Ammar Awan, Q. Anthony, H. Subramoni, and DK Panda, 21st IEEE International Conference on Cluster Computing, Sep 2019 [Bib - Plain]
75	Communication Profiling and Characterization of Deep-Learning Workloads on Clusters With High-Performance Interconnects Ammar Awan, A. Jain, C. Chu, H. Subramoni, and DK Panda, 26th Symposium on High-Performance Interconnects (HotI '19), Aug 2019 [Bib - Plain]
76	Designing Scalable and High-performance MPI Libraries on Amazon Elastic Fabric Adapter S. Chakraborty, S. Xu, H. Subramoni, and DK Panda, HOT Interconnects 26, Aug 2019 [Bib - Plain]
77	Performance Evaluation of MPI Libraries on GPU-enabled OpenPOWER Architectures: Early Experiences K. Khorassani, C. Chu, H. Subramoni, and DK Panda, International Workshop on OpenPOWER for HPC, held in conjunction with ISC'19, Jun 2019 [Bib - Plain]
78	Reduction Operations on Modern Supercomputers: Challenges and Solutions M. Bayatpour, J. Hashmi, S. Chakraborty, H. Subramoni, and DK Panda, ISC HIGH PERFORMANCE 2019, Jun 2019 [Best Poster Award] [Bib - Plain]
79	FALCON: Efficient Designs for Zero-copy MPI Datatype Processing on Emerging Architectures J. Hashmi, S. Chakraborty, M. Bayatpour, H. Subramoni, and DK Panda, 33rd IEEE International Parallel & Distributed Processing Symposium (IPDPS '19), May 2019 [Best Paper Finalist] [Bib - Plain]
80	C-GDR: High-Performance Container-aware GPUDirect MPI Communication Schemes on RDMA Networks J. Zhang, X. Lu, C. Chu, and DK Panda, 33rd IEEE International Parallel & Distributed Processing Symposium (IPDPS '19), May 2019 [Bib - Plain]
81	Design and Characterization of Shared Address Space MPI Collectives on Modern Architectures J. Hashmi, S. Chakraborty, M. Bayatpour, H. Subramoni, and DK Panda, The 19th Annual IEEE/ACM International Symposium in Cluster, Cloud, and Grid Computing (CCGRID 2019), May 2019 [Bib - Plain]
82	Scalable Distributed DNN Training using TensorFlow and CUDA-Aware MPI: Characterization, Designs, and Performance Evaluation Ammar Awan, J. Bedorf, C. Chu, H. Subramoni, and DK Panda, The 19th Annual IEEE/ACM International Symposium in Cluster, Cloud, and Grid Computing (CCGRID 2019), May 2019 [Bib - Plain]
83	Characterizing CUDA Unified Memory (UM)-AwareMPI Designs on Modern GPU Architectures K. Vadambacheri Manian, Ammar Awan, A. Ruhela, C. Chu, and DK Panda, 12th Workshop on General Purpose Processing Using GPU (GPGPU 2019) @ ASPLOS 2019, Apr 2019 [Bib - Plain]
84	OC-DNN: Exploiting Advanced Unified Memory Capabilities in CUDA 9 and Volta GPUs for Out-of-Core DNN Training Ammar Awan, C. Chu, H. Subramoni, X. Lu, and DK Panda, 25th IEEE International Conference on High Performance Computing, Data, and Analytics, Dec 2018 [Bib - Plain]
85	Cooperative Rendezvous Protocols for Improved Performance and Overlap S. Chakraborty, M. Bayatpour, J. Hashmi, H. Subramoni, and DK Panda, 2018 The International Conference for High Performance Computing, Networking, Storage, and Analysis, Nov 2018 [Best Student Paper Finalist] [Bib - Plain]
86	Efficient Asynchronous Communication Progress for MPI without Dedicated Resources A. Ruhela, H. Subramoni, S. Chakraborty, M. Bayatpour, P. Kousha, and DK Panda, The EuroMPI 2018 Conference, Sep 2018 [Bib - Plain]
87	Optimized Broadcast for Deep Learning Workloads on Dense-GPU InfiniBand Clusters: MPI or NCCL? Ammar Awan, C. Chu, H. Subramoni, and DK Panda, The EuroMPI 2018 Conference, Sep 2018 [Bib - Plain]
88	Multi-Threading and Lock-Free MPI RMA Based Graph Processing on KNL and POWER Architectures M. Li, X. Lu, H. Subramoni, and DK Panda, The EuroMPI 2018 Conference, Sep 2018 [Bib - Plain]
89	SALaR: Scalable and Adaptive Designs for Large Message Reduction Collectives M. Bayatpour, J. Hashmi, S. Chakraborty, H. Subramoni, P. Kousha, and DK Panda, IEEE Cluster 2018, Sep 2018 [Best Paper Award] [Bib - Plain]
90	Designing Efficient Shared Address Space Reduction Collectives for Multi-/Many-cores J. Hashmi, S. Chakraborty, M. Bayatpour, H. Subramoni, and DK Panda, 32nd IEEE International Parallel & Distributed Processing Symposium (IPDPS '18), May 2018 [Bib - Plain]
91	Kernel-assisted Communication Engine for MPI on Emerging Manycore Processors J. Hashmi, K. Hamidouche, H. Subramoni, and DK Panda, 24th IEEE International Conference on High Performance Computing, Data, and Analytics (HiPC'17), Dec 2017 [Bib - Plain]
92	Designing Registration Caching Free High-Performance MPI Library with Implicit On-Demand Paging (ODP) of InfiniBand M. Li, X. Lu, H. Subramoni, and DK Panda, 24th IEEE International Conference on High Performance Computing, Data, and Analytics (HiPC'17), Dec 2017 [Bib - Plain]
93	MPI-LiFE: Designing High-Performance Linear Fascicle Evaluation of Brain Connectome with MPI S. Gugnani, X. Lu, F. Pestilli, C.F. Caiafa, and DK Panda, 24th IEEE International Conference on High Performance Computing, Data, and Analytics (HiPC'17), Dec 2017 [Bib - Plain]
94	Is Singularity-based Container Technology Ready for Running MPI Applications on HPC Clouds? J. Zhang, X. Lu, and DK Panda, 10th IEEE/ACM International Conference on Utility and Cloud Computing, Dec 2017 [Best Student Paper Award] [Bib - Plain]
95	An In-depth Performance Characterization of CPU- and GPU-based DNN Training on Modern Architectures Ammar Awan, H. Subramoni, and DK Panda, 3rd Workshop on Machine Learning in High Performance Computing Environments, held in conjunction with SC17, Nov 2017 [Bib - Plain]
96	Scalable Reduction Collectives with Data Partitioning-based Multi-Leader Design M. Bayatpour, S. Chakraborty, H. Subramoni, X. Lu, and DK Panda, SuperComputing 2017, Nov 2017 [Bib - Plain]
97	MPI Performance Engineering with the MPI Tool Interface: the Integration of MVAPICH and TAU DK Panda, 24th European MPI Users' Group Meeting, Sep 2017 [Best Paper] [Bib - Plain]
98	Efficient and Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning C. Chu, X. Lu, Ammar Awan, H. Subramoni, J. Hashmi, Bracy Elton, and DK Panda, ICPP 2017 : International Conference on Parallel Processing, Aug 2017 [Bib - Plain]
99	MPI-GDS: High Performance MPI Designs with GPUDirect-aSync for CPU-GPU Control Flow Decoupling A. Venkatesh, C. Chu, K. Hamidouche, S. Potluri, Davide Rossetti, and DK Panda, ICPP 2017 : International Conference on Parallel Processing, Aug 2017 [Bib - Plain]
100	Exploiting and Evaluating OpenSHMEM on KNL Architecture J. Hashmi, M. Li, H. Subramoni, and DK Panda, Fourth Workshop on OpenSHMEM and Related Technologies, Aug 2017 [Bib - Plain]
101	Designing Dynamic and Adaptive MPI Point-to-point Communication Protocols for Efficient Overlap of Computation and Communication H. Subramoni, S. Chakraborty, and DK Panda, International Supercomputing Conference (ISC ’17), Jun 2017 [Hans Meuer Award (Most Outstanding Research Paper)] [Bib - Plain]
102	High-Performance Virtual Machine Migration Framework for MPI Applications on SR-IOV enabled InfiniBand Clusters J. Zhang, X. Lu, and DK Panda, 31st IEEE International Parallel & Distributed Processing Symposium (IPDPS '17), May 2017 [Bib - Plain]
103	Designing Locality and NUMA Aware MPI Runtime for Nested Virtualization based HPC Cloud with SR-IOV Enabled InfiniBand J. Zhang, X. Lu, and DK Panda, 13th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments (VEE '17), Apr 2017 [Bib - Plain]
104	S-Caffe: Co-designing MPI Runtimes and Caffe for Scalable Deep Learning on Modern GPU Clusters Ammar Awan, K. Hamidouche, J. Hashmi, and DK Panda, 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Feb 2017 [Slides] [Bib - Plain]
105	Mizan-RMA: Accelerating Mizan Graph Processing Framework with MPI RMA M. Li, X. Lu, K. Hamidouche, J. Zhang, and DK Panda, 23rd IEEE International Conference on High Performance Computing, Data, and Analytics, Dec 2016 [Bib - Plain]
106	Re-designing CNTK Deep Learning Framework on Modern GPU Enabled Clusters D. Banerjee, K. Hamidouche, and DK Panda, 8th IEEE International Conference on Cloud Computing Technology and Science (IEEE CloudCom '16), Dec 2016 [Bib - Plain]
107	Enabling Performance Efficient Runtime Support for Hybrid MPI+UPC++ Programming Models J. Hashmi, K. Hamidouche, and DK Panda, 18th IEEE International Conference on High Performance Computing and Communications (HPCC'16), Dec 2016 [Bib - Plain]
108	Efficient Reliability Support for Hardware Multicast-based Broadcast in GPU-enabled Streaming Applications C. Chu, K. Hamidouche, H. Subramoni, A. Venkatesh, B. Elton, and DK Panda, First Workshop on Optimization of Communication in HPC runtime systems (COMHPC, SC Workshop), Nov 2016 [Bib - Plain]
109	OpenSHMEM NonBlocking Data Movement Operations with MVAPICH2-X: Early Experiences K. Hamidouche, J. Zhang, K. Tomko, and DK Panda, PGAS Applications Workshop, Nov 2016 [Bib - Plain]
110	Designing MPI Library with On-Demand Paging (ODP) of InfiniBand: Challenges and Benefits M. Li, K. Hamidouche, X. Lu, H. Subramoni, J. Zhang, and DK Panda, SuperComputing 2016, Nov 2016 [Bib - Plain]
111	Designing High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters C. Chu, K. Hamidouche, H. Subramoni, A. Venkatesh, B. Elton, and DK Panda, 28th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD'16), Oct 2016 [Bib - Plain]
112	Efficient Large Message Broadcast using NCCL and CUDA-Aware MPI for Deep Learning Ammar Awan, K. Hamidouche, A. Venkatesh, and DK Panda, The 23rd European MPI Users' Group Meeting (EuroMPI 16), Sep 2016 [Best Paper Runner-Up] [Bib - Plain]
113	SLURM-V: Extending SLURM for Building Efficient HPC Cloud with SR-IOV and IVShmem J. Zhang, X. Lu, S. Chakraborty, and DK Panda, 22nd International European Conference on Parallel and Distributed Computing (Euro-Par '16), Aug 2016 [Bib - Plain]
114	High Performance MPI Library for Container-based HPC Cloud on InfiniBand Clusters J. Zhang, X. Lu, and DK Panda, The 45th International Conference on Parallel Processing (ICPP '16), Aug 2016 [Bib - Plain]
115	INAM^2: InfiniBand Network Analysis & Monitoring with MPI H. Subramoni, A. Augustine, M. Arnold, J. Perkins, X. Lu, K. Hamidouche, and DK Panda, International Supercomputing Conference, Jun 2016 [Slides] [Bib - Plain]
116	Performance Characterization of Hypervisor- and Container-based Virtualization for HPC on SR-IOV Enabled InfiniBand Clusters J. Zhang, X. Lu, and DK Panda, IPDRM '16 (IPDPS Workshop), May 2016 [Bib - Plain]
117	Exploiting Maximal Overlap for Non-Contiguous Data Movement Processing on Modern GPU-enabled System C. Chu, K. Hamidouche, A. Venkatesh, D. Banerjee, H. Subramoni, and DK Panda, The 30th IEEE International Parallel & Distributed Processing Symposium (IPDPS '16), May 2016 [Bib - Plain]
118	CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters C. Chu, K. Hamidouche, A. Venkatesh, Ammar Awan, and DK Panda, 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid'16), May 2016 [Bib - Plain]
119	SHMEMPMI - Shared Memory based PMI for Improved Performance and Scalability S. Chakraborty, H. Subramoni, J. Perkins, and DK Panda, 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid'16), May 2016 [Bib - Plain]
120	High Performance OpenSHMEM Strided Communication Support with InfiniBand UMR M. Li, K. Hamidouche, X. Lu, J. Zhang, J. Lin, and DK Panda, HiPC '15, Dec 2015 [Bib - Plain]
121	A Case for Application-Oblivious Energy-Efficient MPI Runtime A. Venkatesh, A. Vishnu, K. Hamidouche, N. Tallent, DK Panda, D. Kerbyson, and A. Hoise, Supercomputing 2015, Nov 2015 [Best Student Paper Finalist] [Bib - Plain]
122	GPU-Aware Design, Implementation, and Evaluation of Non-blocking Collective Benchmarks Ammar Awan, K. Hamidouche, A. Venkatesh, J. Perkins, H. Subramoni, and DK Panda, EuroMPI 2015, Sep 2015 [Bib - Plain]
123	High Performance MPI Datatype Support with User-mode Memory Registration: Challenges, Designs and Benefits M. Li, H. Subramoni, K. Hamidouche, X. Lu, and DK Panda, IEEE Cluster 2015, Sep 2015 [Bib - Plain]
124	Exploiting GPUDirect RDMA in Designing High Performance OpenSHMEM for NVIDIA GPU Clusters K. Hamidouche, A. Venkatesh, Ammar Awan, H. Subramoni, and DK Panda, IEEE Cluster 2015, Sep 2015 [Bib - Plain]
125	Impact of InfiniBand DC Transport Protocol on Energy Consumption of All-to-all Collective Algorithms H. Subramoni, A. Venkatesh, K. Hamidouche, K. Tomko, and DK Panda, 23rd International Symposium on High Performance Interconnects 2015, Aug 2015 [Bib - Plain]
126	High Performance and Scalable Design of MPI-3 RMA on Xeon Phi Clusters M. Li, K. Hamidouche, X. Lu, J. Lin, and DK Panda, Euro-Par '2015, Aug 2015 [Bib - Plain]
127	A Case for Non-Blocking Collectives in OpenSHMEM: Design, Implementation, and Performance Evaluation using MVAPICH2-X Ammar Awan, K. Hamidouche, C. Chu, and DK Panda, OpenSHMEM 2015 for PGAS Programming in the Exascale Era, Aug 2015 [Bib - Plain]
128	Designing Non-Blocking Personalized Collectives with Near Perfect Overlap for RDMA-Enabled Clusters H. Subramoni, Ammar Awan, K. Hamidouche, D. Pekurovsky, A. Venkatesh, S. Chakraborty, K. Tomko, and DK Panda, ISC '15, Jul 2015 [Bib - Plain]
129	On-demand Connection Management for OpenSHMEM and OpenSHMEM+MPI S. Chakraborty, H. Subramoni, J. Perkins, Ammar Awan, and DK Panda, HIPS '15 (IPDPS Workshop), May 2015 [Bib - Plain]
130	High-Performance Coarray Fortran Support with MVAPICH2-X: Initial Experience and Evaluation J. Lin, K. Hamidouche, X. Lu, M. Li, and DK Panda, HIPS '15 (IPDPS Workshop), May 2015 [Bib - Plain]
131	Non-blocking PMI Extensions for Fast MPI Startup S. Chakraborty, H. Subramoni, A. Moody, A. Venkatesh, J. Perkins, and DK Panda, CCGrid '15, May 2015 [Bib - Plain]
132	MVAPICH2 over OpenStack with SR-IOV: An Efficient Approach to Build HPC Clouds J. Zhang, X. Lu, M. Arnold, and DK Panda, CCGrid '15, May 2015 [Bib - Plain]
133	Power-Check: An Energy-Efficient Checkpointing Framework for HPC Clusters R. Rajachandrasekar, A. Venkatesh, K. Hamidouche, and DK Panda, CCGrid '15, May 2015 [Bib - Plain]
134	Designing Efficient Small Message Transfer Mechanism for Inter-node MPI Communication on InfiniBand GPU Clusters R. Shi, S. Potluri, K. Hamidouche, M. Li, J. Perkins, D. Rossetti, and DK Panda, IEEE International Conference on High Performance Computing (HiPC ’14), Dec 2014 [Bib - Plain]
135	A High Performance Broadcast Design with Hardware Multicast and GPUDirect RDMA for Streaming Applications on Infiniband Clusters A. Venkatesh, H. Subramoni, K. Hamidouche, and DK Panda, IEEE International Conference on High Performance Computing (HiPC ’14), Dec 2014 [Bib - Plain]
136	High Performance MPI Library over SR-IOV Enabled InfiniBand Clusters J. Zhang, X. Lu, J. Jose, M. Li, R. Shi, and DK Panda, IEEE International Conference on High Performance Computing (HiPC ’14), Dec 2014 [Bib - Plain]
137	Scalable MiniMD Design with Hybrid MPI and OpenSHMEM M. Li, J. Lin, X. Lu, K. Hamidouche, K. Tomko, and DK Panda, OUG '14 (Co-located with PGAS), Oct 2014 [Bib - Plain]
138	Designing Scalable Out-of-core Sorting with Hybrid MPI+PGAS Programming Models J. Jose, S. Potluri, H. Subramoni, X. Lu, K. Hamidouche, K. Schulz, H. Sundar, and DK Panda, International Conference on Partitioned Global Address Space Programming Models (PGAS '14), Oct 2014 [Bib - Plain]
139	PMI Extensions for Scalable MPI Startup S. Chakraborty, H. Subramoni, J. Perkins, A. Moody, M. Arnold, and DK Panda, EuroMPI/ASIA 2014, Sep 2014 [Bib - Plain]
140	Understanding the Memory-Utilization of MPI Libraries: Challenges and Designs in Implementing the MPI_T Interface R. Rajachandrasekar, J. Perkins, K. Hamidouche, M. Arnold, and DK Panda, EuroMPI/ASIA 2014, Sep 2014 [Bib - Plain]
141	HAND: A Hybrid Approach to Accelerate Non-contiguous Data Movement using MPI Datatypes on GPU Clusters R. Shi, X. Lu, S. Potluri, K. Hamidouche, J. Zhang, and DK Panda, International Conference on Parallel Processing (ICPP’14), Sep 2014 [Bib - Plain]
142	Designing Topology-Aware Communication Schedules for Alltoall Operations in Large InfiniBand Clusters H. Subramoni, K. Kandalla, J. Jose, K. Tomko, K. Schulz, D. Pekurovsky, and DK Panda, International Conference on Parallel Processing (ICPP’14), Sep 2014 [Bib - Plain]
143	High Performance OpenSHMEM for MIC Clusters: Extensions, Runtime Designs, and Application Co-Design J. Jose, K. Hamidouche, X. Lu, S. Potluri, J. Zhang, K. Tomko, and DK Panda, IEEE CLUSTER’14, Sep 2014 [Bib - Plain]
144	Scalable Graph500 Design with MPI-3 RMA M. Li, X. Lu, S. Potluri, K. Hamidouche, J. Jose, K. Tomko, and DK Panda, IEEE CLUSTER’14, Sep 2014 [Bib - Plain]
145	Can Inter-VM Shmem Benefit MPI Applications on SR-IOV based Virtualized InfiniBand Clusters? J. Zhang, X. Lu, J. Jose, R. Shi, and DK Panda, Euro-Par 2014 Parallel Processing, Aug 2014 [Bib - Plain]
146	MIC-Check: A Distributed Checkpointing Framework for the Intel Many Integrated Cores Architecture R. Rajachandrasekar, S. Potluri, A. Venkatesh, K. Hamidouche, M. W. Rahman, and DK Panda, International Symposium on High Performance and Distributed Computing (HPDC), Jun 2014 [Bib - Plain]
147	Designing MPI Library with Dynamic Connected Transport (DCT) of InfiniBand : Early Experiences H. Subramoni, K. Hamidouche, A. Venkatesh, S. Chakraborty, and DK Panda, IEEE International Supercomputing Conference (ISC ’14), Jun 2014 [Bib - Plain]
148	High Performance Alltoall and Allgather designs for InfiniBand MIC Clusters A. Venkatesh, S. Potluri, R. Rajachandrasekar, M. Luo, K. Hamidouche, and DK Panda, International Parallel and Distributed Processing Symposium (IPDPS’14), May 2014 [Bib - Plain]
149	Optimizing Collective Communication in UPC J. Jose, K. Hamidouche, J. Zhang, A. Venkatesh, and DK Panda, International Workshop on High-Level Parallel Programming Models and Supportive Environments (HIPS '14), May 2014 [Slides] [Bib - Plain]
150	A Comprehensive Performance Evaluation of OpenSHMEM Libraries on InfiniBand Clusters J. Jose, J. Zhang, A. Venkatesh, S. Potluri, and DK Panda, OpenSHMEM Workshop, Mar 2014 [Bib - Plain]
151	Initial Study of Multi-Endpoint Runtime for MPI+OpenMP Hybrid Programming Model on Multi-Core Systems M. Luo, X. Lu, K. Hamidouche, K. Kandalla, and DK Panda, International Symposium on Principles and Practice of Parallel Programming (PPoPP '14), Feb 2014 [Bib - Plain]
152	The MVAPICH Project: Evolution and Sustainability of an Open Source Production Quality MPI Library for HPC DK Panda, K. Tomko, K. Schulz, and A. Majumdar, Int'l Workshop on Sustainable Software for Science: Practice and Experiences, Nov 2013 [Bib - Plain]
153	MVAPICH-PRISM: A Proxy-based Communication Framework using InfiniBand and SCIF for Intel MIC Clusters S. Potluri, D. Bureddy, K. Hamidouche, A. Venkatesh, K. Kandalla, H. Subramoni, and DK Panda, Internationall Conference on Supercomputing (SC 2013), Nov 2013 [Bib - Plain]
154	A Novel Functional Partitioning Approach to Design High-Performance MPI-3 Non-Blocking Alltoallv Collective on Multi-core Systems K. Kandalla, H. Subramoni, K. Tomko, D. Pekurovsky, and DK Panda, International Conference on Parallel Processing (ICPP '13), Oct 2013 [Bib - Plain]
155	Efficient Inter-node MPI Communication using GPUDirect RDMA for InfiniBand Clusters with NVIDIA GPUs S. Potluri, K. Hamidouche, A. Venkatesh, D. Bureddy, and DK Panda, International Conference on Parallel Processing (ICPP '13), Oct 2013 [Bib - Plain]
156	UPC on MIC: Early Experiences with Native and Symmetric Modes M. Luo, M. Li, A. Venkatesh, X. Lu, and DK Panda, International Conference on Partitioned Global Address Space Programming Models (PGAS '13), Oct 2013 [Bib - Plain]
157	Optimizing Collective Communication in OpenSHMEM J. Jose, K. Kandalla, S. Potluri, J. Zhang, and DK Panda, International Conference on Partitioned Global Address Space Programming Models (PGAS '13), Oct 2013 [Bib - Plain]
158	Design of Network Topology Aware Scheduling Services for Large InfiniBand Clusters H. Subramoni, D. Bureddy, K. Kandalla, K. Schulz, B. Barth, J. Perkins, M. Arnold, and DK Panda, IEEE Cluster (Cluster '13), Sep 2013 [Bib - Plain]
159	A Scalable and Portable Approach to Accelerate Hybrid HPL on Heterogeneous CPU-GPU Clusters R. Shi, S. Potluri, K. Hamidouche, X. Lu, K. Tomko, and DK Panda, IEEE Cluster (Cluster '13), Sep 2013 [Bib - Plain]
160	Efficient and Truly Passive MPI-3 RMA Using InfiniBand Atomics M. Li, S. Potluri, K. Hamidouche, J. Jose, and DK Panda, EuroMPI 2013, Sep 2013 [Slides] [Bib - Plain]
161	Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters K. Kandalla, A. Venkatesh, K. Hamidouche, S. Potluri, and DK Panda, International Symposium on High-Performance Interconnects (HotI '13), Aug 2013 [Bib - Plain]
162	MVAPICH2-MIC: A High-Performance MPI Library for Xeon Phi Clusters with InfiniBand S. Potluri, K. Hamidouche, D. Bureddy, and DK Panda, Extreme Scaling Workshop, Aug 2013 [Bib - Plain]
163	Optimized MPI Gather collective for Many Integrated Core (MIC) InfiniBand Clusters A. Venkatesh, K. Kandalla, and DK Panda, Extreme Scaling Workshop, Aug 2013 [Bib - Plain]
164	A 1PB/s File System to Checkpoint Three Million MPI Tasks R. Rajachandrasekar, A. Moody, K. Mohror, and DK Panda, International Conference on High Performance Distributed Computing (HPDC '13), Jun 2013 [Slides] [Bib - Plain]
165	Designing Scalable Graph500 Benchmark with Hybrid MPI+OpenSHMEM Programming Models J. Jose, S. Potluri, K. Tomko, and DK Panda, International Supercomputing Conference (ISC '13), Jun 2013 [Slides] [Bib - Plain]
166	MIC-RO: Enabling Efficient Remote Offload on Heterogeneous Many Integrated Core (MIC) Clusters with InfiniBand K. Hamidouche, S. Potluri, H. Subramoni, K. Kandalla, and DK Panda, International Conference on Supercomputing (ICS '13), Jun 2013 [Bib - Plain]
167	Extending OpenSHMEM for GPU Computing S. Potluri, D. Bureddy, H. Wang, H. Subramoni, and DK Panda, International Parallel and Distributed Processing Symposium (IPDPS '13), May 2013 [Slides] [Bib - Plain]
168	Evaluation of Energy Characteristics of MPI Communication Primitives with RAPL A. Venkatesh, K. Kandalla, and DK Panda, International Workshop on High Performance (High-Performance, Power-Aware Computing Workshop), May 2013 [Bib - Plain]
169	High Performance RDMA-Based Design of HDFS over InfiniBand N. Islam, M. W. Rahman, J. Jose, R. Rajachandrasekar, H. Wang, H. Subramoni, C. Murthy, and DK Panda, International Conference on Supercomputing (SC '12), Nov 2012 [Slides] [Bib - Plain]
170	Design of a Scalable InfiniBand Topology Service to Enable Network-Topology-Aware Placement of Processes H. Subramoni, S. Potluri, K. Kandalla, B. Barth, J. Vienne, J. Keasler, K. Tomko, K. Schulz, A. Moody, and DK Panda, International Conference on Supercomputing (SC '12), Nov 2012 [Bib - Plain]
171	Multi-Threaded UPC Runtime for GPU to GPU communication over InfiniBand M. Luo, H. Wang, and DK Panda, International Conference on Partitioned Global Address Space Programming Models (PGAS '12), Oct 2012 [Slides] [Bib - Plain]
172	Supporting Hybrid MPI and OpenSHMEM over InfiniBand: Design and Performance Evaluation J. Jose, K. Kandalla, M. Luo, and DK Panda, International Conference on Parallel Processing (ICPP '12), Sep 2012 [Bib - Plain]
173	OMB-GPU: A Micro-benchmark suite for Evaluating MPI Libraries on GPU Clusters D. Bureddy, H. Wang, A. Venkatesh, S. Potluri, and DK Panda, EuroMPI 2012, Sep 2012 [Bib - Plain]
174	Minimizing Network Contention in InfiniBand Clusters with a QoS-Aware Data-Staging Framework R. Rajachandrasekar, J. Jaswani, H. Subramoni, and DK Panda, IEEE Cluster (Cluster '12), Sep 2012 [Bib - Plain]
175	Can Network-Offload based Non-Blocking Neighborhood MPI Collectives Improve Communication Overheads of Irregular Graph Algorithms? Int'l Workshop on Parallel Algorithm and Parallel Software (IWPAPS12) K. Kandalla, H. Subramoni, K. Tomko, J. Vienne, L. Oliker, and DK Panda, held in conjunction with IEEE Cluster (Cluster '12), Sep 2012 [Bib - Plain]
176	A Scalable InfiniBand Network-Topology-Aware Performance Analysis Tool for MPI H. Subramoni, J. Vienne, and DK Panda, International Workshop on Productivity and Performance (Proper '12), Aug 2012 [Bib - Plain]
177	Performance Analysis and Evaluation of InfiniBand FDR and 40GigE RoCE on HPC and Cloud Computing System J. Vienne, J. Chen, M. W. Rahman, N. Islam, H. Subramoni, and DK Panda, International Symposium on High-Performance Interconnects (HotI 2012), Aug 2012 [Bib - Plain]
178	Congestion Avoidance on Manycore High Performance Computing Systems M. Luo, DK Panda, C. Iancu, and K. Z. Ibrahim, International Conference on Supercomputing (ICS '12), Jun 2012 [Bib - Plain]
179	Redesigning MPI Shared Memory Communication for Large Multi-Core Architecture M. Luo, H. Wang, J. Vienne, and DK Panda, International Supercomputing Conference 2012, Jun 2012 [Bib - Plain]
180	Designing Non-blocking Allreduce with Collective Offload on InfiniBand Clusters: A Case Study with Conjugate Gradient Solvers K. Kandalla, U. Yang, J. Keasler, T. Kolev, A. Moody, H. Subramoni, K. Tomko, J. Vienne, and DK Panda, International Parallel and Distributed Processing Symposium 2012, May 2012 [Bib - Plain]
181	Designing Network Failover and Recovery in MPI for Multi-Rail InfiniBand Clusters S. P. Raikar, H. Subramoni, K. Kandalla, J. Vienne, and DK Panda, International Workshop on System Management Techniques, May 2012 [Bib - Plain]
182	Monitoring and Predicting Hardware Failures in HPC Clusters with FTB-IPMI R. Rajachandrasekar, X. Besseron, and DK Panda, International Workshop on System Management Techniques, May 2012 [Bib - Plain]
183	Optimizing MPI Communication on Multi-GPU Systems using CUDA Inter-Process Communication S. Potluri, H. Wang, D. Bureddy, A. Singh, C. Rosales, and DK Panda, International Workshop on Accelerators and Hybrid Exascale Systems (AsHES), May 2012 [Slides] [Bib - Plain]
184	Intra-MIC MPI Communication using MVAPICH2: Early Experience S. Potluri, K. Tomko, D. Bureddy, and DK Panda, TACC-Intel Highly-Parallel Computing Symposium, Apr 2012 [Slides] [Bib - Plain]
185	Multi-threaded UPC Runtime with Network Endpoints: Design Alternatives and Evaluation on Multi-core Architectures M. Luo, J. Jose, S. Sur, and DK Panda, International Conference on High Performance Computing (HiPC '11), Dec 2011 [Slides] [Bib - Plain]
186	UPC Queues for Scalable Graph Traversals: Design and Evaluation on InfiniBand Clusters J. Jose, S. Potluri, M. Luo, S. Sur, and DK Panda, Fifth Conference on Partitioned Global Address Space Programming Model (PGAS '11), Oct 2011 [Slides] [Bib - Plain]
187	Can a Decentralized Metadata Service Layer benefit Parallel Filesystems? Workshop on Interfaces and Architectures for Scientific Data Storage (IASDS '11) V. Meshram, X. Besseron, X. Ouyang, R. Rajachandrasekar, and DK Panda, held in conjunction with Cluster '11, Sep 2011 [Bib - Plain]
188	MPI Alltoall Personalized Exchange on GPGPU Clusters: Design Alternatives and Benefits A. Singh, S. Potluri, H. Wang, K. Kandalla, S. Sur, and DK Panda, International Workshop on Parallel Programming on Accelerator Clusters (PPAC '11), Sep 2011 [Slides] [Bib - Plain]
189	Design and Evaluation of Network Topology-/Speed- Aware Broadcast Algorithms for InfiniBand Clusters H. Subramoni, K. Kandalla, J. Vienne, S. Sur, B. Barth, K. Tomko, R. McLay, K. Schulz, and DK Panda, IEEE Cluster '11, Sep 2011 [Bib - Plain]
190	Optimized Non-contiguous MPI Datatype Communication for GPU Clusters: Design Implementation and Evaluation with MVAPICH2 H. Wang, S. Potluri, M. Luo, A. Singh, X. Ouyang, S. Sur, and DK Panda, IEEE Cluster '11, Sep 2011 [Slides] [Bib - Plain]
191	Optimizing MPI One Sided Communication on Multi-core InfiniBand Clusters using Shared Memory Backed Windows S. Potluri, H. Wang, V. Dhanraj, S. Sur, and DK Panda, EuroMPI '11, Sep 2011 [Bib - Plain]
192	Design and Implementation of Key Proposed MPI-3 One-Sided Communication Semantics on InfiniBand S. Potluri, S. Sur, D. Bureddy, and DK Panda, EuroMPI '11, Sep 2011 [Slides] [Poster/Short Paper] [Bib - Plain]
193	CRFS: A Lightweight User-Level Filesystem for Generic Checkpoint/Restart X. Ouyang, R. Rajachandrasekar, X. Besseron, H. Wang, J. Huang, and DK Panda, International Conference on Parallel Processing (ICPP '11), Sep 2011 [Slides] [Bib - Plain]
194	Can Checkpoint/Restart Mechanisms Benefit from Hierarchical Data Staging? Workshop on Resiliency in High Performance Computing in Clusters R. Rajachandrasekar, X. Ouyang, X. Besseron, V. Meshram, and DK Panda, Workshop on Resiliency in High Performance Computing in Clusters, Clouds, and Grids 2011, held in conjunction with EuroPar, Aug 2011 [Bib - Plain]
195	INAM - A Scalable InfiniBand Network Analysis and Monitoring Tool N. Dandapanthula, H. Subramoni, J. Vienne, K. Kandalla, S. Sur, DK Panda, and R. Brightwell, 4th International Workshop on Productivity and Performance (PROPER 2011), Aug 2011 [Slides] [Bib - Plain]
196	Designing Non-blocking Broadcast with Collective Offload on InfiniBand Clusters: A Case Study with HPL K. Kandalla, H. Subramoni, J. Vienne, K. Tomko, S. Sur, and DK Panda, Hot Interconnect '11, Aug 2011 [Bib - Plain]
197	High-Performance and Scalable Non-Blocking All-to-All with Collective Offload on InfiniBand Clusters: A Study with Parallel 3D FFT K. Kandalla, H. Subramoni, K. Tomko, D. Pekurovsky, S. Sur, and DK Panda, International Supercomputing Conference '11 (ISC'11), Jun 2011 [Bib - Plain]
198	MVAPICH2-GPU: Optimized GPU to GPU Communication for InfiniBand Clusters H. Wang, S. Potluri, M. Luo, A. Singh, S. Sur, and DK Panda, International Supercomputing Conference '11 (ISC'11), Jun 2011 [Slides] [Bib - Plain]
199	Efficient Intra-node Communication on Intel-MIC Clusters S. Potluri, A. Venkatesh, D. Bureddy, K. Kandalla, and DK Panda, International Symposium on Cluster, May 2011 [Slides] [Bib - Plain]
200	SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience J. Jose, M. Li, X. Lu, K. Kandalla, M. Arnold, and DK Panda, International Symposium on Cluster, May 2011 [Slides] [Bib - Plain]
201	High Performance Pipelined Process Migration with RDMA X. Ouyang, R. Rajachandrasekar, X. Besseron, and DK Panda, International Symposium on Cluster, May 2011 [Slides] [Bib - Plain]
202	Beyond Block I/O: Rethinking Traditional Storage Primitives X. Ouyang, D. Nellans, R. Wipfel, D. Flynn, and DK Panda, 17th IEEE International Symposium on High Performance Computer Architecture (HPCA-17), Feb 2011 [Slides] [Bib - Plain]
203	Scalable Earthquake Simulation on Petascale Supercomputers Y. Cui, K. B. Olsen, T. H. Jordan, K. Lee, J. Zhou, P. Small, D. Roten, G. Ely, DK Panda, A. Chourasia, J. Levesque, S. M. Day, and P. Maechling, SuperComputing 2010, Nov 2010 [Bib - Plain]
204	Unifying UPC and MPI Runtimes: Experience with MVAPICH J. Jose, M. Luo, S. Sur, and DK Panda, International Workshop on Partitioned Global Address Space (PGAS '10), Oct 2010 [Slides] [Bib - Plain]
205	RDMA-Based Job Migration Framework for MPI over InfiniBand Int'l Conference on Cluster Computing (Cluster '10) X. Ouyang, S. Marcarelli, R. Rajachandrasekar, and DK Panda, IEEE International Conference on Cluster Computing 2010, Sep 2010 [Bib - Plain]
206	Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters H. Subramoni, P. Lai, S. Sur, and DK Panda, International Conference on Parallel Processing (ICPP '10), Sep 2010 [Slides] [Bib - Plain]
207	Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters K. Kandalla, E. Mancini, S. Sur, and DK Panda, International Conference on Parallel Processing (ICPP '10), Sep 2010 [Slides] [Bib - Plain]
208	High Performance Design and Implementation of Nemesis Communication Layer for Two-sided and One-Sided MPI Semantics in MVAPICH2 M. Luo, S. Potluri, P. Lai, E. Mancini, H. Subramoni, K. Kandalla, S. Sur, and DK Panda, International Workshop on Parallel Programming Models and Systems Software for High-End Computing (P2S2 '10), Sep 2010 [Bib - Plain]
209	Design and Evaluation of Generalized Collective Communication Primitives with Overlap using ConnectX-2 Offload Engine H. Subramoni, K. Kandalla, S. Sur, and DK Panda, International Symposium on High Performance Interconnects 2010, Aug 2010 [Bib - Plain]
210	Quantifying Performance Benefits of Overlap using MPI-2 in a Seismic Modeling Application S. Potluri, P. Lai, K. Tomko, S. Sur, Y. Cui, M. Tatineni, K. Schulz, W. Barth, A. Majumdar, and DK Panda, 24th International Conference on Supercomputing (ICS), Jun 2010 [Bib - Plain]
211	Designing Truly One-Sided MPI-2 RMA Intra-node Communication on Multi-core Systems P. Lai, S. Sur, and DK Panda, 24th International Conference on Supercomputing (ICS), Jun 2010 [Slides] [Bib - Plain]
212	High Performance Data Transfer in Grid Environment Using GridFTP over InfiniBand H. Subramoni, P. Lai, R. Kettimuthu, and DK Panda, 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid'10), May 2010 [Slides] [Bib - Plain]
213	Enhancing Checkpoint Performance with Staging IO and SSD X. Ouyang, S. Marcarelli, and DK Panda, IEEE International Workshop on Storage Network Architecture and Parallel I/Os (SNAPI), May 2010 [Slides] [Bib - Plain]
214	Designing Topology-Aware Collective Communication Algorithms for Large Scale InfiniBand Clusters: Case Studies with Scatter and Gather K. Kandalla, H. Subramoni, A. Vishnu, and DK Panda, International Workshop on Communication Architecture for Clusters (CAC 10), Apr 2010 [Bib - Plain]
215	Designing High-Performance and Resilient Message Passing on InfiniBand M. Koop, P. Shamis, I. Rabinovitz, and DK Panda, International Workshop on Communication Architecture for Clusters (CAC 10), Apr 2010 [Bib - Plain]
216	Designing Efficient FTP Mechanisms for High Performance Data-Transfer over InfiniBand P. Lai, H. Subramoni, S. Narravula, A. Mamidala, and DK Panda, International Conference on Parallel Processing (ICPP '09), Sep 2009 [Slides] [Bib - Plain]
217	Accelerating Checkpoint Operation by Node-Level Write Aggregation on Multicore Systems X. Ouyang, K. Gopalakrishnan, and DK Panda, International Conference on Parallel Processing (ICPP '09), Sep 2009 [Slides] [Bib - Plain]
218	CIFTS: A Coordinated Infrastructure for Fault-Tolerant Systems R. Gupta, P. Beckman, H. Park, E. Lusk, P. Hargrove, A. Geist, DK Panda, A. Lumsdaine, and J. Dongarra, International Conference on Parallel Processing (ICPP '09), Sep 2009 [Bib - Plain]
219	Designing and Evaluating MPI-2 Dynamic Process Management Support for InfiniBand T. Gangadharappa, M. Koop, and DK Panda, International Workshop on Parallel Programming Models and Systems Software for High-End Computing (P2S2 '09), Sep 2009 [Bib - Plain]
220	Impact of Node Level Caching in MPI Job Launch Mechanisms J. Sridhar, and DK Panda, EuroPVM/MPI '09, Sep 2009 [Slides] [Bib - Plain]
221	An Efficient Hardware-Software Approach to Network Fault Tolerance with InfiniBand A. Vishnu, M. Krishnan, and DK Panda, International Conference on Cluster Computing (Cluster '09), Sep 2009 [Slides] [Bib - Plain]
222	Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters M. Koop, M. Luo, and DK Panda, International Conference on Cluster Computing (Cluster '09), Sep 2009 [Slides] [Bib - Plain]
223	Design Alternatives for Implementing Fence Synchronization in MPI-2 One-sided Communication on InfiniBand Clusters G. Santhanaraman, T. Gangadharappa, S. Narravula, A. Mamidala, and DK Panda, International Conference on Cluster Computing (Cluster '09), Sep 2009 [Slides] [Bib - Plain]
224	RDMA over Ethernet - A Preliminary Study H. Subramoni, P. Lai, M. Luo, and DK Panda, International Workshop on High Performance Distributed Computing (HPI-DC '09), Sep 2009 [Slides] [Bib - Plain]
225	ProOnE: A General Purpose Protocol Onload Engine for Multi- and Many-Core Architectures P. Lai, P. Balaji, R. Thakur, and DK Panda, International Supercomputing Conference (ISC), Jun 2009 [Bib - Plain]
226	Designing Multi-Leader-Based Allgather Algorithms for Multi-Core Clusters K. Kandalla, H. Subramoni, G. Santhanaraman, and DK Panda, International Workshop on Communication Architecture for Clusters (CAC'09), May 2009 [Slides] [Bib - Plain]
227	Fast Checkpointing by Write Aggregation with Dynamic Buffer and Interleaving on Multicore Architecture X. Ouyang, K. Gopalakrishnan, DK Panda, Fast Checkpointing by Write Aggregation with Dynamic Buffer, and Interleaving on Multicore Architecture, Int'l Conference on High Performance Computing 2009, Feb 2009 [Slides] [Bib - Plain]
228	ScELA: Scalable and Extensible Launching Architecture for Clusters J. Sridhar, M. Koop, J. Perkins, and DK Panda, International Symposium on High Performance Computing (HiPC), Dec 2008 [Slides] [Bib - Plain]
229	Designing High Performance pNFS With RDMA on InfiniBand R. Noronha, X. Ouyang, and DK Panda, International Symposium on High Performance Computing (HiPC), Dec 2008 [Bib - Plain]
230	Sockets Direct Protocol for Hybrid Network Stacks: A Case Study with iWARP over 10G Ethernet P. Balaji, S. Bhagvat, R. Thakur, and DK Panda, International Symposium on High Performance Computing (HiPC), Dec 2008 [Slides] [Bib - Plain]
231	Design and Evaluation of Benchmarks for Financial Applications using Advanced Message Queuing Protocol (AMQP) over InfiniBand H. Subramoni, G. Marsh, S. Narravula, P. Lai, and DK Panda, Workshop on High Performance Computational Finance (In conjunction with SC '08), Nov 2008 [OSU Technical Report Version (OSU-CISRC-10/08-TR51)] [Bib - Plain]
232	Scalable MPI Design over InfiniBand using eXtended Reliable Connection M. Koop, J. Sridhar, and DK Panda, IEEE Cluster 2008, Sep 2008 [Slides] [Bib - Plain]
233	Efficient One-Copy MPI Shared Memory Communication in Virtual Machines W. Huang, M. Koop, and DK Panda, IEEE Cluster 2008, Sep 2008 [Slides] [Bib - Plain]
234	IMCa: A High Performance Caching Frontend for GlusterFS on InfiniBand R. Noronha, and DK Panda, International Conference on Parallel Processing 2008, Sep 2008 [Slides] [Bib - Plain]
235	Performance of HPC middleware over InfiniBand WAN S. Narravula, H. Subramoni, P. Lai, R. Noronha, and DK Panda, International Conference on Parallel Processing 2008, Sep 2008 [Bib - Plain]
236	Designing An Efficient Kernel-level and User-level Hybrid Approach for MPI Intra-node Communication on Multi-core Systems L. Chai, P. Lai, H. Jin, and DK Panda, International Conference on Parallel Processing 2008, Sep 2008 [Slides] [Bib - Plain]
237	Can Software Reliability Outperform Hardware Reliability on High Performance Interconnects? A Case Study with MPI over InfiniBand M. Koop, R. Kumar, and DK Panda, 22nd ACM International Conference on Supercomputing (ICS '08), Jun 2008 [Bib - Plain]
238	Advanced RDMA-based Admission Control for Modern Data-Centers P. Lai, S. Narravula, K. Vaidyanathan, and DK Panda, CCGrid '08, May 2008 [Slides] [Bib - Plain]
239	Optimized Distributed Data Sharing Substrate in Multi-Core Commodity Clusters: A Comprehensive Study with Applications K. Vaidyanathan, and S. Narravula, CCGrid '08, May 2008 [Slides] [Bib - Plain]
240	MPI Collectives on modern Multicore clusters: Performance Optimizations and Communication Characteristics A. Mamidala, R. Kumar, D. De, and DK Panda, CCGrid '08, May 2008 [Bib - Plain]
241	Scaling Alltoall Collective on Multi-core Systems R. Kumar, A. Mamidala, and DK Panda, International Workshop on Communication Architecture for Clusters, Apr 2008 [Slides] [Bib - Plain]
242	pNFS/PVFS2 over InfiniBand: Early Experiences L. Chai, X. Ouyang, R. Noronha, and DK Panda, Petascale Data Storage Workshop, Nov 2007 [Slides] [Bib - Plain]
243	Virtual Machine Aware Communication Libraries for High Performance Computing W. Huang, M. Koop, Q. Gao, and DK Panda, SuperComputing (SC'07), Nov 2007 [Slides] [Best Student Paper Finalist] [Bib - Plain]
244	Enhancing the Performance of NFSv4 with RDMA R. Noronha, L. Chai, S. Shepler, and DK Panda, International Workshop on Storage Network Architecture and Parallel I/Os (SNAPI'07), Sep 2007 [Bib - Plain]
245	MPI-2 One Sided Usage and Implementation for Read Modify Write operations: A case study with HPCC G. Santhanaraman, S. Narravula, A. Mamidala, and DK Panda, EuroPVM/MPI 2007, Sep 2007 [Bib - Plain]
246	Zero-Copy Protocol for MPI using InfiniBand Unreliable Datagram M. Koop, S. Sur, and DK Panda, IEEE International Conference on Cluster Computing 2007, Sep 2007 [Bib - Plain]
247	High Performance Virtual Machine Migration with RDMA over Modern Interconnects W. Huang, Q. Gao, J. Liu, and DK Panda, IEEE International Conference on Cluster Computing 2007, Sep 2007 [Best Paper] [Bib - Plain]
248	Efficient Asynchronous Memory Copy Operations on Multi-Core Systems and I/OAT K. Vaidyanathan, L. Chai, W. Huang, and DK Panda, IEEE International Conference on Cluster Computing 2007, Sep 2007 [Bib - Plain]
249	Group-based Coordinated Checkpointing for MPI: A Case Study on InfiniBand Q. Gao, W. Huang, M. Koop, and DK Panda, International Conference on Parallel Processing (ICPP'07), Sep 2007 [Slides] [Bib - Plain]
250	High Performance MPI over iWARP: Early Experiences S. Narravula, A. Mamidala, A. Vishnu, G. Santhanaraman, and DK Panda, High Performance MPI over iWARP: Early Experiences, Sep 2007 [Bib - Plain]
251	Designing NFS With RDMA For Security, Performance and Scalability R. Noronha, L. Chai, T. Talpey, and DK Panda, International Conference on Parallel Processing 2007, Sep 2007 [Bib - Plain]
252	Designing Next Generation Clusters: Evaluation of InfiniBand DDR/QDR on Intel Computing Platforms H. Subramoni, M. Koop, and DK Panda, International Symposium on Hot Interconnects (HotI), Aug 2007 [Slides] [Bib - Plain]
253	Performance Analysis and Evaluation of PCIe 2.0 and Quad-Data Rate InfiniBand M. Koop, W. Huang, K. Gopalakrishnan, and DK Panda, International Symposium on Hot Interconnects (HotI), Aug 2007 [Bib - Plain]
254	Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms S. Sur, M. Koop, L. Chai, and DK Panda, International Symposium on Hot Interconnects (HotI), Aug 2007 [Slides] [Bib - Plain]
255	High Performance MPI Design using Unreliable Datagram for Ultra-Scale InfiniBand Clusters M. Koop, S. Sur, Q. Gao, and DK Panda, 21st International ACM Conference on Supercomputing (ICS '07), Jun 2007 [Bib - Plain]
256	Nomad: Migrating OS-bypass Networks in Virtual Machines W. Huang, J. Liu, M. Koop, B. Abali, and DK Panda, Third International SIGPLAN/SIGOPS Conference on Virtual Execution Environments (VEE), Jun 2007 [Bib - Plain]
257	High Performance Distributed Lock Management Services using Network-based Remote Atomic Operations S. Narravula, A. Mamidala, A. Vishnu, K. Vaidyanathan, and DK Panda, International Sympsoium on Cluster Computing and the Grid (CCGrid 2007), May 2007 [Slides] [Bib - Plain]
258	Design and Implementation of High Performance MVAPICH2: MPI2 over InfiniBand W. Huang, G. Santhanaraman, H. Jin, Q. Gao, and DK Panda, International Sympsoium on Cluster Computing and the Grid (CCGrid 2007), May 2007 [Bib - Plain]
259	Benefits of I/O Acceleration Technology (I/OAT) in Clusters K. Vaidyanathan, and DK Panda, International Symposium on Performance Analysis of Systems and Software (ISPASS), Apr 2007 [Bib - Plain]
260	Designing Efficient Systems Services and Primitives for Next-Generation Data-Centers K. Vaidyanathan, S. Narravula, P. Balaji, and DK Panda, Workshop on NSF Next Generation Software(NGS) Program; held in conjunction with IPDPS, Apr 2007 [Bib - Plain]
261	Improving Scalability of OpenMP Applications on MultiCore Systems Using Large Page Support R. Noronha, and DK Panda, International Workshop on Multithreaded Architectures and Applications (MTAAP), Mar 2007 [Bib - Plain]
262	High Performance MPI on IBM 12x InfiniBand Architecture A. Vishnu, B. Benton, and DK Panda, International Workshop on High-Level Parallel Programming Models and Supportive Environments (HIPS), Mar 2007 [Bib - Plain]
263	Automatic Path Migration over InfiniBand: Early Experience A. Vishnu, A. Mamidala, S. Narravula, and DK Panda, Third International Workshop on System Management Techniques, Mar 2007 [Bib - Plain]
264	Designing Efficient Asynchronous Memory Operations Using Hardware Copy Engine: A Case Study with I/OAT K. Vaidyanathan, W. Huang, L. Chai, and DK Panda, International Workshop on Communication Architecture for Clusters (CAC), Mar 2007 [Bib - Plain]
265	Using Connection-Oriented and Connection-Less Transport on Performance and Scalability of Collective and One-sided operations: Trade-offs and Impact A. Mamidala, S. Narravula, A. Vishnu, G. Santhanaraman, and DK Panda, International Symposium on Principles and Practice of Parallel Programming (PPoPP 2007), Mar 2007 [Bib - Plain]
266	DDSS: A Low-Overhead Distributed Data Sharing Substrate for Cluster-Based Data-Centers over Modern Interconnects K. Vaidyanathan, S. Narravula, and DK Panda, International Conference on High Performance Computing (HiPC), Dec 2006 [Slides] [Bib - Plain]
267	Finding Bugs in Large-Scale Parallel Programs by Detecting Anomaly in Data Movements Q. Gao, F. Qin, and DK Panda, SuperComputing 2006, Nov 2006 [Bib - Plain]
268	Analyzing the Impact of Supporting Out-of-Order Communication on In-order Performance with iWARP P. Balaji, W. Feng, S. Bhagvat, DK Panda, R. Thakur, and W. Gropp, SuperComputing 2006, Nov 2006 [Bib - Plain]
269	High-Performance and Scalable MPI over InfiniBand with Reduced Memory Usage: An In-Depth Performance Analysis S. Sur, M. Koop, and DK Panda, SuperComputing 2006, Nov 2006 [Bib - Plain]
270	A Software Based Approach for Providing Network Fault Tolerance in Clusters Using the uDAPL Interface: MPI Level Design and Performance Evaluation A. Vishnu, P. Gupta, A. Mamidala, and DK Panda, SuperComputing 2006, Nov 2006 [Bib - Plain]
271	NemC: A Network Emulator for Cluster-of-Clusters H. Jin, S. Narravula, K. Vaidyanathan, and DK Panda, International Conf. on Computer Commn. and Networks, Oct 2006 [Bib - Plain]
272	Designing Efficient MPI Intra-node Communication Support for Modern Computer Architectures L. Chai, A. Hartono, and DK Panda, International Conference on Cluster Computing, Sep 2006 [Bib - Plain]
273	Efficient Shared Memory and RDMA based design for MPI\_Allgather over InfiniBand A. Mamidala, A. Vishnu, and DK Panda, EuroPVM/MPI, Sep 2006 [Bib - Plain]
274	Exploiting RDMA operations for Providing Efficient Fine-Grained Resource Monitoring in Cluster-based Servers K. Vaidyanathan, H. Jin, and DK Panda, Workshop on Remote Direct Memory Access (RDMA): Applications, Implementations, and Technologies, Sep 2006 [Bib - Plain]
275	Memory Scalability Evaluation of the Next-Generation Intel Bensley Platform with InfiniBand M. Koop, W. Huang, A. Vishnu, and DK Panda, International Symposium on Hot Interconnect 2006 (HotI'06), Aug 2006 [Slides] [Bib - Plain]
276	Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand Q. Gao, W. Yu, W. Huang, and DK Panda, International Conference on Parallel Processing (ICPP), Aug 2006 [Slides] [Bib - Plain]
277	High Performance Block I/O for Global File System (GFS) with InfiniBand RDMA S. Liang, W. Yu, and DK Panda, International Conference on Parallel Processing (ICPP), Aug 2006 [Bib - Plain]
278	A Case for High Performance Computing with Virtual Machines W. Huang, J. Liu, B. Abali, and DK Panda, International Conference on Supercomputing (ICS), Jun 2006 [Slides] [Bib - Plain]
279	High Performance VMM-Bypass I/O in Virtual Machines J. Liu, W. Huang, B. Abali, and DK Panda, USENIX Annual Technical Conference, Jun 2006 [Bib - Plain]
280	An MPI-Stream Hybrid Programming Model for Computational Clusters E. Mancini, G. Marsh, and DK Panda, International Symposium on Cluster Computing and the Grid (CCGrid 2006), May 2006 [Slides] [Bib - Plain]
281	Natively Supporting True One-sided Communication in MPI on Multi-core Systems with InfiniBand G. Santhanaraman, P. Balaji, K. Gopalakrishnan, R. Thakur, W. Gropp, and DK Panda, International Symposium on Cluster Computing and the Grid (CCGrid 2006), May 2006 [Bib - Plain]
282	Reducing Connection Memory Requirements of MPI for InfiniBand Clusters: A Message Coalescing Approach M. Koop, T. Jones, and DK Panda, International Symposium on Cluster Computing and the Grid (CCGrid 2006), May 2006 [Bib - Plain]
283	Understanding the Impact of Multi-Core Architecture in Cluster Computing: A Case Study with Intel Dual-Core System L. Chai, Q. Gao, and DK Panda, International Symposium on Cluster Computing and the Grid (CCGrid 2006), May 2006 [Bib - Plain]
284	Hot-Spot Avoidance With Multi-Pathing Over InfiniBand: An MPI Perspective A. Vishnu, M. Koop, A. Moody, A. Mamidala, S. Narravula, and DK Panda, International Symposium on Cluster Computing and the Grid (CCGrid 2006), May 2006 [Bib - Plain]
285	Designing Efficient Cooperative Caching Schemes for Multi-Tier Data-Centers over RDMA-enabled Networks S. Narravula, H. Jin, K. Vaidyanathan, and DK Panda, International Symposium on Cluster Computing and the Grid (CCGrid 2006), May 2006 [Bib - Plain]
286	MPI over uDAPL: Can High Performance and Portability Exist Across Architectures? L. Chai, R. Noronha, and DK Panda, International Sympsoium on Cluster Computing and the Grid 2006, May 2006 [Bib - Plain]
287	Designing High Performance and Scalable MPI Intra-node Communication Support for Clusters L. Chai, and DK Panda, International Sympsoium on Cluster Computing and the Grid 2006, May 2006 [Slides] [Bib - Plain]
288	Designing Next-Generation Data-Centers with Advanced Communication Protocols and Systems Services P. Balaji, K. Vaidyanathan, S. Narravula, H. Jin, and DK Panda, Workshop on NSF Next Generation Software(NGS) Program; held in conjuction with IPDPS, Apr 2006 [Slides] [Bib - Plain]
289	Shared Receive Queue based Scalable MPI Design for InfiniBand Clusters S. Sur, L. Chai, H. Jin, and DK Panda, International Parallel and Distributed Processing Symposium (IPDPS '06), Apr 2006 [Bib - Plain]
290	Adaptive Connection Management for Scalable MPI over InfiniBand W. Yu, Qi Gao, and DK Panda, International Parallel and Distributed Processing Symposium (IPDPS '06), Apr 2006 [Slides] [Bib - Plain]
291	Efficient SMP-Aware MPI-Level Broadcast over InfiniBand's Hardware Multicast A. Mamidala, L. Chai, H. Jin, and DK Panda, Communication Architecture for Clusters (CAC) Workshop, Apr 2006 [Bib - Plain]
292	Asynchronous Zero-Copy Communication for Synchronous Sockets Direct Protocol (SDP) over InfiniBand P. Balaji, S. Bhagvat, H. Jin, and DK Panda, Communication Architecture for Clusters (CAC) Workshop, Apr 2006 [Bib - Plain]
293	Benefits of High Speed Interconnects to Cluster File Systems: A Case Study with Lustre W. Yu, R. Noronha, S. Liang, and DK Panda, Communication Architecture for Clusters (CAC) Workshop, Apr 2006 [Bib - Plain]
294	RDMA Read Based Rendezvous Protocol for MPI over InfiniBand: Design Alternatives and Benefits S. Sur, L. Chai, H. Jin, and DK Panda, International Symposium on Principles and Practice of Parallel Programming (PPoPP 2006), Mar 2006 [Slides] [Bib - Plain]
295	A Case for UDP Offload Engines in LambdaGrids V. Vishwanathz, P. Balaji, W. Feng, J. Leigh, and DK Panda, International Workshop on Protocols for Fast Long-Distance Networks (PFLDnet 2006), Feb 2006 [Bib - Plain]
296	High Performance RDMA Based All-to-all Broadcast for InfiniBand Clusters S. Sur, U. Bondhugula, A. Mamidala, H. Jin, and DK Panda, International Conference on High Performance Computing (HiPC 2005), Dec 2005 [Bib - Plain]
297	Supporting MPI-2 One Sided Communication on Multi-Rail InfiniBand Clusters: Design Challenges and Performance Benefits A. Vishnu, G. Santhanaraman, W. Huang, H. Jin, and DK Panda, International Conference on High Performance Computing (HiPC 2005), Dec 2005 [Bib - Plain]
298	Supporting iWARP Compatibility and Features for Regular Network Adapters P. Balaji, H. Jin, K. Vaidyanathan, and DK Panda, Workshop on Remote Direct Memory Access (RDMA): Applications, Implementations, and Technologies, Sep 2005 [Slides] [Bib - Plain]
299	Head-to-TOE Evaluation of High-Performance Sockets over Protocol Offload Engines P. Balaji, W. Feng, Q. Gao, R. Noronha, W. Yu, and DK Panda, IEEE Cluster Computing 2005, Sep 2005 [Slides] [Bib - Plain]
300	Swapping to Remote Memory over InfiniBand: An Approach using a High Performance Network Block Device S. Liang, R. Noronha, and DK Panda, IEEE Cluster Computing 2005, Sep 2005 [Slides] [Bib - Plain]
301	Benefits of Quadrics Scatter/Gather to PVFS2 Noncontiguous I/O W. Yu, and DK Panda, International Workshop on Storage Network Architecture and Parallel I/Os (SNAPI) 2005. Sept. 2005., Sep 2005 [Slides] [Bib - Plain]
302	Can Memory-Less Network Adapters Benefit Next-Generation InfiniBand Systems? S. Sur, A. Vishnu, H. Jin, W. Huang, and DK Panda, Hot Interconnect 13 (HOTI 05), Aug 2005 [Slides] [Bib - Plain]
303	Performance Characterization of a 10-Gigabit Ethernet TOE W. Feng, P. Balaji, C. Baron, L. N. Bhuyan, and DK Panda, Hot Interconnect 13 (HOTI 05), Aug 2005 [Slides] [Bib - Plain]
304	Performance Evaluation of MM5 on Clusters With Modern Interconnects: Scalability and Impact R. Noronha, and DK Panda, Euro-Par, Aug 2005 [Bib - Plain]
305	Performance Evaluation of RDMA over IP: A Case Study with the Ammasso Gigabit Ethernet NIC H. Jin, S. Narravula, K. Vaidyanathan, P. Balaji, and DK Panda, Workshop on High Performance Interconnects for Distributed Computing (HPI-DC); In conjunction with HPDC-14, Jul 2005 [Bib - Plain]
306	High Performance Support of Parallel Virtual File System (PVFS2) over Quadrics W. Yu, S. Liang, and DK Panda, International Conference on Supercomputing (ICS '05), Jun 2005 [Bib - Plain]
307	LiMIC: Support for High-Performance MPI Intra-Node Communication on Linux Cluster H. Jin, S. Sur, L. Chai, and DK Panda, International Conference on Parallel Processing (ICPP-05), Jun 2005 [Slides] [Bib - Plain]
308	Architecture for Caching Responses with Multiple Dynamic Dependencies in Multi-Tier Data-Centers over InfiniBand S. Narravula, P. Balaji, K. Vaidyanathan, H. Jin, and DK Panda, IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGrid 05), May 2005 [Slides] [Bib - Plain]
309	Can High Performance Software DSM Systems Designed With InfiniBand Features Benefit from PCI-Express? R. Noronha, and DK Panda, DSM Workshop, May 2005 [Bib - Plain]
310	Designing Multi-Level, Multi-Tier Data Center Architecture for Securing Distributed Infrastructure and Assets DK Panda, DHS Homeland Security Conference, Apr 2005 [Bib - Plain]
311	Analysis of Design Considerations for Optimizing Multi-Channel MPI over InfiniBand L. Chai, S. Sur, H. Jin, and DK Panda, Workshop on Communication Architecture on Clusters (CAC '05), Apr 2005 [Bib - Plain]
312	Scheduling of MPI-2 One Sided Operations over InfiniBand W. Huang, G. Santhanaraman, H. Jin, and DK Panda, Workshop on Communication Architecture on Clusters (CAC '05), Apr 2005 [Slides] [Bib - Plain]
313	Performance Modeling of Subnet Management on Fat Tree InfiniBand Networks using OpenSM A. Vishnu, A. Mamidala, and H.- W, Workshop on System Management Tools on Large Scale Parallel Systems, Apr 2005 [Bib - Plain]
314	Design and Implementation of Open MPI over Quadrics/Elan4 W. Yu, T. S. Woodall, R. L. Graham, and DK Panda, International Parallel and Distributed Processing Symposium (IPDPS 2005). April 2005., Apr 2005 [Slides] [Bib - Plain]
315	On the Provision of Prioritization and Soft QoS in Dynamically Reconfigurable Shared Data-Centers over InfiniBand P. Balaji, S. Narravula, K. Vaidyanathan, H. Jin, and DK Panda, IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS 05), Mar 2005 [Slides] [Bib - Plain]
316	Workload-driven Analysis of File Systems in Shared Multi-Tier Data-Centers over InfiniBand K. Vaidyanathan, P. Balaji, H. Jin, and DK Panda, Computer Architecture Evaluation using Commercial Workloads (in conjunction with HPCA), Feb 2005 [Slides] [Bib - Plain]
317	Scalable Startup of Parallel Programs over InfiniBand W. Yu, J. Wu, and DK Panda, International Conference on High Performance Computing (HiPC '04), Dec 2004 [Slides] [Bib - Plain]
318	Building Multirail InfiniBand Clusters: MPI-Level Design and Performance Evaluation J. Liu, A. Vishnu, and DK Panda, SuperComputing 2004 Conference (SC 04), Nov 2004 [Slides] [Bib - Plain]
319	Reducing Diff Overhead in Software DSM Systems using RDMA Operations in InfiniBand R. Noronha, and DK Panda, Workshop on Remote Direct Memory Access (RDMA): Applications, Implementations, and Technologies in conjunction with the IEEE Cluster, Sep 2004 [Slides] [Bib - Plain]
320	Exploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand P. Balaji, K. Vaidyanathan, S. Narravula, K. Savitha, H. Jin, and DK Panda, Workshop on Remote Direct Memory Access (RDMA): Applications, Implementations, and Technologies in conjunction with the IEEE Cluster, Sep 2004 [Slides] [Bib - Plain]
321	Sockets vs RDMA Interface over 10-Gigabit Networks: An In-depth analysis of the Memory Traffic Bottleneck P. Balaji, H. V. Shah, and DK Panda, Workshop on Remote Direct Memory Access (RDMA): Applications, Implementations, and Technologies in conjunction with the IEEE Cluster, Sep 2004 [Slides] [Bib - Plain]
322	Scalable and High Performance NIC-Based Allgather over Myrinet/GM W. Yu, D. Buntinas, and DK Panda, International Conference on Cluster Computing 2004, Sep 2004 [Slides] [Bib - Plain]
323	Efficient Barrier and Allreduce on IBA Clusters using Hardware Multicast and Adaptive Algorithms A. Mamidala, J. Liu, and DK Panda, International Conference on Cluster Computing 2004, Sep 2004 [Bib - Plain]
324	NIC-Based Offload of Dynamic User-Defined Modules for Myrinet Clusters A. Wagner, H. Jin, R. Riesen, and DK Panda, International Conference on Cluster Computing 2004, Sep 2004 [Bib - Plain]
325	Zero-Copy MPI Derived Datatype Communication over InfiniBand G. Santhanaraman, J. Wu, and DK Panda, EuroPVM/MPI 2004, Sep 2004 [Slides] [Bib - Plain]
326	Efficient Implementation of MPI-2 Passive One-Sided Communication on InfiniBand Clusters W. Jiang, J. Liu, H. Jin, DK Panda, D. Buntinas, R. Thakur, and W. Gropp, EuroPVM/MPI 2004, Sep 2004 [Slides] [Bib - Plain]
327	Performance Evaluation of InfiniBand with PCI Express J. Liu, A. Mamidala, A. Vishnu, and DK Panda, Hot Interconnect 12 (HOTI 04), Aug 2004 [Bib - Plain]
328	Efficient and Scalable All-to-All Personalized Exchange for InfiniBand-based Clusters S. Sur, H. Jin, and DK Panda, International Conference on Parallel Processing (ICPP '04), Aug 2004 [Bib - Plain]
329	Design and Implementation of MPICH2 over InfiniBand with RDMA Support J. Liu, W. Jiang, P. Wyckoff, DK Panda, D. Ashton, D. Buntinas, W. Gropp, and B. Toonen, International Parallel and Distributed Processing Symposium (IPDPS 04), Apr 2004 [Slides] [Bib - Plain]
330	Fast and Scalable MPI-Level Broadcast using InfiniBand's Hardware Multicast Support J. Liu, A. Mamidala, and DK Panda, International Parallel and Distributed Processing Symposium (IPDPS 04), Apr 2004 [Slides] [Bib - Plain]
331	High Performance Implementation of MPI Datatype Communication over InfiniBand J. Wu, P. Wyckoff, and DK Panda, International Parallel and Distributed Processing Symposium (IPDPS 04), Apr 2004 [Bib - Plain]
332	Host-Assisted Zero-Copy Remote Memory Access Communication on InfiniBand V. Tipparaju, G. Santhanaraman, J. Nieplocha, and DK Panda, International Parallel and Distributed Processing Symposium (IPDPS 04), Apr 2004 [Bib - Plain]
333	Implementing Efficient and Scalable Flow Control Schemes in MPI over InfiniBand J. Liu, and DK Panda, International Workshop on Communication Architecture for Clusters (CAC 04), Apr 2004 [Slides] [Bib - Plain]
334	Efficient and Scalable Barrier over Quadrics and Myrinet with a New NIC-Based Collective Message Passing Protocol W. Yu, and DK Panda, International Workshop on Communication Architecture for Clusters (CAC 04), Apr 2004 [Slides] [Bib - Plain]
335	High Performance MPI-2 One-Sided Communication over InfiniBand W. Jiang, J. Liu, H. Jin, DK Panda, W. Gropp, and R. Thakur, International Symposium on Cluster Computing and the Grid (CCGrid 04), Apr 2004 [Slides] [Bib - Plain]
336	Unifier: Unifying Cache Management and Communication Buffer Management for PVFS over InfiniBand J. Wu, P. Wyckoff, DK Panda, and R. Ross, International Symposium on Cluster Computing and the Grid (CCGrid 04), Apr 2004 [Bib - Plain]
337	Designing High Performance DSM Systems using InfiniBand Features R. Noronha, and DK Panda, International Workshop on Distributed Shared Memory Systems, Apr 2004 [Slides] [Bib - Plain]
338	Sockets Direct Protocol over InfiniBand in Clusters: Is it Beneficial? Int'l Symposium on Performance Analysis of Systems and Software (ISPASS 04). March P. Balaji, S. Narravula, K. Vaidyanathan, S. Krishnamoorthy, J. Wu, and DK Panda, International Symposium on Performance Analysis of Systems and Software, Apr 2004 [Bib - Plain]
339	Sockets Direct Procotol over InfiniBand in Clusters: Is it Beneficial? P. Balaji, S. Narravula, K. Vaidyanathan, S. Krishnamoorthy, J. Wu, and DK Panda, IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS 04), Apr 2004 [Slides] [Bib - Plain]
340	Supporting Strong Coherency for Active Caches in Multi-Tier Data-Centers over InfiniBand S. Narravula, P. Balaji, K. Vaidyanathan, S. Krishnamoorthy, J. Wu, and DK Panda, SAN-03 Workshop (in conjunction with HPCA), Feb 2004 [Slides] [Bib - Plain]
341	Evaluating the Impact of RDMA on Storage I/O over InfiniBand J. Liu, DK Panda, and M. Banikazemi, SAN-03 Workshop (in conjunction with HPCA), Feb 2004 [Slides] [Bib - Plain]
342	Application-Bypass Reduction for Large-Scale Clusters A. Wagner, D. Buntinas, R. Brightwell, and DK Panda, Cluster 2003 Conference, Dec 2003 [Bib - Plain]
343	Supporting Efficient Noncontiguous Access in PVFS over InfiniBand J. Wu, P. Wyckoff, and DK Panda, Cluster 2003 Conference, Dec 2003 [Bib - Plain]
344	Optimizing Mechanisms for Latency Tolerance in Remote Memory Access Communication V. Tipparaju, M. Krishnan, J. Nieplocha, G. Santhanaraman, and DK Panda, Cluster 2003 Conference, Dec 2003 [Bib - Plain]
345	Performance Comparison of MPI Implementations over InfiniBand, Myrinet and Quadrics J. Liu, B. Chandrasekaran, J. Wu, W. Jiang, S. Kini, W. Yu, D. Buntinas, P. Wyckoff, and DK Panda, SuperComputing 2003, Nov 2003 [Bib - Plain]
346	Scalable NIC-based Reduction on Large-scale Clusters A. Moody, J. Fernandez, F. Petrini, and DK Panda, SuperComputing 2003, Nov 2003 [Bib - Plain]
347	High Performance Broadcast Support in LA-MPI over Quadrics W. Yu, S. Sur, DK Panda, R. T. Aulwes, and R. Graham, Los Alamos Computer Science Institute (LACSI) Symposium, Oct 2003 [Slides] [Bib - Plain]
348	High Performance and Reliable NIC-Based Multicast over Myrinet/GM-2 W. Yu, D. Buntinas, and DK Panda, International Conference on Parallel Processing, Oct 2003 [Slides] [Bib - Plain]
349	PVFS over InfiniBand: Design and Performance Evaluation J. Wu, P. Wyckoff, and DK Panda, International Conference on Parallel Processing, Oct 2003 [Bib - Plain]
350	Designing a Portable MPI-2 over Modern Interconnects using uDAPL Interface L. Chai, R. Noronha, P. Gupta, G. Brown, and DK Panda, Euro PVM/MPI Conference, Sep 2003 [Bib - Plain]
351	Efficient Hardware Multicast Group Management for Multiple MPI Communicators over InfiniBand A. Mamidala, H. Jin, and DK Panda, Euro PVM/MPI Conference, Sep 2003 [Slides] [Bib - Plain]
352	Design Alternatives and Performance Trade-offs for Implementing MPI-2 over InfiniBand W. Huang, G. Santhanaraman, H. Jin, and DK Panda, Euro PVM/MPI Conference, Sep 2003 [Slides] [Bib - Plain]
353	Fast and Scalable Barrier using RDMA and Multicast Mechanisms for InfiniBand-Based Clusters S. Kini, J. Liu, J. Wu, P. Wyckoff, and DK Panda, Euro PVM/MPI Conference, Sep 2003 [Bib - Plain]
354	Demotion-Based Exclusive Caching through Demote Buffering: Design and Evaluations over Different Networks J. Wu, P. Wyckoff, and DK Panda, Workshop on Storage Network Architecture and Parallel I/O (SNAPI), Sep 2003 [Bib - Plain]
355	MIBA: A Micro-benchmark Suite for Evaluating InfiniBand Architecture Implementations B. Chandrasekaran, P. Wyckoff, and DK Panda, Performance TOOLS 2003, Sep 2003 [Bib - Plain]
356	Micro-Benchmark Level Performance Comparison of High-Speed Cluster Interconnects J. Liu, B. Chandrasekaran, W. Yu, J. Wu, D. Buntinas, S. P. Kinis, P. Wyckoff, and DK Panda, Hot Interconnects 10, Aug 2003 [Bib - Plain]
357	High Performance RDMA-Based MPI Implementation over InfiniBand J. Liu, J. Wu, S. Kini, P. Wyckoff, and DK Panda, International Conference on Supercomputing (ICS '03), Jun 2003 [Bib - Plain]
358	QoS-aware Middleware for Cluster-based Servers to Support Interactive and Resource-Adaptive Applications S. Senapathi, B. Chandrasekharan, D. Stredney, H.-W. Shen, and DK Panda, High Performance Distributed Computing, Jun 2003 [Bib - Plain]
359	Impact of High Performance Sockets on Data Intensive Applications P. Balaji, J. Wu, T. Kurc, U. Catalyurek, DK Panda, and J. Saltz, High Performance Distributed Computing, Jun 2003 [Bib - Plain]
360	Application-Bypass Broadcast in MPICH over GM D. Buntinas, DK Panda, and R. Brightwell, Cluster Computing and Grid (CCGrid '03), May 2003 [Bib - Plain]
361	Optimizing Barrier and Lock Operations in ARMCI D. Buntinas, A. Saify, DK Panda, and Jarek Nieplocha, International Workshop on Communication Architecture for Clusters (CAC '03), Apr 2003 [Bib - Plain]
362	Efficient Collective Operations using Remote Memory Operations on VIA-Based Clusters R. Gupta, P. Balaji, DK Panda, and J. Nieplocha, International Parallel and Distributed Processing Symposium (IPDPS '03), Apr 2003 [Bib - Plain]
363	NIC-Based Reduction in Myrinet Clusters: Is It Beneficial? D. Buntinas, and DK Panda, SAN-02 Workshop (in conjunction with HPCA), Apr 2003 [Bib - Plain]
364	A Portable Client/Server Communication Middleware over SANs: Design and Performance Evaluation with InfiniBand J. Liu, M. Banikazemi, B. Abali, and DK Panda, SAN-02 Workshop (in conjunction with HPCA), Apr 2003 [Bib - Plain]
365	Impact of On-Demand Connection Management in MPI over VIA J. Wu, J. Liu, P. Wyckoff, and DK Panda, Cluster '02, Sep 2002 [Bib - Plain]
366	Efficient Barrier using Remote Memory Operations on VIA-Based Clusters R. Gupta, V. Tipparaju, J. Nieplocha, and DK Panda, Cluster '02, Sep 2002 [Bib - Plain]
367	High Performance User-Level Sockets over Gigabit Ethernet P. Balaji, P. Shivam, P. Wyckoff, and DK Panda, Cluster '02, Sep 2002 [Bib - Plain]
368	A QoS Framework for Clusters to support Applications with Resource Adaptivity and Predictable Performance S. Senapathi, DK Panda, D. Stredney, and H.-W. Shen, International Workshop on Quality of Service (IWQoS), May 2002 [Bib - Plain]
369	Can User Level Protocols Take Advantage of Multi-CPU NICs? P. Shivam, P. Wyckoff, and DK Panda, International Parallel and Distributed Processing Symposium (IPDPS '02), Apr 2002 [Bib - Plain]
370	MPI/IO on DAFS Over VIA: Implementation and Performance Evaluation J. Wu, and DK Panda, Communication Architecture for Clusters (CAC'02) Workshop, Apr 2002 [Bib - Plain]
371	Protocols and Strategies for Optimizing Remote Memory Operations on Clusters (CAC'02) Workshop J. Nielplocha, V. Tipparaju, A. Saify, and DK Panda, held in conjunction with IPDPS '02, Apr 2002 [Bib - Plain]
372	NIC-Based Atomic Operations on Myrinet/GM D. Buntinas, DK Panda, and W. Gropp, SAN-1 Workshop, Feb 2002 [Bib - Plain]
373	EMP: Zero-copy OS-bypass NIC-driven Gigabit Ethernet Message Passing P. Shivam, P. Wyckoff, and DK Panda, Supercomputing '01., Feb 2002 [Bib - Plain]
374	Implementing TreadMarks over GM on Myrinet: Challenges, Design Experiences and Performance Evaluation R. Noronha, and DK Panda, The Workshop on Communication Architecture for Clusters held in conjunction with IPDPS 2003, Sep 2001 [Slides] [Bib - Plain]
375	Implementing TreadMarks over VIA on Myrinet and Gigabit Ethernet: Challenges, Design Experience, and Performance Evaluation M. Banikazemi, J. Liu, DK Panda, and P. Sadayappan, International Conference on Parallel Processing 2001, Sep 2001 [Bib - Plain]
376	NIC-based Rate Control for Proportional Bandwidth Allocation in Myrinet Clusters A. Gulati, DK Panda, P. Sadayappan, and P. Wyckoff, International Conference on Parallel Processing 2001, Sep 2001 [Bib - Plain]
377	Performance Benefits of NIC-Based Barrier on Myrinet/GM D. Buntinas, DK Panda, and P. Sadayappan, Workshop on Communication Architecture for Clusters (CAC '01), Apr 2001 [Bib - Plain]
378	Fast NIC-Based Barrier over Myrinet/GM D. Buntinas, DK Panda, and P. Sadayappan, International Parallel and Distributed Processing Symposium, Apr 2001 [Bib - Plain]
379	Can Scatter Communication Take Advantage of Multidestination Message Passing? M. Banikazemi, and DK Panda, International Symposium on High Performance Computing (HiPC '00), Dec 2000 [Bib - Plain]
380	Characterization and Enhancement of Static Mapping Heuristics for Heterogeneous Systems Praveen Holenarsipur, V. Yarmolenko, J. Duato, DK Panda, and P. Sadayappan, International Symposium on High Performance Computing (HiPC '00), Dec 2000 [Bib - Plain]
381	Dynamic Mapping Heuristics in Heterogeneous Systems V. Yarmolenko, J. Duato, DK Panda, and P. Sadayappan, Workshop on Network-Based Computing, Aug 2000 [Bib - Plain]
382	Balancing Web Server Load for Adaptive Video Distribution A. Paul, W.-C. Feng, DK Panda, and P. Sadayappan, Workshop on Multimedia Computing, Aug 2000 [Bib - Plain]
383	Implementing TreadMarks on Virtual Interface Architecture (VIA): Design Issues and Alternatives M. Banikazemi, DK Panda, and P. Sadayappan, Ninth Workshop on Scalable Shared Memory Multiprocessors, Jun 2000 [Bib - Plain]
384	TupleQ: Fully-Asynchronous and Zero-Copy MPI over InfiniBand M. Koop, J. Sridhar, and DK Panda, International Parallel and Distributed Processing Symposium (IPDPS), May 2000 [Slides] [Bib - Plain]
385	MVAPICH-Aptus: Scalable High-Performance Multi-Transport MPI over InfiniBand M. Koop, T. Jones, and DK Panda, International Parallel and Distributed Processing Symposium (IPDPS), May 2000 [Slides] [Bib - Plain]
386	Designing Passive Synchronization for MPI-2 One-Sided Communication to Maximize Overlap G. Santhanaraman, S. Narravula, and DK Panda, International Parallel and Distributed Processing Symposium (IPDPS), May 2000 [Bib - Plain]
387	VIBe: A Micro-benchmark Suite for Evaluating Virtual Interface Architecture (VIA) Implementations M. Banikazemi, J. Liu, S. Kutlug, A. Ramakrishna, P. Sadayappan, H. Sah, and DK Panda, International Parallel and Distributed Processing Symposium (IPDPS), May 2000 [Bib - Plain]
388	Efficient Multicast Algorithms for Heterogeneous Switch-based Irregular Networks of Workstations A. Singhal, M. Banikazemi, P. Sadayappan, and DK Panda, International Parallel and Distributed Processing Symposium (IPDPS), May 2000 [Bib - Plain]
389	Efficient Virtual Interface Architecture Support for the IBM SP Switch-Connected NT Clusters M. Banikazemi, V. Moorthy, L. Herger, DK Panda, and B. Abali, International Parallel and Distributed Processing Symposium (IPDPS), May 2000 [Bib - Plain]
390	Adaptive Routing in RS/6000 SP-like Bidirectional Multistage Interconnection Networks M. Banikazemi, C. B. Stunkel, DK Panda, and B. Abali, International Parallel and Distributed Processing Symposium (IPDPS), May 2000 [Bib - Plain]
391	Comparison and Evaluation of Design Choices for Implementing the Virtual Interface Architecture (VIA) M. Banikazemi, B. Abali, and DK Panda, Fourth International Workshop on Communication and Architectural Support for Network-Based Parallel Computing (CANPC'00), Jan 2000 [Bib - Plain]
392	Broadcast/Multicast over Myrinet Using NIC-Assisted Multidestination Messages D. Buntinas, DK Panda, J. Duato, and P. Sadayappan, Fourth International Workshop on Communication and Architectural Support for Network-Based Parallel Computing (CANPC'00), Jan 2000 [Bib - Plain]
393	Fast Collective Communication Algorithms for Reflective Memory Network Clusters V. Moorthy, DK Panda, and P. Sadayappan, Fourth International Workshop on Communication and Architectural Support for Network-Based Parallel Computing (CANPC'00), Jan 2000 [Bib - Plain]
394	Implementing Efficient MPI on LAPI for the IBM-SP: Experiences and Performance Evaluation M. Banikazemi, R. Govindaraju, R. Blackmore, and DK Panda, International Parallel Processing Symposium (IPPS'99), Jan 2000 [Bib - Plain]
395	Low Latency Message Passing on Workstation Clusters using SCRAMNet V. Moorthy, M. Jacunski, M. Pillai, P. Ware, DK Panda, T. Page, P. Sadayappan, V. Nagarajan, and J. Daniel, International Parallel Processing Symposium (IPPS'99), Jan 2000 [Bib - Plain]
396	Communication Modeling of Heterogeneous Networks of Workstations for Performance Characterization of Collective Operations M. Banikazemi, S. Prabhu, J. Sampathkumar, DK Panda, and P. Sadayappan, International Workshop on Heterogeneous Computing (HCW'99), Jan 2000 [Bib - Plain]
397	All-to-All Broadcast on Switch-Based Clusters of Workstations M. Jacunski, P. Sadayappan, and DK Panda, International Parallel Processing Symposium 1999, Apr 1999 [Bib - Plain]
398	Low Latency Message-Passing for Reflective Memory Networks M. Jacunski, V. Moorthy, P. Ware, M. Pillai, DK Panda, and P. Sadayappan, International Workshop on Communication, Jan 1999 [Bib - Plain]
399	Where to Provide Support for Efficient Multicasting in Irregular Networks: Network Interface or Switch? International Conference on Parallel Processing R. Sivaram, R. Kesavan, DK Panda, and Craig B. Stunkel, International Conference on Parallel Processing, Aug 1998 [ pp. 452-459] [Bib - Plain]
400	Experiences with Software MPEG-2 Video Decompression on an SMP PC A. Bala, D. Shah, W.-C. Feng, and DK Panda, ICPP Workshop, Aug 1998 [Bib - Plain]
401	HIPIQS: A High-Performance Switch Architecture using Input Queuing R. Sivaram, C. Stunkel, and DK Panda, International Parallel Processing Symposium (IPPS '98), Aug 1998 [Bib - Plain]
402	Prioritized Demand Multiplexing (PDM): A Low-Latency Virtual Channel Flow Control Framework for Prioritized Traffic A-H. Smai, DK Panda, and L-E. Thorelli, International Conference on High Performance Computing, Dec 1997 [Bib - Plain]
403	How Much Does Network Contention Affect Distributed Shared Memory Performance? D. Dai, and DK Panda, International Conference on Parallel Processing 1997, Dec 1997 [pp. 454-461] [Bib - Plain]
404	Optimal Multicast with Packetization and Network Interface Support R. Kesavan, and DK Panda, International Conference on Parallel Processing (ICPP'97), Dec 1997 [pp. 370-377] [Bib - Plain]
405	Multicasting on Switch-based Irregular Networks using Multi-drop Path-based Multidestination Worms R. Kesavan, and DK Panda, Parallel Computing, Routing, and Communication Workshop, Dec 1997 [Bib - Plain]
406	Multicasting in Irregular Networks with Cut-Through Switches using Tree-Based Multidestination Worms R. Sivaram, DK Panda, and C. B. Stunkel, Parallel Computing, Routing, and Communication Workshop, Dec 1997 [Bib - Plain]
407	How Can We Design Better Networks for DSM Systems? D. Dai, and DK Panda, Parallel Computing, Routing, and Communication Workshop, Dec 1997 [Bib - Plain]
408	Implementing Multidestination Worms in Switch-Based Parallel Systems: Architectural Alternatives and their Impact C. B. Stunkel, R. Sivaram, and DK Panda, International Symposium on Computer Architecture (ISCA'97), Jun 1997 [Bib - Plain]
409	A Reliable Hardware Barrier Synchronization Scheme R. Sivaram, C. B. Stunkel, and DK Panda, International Parallel Processing Symposium (IPPS'97), Apr 1997 [Bib - Plain]
410	Efficient Collective Communication on Heterogeneous Networks of Workstations M. Banikazemi, V. Moorthy, and DK Panda, International Conference on Parallel Processing, Aug 1996 [Bib - Plain]
411	Impact of Adaptivity on the Behavior of Networks of Workstations under Bursty Traffic F. Silla, M. P. Malumbres, J. Duato, D. Dai, and DK Panda, International Conference on Parallel Processing, Aug 1996 [Bib - Plain]
412	Designing Processor-cluster Based Systems: Interplay Between Cluster Organizations and Collective Communication Algorithms D. Basak, and DK Panda, International Conference on Parallel Processing, Aug 1996 [Bib - Plain]
413	Reducing Cache Invalidation Overheads in Wormhole DSMs using Multidestination Message Passing D. Dai, and DK Panda, International Conference on Parallel Processing, Aug 1996 [Bib - Plain]
414	Minimizing Node Contention in Multiple Multicast on Wormhole k-ary n-cube Networks R. Kesavan, and DK Panda, International Conference on Parallel Processing, Aug 1996 [Bib - Plain]
415	Hybrid Algorithms for Complete Exchange in 2D Meshes N. S. Sundar, D. N. Jayasimha, DK Panda, and P. Sadayappan, Proceedings of the International Conference on Supercomputing, May 1996 [Bib - Plain]
416	Multicast on Irregular Switch-based Networks with Wormhole Routing R. Kesavan, K. Bondalapati, and DK Panda, Proceedings of the Third International Symposium on High Performance Computer Architecture (HPCA-3), Feb 1996 [Bib - Plain]
417	Fast Barrier Synchronization in Wormhole k-ary n-cube Networks with Multidestination Worms DK Panda, International Symposium on High Performance Computer Architecture, Jan 1995 [Bib - Plain]
418	Issues in Designing Scalable Systems with k-ary n-cube cluster-c organization DK Panda, and D. Basak, International Workshop on Parallel Processing, Dec 1994 [Bib - Plain]
419	Architectural Issues in Designing Heterogeneous Parallel Systems with Passive Star-Coupled Optical Interconnection R. Prakash, and DK Panda, International Symposium on Parallel Architectures, Dec 1994 [Bib - Plain]
420	Designing Large Hierarchical Multiprocessor Systems under Processor D. Basak, and DK Panda, International Parallel Processing Conference (ICPP '94), Aug 1994 [Bib - Plain]
421	Message-Ordering for Wormhole-Routed Multiport Systems with Link Contention and Routing Adaptivity DK Panda, and V. Dixit-Radiya, Scalable High Performance Computing Conference, May 1994 [Bib - Plain]
422	Complete Exchange in 2D Meshes N. S. Sundar, D. N. Jayasimha, DK Panda, and P. Sadayappan, Scalable High Performance Computing Conference, May 1994 [Bib - Plain]
423	Multidestination Message Passing Mechanism Conforming to Base Wormhole Routing Scheme DK Panda, S. Singal, and P. Prabhakaran, Parallel Routing and Communication Workshop, May 1994 [Bib - Plain]
424	Scalable Architecture with k-ary n-cube cluster-c Organizations D. Basak, and DK Panda, Symposium on Parallel and Distributed Processing, Dec 1993 [Bib - Plain]
425	Task Assignment in Distributed-Memory Systems with Adaptive Wormhole Routing V. Dixit-Radiya, and DK Panda, Symposium on Parallel and Distributed Processing, Dec 1993 [Bib - Plain]
426	Optimal Phase Barrier Synchronization in k-ary n-cube Wormhole-routed Systems using Multirendezvous Primitives DK Panda, Workshop on Fine-Grain Massively Parallel Coordination, May 1993 [Bib - Plain]
427	Analysis of Routing in Pyramid Architectures T. Mzaik, S. Chandra, J. M. Jagadeesh, and DK Panda, IEEE National Aerospace and Electronics Conference (NAECON), May 1993 [Bib - Plain]
428	Benefits of Processor Clustering in Designing Large Parallel Systems: When and How? D. Basak, DK Panda, and M. Banikazemi, International Parallel Processing Symposium, Apr 1993 [Bib - Plain]
429	Global Reduction in Wormhole k-ary n-cube Networks with Multidestination Exchange Worms DK Panda, International Parallel Processing Symposium, Apr 1993 [Bib - Plain]
430	An Efficient Scheme for Complete Exchange in 2D Tori Y.-C. Tseng, S. K. S. Gupta, and DK Panda, International Parallel Processing Symposium, Apr 1993 [Bib - Plain]
431	Clustering and Intra-Processor Scheduling for Explicitly-Parallel Programs on Distributed-Memory Systems V. Dixit-Radiya, and DK Panda, International Parallel Processing Symposium, Apr 1993 [Bib - Plain]
432	Impact of Multiple Consumption Channels on Wormhole Routed k-ary n-cube Networks S. Balakrishnan, and DK Panda, International Parallel Processing Symposium, Apr 1993 [Bib - Plain]
433	Barrier Synchronization in Distributed-Memory Multiprocessors using Rendezvous Primitives S. K. S. Gupta, and DK Panda, International Parallel Processing Symposium, Apr 1993 [Bib - Plain]
434	A Trip-based Multicasting Model for Wormhole-routed Networks with Virtual Channels Y. C. Tseng, and DK Panda, International Parallel Processing Symposium, Apr 1993 [Bib - Plain]

Technical Reports (8)
1	K. Vaidyanathan, P. Lai, S. Narravula, and DK Panda, Benefits of Dedicating Resource Sharing Services in Data-Centers for Emerging Multi-Core Systems, OSU-CISRC-8/07-TR53
2	K. Vaidyanathan, H. Jin, S. Narravula, and DK Panda, Accurate Load Monitoring for Cluster-based Web Data-Centers over RDMA-enabled Networks OSU-CISRC-7/05-TR49
3	G. Marsh, A. Sampat, S. Potluri, and DK Panda, Scaling Advanced Message Queuing Protocol (AMQP) Architecture with Broker Federation and InfiniBand OSU Technical Report (OSU-CISRC-5/09-TR17)
4	W. Huang, J. Liu, B. Abali, and DK Panda, InfiniBand Support in Xen Virtual Machine Environment, OSU-CISRC-2/06--TR18
5	P. Balaji, W. Feng, and DK Panda, The Convergence of Ethernet and Ethernot: A 10-Gigabit Ethernet Perspective, OSU-CISRC-1/06-TR10
6	H. Jin, S. Narravula, G. Brown, K. Vaidyanathan, P. Balaji, and DK Panda, Performance Evaluation of RDMA over IP: A Case Study with Ammasso Gigabit Ethernet NIC, OSU-CISRC-6/05-TR40
7	K. Vaidyanathan, P. Balaji, J. Wu, H. Jin, and DK Panda, An Architectural Study of Cluster-Based Multi-Tier Data-Centers,
8	S. Krishnamoorthy, P. Balaji, K. Vaidyanathan, H. Jin, and DK Panda, Dynamic Reconfigurability Support for providing Soft QoS Guarantees in Cluster-based Multi-Tier Data-Centers over InfiniBand,

Ph.D. Disserations (36)
1	M. Bayatpour, Designing High Performance Hardware-assisted Communication Middlewares for Next-Generation HPC Systems, May 2021
2	C. Chu, Accelerator-enabled Communication Middleware for Large-scale Heterogeneous HPC Systems with Modern Interconnects, Jul 2020
3	J. Hashmi, Designing High Performance Shared-Address-Space and Adaptive Communication Middlewares for Next-Generation HPC Systems, Apr 2020
4	Ammar Awan, Co-designing Communication Middleware and Deep Learning Frameworks for High-Performance DNN Training on HPC Systems, Apr 2020
5	S. Chakraborty, High Performance and Scalable Cooperative Communication Middleware for Next Generation Architectures, Jun 2019
6	J. Zhang, Designing and Building Efficient HPC Cloud with Modern Networking Technologies on Heterogeneous HPC Clusters, Jul 2018
7	M. Li, Designing High-Performance Remote Memory Access for MPI and PGAS Models with Modern Networking Technologies on Heterogeneous Clusters, Nov 2017
8	A. Venkatesh, High-Performance Heterogeneity/Energy-Aware Communication for MultiPetaflop HPC Systems, Dec 2016
9	R. Rajachandrasekar, Designing Scalable And Efficient I/O Middleware for Fault-Resilient High-performance Computing Clusters, Nov 2014
10	J. Jose, Designing High Performance and Scalable Unified Communication Runtime (UCR) for HPC and Big Data Middleware, Aug 2014
11	S. Potluri, Enabling Efficient Use of MPI and PGAS Programming Models on Heterogeneous Clusters with High Performance Interconnects, May 2014
12	K. Kandalla, High Performance Non-Blocking Collective Communication for Next Generation InfiniBand Clusters, Jul 2013
13	M. Luo, Designing Efficient MPI and UPC Runtime for Multicore Clusters with InfiniBand and Heterogeneous System, Jul 2013
14	H. Subramoni, Topology-Aware MPI communication and Scheduling for High Performance Computing Systems, Jul 2013
15	X. Ouyang, Efficient Storage Middleware Design in InfiniBand Clusters for High-End Computing, Mar 2012
16	G. Santhanaraman, Designing Scalable And High Performance One Sided Communication Middleware For Modern Interconnects, Jun 2009
17	M. Koop, High-Performance Multi-Transport MPI Design For Ultra-Scale Infiniband Clusters, Jun 2009
18	L. Chai, High Performance And Scalable MPI Intra-Node Communication Middleware For Multi-Core Clusters, Mar 2009
19	W. Huang, High Performance Network I/O In Virtual Machines Over Modern Interconnects, Aug 2008
20	R. Noronha, Designing High-Performance and Scalable Clustered Network Attached Storage With InfiniBand, Aug 2008
21	S. Narravula, Designing High-Performance and Scalable Distributed Datacenter Services over Modern Interconnects, Aug 2008
22	A. Mamidala, Scalable and High Performance Collective Communication For Next Generation Multicore InfiniBand Clusters, May 2008
23	K. Vaidyanathan, High Performance and Scalable Soft Shared State for Next-Generation Datacenters, May 2008
24	A. Vishnu, High Performance and Network Fault Tolerant MPI with Multi-Pathing Over InfiniBand, Dec 2007
25	S. Sur, Scalable and High Performance MPI Design for Very Large InfiniBand Clusters, Aug 2007
26	W. Yu, Enhancing MPI with Modern Networking Mechanisms in Cluster Interconncts, Jun 2006
27	P. Balaji, High Performance Communication Support for Sockets Based Applications over High-Speed Networks, Jun 2006
28	J. Liu, Designing High Performance and Scalable MPI over InfiniBand, Sep 2004
29	J. Wu, Communication and Memory Management in Networked Storage Systems, Sep 2004
30	D. Buntinas, Improving Cluster Performance through the Use of Programmable Network Interfaces, Jun 2003
31	M. Banikazemi, Design and Implementation of High Performance Communication Subsystems for Clusters, Dec 2000
32	D. Dai, Designing Efficient Communication Subsystems for Distributed Shared Memory (DSM) Systems, Mar 1999
33	R. Kesavan, Communication Mechanisms and Algorithms for Supporting Scalable Collective Communication on Parallel Systems, Oct 1998
34	R. Sivaram, Architectural Support for Efficient Communication in Scalable Parallel Systems, Aug 1998
35	D. Basak, Designing High Performance Parallel Systems: A Processor-Cluster Based Approach, Jul 1996
36	V. Dixit-Radiya, Mapping on Wormhole-routed Distributed-Memory Systems: A Temporal Communication Graph-based Approach, Mar 1995

M.S. Thesis (31)
1	S. Srivastava, MVAPICH2-AutoTune: An Automatic Collective Tuning Framework for the MVAPICH2 MPI Library, May 2021
2	N. Senthil Kumar, Designing Optimized MPI+NCCL Hybrid Collective Communication Routines for Dense Many-GPU Clusters, May 2021
3	Kamal Raj Sankarapandian, Profiling MPI Primitives in Real-time Using OSU INAM, Apr 2020
4	R. Biswas, Benchmarking and Accelerating TensorFlow-based Deep Learning on Modern HPC Systems, Jul 2018
5	A. Augustine, Designing a Scalable Network Analysis and Monitoring Tool with MPI Support, Aug 2016
6	V. Dhanraj, Enhancement of LIMIC-Based Collectives for Multi-core Clusters, Aug 2012
7	A. Singh, Optimizing All-to-all and Allgather Communications on GPGPU Clusters, Apr 2012
8	S. Pai Raikar, Network Fault-Resilient MPI for Multi-Rail InfiniBand Clusters, Dec 2011
9	N. Dandapanthula, InfiniBand Network Analysis and Monitoring using OpenSM, Aug 2011
10	V. Meshram, Distributed Metadata Management for Parallel Systems, Aug 2011
11	G. Marsh, Evaluation of High Performance Financial Messaging on Modern Multi-core Systems, Mar 2010
12	K. Gopalakrishnan, Enhancing Fault Tolerance in MPI for Modern InfiniBand Clusters, Aug 2009
13	T. Gangadharappa, Designing Support For MPI-2 Programming Interfaces On Modern Interconnects, Jun 2009
14	J. Sridhar, Scalable Job Startup And Inter-Node Communication In Multi-Core Infiniband Clusters, Jun 2009
15	R. Kumar, Enhancing MPI Point-to-Point and Collectives for Clusters with Onloaded/Offloaded InfiniBand Adapters, Aug 2008
16	S. Bhagvat, Designing and Enhancing the Sockets Direct Protocol (SDP) over iWARP and InfiniBand, Aug 2006
17	S. Krishnamoorthy, Dynamic Re-Configurability Support to Provide Soft QoS Guarantees in Cluster-Based Multi-Tier Data-Centers over InfiniBand, Jun 2004
18	W. Jiang, High Performance MPICH2 One-Sided Communication Implementation over InfiniBand, Jun 2004
19	A. Wagner, Static and Dynamic Processing Offload on Myrinet Clusters with Programmable NIC Support, Jun 2004
20	A. Moody, NIC-based Reduction on Large-Scale Quadrics Clusters, Dec 2003
21	B. Chandrasekharan, Micro-benchmark Level Performance Evaluation and Comparison of High Speed Cluster Interconnects, Sep 2003
22	S. Kini, Efficient Collective Communication using Multicast and RDMA Operations for InfiniBand-based Clusters, Jun 2003
23	S. Senapathi, QoS-Aware Middleware to Support Interactive and Resource Adaptive Applications on Myrinet Clusters, Sep 2002
24	P. Shivam, High Performance User Level Protocol on Gigabit Ethernet, Aug 2002
25	R. Gupta, Efficient Collective Communication using Remote Memory Operations on VIA-Based Clusters, Aug 2002
26	A. Saify, Optimizing Collective Communication Operations in ARMCI, Jul 2002
27	S. Desai, Mechanisms for Implementing Efficient Collective Communication in Clusters with Application Bypass, Jun 2002
28	V. Tipparaju, Optimizing ARMCI Get/Put Operations on Myrinet/GM, Sep 2001
29	A. Gulati, A Proportional Bandwidth Allocation Scheme for Myrinet Clusters, Jun 2001
30	V. Kota, Designing Efficient Inter-Cluster Communication Layer for Distributed Computing, Jun 2001
31	S. Kutlug, Performance Evaluation and Analysis of User Level Networking Protocols in Clusters, Jun 2000

CUDA

ROCM

MVAPICH: MPI over InfiniBand, Omni-Path, Ethernet/iWARP, RoCE, and Slingshot

Journals (34)

Book Chapter (2)

Conferences & Workshops (434)

Design and Optimization of GPU-Aware MPI Allreduce Using Direct Sendrecv Communication

Characterizing Communication Patterns in Distributed Large Language Model Inference

OMB-Compr: An Extension to OSU Micro Benchmarks for Collective Compression Error Measurement

Use of BlueField-SmartNICs in Offloading One-Sided Communication Primitives

Design and Implementation of MPI Collective Operations for Large Message Communication on AMD GPUs

Design and Implementation of a GPU-Aware MPI Collective Library for Intel GPUs

Unified Designs of Multi-rail-aware MPI Allreduce and Alltoall Operations Across Diverse GPU and Interconnect Systems

Training ultra long context language model with fully pipelined distributed transformer

Effective and Efficient Offloading Designs for One-Sided Communication to SmartNICs

Using BlueField-3 SmartNICs to Offload Vector Operations in Krylov Subspace Methods

Design and Implementation of Kernel-based MPI Reduction Operations for Intel GPUs

Scaling Large Language Model Training on Frontier with Low-Bandwidth Partitioning

HyperSack: Distributed Hyperparameter Optimization for Deep Learning using Resource-Aware Scheduling on Heterogeneous GPU Systems

HARVEST-2.0: High-Performance Vision Framework for End-to-end Preprocessing, Training, Inference, and Visualization

Characterizing Communication in Distributed Parameter-Efficient Fine-Tuning for Large Language Models

OHIO: Improving RDMA Network Scalability in MPI_Alltoall through Optimized Hierarchical and Intra/Inter-Node Communication Overlap Design

The Case for Co-Designing Model Architectures with Hardware

Design and Implementation of an IPC-based Collective MPI Library for Intel GPUs

A Novel LLM-enabled Framework for Accelerating the Creation of Knowledge Graphs for HPC

OMB-FPGA: A Microbenchmark Suite for FPGA-aware MPIs using OpenCL and SYCL

Exploiting Inter-Layer Expert Affinity for Accelerating Mixture-of-Experts Model Inference

Accelerating MPI AllReduce Communication with Efficient GPU-Based Compression Schemes on Modern GPU Clusters

Accelerating Large Language Model Training with Hybrid GPU-based Compression

Flover: A Temporal Fusion Framework for Efficient Autoregressive Model Parallel Inference

MPI Allgather Utilizing CXL Shared Memory Pool in Multi-Node Computing Systems

HARVEST: High-Performance Artificial Vision Framework for Expert Labeling using Semi-Supervised Training

Benchmarking Modern Databases for Storing and Profiling Very Large Scale HPC Communication Data

MPI-xCCL: A Portable MPI Library over Collective Communication Libraries for Various Accelerators

Democratizing HPC Access and Use with Knowledge Graphs

DPU-Bench: A Micro-Benchmark Suite to Measure Offload Efficiency Of SmartNICs

Enabling Reconfigurable HPC through MPI-based Inter-FPGA Communication

SAI: AI-Enabled Speech Assistant Interface for Science Gateways in HPC

A Novel Framework for Efficient Offloading of Communication Operations to Bluefield SmartNICs

Accelerating Distributed Deep Learning Training with Compression Assisted Allgather and Reduce-Scatter Communication

MCR-DL: Mix-and-Match Communication Runtime for Deep Learning

Designing and Optimizing GPU-aware Nonblocking MPI Neighborhood Collective Communication for PETSc

In-Depth Evaluation of a Lower-Level Direct-Verbs API on InfiniBand-based Clusters: Early Experiences

Implementing and Optimizing a GPU-aware MPI Library for Intel GPUs: Early Experiences

AccDP: Accelerated Data-Parallel Distributed DNN Training for Modern GPU-Based HPC Clusters

Accelerating Broadcast Communication with GPU Compression for Deep Learning Workloads

Spark Meets MPI: Towards High-Performance Communication Framework for Spark using MPI

Network-Assisted Non-Contiguous Transfers for GPU-Aware MPI Libraries

High Performance MPI over the Slingshot Interconnect: Early Experiences

Highly Efficient Alltoall and Alltoallv Communication Algorithms for GPU Systems

OMB-Py: Python Micro-Benchmarks for Evaluating Performance of MPI Libraries on HPC Systems

Hy-Fi: Hybrid Five-Dimensional Parallel DNN Training on High-Performance GPU Clusters

Accelerating MPI All-to-All Communication with Online Compression on Modern GPU Clusters

OMB-Py: Python Micro-Benchmarks for Evaluating Performance of MPI Libraries and Machine Learning Applications on HPC Systems

DistMILE: A Distributed Multi-Level Framework for Scalable Graph Embedding

Towards Architecture-aware Hierarchical Communication Trees on Modern HPC Systems

Accelerating CPU-based Distributed DNN Training on Modern HPC Clusters using BlueField-2 DPUs

BluesMPI: Efficient MPI Non-blocking Alltoall Offloading Designs on Modern BlueField Smart NICs

Designing a ROCm-aware MPI Library for AMD GPUs: Early Experiences

SUPER: SUb-Graph Parallelism for TransformERs

Designing High-Performance MPI Libraries with On-the-fly Compression for Modern GPU Clusters

Adaptive and Hierarchical Large Message All-to-all Communication Algorithms for Large-scale Dense GPU Systems

Efficient MPI-based Communication for GPU-Accelerated Dask Applications

Blink: Towards Efficient RDMA-based Communication Coroutines for Parallel Python Applications

GEMS: GPU Enabled Memory Aware Model Parallelism System for Distributed DNN Training

Exploring Hybrid MPI+Kokkos Tasks Programming Model

Design and Characterization of Infiniband Hardware Tag Matching in MPI

Accelerating GPU-based Machine Learning in Python using MPI Library: A Case Study with MVAPICH2-GDR

Dynamic Kernel Fusion for Bulk Non-contiguous Data Transfer on GPU Clusters

Accelerated Real-time Network Monitoring and Profiling at Scale using OSU INAM

NV-Group: Link-Efficient Reductions for Distributed Deep Learning on Modern Dense GPU Systems

Communication-Aware Hardware-Assisted MPI Overlap Engine

HyPar-Flow: Exploiting MPI and Keras for Scalable Hybrid-Parallel DNN Training with TensorFlow

Machine-agnostic and Communication-aware Designs for MPI on Emerging Architectures

High-Performance Adaptive MPI Derived Datatype Communication for Modern Multi-GPU Systems

Design and Evaluation of Shared Memory CommunicationBenchmarks on Emerging Architectures using MVAPICH2

Leveraging Network-level parallelism with Multiple Process-Endpoints for MPI Broadcast

OMB-UM: Design, Implementation, and Evaluation of CUDA Unified Memory Aware MPI Benchmarks

Scaling TensorFlow, PyTorch, and MXNet using MVAPICH2 for High-Performance Deep Learning on Frontera

Performance Characterization of DNN Training using TensorFlow and PyTorch on Modern Clusters