OSU INAM Changelog -------------------- This file briefly describes the changes to the OSU InfiniBand Network Analysis and Monitoring Tool (OSU INAM) software package. The logs are arranged in the "most recent first" order. OSU INAM v1.1 (03/11/2024) * Major Features & Enhancements (since 1.0): - Support for ClickHouse Database to support real-time querying and visualization of very large HPC clusters (20,000+ nodes) - Support for up to 64 parallel insertion for multiple sources of profiling data - Support for up to 64 concurrent users to access OSU INAM with sub-second latency by using ClickHouse - Improved stability of OSU INAM operation - Reduced disk space by using ClickHouse - Change Default Bulk Insertion Size based on Database used to improve real-time view of network traffic - Extending notifications to support multiple criteria * Bug fixes - Fix issues loading certain switch nickname files - Fix a bug for showing link level information for live jobs OSU INAM v1.0 (11/10/2022) * Major Features & Enhancements (since 0.9.8): - Support for data loading progress bars on the UI for all charts - Enhanced the UI by making asynchronous calls for data loading - Support for detailed debugging levels for the INAM daemon * Bug fixes - Fix issues for loading MPI process data for InfluxDB - Fix issues for loading historical MPI Performance Variables (PVAR) on the initialization of the chart - Fix the issue where the INAM daemon stops receiving MPI information - Fix the issue of refreshing the PVAR plot based on a metric change OSU INAM v0.9.8 (08/11/2022) * Major Features & Enhancements (since 0.9.7): - Support for MySQL and InfluxDB as database backends - Enhanced database insertion using InfluxDB - MPI_T Performance Variables (PVARs) - Fabric topology - InfiniBand port data counters and errors - Support for continuous queries to improve visualization performance - Support for SLURM multi-cluster configuration - Significantly improved database query performance when using InfluxDB resulting in improvements for the following: - Live switch page - Live jobs page - Live node page - Historic switch page - Historic node pages - Historic jobs pages - Support for automatic data retention policy when using InfluxDB OSU INAM v0.9.7 (03/07/2022) * Major Features & Enhancements (since 0.9.6): - Significantly improved page load performance for - Live switch page - Live jobs page - Live node page - Live link information page - Historic node pages - Historic jobs pages - Add support for refreshing charts when zooming in and out in historic switches page - Add support to show the number of data points found for an InfiniBand switch counters and time taken to fetch them for each port in the historic switch page - Enhanced debugging, error reporting, and logging capabilities on OSU INAM web-based user interface - Display "no data" for fields that have no data to display - Enhanced stability and performance in emulator mode - Add support to display historic job information while in emulator mode * Bug fixes - Fix issues when different time zones are used in the web UI and server - Fix issues with dates when collecting InfiniBand port counters - Fix issues on the web UI when date field is NULL or missing in database - Fix index out of bound exception when selecting a link between a cluster and switch - Fix index out of bound exception when phantomJS configuration is Invalid OSU INAM v0.9.6 (06/10/2020) * Major Features & Enhancements (since 0.9.5): - Support to collect and visualize MPI_T based performance data - Ability to collect and display "most used" MPI primitives - Node, job, and cluster level granularities - Live and historical views - Ability to collect and display MPI_T based performance data for each MPI primitive for different message ranges - Node, job, and cluster level granularities - Live and historical views - Ability to classify blocking and non-blocking data transfers for different message ranges - Node, job, and cluster level granularities - Live and historical views - Ability to gather and display Lustre I/O for MPI jobs - Support node, job, and cluster level granularities - Support live and historical views - Enable emulation mode to allow users to test OSU INAM tool in a sandbox environment without actual deployment. Emulation mode supports: - Node, job, and cluster level granularities - Live and historical views - Collecting and reporting internal details of MPI jobs - Support to search nodes and jobs for historical node/job and network view page - Generate email notifications to alert users when user defined events occur - Ability to select PBS/SLURM job schedulers at runtime - Support for MOFED 4.5, 4.6, 4.7, and 5.0 - Support for ARM and OpenPower architecture - Support for HDR InfiniBand adapters and switches - Showing interval of querying data on the web front charts - Improving functionality and stability of OSU INAM daemon - Redesigned and optimized web front charts with clear legends * Bug Fixes (since 0.9.5): - Fixed issue for live node where HCA query is enabled - Handled missing slashes for phantomJS configuration parameters - Fix memory leaks OSU INAM 0.9.5 (12/18/2019) * Major Features & Enhancements (since 0.9.4): - Support for 64 bit InfiniBand port counters - Optimized port counters API to fetch minimal data - Support for PBS job scheduler - Support multiple job schedulers on the same fabric - Thanks to Trey Dockendorf @OSC for the feedback - Support for InfiniBand port counters in live jobs page, live nodes page, historical jobs, and historical nodes pages - Support to display Job-level and Node-level CPU, Virtual Memory, and Communication Buffer utilization information for historical jobs - Thanks to Heechang Na @OSC for the feedback - Support to search switches with name and lid in historical switches page - Thanks to Heechang Na @OSC for the feedback - Support to update charts when the user changes time frame in historical jobs, nodes pages - Thanks to Heechang Na @OSC for the feedback - Optimized historical replay of the network view to yield quicker results - Support for adding user-defined labels to switches for better readability and usability - Thanks to Trey Dockendorf @OSC for the feedback - Support to view connection information at port level granularity for each switch - Thanks to Heechang Na @OSC for the feedback - Support to view information about all jobs running on the cluster in live node page - Added information tooltips on various charts throughout OSU INAM - Added interval of querying and reading information to historical jobs, switches and nodes page - Support to configure refresh rates for network topology and links - Support authentication for accessing the OSU INAM webpage - Accelerated database purging capability - Stabilized rendering of live network view - Support for interpolation of process and port counters charts in live job page - Added logging to monitor topology refresh time - Added support to choose MPI ranks to visualize for Link Info page - Compatible with OFED v4.5.1 * Bug Fixes (since 0.9.4): - Fix out of index issues with very large databases sizes - Fix issue with updating PhantomJS cache files when topology changes - Fix issue with rendering of Network view when PhantomJS is disabled - Fix issue with Aggregate mode for port counters - Fix issue with link info page for overlapping MPI ranks and height of charts - Fix issue with Phantom Read and read/write consistency on the website - Thanks to Heechang Na @OSC for reporting the issue - Fix issue with calculating link utilization based on querying interval - Thanks to Heechang Na @OSC for reporting the issue - Fix issue with pattern-based node name search - Thanks to Heechang Na @OSC for reporting the issue - Fix issue with correctly initializing data points for port counters - Fix issue with updating performance charts for different metrics - Fix issue with searching and filtering nodes and jobs on network view - Fixed labeling issues for charts - Fix issue with correctly handling changes in OSU INAM configuration file - Fix issue with displaying port data counters and port error counters for end compute nodes - Thanks to Trey Dockendorf @OSC for reporting the issue - Gracefully handle error when job scheduler component fails - Fix issue when searching for historical jobs - Handled jobs without job ID in Current Jobs page - Fix for search switch by name issue in Network page - Thanks to Trey Dockendorf @OSC for reporting the issue - Fix for link usage check box issue in Network page - Thanks to Heechang Na @OSC for reporting the issue - Fix for various issues related to port counters, process counters - Thanks to Heechang Na @OSC for reporting the issue - Fix for issues related to job-level, node-level and process-level CPU, Virtual Memory and Communication Buffer utilization in live job page OSU INAM 0.9.4 (11/09/2018) * Major Features & Enhancements (since 0.9.3): - Enhanced performance for fabric discovery using optimized OpenMP-based multi-threaded designs - Ability to gather InfiniBand performance counters at sub-second granularity for very large (>2,000 nodes) clusters - Redesign database layout to reduce database size - Enhanced fault tolerance for database operations - Thanks to Trey Dockendorf @ OSC for the feedback - OpenMP-based multi-threaded designs to handle database purge, read, and insert operations simultaneously - Improved database purging time by using bulk deletes - Automatically recreate tables if they are deleted or missing - Tune database timeouts to handle very long database operations - Improved debugging support by introducing several debugging levels - Safely exit osuinamd when no database has been created by the user - Thanks to Trey Dockendorf @ OSC for the feedback * Bug Fixes (since 0.9.3): - Fix issue with web-based front end crashing and not being restarted automatically - Fix issue with locating node when searching network graph by node name on the web-based front end - Handle unexpected characters as input in the search boxes on the web-based front end - Handle negative and incorrect values for runtime parameters in the config file gracefully - Fix issue with marking MPI jobs as complete - Fix issue where osuinamd was not terminating properly after a crash - Gracefully handle error on Network view due to timeout - Gracefully handle error when database is full for osuinamd - Automatically reconnect to MySQL daemon if the connection is lost - Handle restarting MySQL service automatically - Thanks to Trey Dockendorf @ OSC for the feedback OSU INAM 0.9.3 (02/20/2018) * Major Features (since 0.9.2): - Enhance INAMD to query end nodes based on command line option - Thanks to Russell Auld @ Pratt & Whitney for the feedback - Add web page to display size of database in real-time - Enhance interaction between web application and SLURM job launcher for increased portability - Improve packaging of web application and daemon to ease installation - Single RPM containing web application, startup scripts, daemon, and configuration files - Include web server with jar file - Thanks to Trey Dockendorf @ OSC for the feedback - Enhance web interface to improve user experience - Update web application to use Java v1.8 - Update web application to use Spring Boot v1.5.9 - Improve debugging and logging support in daemon and web application - Add 'Debug' tab in web application to consolidate all debugging information - Flexibility to control verbosity of log messages * Bug Fixes (since 0.9.2): - Fix compilation warnings and memory leaks OSU INAM 0.9.2 (10/30/2017) * Bug Fixes (since 0.9.1): - Fixes to make OSU INAM work with MVAPICH2-X 2.3b OSU INAM 0.9.1 (05/13/2016) * Major Features (since 0.9): - Add support to find routes starting from / ending on the selected node going through a user specified link - Capability to view link utilization at the network level for a user specified link - Enhanced load time by clustering for all compute nodes connected to a leaf switch - Support for using internal graph rendering libraries instead of Google chart APIs * Bug Fixes (since 0.9): - Fix issue with addition / removal of nodes from network - Thanks to Doug Johnson@OSC for the report OSU INAM 0.9 (03/30/2016) * Major Features (since 0.8.5): - Significant enhancements to user interface to enable scaling to clusters with thousands of nodes - Improve database insert times by using 'bulk inserts' - Capability to look up list of nodes communicating through a network link - Capability to classify data flowing over a network link at job level and process level granularity in conjunction with MVAPICH2-X 2.2rc1 - Capability to profile and report process to node communication matrix for MPI processes at user specified granularity in conjunction with MVAPICH2-X 2.2rc1 * Bug Fixes (since 0.8.5): - Fix memory leaks in the OSU INAM daemon OSU INAM 0.8.5 (11/12/2015) * Major Features (since 0.8): - Capability to profile and report the following parameters of MPI processes at node-level, job-level and process-level at user specified granularity in conjunction with MVAPICH2-X 2.2b - Memory Utilization - Inter-node communication buffer usage for RC transport - Inter-node communication buffer usage for UD transport - Improve network load time by clustering individual nodes - Introduce "Job Page" to display jobs in ascending/descending order of various performance metrics in conjunction with MVAPICH2-X 2.2b * Bug Fixes (since 0.8): - Use "Big Integer" to handle GUIDs - Thanks to Michael Knox@Cray for the report - Fix issue caused due to recurring decimal generated while calculating link usage percentage - Thanks to Michael Knox@Cray for the report - Handle cases where special symbols appear in host-name - Thanks to Michael Knox@Cray for the report - Fix inconsistency with displaying data in graphs - Thanks to Michael Knox@Cray for the report OSU INAM 0.8 (09/08/2015) * Major Features - Analyze and profile network-level activities with many parameters (data and errors) at user specified granularity - Capability to analyze and profile node-level, job-level and process-level activities for MPI communication (Point-to-Point, Collectives and RMA) in conjunction with MVAPICH2-X - Remotely monitor CPU utilization of MPI processes at user specified granularity in conjunction with MVAPICH2-X - Visualize the data transfer happening in a "live" fashion for - Entire Network - Live Network Level View - Particular Job - Live Job Level View - One or multiple Nodes - Live Node Level View - Capability to visualize data transfer that happened in the network at a time duration in the past for - Entire Network - Historical Network Level View - Particular Job - Historical Job Level View - One or multiple Nodes - Historical Node Level View