1. Overview of the OSU INAM Project

As InfiniBand (IB) based High Performance Computing (HPC) installations grow in size and scale, predicting the behavior of the IB network in terms of link usage and performance becomes an increasingly challenging task. Further, as the computing, and networking technologies continue to evolve in HPC platforms, it becomes increasingly essential to understand the interactions between high-performance HPC middleware infrastructures and the high performance communication fabric which they rely on. The OSU InfiniBand Network Analysis and Monitoring tool - OSU INAM monitors IB clusters in real time by querying various subnet management entities in the network. It is also capable of interacting with the MVAPICH2-X software stack to gain insights into the communication pattern of the application and classify the data transferred into Point-to-Point, Collective and Remote Memory Access (RMA). OSU INAM can also remotely monitor the CPU utilization of MPI processes in conjunction with MVAPICH2-X.

This document contains the necessary information for users to download, install, test, use, tune and troubleshoot OSU INAM v0.9.4. We continuously fix bugs and update this document as per user feedback. Therefore, we strongly encourage you to refer to our web page for updates and feel free to give us your feedback.

2. Features

OSU INAM supports profiling InfiniBand Network traffic. It also has support to introspect the communication pattern of pure MPI programs and MPI+OpenMP programs built with MVAPICH2-X 2.3rc1. High level features of OSU INAM v0.9.4 are listed below.

2.1. Performance and Scalability Features

  • Capability to analyze and profile network-level activities with many parameters (data and errors) at user specified granularity

  • Significant enhancements to the user interface to enable scaling to clusters with thousands of nodes

  • Ability to gather InfiniBand performance counters at sub-second granularity for very large (>2,000 nodes) clusters

  • Enhanced performance for fabric discovery using optimized OpenMP-based multi-threaded designs

  • Enhanced fault tolerance for database operations

  • Improve database insert times by using bulk inserts

  • Improve database purging times by using bulk deletes

  • Improve network load time by clustering individual nodes

  • Improve debugging support by introducing several debugging levels

  • Capability to look up the list of nodes communicating through a network link

  • Capability to visualize the data transfer happening in a ‘live’ fashion - Live View for

    • Entire Network - Live Network Level View

    • One or multiple Jobs - Live Job Level View

    • One or multiple Node - Live Node Level View

  • Capability to visualize data transfer that happened in the network at a time duration in the past - Historical View for

    • Entire Network - Historical Network Level View

    • One or multiple Jobs - Historical Job Level View

    • One or multiple Node - Historical Node Level View

2.2. MVAPICH2-X Specific Features

  • Capability to analyze and profile node-level, job-level and, process-level activities for MPI communication (Point-to-Point, Collectives and RMA) at user specified granularity

  • Capability to profile and report the following parameters of MPI processes at node-level, job-level, and process-level at user specified granularity

    • CPU Utilization

    • Memory Utilization

    • Inter-node communication buffer usage for RC transport

    • Inter-node communication buffer usage for UD transport

  • Capability to profile and report process to node communication matrix for MPI processes at user specified granularity

  • Capability to visualize utilization of a given network link in a live fashion - Live View for

    • Data transferred via a link at Job Level

    • Data transferred via a link at Process Level

  • Support for "Job Page" using data pushed by MVAPICH2-X if SLURM is not enabled

2.3. MVAPICH2-X + SLURM Specific Features

  • Support for "Job Page" to display jobs in ascending/descending order of various performance metrics using SLURM’s sacct command.

3. Download and Installation Instructions

The OSU INAM package can be downloaded from http://mvapich.cse.ohio-state.edu/downloads/#osu-inam. Select the link for your distro. All OSU INAM RPMs are relocatable.

In order to use job tracking, SLURM accounting is required. New with this release is the use of sacct directly, so no database credentials for SLURM are needed. For more information visit https://slurm.schedmd.com/accounting.html. In order to enable SLURM, set OSU_INAM_ENABLE_SLURM=1 in $OSU_INAM_INSTALL_PREFIX/etc/osu-inamd.conf. NOTE - sacct needs to be available on the same host as the osu inam daemon in order to work.

Note
Please note that SLURM is the default job scheduler.

3.1. RHEL6/CentOS6 packages

3.1.1. The following packages are required to get the OSU INAM tool working

  • mysql

  • mysql-devel

  • java 1.8.0

yum install -y mysql mysql-devel mysql-server java-1.8.0-openjdk
Installation Instructions for INAM daemon
service mysqld start
export OSU_INAM_INSTALL_PREFIX=/opt/osu-inam

# Setup DB
mysql -uroot
CREATE DATABASE osuinamdb;
CREATE USER 'osuinamuser'@'localhost' IDENTIFIED BY 'osuinampassword';
GRANT ALL PRIVILEGES ON osuinamdb.* TO 'osuinamuser'@'localhost';
FLUSH PRIVILEGES;
exit

# Web server and daemon install steps
rpm -Uvh osu-inam-0.9.4-1.el6.x86_64.rpm

# Start the daemons (all prior steps need to have been run successfully)
service osu-inamd start
service osu-inamweb start

# Make them start at boot time
chkconfig osu-inamd --enable
chkconfig osu-inamweb --enable

3.2. RHEL7/CentOS7 packages

3.2.1. The following packages are required to get the OSU INAM tool working

  • mariadb-server (formerly mysql)

  • mariadb-devel

  • java 1.8.0

yum install -y  mariadb-server mariadb-devel java-1.8.0-openjdk
Installation Instructions for INAM daemon
systemctl start mariadb
export OSU_INAM_INSTALL_PREFIX=/opt/osu-inam

# Setup DB
mysql -uroot
CREATE DATABASE osuinamdb;
CREATE USER 'osuinamuser'@'localhost' IDENTIFIED BY 'osuinampassword';
GRANT ALL PRIVILEGES ON osuinamdb.* TO 'osuinamuser'@'localhost';
FLUSH PRIVILEGES;
exit

# Web server and daemon install steps
rpm -Uvh osu-inam-0.9.4-1.el7.x86_64.rpm

# Start the daemons (all prior steps need to have been run successfully)
systemctl start osu-inamd
systemctl start osu-inamweb

# Make them start at boot time
systemctl enable osu-inamd
systemctl enable osu-inamweb

3.3. Sample Configuration Files

These files are provided in the etc folder of the OSU INAM installation, usually /etc/osu-inam/etc/..

Warning
Please note that for osuinamd, The purge process has different configuration parameters than other components. This design helps support running OSU INAM for low-frequency performance counter intervals.
Warning
Please remember that if you change a runtime parameter in the configuration files, you will need to restart the components that you have changed the configuration file. It’s recommended to do the following steps for a proper restart of OSU INAM
proper restart for OSU INAM
[source,bash]
# delete the database in MySQL if needed
mysql -uroot
drop database osuinamdb

# first restart osuinamd and then osuinamweb
systemctl restart osu-inamd
systemctl restart osu-inamweb
osu-inam.properties
# Global interval (in seconds) for refreshing information on different pages
osuinam.counterinterval=30
# Max Cluster Size (in number of nodes). For clusters larger than this,
# the leaf nodes will be collapsed by default to improve visual appeal and
# rendering time. Default value: 500
osuinam.clustering_threshold=500
osuinam.clustername=osuinamcluster

# Properties for opensm datasource configuration
osuinam.datasource.url=jdbc:mysql://localhost:3306/osuinamdb
osuinam.datasource.username=osuinamuser
osuinam.datasource.password=osuinampassword
# Control connection pool size
osuinam.datasource.initial-size=20
osuinam.datasource.max-active=50

#log file is rotated once it reaches the size of 10MB
logging.file=/var/log/osu-inam.log
logging.level.edu.osu.inam=WARN

#control server port number, default is 8080
#server.port = 8080

#phantomjs config
phantomjs.execdir=
phantomjs.runjs=
phantomjs.filedir=
phantomjs.cachefile=

#Specify the path to the inamd conf file, set to the default installation path
#osuinam.daemon.conf=/opt/osu-inam/etc/osu-inamd.conf
osu-inam.conf
MV2_TOOL_QPN=X
MV2_TOOL_LID=X
MV2_TOOL_COUNTER_INTERVAL=30
MV2_TOOL_REPORT_CPU_UTIL=0
MV2_TOOL_REPORT_MEM_UTIL=0
MV2_TOOL_REPORT_IO_UTIL=0
MV2_TOOL_REPORT_COMM_GRID=0

Please email us at mvapich-help@cse.ohio-state.edu if you experience any trouble installing the package on your system.

3.4. Upgrading from an older version

Upgrading from older versions involves a subset of steps from the complete installation. INAM v0.9.4 uses an embedded tomcat server and doesn’t expect tomcat server to be installed, unlike the older versions of INAM.

The embedded tomcat server uses the same port number 8080 as the default tomcat installation. It is recommended that the tomcat installation is uninstalled or stopped before installing the new version of INAM. If tomcat cannot be uninstalled, the port number used by INAM can be changed by using the server.port property in the osu-inam.properties file.

Upgrade Steps
# Kill the current running inam daemon
pkill osu-inamd
# Stop and uninstall tomcat6
service tomcat6 stop
yum remove tomcat6
# Install the latest rpm, uninstallation of the old rpm may be necessary
rpm -Uvh osu-inam-0.9.4-1.el6.x86_64.rpm
# Start the osu-inam daemon again
service osu-inamd start
[source,bash]
.osu-inam.properties
# New properties since 0.9.3
#log file is rotated once it reaches size of 10MB
#logging.file=<name of log file>
logging.level.edu.osu.inam=WARN
#change the port number used by the server. 8080 is the default port number
server.port = 8080

4. Basic Usage Instructions

If the installation was successful and the service has been started, you should be able to see the OSU INAM homepage if you point your web browser to http://localhost:8080/ or http://<server_ip>:8080/, depending on where the server was installed. If the server is behind a firewall, look here for some pointers.

4.1. Using the Network View

The Network View provides an overview of the entire network fabric. The network topology is presented as an interactive display that can be moved, dragged or zoomed as required. The nodes are represented by blue circles and switches are represented by red circles. They are labeled by their respective LIDs. The interconnects are colored according to their current load as indicated in the legend.

4.1.1. Network Metrics

The ‘Network Metrics’ drop-down box lists a set of port counters available from the switch. By default, total traffic on the link (Transmitted + Received Bytes) is shown. For the full list of supported counters, refer to port-counters.

4.1.2. Live View

When Live View is selected, the display is refreshed every 30 seconds. This frequency can be changed by changing the runtime variable osuinam.counterinterval in Java side configuration file, usually in /etc/osu-inam.properties. Please note that you should adjust the page refreshing interval based on the intervals on the Daemon side. For example, if you have OSU_INAM_PERF_COUNTER_QUERY_INTRVL=10000, then selecting anything less than 10 seconds for osuinam.counterinterval is not recommended. The default live page reftesh counter is 30 seconds.

The view can be updated manually by selecting a node or switch, right-clicking on it and can be updated manually by selecting a node or switch, right-clicking on it and selecting ‘Update Network’. To get a live view of the switch (red circle) or the node (blue circle), right-click on the appropriate circle and select "Open Node Info". This will open up a new tab / window for the respective element.

4.1.3. Historical View

Looking at the past behavior of a network is often useful while investigating an issue. The Historical View shows the condition of the network from the ‘Start Time’ to the ‘End Time’. The Play/Pause button can be used to start and stop the display. By default, the snapshots are shown in real-time but it can be sped up to 2x, 4x, or 8x speed. The display can be also be restarted by clicking the Rewind button.

By using the check-boxes under ‘Link Usage’, only the links with a certain range of traffic can be included in the view. For example, idle links can be excluded by unchecking the 0-5% checkbox. For metrics indicating errors, the links with or without that error can be selected.

4.1.5. Node Information

Right-clicking on a node presents a context menu. Selecting ‘Open Node Info’ will show detailed information about that node. If the node is running MVAPICH2-X, aggregate CPU usage and usage by each rank will be available.

4.1.6. Switch Information

Detailed information about a switch can be obtained by right-clicking on a switch followed by ‘Open Switch Info’. Clicking on a port will show the port counter information for that particular port.

4.1.7. Route Information

Multiple nodes can be selected on the display by CTRL+Clicking (CMD+Click for OS X) on them. Once multiple nodes are selected, right-click followed by ‘Find Routes Between Selected Nodes’ will highlight the available routes between them.

Detailed information about the utilization of a link can be viewed by right-clicking on the link followed by Open Link Info. For jobs using MVAPICH-X, data transferred via that link is available. Also, by selecting a job id, process level link utilization is available.

Left click on the link of choice to select it. Then right click and select the Find routes going through this link option. This will display all the routes connecting hosts which uses the selected link.

4.1.10. Using the Job Level View

OSU INAM can work with the resource manager to show information pertaining to a single job instead of the entire cluster. The Job Level View can be activated by selecting ‘Job Id’ in ‘Filter By’ and entering a job id. In this view, only the nodes and switches participating in that job are displayed. The features present in Network Level View (See network-view) like Live View, Historical View, Link Usage etc. are supported in the Job Level View as well.

In Historical View, the start and end time of the job are automatically populated. For a running job, the end time is populated to the current time. The user can select specific start and end time as well.

4.1.11. Using the Node View

For supported MPI libraries, OSU INAM can display process level CPU and network utilization information. This mode can be selected by choosing ‘Node Id’ and selecting one or more nodes from the list of nodes. If an MPI job is running on that node, OSU INAM will display aggregate or per core CPU usage. The list of MPI ranks is also shown, and each of the ranks can be selected to view their network usage over a period of time.

4.2. Using the Live Jobs Tab

The Live Jobs tab allows the user to see various selectable metrics (defined below) for all jobs using MVAPICH2-X 2.3rc1 on a cluster.

4.2.1. CPU User Usage

This displays the aggregated CPU utilization (percentage) for a specific job.

4.2.2. Virtual Memory Usage

This displays the total virtual memory utilization (in bytes) for a specific job

4.2.3. Total I/O

This displayed the total I/O data read and written (in bytes) for a specific job

4.2.4. Total Communication

This displays the sum of all inter and intra node communication performed by this job.

4.2.5. Total Intra Node Communication

This displays the number of bytes exchanged by the job between the process running on one node.

4.2.6. Total Inter Node Communication

This displays the number of bytes exchanged by the job between processes running on different nodes.

4.2.7. Total Collective

This displays the number of bytes exchanged by processes (for the specific job) during collective communication (e.g. MPI_Bcast) only.

4.2.8. RMA Sent

This displays the number of bytes sent for one-sided communication (e.g. MPI_Put) by processes of a specific job.

4.2.9. Total Pt-to-pt

This displays the number of bytes sent and received by point to point operations (e.g. MPI_Send or MPI_Recv) for a specific job.

4.2.10. Inter-node Communication Buffers Allocated

This displays the number of buffers allocated for communication across nodes.

4.2.11. Inter-node Communication Buffers Used

This displays the number of buffers actually used for communication across nodes.

5. Using MVAPICH2-X INAM

5.1. Running Example

Note
users should be using the appropriate version of the MVAPICH2-X RPM built with the support for advanced features to use this feature.

In this section, we detail how one should enable MVAPICH2-X to work in conjunction with OSU INAM.

Please note that MVAPICH2-X must be launched with support for on-demand connection management when running in conjunction with OSU INAM. One can achieve this by setting the MV2_ON_DEMAND_THRESHOLD environment variable to a value less than the number of processes in the job.

Tip
please use the following configure flags for MVAPICH2-X configuration
CFLAGS=-pipe --enable-ucr --enable-mpit-tool --enable-g=all --enable-fast=none

This command launches test on nodes n0 and n1, two processes per node with support for sending the process and node level information to the OSU INAM daemon.

MVAPICH2 Running Example
$ mpirun_rsh -rsh -np 4 n0 n0 n1 n1 MV2_ON_DEMAND_THRESHOLD=1
MV2_TOOL_INFO_FILE_PATH=/opt/inam/.mv2-tool-mvapich2.conf
MV2_TWO_LEVEL_COMM_THRESHOLD=1 MV2_USE_RDMA_CM=0 ./test
$ cat /opt/inam/.mv2-tool-mvapich2.conf
MV2_TOOL_QPN=473             #UD QPN at which OSU INAM is listening.
MV2_TOOL_LID=208             #LID at which OSU INAM is listening.
MV2_TOOL_COUNTER_INTERVAL=30 #Specifies whether MVAPICH2-X should report
                             #process level CPU utilization information.
MV2_TOOL_REPORT_CPU_UTIL=1   #The interval at which MVAPICH2-X should
                             #report node, job and process level information.

6. Runtime Parameters for osuinamd

A list of all runtime parameters supported by OSU INAM v0.9.4 is listed below. All these parameters can be set in the configuration file for OSU INAM, usually called "osu-inamd.conf". An example configuration file is provided for you in INAM_PATH/etc/osu-inamd.conf.example If the user chooses to tune any of these values, note that one needs to restart the daemon so that it takes effect. All of the parameters listed here are for daemon only.

6.1. General Parameters

In this section, runtime parameters for fabric and port data (counters and errors) are presented. These parameters are used to adjust query interval, the number of threads for each component, and enabling some features.

6.1.1. OSU_INAM_FABRIC_QUERY_INTRVL

  • Default: 3600 seconds

  • Specifies the interval in seconds at which OSU INAM should query the fabric to identify a change in state for switches, nodes, links ,and routes.

6.1.2. OSU_INAM_PERF_COUNTER_QUERY_INTRVL

  • Default: 30 seconds

  • Specifies the interval in millisecond at which OSU INAM should query the switches to obtain port counter and port errors information.

6.1.3. OSU_INAM_ENABLE_HCA_QUERY

  • Default: 0 (disabled)

  • Specifies if port counters and port errors data should be fetched from all host connected nodes on the network in addition to the switches on the network. This is not required since the switches port data are already fetched. Enabling it will increase the time taken to gather port data.

6.1.4. OSU_INAM_FABRIC_DISC_NUM_OMP_THREADS

  • Default: 8

  • Specifies the number of OMP thread for performing fabric discovery.

6.1.5. OSU_INAM_NUM_OMP_THREADS_FOR_SWITCHES

  • Default: 8

  • Specifies the number of OMP thread for performance counters for the number of switches. Should be greater than 0.

6.1.6. OSU_INAM_NUM_OMP_THREADS_FOR_SWITCH_PORTS

  • Default: 1

  • Specifies the number of OMP thread for performance counters for the number of ports of switches. It cannot be zero.

6.1.7. OSU_INAM_USE_OMP_THREADS_FOR_SWITCHES

  • Default: 1

  • Enable OMP threading for the switches. Should be set even if you only want to do gather ports information of switches in parallel, not across switches.

6.1.8. OSU_INAM_USE_OMP_THREADS_FOR_SWITCH_PORT

  • Default: 0

  • Enable OMP threading for the ports of a switch.

6.1.9. OSU_INAM_ENABLE_PARALLEL_PERF_COUNTER_DATA_WRITE

  • Default: 0

  • Enable concurrent writes for performance counters info into the database.

6.1.10. OSU_INAM_ENABLE_ROUTE_DISCOVERY

  • Default: 1 (enabled)

  • Specifies if HCA nodes should be scanned for route information.

6.2. MVAPICH2-X Specific Parameters

Runtime Parameters related to running MVAPICH2-X for ousinamd are presented in this section including the interval of querying process info and metrics that MVAPICH2-X should report.

6.2.1. OSU_INAM_PROC_COUNTER_QUERY_INTRVL

  • Default: 30 seconds

  • Specifies the interval at which MVAPICH2-X should report node, job and process level information.

6.2.2. OSU_INAM_TOOL_REPORT_CPU_UTIL

  • Default: 1

  • Specifies whether MVAPICH2-X should report process level CPU utilization information.

6.2.3. OSU_INAM_TOOL_REPORT_MEM_UTIL

  • Default: 1

  • Specifies whether MVAPICH2-X should report process-level memory utilization information.

6.2.4. OSU_INAM_TOOL_REPORT_IO_UTIL

  • Default: 1

  • Specifies whether MVAPICH2-X should report process level IO information.

6.2.5. OSU_INAM_TOOL_REPORT_COMM_GRID

  • Default: 1

  • Specifies whether MVAPICH2-X should report process communication grid information.

6.2.6. OSU_INAM_JOB_COMPLETION_TIMEOUT

  • Default: 100 seconds

  • Specifies the time (in seconds) after which a job is marked as complete if no update from #MVAPICH2-X is received for that job.

6.3. OSU INAM Database Configuration Parameters

This section presents the runtime parameters related to MySQL database setting for osuinamd. Some parameters must be set by the user to have OSU INAM working.

6.3.1. OSU_INAM_DATABASE_HOST

  • Default: Unset (Must be set by user)

  • Specifies the name of the host where the MySQL database daemon is running.

6.3.2. OSU_INAM_DATABASE_PORT

  • Default: Unset (Must be set by the user)

  • Specifies the port on OSU_INAM_DATABASE_HOST at which the MySQL database daemon is listening for incoming connections.

6.3.3. OSU_INAM_DATABASE_NAME

  • Default: Unset (Must be set by the user)

  • Specifies the name of MySQL database OSU INAM should use to store data.

6.3.4. OSU_INAM_DATABASE_USER

  • Default: Unset (Must be set by the user)

  • Specifies the name of user who has privileges to enter data into the MySQL database with name OSU_INAM_DATABASE_NAME.

6.3.5. OSU_INAM_DATABASE_PASSWD

  • Default: Unset (Must be set by the user)

  • Specifies the password associated with user id OSU_INAM_DATABASE_USER.

6.3.6. OSU_INAM_DATA_RETENTION_PERIOD

  • Default: 7 days

  • Specifies the duration in days the profiling data should be stored in the MySQL database. Any data longer than that will be purged.

6.3.7. OSU_INAM_PURGE_QUERY_INTERVAL

  • Default: 3600 seconds

  • Specifies the interval between two purge queries used to delete profiling information from the database.

6.3.8. OSU_INAM_DATABASE_BULK_ACTIVE

  • Default: 1 (enable)

  • If enabled, the insertion queries will insert data in a bulk manner to MySQL.

6.3.9. OSU_INAM_DATABASE_BULK_SIZE

  • Default: 100

  • Specifies the number of records inserted in a bulk insert.

6.3.10. OSU_INAM_DB_RECONNECT

  • Default: 1 (enabled)

  • Enables reconection for MySQL. If the connection to the server is lost, automatically try to reconnect three times.

6.3.11. OSU_INAM_DB_READ_TIMEOUT

  • Default: 30 seconds

  • Specifies the number of seconds to wait for more data from a MySQL connection before aborting the read.

6.3.12. OSU_INAM_DB_CONNECT_TIMEOUT

  • Default: 10 seconds

  • Specifies the number of seconds that the MySQL server waits for a connect packet before ending the connection due to bad handshake.

6.3.13. OSU_INAM_DB_WRITE_TIMEOUT

  • Default: 60 seconds

  • Specifies the number of seconds to wait for a block to be written to a connection before aborting the write.

6.3.14. OSU_INAM_DB_WAIT_TIMEOUT

  • Default: 3 times the interval of each component. For example: if fabric discovery interval is every 8 hours then wait_timeout for fabric connection will be set to 24 hours.

  • The number of seconds MySQL database server waits for an activity on a connection before closing it.

6.3.15. OSU_INAM_BULK_PURGE_SIZE

  • Default: 100000

  • Specifies the batch size to delete as the number of rows in purge procedure.

6.3.16. OSU_INAM_DB_PURGE_WAIT_TIMEOUT

  • Default: 3 * OSU_INAM_PURGE_QUERY_INTERVAL

  • Specifies the number of seconds MySQL database server waits for an activity on purge connection before closing it for the purge connection.

6.3.17. OSU_INAM_DB_PURGE_READ_TIMEOUT

  • Default: 28800 seconds (8 hours)

  • Specifies the number of seconds to wait for more data from purge connection before aborting the read.

6.3.18. OSU_INAM_DB_PURGE_WRITE_TIMEOUT

  • Default: 28800 seconds (8 hours)

  • Specifies the number of seconds to wait for a block to be written to the purge connection before aborting the write.

6.3.19. OSU_INAM_DELETE_INTERVAL

  • Default: 1 second

  • Specifies the time period between intervals of delete for purge function.

6.4. OSU INAM Job Scheduler Configuration Parameters

This section presents the runtime parameters related to job scheduler for osuinamd including the query interval of job scheduler and timeout for jobs.

6.4.1. OSU_INAM_ENABLE_SLURM

  • Default: 1

  • Specifies if SLURM should be used to get live jobs information. The sacct command is run on the system where the inamd daemon is running to get the jobs information

6.4.2. OSU_INAM_SLURM_QUERY_INTERVAL

  • Default: 30 seconds

  • Specifies how often the jobs information must be pulled in from SLURM

6.4.3. OSU_INAM_SQUEUE_CMD_PATH

  • No Default

  • Specifies the path to the directory that contains squeue command.

6.5. Debug Parameters

A list of debugging parameters and the verbosity level of each one is shown in this section. It’s useful to find out the problem by choosing the right level of debugging for each component.

6.5.1. INAM_DEBUG_INIT_VERBOSE

  • Default: 1

  • Prints the given arguments for OSU INAM and status of each threads.

        Level 0
                - All debugging is disabled
        Level 1
                - General launch information
                - Creation and exiting of Pthreads for components

                    - Examples of debugging information at this level
                        - creation of Fabric, Performance Counter, Slurm, and Network
                          threads
                        - prints the argument passed to osuinamd

6.5.2. INAM_DEBUG_SM_VERBOSE

  • Default: 0

  • Verbosity level for tracking state machines operations.

        Level 0

            - All debugging is disabled

        Level 1

            - Displays the transition of the state machine for fabric discovery,
              performance counter and MPI_T network threads

6.5.3. INAM_DEBUG_DB_VERBOSE

  • Default: 1

  • Verbosity level for the database operations.

        Level 0
            - All debugging is disabled

        Level 1
            - Verification of creating MySQL connections for different components
            - Debugging information related to interactions with SLURM
            - Debugging information related to purging database and database related
              faults


        Level 2
            - Basic debugging information related to creating and altering database
              tables
            - Basic debugging information related to MySQL connections

        Level 3
            - Advanced debugging information related to MySQL connections
            - Advanced debugging information related to MySQL insertions and deletions

6.5.4. INAM_DEBUG_NW_VERBOSE

  • Default: 1

  • Verbosity level for the network operations.

        Level 0

            - All debugging is disabled

        Level 1

            - Displays basic debugging information
                * Device information
                * Details about `osu-inam.conf` file
                * Details about querying and gathering process information from MPI
                  processes

             Examples of debugging information at this level
                - Allocated QPN and LID for listening to MPI info
                - Insertion of Process info into the database
                - The path to `osu-inam.conf` file and the content
                - Start and end of network initialization

        Level 2

            - Acknowledgments for receiving info from MPI jobs

        Level 3

            - Rank information for MPI jobs

6.5.5. INAM_DEBUG_FB_VERBOSE

  • Default: 1

  • Verbosity level for the fabric discovery and performance counters sweep operations.

        Level 0

            - All debugging is disabled

        Level 1

            - Displays basic debugging information for the following features
                * Querying and gathering InfiniBand fabric data
                * Querying and gathering InfiniBand performance counter data
                * High-level details about multi-threading

                Examples of debugging information at this level
                - Starting and finalizing various threads
                - Verification of information gathered from InfiniBand fabric
                - Number of OpenMP threads used for Fabric
                - Number of Nodes detected by INAM

        Level 2

            - Advanced debugging for querying and gathering InfiniBand fabric data
              and InfiniBand performance counter data

            - Details about querying and gathering performance counters for InfiniBand
              switches in the network

             Examples of debugging information at this level
                - Name and number of ports for each switch

        Level 3

            - Debugging information about optimized OpenMP-based multi-threading design

             Examples of debugging information at this level
                - Detailed information for links and node for fabric and performance
                  counter threads
                - Information related to gathering route data

        Level 4

            - Details about route discovery for Fabric thread
            - Reports encountered errors while discovering Fabric

             Examples of debugging information at this level
                - Information about GUID and endpoint nodes of the routes
                - Information about bad forwarding tables

7. Runtime Parameters for osuinamweb (website)

A list of mostly used parameters supported by OSU INAM v0.9.4 is listed below for the osuinamweb. osuinamweb uses Apache Tomcat. You can set all of listed parameters in this section in the configuration file located at /etc/osu-inam.properties.

Note
If you choose to tune any of these values, you should restart osuinamweb in order to take effect.

7.1. General Parameters

Bellow, general runtime parameters for the website are listed including graphs' update time and server port.

7.1.1. osuinam.counterinterval

  • Default: 30 seconds

  • The time that the website graphs are refreshed.

7.1.2. osuinam.clustering_threshold

-Default: 500 - Max Cluster Size (in number of nodes). For cluster larger than this size, the leaf nodes will be collapsed by default to improve visual appeal and rendering time

7.1.3. osuinam.clustername

  • Default: Unset

  • The cluster name for osuinamweb

7.1.4. server.port

  • Default: 8080

  • Controls server port number.

7.1.5. osuinam.daemon.conf

  • Default: Unset

  • Location of the osuinamd file used for osuinamd to be view in Debug Tab.

7.2. Data Source Parameters

The runtime parameters for configuring data source ( MySQL here) are presented here for use in Spring boot framework. some values are common between osuinamd and osuinamweb like data source login information and database name. Please check them to be the same in regard to the value that was set for the osuinamd configuration file and creating the database in MySQL.

Tip
Data source configuration is controlled by configuration properties like `osuinam.datasource.* `.

7.2.1. osuinam.datasource.url

  • Default: Unset must bet set

  • The JDBC URL of MySQL that ``osuinamweb`` will connect.

7.2.2. osuinam.datasource.username

  • Default: Unset must be set

  • The login username for the database.

7.2.3. osuinam.datasource.password

  • Default: Unset must be set

  • The login password for the database.

7.2.4. osuinam.datasource.driver-class-name

  • Default: Unset

  • Fully qualified name of the JDBC driver. Auto-detected based on the URL by default.

7.3. Apache Tomcat Parameters

List of runtime parameters related to Apache Tomcat server is presented below.

Tip
osuinamweb uses Apache Tomcat. Tomcat connection pool^ lists the parameters and features that user can use to customize for additional setting.
Note
To set additional setting Tomcat, you need to use osuinam.datasource prefix. A few examples are mentioned below.

7.3.1. osuinam.datasource.initial-size

  • Default: 10

  • The initial number of connections that are created when the Tomcat connection pool is started.

7.3.2. osuinam.datasource.max-active

  • Default: 100

  • The maximum number of active connections that can be allocated from this pool at the same time.

7.3.3. osuinam.datasource.remove-Abandoned

  • Default: false

  • Flag to remove abandoned connections if they exceed the removeAbandonedTimeout.

7.3.4. osuinam.datasource.removeAbandonedTimeout

  • Default: 60 seconds

  • Timeout in seconds before an abandoned(in use) connection can be removed.

7.4. Logging Parameters

Parameters for logging setting for osuinamweb are listed below.

7.4.1. logging.file

  • Default: system messages.

  • The name of log file location for osuinamweb.

7.4.2. logging.path

  • The directory to which log files are written.

7.4.3. logging.level.edu.osu.inam

  • Default: INFO

  • The logging level for the osuinamweb.

7.5. PhantomJS Parameters

As mentioned in Advanced Usage Instructions, osuinamweb uses PhantomJS to accelerate rendering of the network. The runtime parameters for PhantomJS are presented here.

Note
All of PhantomJS config variables are unset.

7.5.1. phantomjs.execdir

  • execdir is the path you placed the phantomjs bin.

7.5.2. phantomjs.runjs

  • runjs should be the explicit path to the inam.js that is provided in the lib folder of the OSU INAM installation, usually /opt/osu-inam/lib/inam.js.

7.5.3. phantomjs.filedir

  • filedir is the location of the phantomjs output for the pre-rendering. It’s better to make sure that this directory exists.

7.5.4. phantomjs.cachefile

  • cachefile is the location of the file to cache the final phantomjs output. On the next restart, the web application would use the cached data and not perform the rendering.

8. List of Supported Network Metrics

The Network Metrics supported by OSU INAM v0.9.4 are listed below. These metrics can be broadly divided into three sets. The descriptions for InfiniBand port and error counters have been obtained from the InfiniBand Specification Release 1.2.1 by the InfiniBand Trade Association.

8.1. Switch Counters

The following node-level counters are queried from the InfiniBand Switches:

  • Xmit Data

    • The total number of data octets, divided by 4, transmitted on all VLs from the port. This includes all octets between (and not including) the start of packet delimiter and the VCRC, and may include packets containing errors. Excludes all link packets.

  • Rcv Data

    • The total number of data octets, divided by 4, received on all VLs from the port. This includes all octets between (and not including) the start of packet delimiter and the VCRC, and may include packets containing errors. Excludes all link packets.

  • Max [Xmit Data/Rcv Data]

    • Maximum of the two values above

8.2. Process Level Counters

MVAPICH2-X collects additional information about the process’s network usage which can be displayed by OSU INAM. The following counters are currently supported:

  • Xmit Data

    • Total number of bytes transmitted as part of the MPI application

  • Rcv Data

    • Total number of bytes received as part of the MPI application

  • Max [Xmit Data/Rcv Data]

    • Maximum of the two values above

  • Point to Point Send

    • Total number of bytes transmitted as part of MPI point-to-point operations

  • Point to Point Rcvd

    • Total number of bytes received as part of MPI point-to-point operations

  • Max [Point to Point Sent/Rcvd]

    • Maximum of the two values above

  • Coll Bytes Sent

    • The total number of bytes transmitted as part of MPI collective operations

  • Coll Bytes Rcvd

    • The total number of bytes received as part of MPI collective operations

  • Max [Coll Bytes Sent/Rcvd]

    • Maximum of the two values above

  • RMA Bytes Sent

    • Total number of bytes transmitted as part of MPI RMA operations. Note that due to the nature of the RMA operations, bytes received for RMA operations cannot be counted

  • RC VBUF

    • The number of internal communication buffers used for reliable connection (RC)

  • UD VBUF

    • The number of internal communication buffers used for unreliable datagram (UD)

  • VM Size

    • Total number of bytes used by the program for its virtual memory

  • VM Peak

    • Maximum number of virtual memory bytes for the program

  • VM RSS

    • The number of bytes resident in the memory (Resident set size)

  • VM HWM

    • The maximum number of bytes that can be resident in memory (Peak resident set size or High watermark)

8.3. Error Counters

The following error counters are available both at switch and process level:

  • SymbolErrors

    • The total number of minor link errors detected on one or more physical lanes

  • LinkRecovers

    • The total number of times the Port Training state machine has successfully completed the link error recovery process

  • LinkDowned

    • The total number of times the Port Training state machine has failed the link error recovery process and downed the link

  • RcvErrors

    • The total number of packets containing an error that were received on the port. These errors include:

      • Local physical errors

      • Malformed data packet errors

      • Malformed link packet errors

      • Packets discarded due to buffer overrun

  • RcvRemotePhysErrors

    • The total number of packets marked with the EBP delimiter received on the port.

  • RcvSwitchRelayErrors

    • The total number of packets received on the port that were discarded because they could not be forwarded by the switch relay

  • XmtDiscards

    • The total number of outbound packets discarded by the port because the port is down or congested. Reasons for this include:

      • The output port is not in the active state

      • Packet length exceeded NeighborMTU

      • Switch Lifetime Limit exceeded

      • Switch HOQ Lifetime Limit exceeded This may also include packets discarded while in VLStalled State.

  • XmtConstraintErrors

    • The total number of packets not transmitted from the switch physical port for the following reasons:

      • FilterRawOutbound is true and the packet is raw

      • PartitionEnforcementOutbound is true and packet fails partition key check or IP version check

  • RcvConstraintErrors

    • The total number of packets not received from the switch physical port for the following reasons:

      • FilterRawInbound is true and the packet is raw

      • PartitionEnforcementInbound is true and packet fails partition key check or IP version check

  • LinkIntegrityErrors

    • The number of time s that the count of local physical errors exceeded the threshold specified by LocalPhyErrors

  • ExcBufOverrunErrors

    • The number of times that OverrunErrors consecutive flow control update periods occurred, each having at least one overrun error

  • VL15Dropped

    • The number of incoming VL15 packets dropped due to resource limitations (e.g., lack of buffers) in the port

9. Advanced Usage Instructions

9.1. Making OSU INAM visible outside of a firewalled environment

The following snippets should work in basic scenarios where the OSU INAM server is sitting behind a firewalled or NAT’d environment. Please exercise caution as this could expose the server to larger, less secure networks or otherwise upset your network administrators.

Iptables
-A PREROUTING -p tcp -d <external ip> --dport 8080 -j DNAT --to <tomcat server>:8080
-A POSTROUTING -p tcp -d <tomcat server> --dport 8080 -j SNAT --to-source <external ip>
Apache
ProxyPass /inam/ http://<tomcat server>:8080/
ProxyPassReverse /inam/ http://<tomcat server>:8080/
Nginx
server {
    listen 8080 default_server;
    server_name X;
    }
    location /inam {
        rewrite ^/inam(.*)$ $1 break;
        proxy_pass http://<tomcat server>:8080;
    }

9.2. Speed up the network map rendering by using PhantomJS

PhantomJS is a headless WebKit that allows OSU INAM to pre-render the network graph so it loads much quicker. The modifications necessary to do this are minimal.

9.2.1. Required Packages

Add the following parameters to your osu-inam.properties file

/etc/osu-inam.properties
#phantomjs
#execdir is the path you placed the phantomjs bin
phantomjs.execdir=/path/to/phantomjs/bin/
#runjs should be the explicit path to the inam.js that is provided in the root of the download tarball
phantomjs.runjs=/path/to/inam.js
#filedir is the location of the phantomjs output for the pre-rendering
phantomjs.filedir=/path/to/phantomjs/working/dir
#cachefile is the location of the file to cache the final phantomjs
#output. On the next restart, the web application would use the cached data and not
#perform the rendering
phantomjs.cachefile=/path/to/cachefile
Note
Be sure to make the PhantomJS binary executable, the runjs file readable, and the filedir writeable by your web server. Place the vis.js from the root of the tarball in the same directory as inam.js.

After the positions are calculated by PhantomJS, the cachefile will be generated by the web application.

Once finished, restart the webserver to pull in the new settings and on the next visit to /network/, the view should be rendered nearly instantly.

PhantomJS execution for rendering the network graph happens during the web application’s deployment. This might affect the web application deployment load time. However, this is an ONE TIME COST. For the subsequent deployments, the web application will load the network information from the cache file. The time taken by PhantomJS for rendering network for the first time is a factor of the complexity of the network and the number of nodes.

Projected ONE TIME Web Application Deployment Time with PhantomJS

These estimates are based on testing with PhantomJS 2.0.0 on a dual socket Intel E5630 with 12GB of memory.

Number of Nodes Number of Switches Network Topology Approximate Time

178

20

Full Fat-Tree

1 min

1879

212

Hybrid Fat-Tree

30 mins

10. Best Practices with OSU INAM

10.1. Deployment Recommendations

Due to the multithreaded design of the OSU INAM daemon, for large clusters constituting of thousands of nodes and hundreds of switches, we recommend dedicating 4 cores on a node for the daemon and one core for the database daemon processes. For smaller clusters consisting of less than 500 nodes, the daemon can be run on a non-dedicated node (a head node/login node for instance).

Based on our experience and feedback we have received from our users, here we include some of the best practices for deploying OSU INAM. If you have any of your own best practices related to OSU INAM, please feel free to contact us by sending an email to mvapich-help@cse.ohio-state.edu

10.2. MySQL Tuning Parameters

For the database, the following parameters can be tuned for better performance at different cluster sizes

MySQL Tuning Parameter Significance

innodb_flush_log_at_trx_commit

Controls the balance between strict ACID compliance for commit

innodb_buffer_pool_size

The size in bytes of the buffer pool, the memory area where InnoDB caches table and index data

innodb_log_buffer_size

The size in bytes of the buffer that InnoDB uses to write to the log files on disk

innodb_log_file_size

The size in bytes of each log file in a log group

10.2.1. Additional Steps Required Before Changing Number or Size of InnoDB Redo Log Files

  • Set innodb_fast_shutdown to 1

mysql> SET GLOBAL innodb_fast_shutdown = 1;
  • Stop MySQL server and ensure it finalizes without errors

  • Backup old log files if desired to enable restoring state

  • Delete old log files

  • Edit my.cnf file and add the lines listed below depending on your cluster size

  • Start MySQL server

10.2.2. Proposed Additions to OSU INAM and MySQL Configuration File for Clusters of Different Sizes

We list some recommended values to be set in my.cnf file for clusters of different sizes.

Additions to my.cnf file for small clusters (<100 nodes)
innodb_flush_log_at_trx_commit=2
Additions to my.cnf file for medium sized clusters (100-500 nodes)
innodb_flush_log_at_trx_commit=2
innodb_buffer_pool_size=4G
innodb_log_buffer_size=16M
innodb_log_file_size=256M
Additions to my.cnf file for large clusters (>500 nodes)
innodb_flush_log_at_trx_commit=2
innodb_buffer_pool_size=16G
innodb_log_buffer_size=32M
innodb_log_file_size=512M
Additions to osu-inamd.conf file for all cluster sizes
#The number of records to be inserted together during bulk insert.
OSU_INAM_DATABASE_BULK_SIZE=100

11. FAQ and Troubleshooting with OSU INAM

Based on our experience and feedback we have received from our users, here we include some of the problems a user may experience and the steps to resolve them. If you are experiencing any other problem, please feel free to contact us by sending an email to mvapich-help@cse.ohio-state.edu

11.1. General Questions and Troubleshooting

11.1.1. Install OSU INAM to a specific location

OSU INAM RPMs are relocatable. Please use the --prefix option during RPM installation for installing MVAPICH2-X into a specific location. An example is shown below:

rpm -ivh --prefix <specific-location or $OSU_INAM_INSTALL_PREFIX> osu-inam-0.9.4.el6.x86_64.rpm

11.1.2. Where can I find the log messages generated by OSU INAM?

OSU INAM will push all the log messages it generates to ‘/var/log/messages’

11.1.3. Why is the web server taking a long time to load?

OSU INAM uses PhantomJS for caching the rendered network graph with the aim of speeding up subsequent deployments. This caching happens when the web application is deployed for the first time. Please refer to Speed up the network map rendering by using PhantomJS for more details.

11.1.4. I have installed PhantomJS, but my webpage is still rendering very slowly

Here we list some possible reason why the webpage rendering can take more time than expected even though PhantomJS has been installed correctly.

  • Incorrect permissions to the directories

    • The user running the web app should be able to write to and read from the directory pointed by phantomjs.filedir

  • Using incorrect inam.js file

    • The phantomjs.runjs variable in /etc/osu-inam.properties file should point to the inam.js file included in the tarball

  • vis.js and inam.js not present in the same directory

    • The vis.js file and the inam.js file should be in the same directory

Please refer to the Speed up the network map rendering by using PhantomJS section for more details on how to correctly setup PhantomJS for use with OSU INAM.

11.1.5. Does OSU INAM support any other job scheduler besides SLURM?

At present, OSU INAM only supports SLURM. We have plans to bring in support for other job launchers like PBS/Torque in the future.

11.1.6. Will OSU INAM work without a supported scheduler?

OSU INAM has been designed so that features that do not depend on the job scheduler (eg: viewing the network counters) will work even without a supported job scheduler.

Please use the following order while starting OSU INAM and related services

  • Create the database

  • Start up the OSU INAM daemon

  • Once the nodes and links tables are populated by the OSU INAM daemon, deploy the web application

Please use the following order while stopping OSU INAM and related services

  • Stop the web application

  • Stop the OSU INAM daemon

  • Destroy the database

    • This step is only required if you do not want to use OSU INAM again.

    • otherwise, you can skip it.

11.1.8. How can I control the size of the database?

OSU INAM can automatically purge data that is older than a user-defined period of time from the current time from the database. There is a parameter OSU_INAM_DATA_RETENTION_PERIOD that controls this. You can set to any desired value. By default, its set to seven days. You can reduce it to a lower value - like one day.

There is another parameter OSU_INAM_PURGE_QUERY_INTERVAL that tells the daemon how frequently it should check for older data. The default value for this is 3600 seconds. You can modify this as well.

Once you’ve changed the value, please restart the daemon so that it takes effect.

11.1.9. Why does INAMD not exit after being shut down?

INAMD checks for exit signals at fixed intervals specified by OSU_INAM_PERF_COUNTER_QUERY_INTERVAL (default value: 30 seconds). Thus, a shutdown command may not take effect immediately.

11.1.10. I have errors on different pages with MySQL incompatibility with sql_mode=only_full_group_by.

If you’re getting any error messages saying "Expression #N of SELECT list is not in GROUP BY … this is incompatible with sql_mode=only_full_group_by" on any of the web pages, it means that the MySQL server is configured to not allow group-by select queries that have non-aggregated columns in the select list. INAM requires that MySQL server be configured without this mode. More information about changing the mode in MySql 5.7 can be found here