InfiniScope™

InfiniBand connectivity and performance analysis tool

Microway, Inc.

Version 3.2.4

Copyright 2006-2009

Contents

Prerequisites

You must have a Linux-based cluster. You must have InfiniBand HCAs on your nodes, and one or more properly-functioning InfiniBand switches connecting the HCAs.

You must have the OFED (Open Fabrics Enterprise Distribution) software installed on the cluster. InfiniScope has been tested with versions 1.0 through 1.3 of OFED. Three things from the OFED distribution are required:

  1. The OpenSM subnet manager.
  2. The include files, typically found in /usr/local/ofed/include/infiniband.
  3. The libraries, typically found in /usr/local/ofed/lib64.

Previous versions of InfiniScope required that Python and wxPython be installed, but this is no longer the case.

To run flop, the Fabric Loading Program, you must have MPI installed. Any version should work, but of course it must be an InfiniBand-capable version like mvapich if you want to exercise the InfiniBand fabric.

Installation

InfiniScope is distributed as a VMware image. After installation, it consists of a Python/wxPython-based interface program. The interface program runs under VMware, and is preinstalled in the VMware image.

A separate InfiniScope collector program (named iscollect) must be installed on your cluster. It is distributed as obfuscated C source code, which is compiled on your cluster with a C compiler. You do not have to modify the source.

A separate “Fabric LOading Program” (named flop) is included in the InfiniScope distribution. It is a C/MPI program distributed as obfuscated source code, which is compiled on your cluster with an mpicc wrapper around a C compiler.

Installation procedure

MPI Link-Checker is distributed as a VMware image on a DVD. The DVD contains a README file with complete installation and configuration instructions. In summary:

  1. If the license key file microway.key is not provided on the DVD, compile and run mpilc on your cluster and submit the results to Microway. We will send you a license key file.
  2. Install and run VMware player, and open the VMware image.
  3. Copy the license key file into /usr/local/microway/microway.key on the VMware image.
  4. Install the iscollect part of InfiniScope on your cluster, with appropriate protections.
  5. Edit /usr/local/microway/infiniscope.conf .
  6. Create an ssh public key on the VMware image, and append it to ~/.ssh/authorized_keys on the head node of your cluster.

Installation notes

Make sure that /usr/local/microway/iscollect is available on all nodes of the cluster. This is not an MPI program, but it must run on all nodes of the cluster to collect data from the InfiniBand ports. It must be owned by root, with group "infiniscope", and protection 4750. Setuid root is required to read and write the port registers on the InfiniBand HCAs and switches.

If you didn't do so during the running of the installation script, add users to the infiniscope group by editing /etc/group .

Put /usr/local/microway in the path for all users who are going to be using InfiniScope.

If you didn't do so during the running of the installation script, modify the startup file for the OpenSM subnet manager, /etc/opensm.conf, so that the VERBOSE option contains the string "-D 0x40".

# -D 0x40 writes out the forwarding tables for InfiniScope
VERBOSE="-D 0x40"
This will cause the subnet manager to write two files every time it runs, /tmp/subnet.lst (the subnet topology file) and /tmp/osm.fdbs (the forwarding table listing). (The files may be in /var/log instead of /tmp, and may have different names, for different versions of OFED.)

Restart the subnet manager after changing the startup file.

service opensmd restart

Uninstalling

To uninstall InfiniScope, delete the VMware image. Also, remove the installation directory /usr/local/microway and its contents on all nodes. (Don't do this if you still want to keep using our other cluster diagnostic tool, MPI Link-Checker.)

Starting InfiniScope

On the VMware image, make sure /usr/local/microway is in your path. Start InfiniScope by typing "is".

Several command line options may be used. They may be combined and used in any order, although some combinations do not make sense. You may but need not put a space between a single letter abbreviation and a following value. You may put a space or an equal sign between an extended option name and a following value.

Note that most of the command line options may be specified in the configuration file. For options that change only rarely, using the configuration file will probably be more convenient.

Option summary

The command syntax is as follows:
is -h | --help 
or:
is -v | --version 
or:
is [-i | --init-file <configuration file name>]
    [-d | --data-file <input data file name>]
    [-o | --output <output data file name>]
    [-A | --ascii]
    [-m | --map <name of file containing displayed name map>]
    [-g | --gui-hostname <GUI host name>]
    [-C | --collector-ssh-command "<command for starting InfiniScope data collector>"]
    [-F | --feeder-ssh-command "<command for starting InfiniScope data feeders>"]
    [-p | --collector-program-name <collector program name>]
    [-s | --single-collector]
    [-x | --excluded-nodes "<node node ... (nodes not running collector)>"]
    [-a | --allow-multiples]
    [-e | --expansion <fabric display box expansion factor>]
    [-n | --no-logging]
    [-L | --looping]
Here is more detail on each of the options:
--help (or -h)
displays a program usage message
--version (or -v)
displays the InfiniScope version number
--init-file (or -i)
specifies the name of an alternate configuration file. On startup, the program will look for this file if specified. Otherwise, it will look for /usr/local/microway/infiniscope.conf. The options --init-file, --data-file, --output, --help, and --version may be specified only on the command line. All other options may be specified either in the configuration file or on the command line. If an option appears both on the command line and in the configuration file, the value on the command line will be used.
--data-file (or -d)
specifies the name of the data file if you are going to display existing data
--output (or -o)
specifies the name of the output file for all data for this run
--ascii (or -A)
produce data file output in ASCII instead of binary
--map (or -m)
specifies the name of the file that maps GUIDs, Microway HCA names, or node names to abbreviated host names in fabric display.
--gui-hostname (or -g)
specifies the name or IP address of the host on which the GUI is running.
--collector-ssh-command (or -C)
command to access the node on which the collector will run
--feeder-ssh-command (or -F)
command to remotely start data feeders on fabric nodes
--collector-program-name (or -p)
specifies the name of the collector program that reports the performance measurements
--single-collector (or -s)
indicates that all performance queries should be made on the GUI node
--excluded-nodes (or -x)
specifies the names of nodes that should not be queried or displayed
--allow-multiples (or -a)
do not terminate other instances of InfiniScope that are currently running
--expansion (or -e)
factor to expand or contract size of boxes in fabric display.
--no-logging (or -n)
turn off logging of console messages to a file.
--looping (or -L)
causes display of recorded data to loop.

The display

Top part of display

The top row of boxes shows the color and shape scale for the port bandwidth display. The next row of boxes (or more than one row if there are more than 24 HCA ports) shows the measured short term bandwidth for each port of each host. The remaining rows of boxes show the bandwidth for each port of each switch, one switch per line.

The boxes at both ends of a DDR connection are outlined in blue. The boxes at the ends of an SDR connection are outlined in red. (QDR connections, when they become available, will be outlined in green.)

The boxes at both ends of a 1X connection and the line between them will flash yellow and black.

If a port has seen any hardware errors, there will be a red "E" over its box. Errors already present in the hardware port registers when you start InfiniScope are suppressed, as well as many errors that occur when the fabric topology changes.

By moving the mouse over a box, you select the port corresponding to that box for the graph at the bottom. Instead of selecting a single port, you may select a switch by moving the mouse over its label on the left, or you may select "all hosts" by moving the mouse over the "Hosts" label on the left. The selected box (port, switch, or all hosts) will be outlined with a colored square or rectangle. If a selected port has seen hardware errors, details of the errors will be shown in the graph in the lower part of the display (see below).

If you click a box, that box will be locked as the selection. Clicking again anywhere unlocks the selection.

If a switch is selected, there will be colored lines showing all connections to that switch. Unless the fabric is too big, the colors of the lines change randomly so that they can be distinguished from each other more easily.

Similarly, if "All hosts" is selected, there will be colored lines showing all connections to HCAs. Again, the colors of the lines change randomly unless the fabric is too big.

If a single port is selected, there will be a colored line showing the connection to that port, and the port at the other end of the connection will be outlined with a gray square.

In an HCA port is selected, all switch ports on paths to the selected port (not from the port: almost every port beyond the first switch can occur on paths from a port) will be outlined with a gray square. This is just some extra information that might be helpful in determining the cause of unexpectedly poor behavior.

If an HCA port is locked as the selection, you can move the mouse over a different HCA port. Doing so will cause the path through the fabric from the selected port to the port under the mouse to be displayed as a set of colored lines.

Bottom part of display

The bottom part of the display shows the average transmit bandwidth from the selected port or group of ports. Measurements are made as often as specified, at intervals from 1 millisecond to 1 second. The bandwidth can be shown as measured, or averaged in groups of 2, 4, 8, ..., 512, 1024 measurments. This means that the full time scale of the graph can range from about one second to nearly two weeks. The default measurement interval is 100 ms, displayed as measured with no averaging, leading to a full time scale of one or two minutes. If the full time scale is less than an hour, the time axis will be labeled with minutes and/or seconds into the past. If the time scale is an hour or more, the time axis will be labeled with time-of-day (and day-of-week for previous days).

If you click on the graph, the data for the selected port will be cleared and the graph will be restarted. This can be useful to rescale the graph. If you double-click on the graph, the data for all ports will be cleared.

If you right-click on the graph, it will be frozen. (Data collection will continue.) Right-clicking again, or moving the mouse outside the graph area, will unfreeze the graph.

If a port is selected and the port has seen hardware errors, the type and number of errors will be shown at the bottom of the graph, using different colored text for different error types. The indication will be something like “Sym:12/561”, indicating that a total of 561 symbol errors were observed during 12 different queries.

The times of the errors will be indicated on the graph with colored marks, the color of a mark corresponding to the color of the text for the corresponding error count. Full details of errors will also be written to the console and the log file.

Menus

Control menu

Reset error indicators

You can clear the "E" error markings over port boxes by clicking on the "Reset error indicators" item in the Control menu. The errors are reported to the console and also appended to the log file, so you don't have to worry about losing the error reports. You can also clear the error indicators by typing Control-R.

Annotate log file

You can add an annotation to the log file by clicking on the "Annotate log file" item in the Control menu. A dialog window will pop up; just enter your annotation and click "OK". The annotation will be timestamped and placed in the log file. You can also bring up the annotation dialog by typing Control-A.

Stopping InfiniScope

You can stop InfiniScope by clicking on the "Exit" item in the Control menu, by clicking on the X at the top right of the window frame, or by typing Control-C over the display or in the console window where you started the program.

Save data menu

Record and End recording

You can begin recording all measurements by clicking the "Record" item of the Save data menu. Data will be saved for all ports, both bandwidth and error data. The data will be saved in the specified output file (or /tmp/infiniscope.data if no file was specified).

To stop recording, click the "End recording" item of the Save data menu. You can resume recording later if you wish. The recorded data will not show a gap, and the times on the graph scale may be wrong.

Control buttons

Reset

Clicking the Reset button resets the measurement interval, graph scale, and scan time to their default initial values, 100 ms, between 1 and 2 minutes, and 0.5 sec respectively.

Freeze/Gather

Clicking the Freeze button stops the gathering of data. You can look at the graph for any port or switch or for all hosts, you can change the graph time scale, and you can start or stop scanning or change the scan rate. When you click it, the Freeze button becomes a Gather button.

Clicking the Gather button unfreezes the display and resumes gathering data. All the data history will be reset. When you click it, the Gather button returns to being a Freeze button.

Changing the measurement interval or clicking the Reset button also unfreezes the display, resumes data gathering, and changes the Gather button to a Freeze button.

Measurement time

You can specify how often you want InfiniScope to query the ports in the fabric by clicking the measurment time '+' or '-' buttons. The measurement interval can be any of a large set of predetermined values between 1 millisecond and 1 second. The '-' button shortens the time between measurements, and the '+' button lengthens it. Changing the measurement interval discards all measurements made with a different measurement interval. When you change the measurement interval, the graph scale will change accordingly.

If you set a very short interval (a few milliseconds), it may be impossible to query the fabric often enough. In this case, some measurements will be skipped to try to keep up with the measurement clock. The bandwidth graphs may appear jumpy in this case.

Graph scale

You can specify how much averaging you want the bandwidth graphs to represent by clicking the graph scale '+' or '-' buttons. The graph scale is shown as the approximate time represented by the full graph scale, but internally it is just the number of data points (some power of 2) that are averaged to produce each point displayed on the graph. The '-' button halves the number of points being averaged and shortens the time represented by the full graph scale; the '+' button doubles the number of points being averaged and lengthens the time represented by the full graph scale. Changing the graph scale by itself will not reset the data.

If you change the measurement interval, the graph scale label will change accordingly, maintaining the same degree of averaging. The data will be reset.

Scan ports/Reverse/Stop scanning

If you click on the Scan ports button, the port selector will cycle through all the ports, spending a short time displaying the bandwidth graph for each one. The Scan ports button becomes a Reverse button. Scanning starts at the most recently locked port.

If you are scanning through the ports, clicking on the Reverse button will cause the scan to change direction and move backward through the ports. The Reverse button becomes a Stop scanning button.

If you are scanning the ports in reverse order, clicking on the Stop scanning button stops the scanning. The Stop scanning button becomes a Scan ports button again.

If you are scanning or reverse-scanning, all the ports' boxes will have small black check marks in them. If you right-click on any port box, the check mark will become a small red 'X', and the corresponding port will be excluded from the scan. You may exclude as many ports as you wish, except that there will always be at least one port being scanned. Right-clicking again will change the 'X' back to a check mark and will reengage scanning.

If you are scanning or reverse-scanning, right-clicking the "Hosts" box will toggle scanning for all HCA ports. Similarly, if you are scanning or reverse-scanning, right-clicking a switch's box will will toggle scanning for all ports on the switch.

If you are scanning or reverse-scanning, two buttons will appear next to the scan time buttons. Clicking the upper "SCAN ALL" button will cause all ports to be included in the scan. Clicking the lower "TOGGLE" button will toggle scanning (off-to-on and on-to-off) for each port, leaving at least one port being scanned.

Scan time

You can change the amount of time that each port stays selected during forward scanning by clicking the Scan time '+' or '-' buttons. The '-' button shortens the time between port selection changes, down to 0.1 second, and the '+' button lengthens it, up to 15 seconds.

The time for reverse-scanning is fixed at 0.75 second.

Messages

Messages are printed at the terminal when InfiniScope is started or stopped and when fabric errors occur. Whenever the fabric is redrawn because the subnet manager detected a change in the fabric, the connections from all switch ports are shown. In addition, a notice is printed once every hour, so that if the fabric dies, you will know approximately when it died. The same messages are appended to the log file /tmp/infiniscope.log.

Known restrictions

Sometimes when you start InfiniScope right after rebooting the cluster, the host names are not correct: some of them may appear as "25204", or InfiniScope may say that the subnet manager could not determine the correct names of all the hosts. If this happens, simply restart the subnet manager. You may also be able to get around the problem by using the name map (-m) command line option.

With a very short measurement interval (1 ms), the label on the Scan/Reverse/Stop scanning button may not be displayed, although it still works properly.

Troubleshooting

Occasionally InfiniScope gets into an incorrect state. To get it working again, stop it. Stop iscollect if any instances of it are still running on the cluster nodes. Restart the subnet manager, then restart InfiniScope.

If you have to restore a node (for example, because of a hard disk failure), be sure that the current versions of iscollect and flop are are copied onto the new disk.

Flop - the Fabric LOading Program

Flop is an MPI application that generates traffic to load the fabric in various ways. It must be compiled using mpicc and run using mpirun in the usual way. The mode of operation is specified using a command-line option.

The usage for flop is:

mpirun -np <n> /usr/local/microway/flop [-f|-h|-r|-a|-c|-s|-l]
   [ranks for -s option] [message_size]
If you normally start MPI programs in a different way, start flop the same way. The options are:
 -f Full-duplex round robin
 -h Half-duplex round robin
 -r Ring
 -a All-to-all
 -c Cycle
 -s Select send/receive rank pairs
 -l List ranks and hosts

In all tests, you can specify the message size in bytes as the last argument. The default message size for all tests except the all-to-all test is 5000000 bytes. Each message is repeated 1000 times in each round.

In the full-duplex round robin test, each node is both sending and receiving at the same time. (One node is inactive in each round if there are an odd number of nodes.)

In the half-duplex round robin test, half the nodes are sending and half are receiving in each round. (One node is inactive in each round if there are an odd number of nodes.)

In the ring test, each node is receiving from one neighbor and sending to another, in a ring covering the entire cluster. The bandwidth of this test is limited to that of the slowest connection.

In the all-to-all test, there is just a single MPI_Alltoall() covering the entire cluster. This test requires more memory than the others, so the default message size is 1000000 bytes. You may have to make it even smaller for a large cluster.

In the cycle test, only one node is sending to only one other node in each round, but eventually every node will send to and receive from every other node.

In the selected ranks test, you can specify precise send/receive pairs, according to their MPI ranks. The ranks are specified in pairs, alternating between send rank and receive rank. Of course, there must be an even number of ranks listed. A node can be sending and receiving at the same time, but not safely more than one of each. To find the MPI rank of each node, run flop with the "-l" option. If you want to specify the message size, put it after the list of ranks.

flop doesn't do anything useful; it just loads the fabric by sending random data around in the specified pattern. It doesn't report anything either, except for the name of the test it is performing.

About Microway

Microway designs, builds, and services high quality clusters, workstations, and networks for High Performance Computing. Our internet address is microway.com. Our telephone number is 1-508-746-7341.

Microway provides our customers with leading edge technologies for high performance computing solutions. We establish and maintain industry recognized products and expertise for cluster interconnect, cluster management and HPC storage solutions.

Microway's reputation as the world leader in innovative solutions for High Performance Technical Computing has been unchallenged since 1982, when our software made it possible to use the 8087 math coprocessor in the IBM PC. Our products consistently receive excellent reviews, our prices are competitive, and our service and technical support are outstanding. Microway's top notch Research and Development Staff keeps you on the leading edge of technology with timely, powerful new products. At Microway our customers are treated as our most valuable resource, which is why our customer base remains strong and continues to grow.

Microway's products include clusters, silent workstations, InfiniBand-based switches, multi-function HCAs and storage solutions. Our clusters incorporate AMD Quad-Core Opterons, Intel Quad-Core Xeons, NVIDIA Tesla GPU products, and Mellanox silicon.

Designed and developed in-house, Microway software includes MCMS cluster management tools; InfiniScope™ InfiniBand diagnostic software; and MPI Link-Checker™ MPI diagnostic tool. Microway's Linux-based clusters and data solutions are used by customers in life sciences, academia, enterprise and government research laboratories.

Our industry recognized trademarks include Microway, FasTree, InfiniScope, MPI Link-Checker, Navion, TriCom, WhisperStation, NumberSmasher, NodeWatch, ServaStor, and Quadputer.

The technical staff at Microway is qualified to assist you in benchmarking and speeding up your existing code and enhancing your present software and hardware investment. Our staff has over 50 years combined experience in designing Linux cluster configurations. We offer white papers on this web site, as well as technical documentation of the hardware and software we design and integrate. To design your next custom system or cluster, please call our Sales Department at 1-508-746-7341. Our Technical Support Department can be reached at the same number or via email at tech@microway.com.

For more than twenty-six years, the employees at Microway have earned our reputation for excellence. We are proud of this reputation and totally committed to designing innovative products that provide state of the art solutions required to keep our customers on the leading edge of technology.

Microway ... technology you can count on, since 1982.

Microway: Technology you can count on