Microway® MPI Link-Checker™

Latency, Bandwidth, and Data Integrity Analyzer for HPC Clusters

Version 3.2.4

Copyright 2005-2009 by Microway, Inc.

Contents

Introduction

MPI Link-Checker™ from Microway® is a tool to help you measure and understand the performance of your Beowulf cluster running MPI. It provides you with a baseline set of measurements for a properly running cluster, and more important, it will enable you to detect any anomalous behavior.

MPI Link-Checker operates by measuring latency, transfer time, bandwidth, and data integrity over a wide range of message sizes between every pair of nodes in your cluster. The data is displayed graphically in a variety of different views, enabling you to quickly spot any problems.

Abbreviated summary of instructions

Reading the documentation is important for effective use of MPI Link-Checker. For the impatient, however, here is the bare minimum that you need to use the program.

First, complete the program installation.

  1. Copy /usr/local/microway/mpilc.c from the VMware image to /usr/local/microway/mpilc.c on the head node of your cluster. On the head node, compile mpilc.c into mpilc, and if necessary, distribute mpilc to all nodes (not necessary if you are using NFS).
  2. Look at /usr/local/microway/link-checker.conf, and if necessary edit it, specifying how to start an MPI program on a remote system. You may want to set some other values too, such as the name of your cluster and the number of nodes.
  3. Run MPI Link-Checker:
    lc -n <process_count>
    
    For the process count, use the number of nodes in your cluster.

    Click on the magenta box at the bottom to call up the help screen.

What MPI Link-Checker measures

MPI Link-Checker measures both transfer time (latency) and bandwidth between every pair of nodes in your cluster. Depending on the logic of your application, the performance of a cluster may be limited by the performance of its worst-performing connection, so it is important to find slow connections.

Data integrity problems

If there are data integrity problems, nothing else matters. They must be resolved. It doesn't matter how fast the data is being transmitted if it's not being sent and received correctly. Fortunately, such problems are very rare; modern network interconnects are highly reliable, and have failsafe mechanisms to guarantee correct delivery of all messages when viewed from the application level. MPI Link-Checker does check data integrity, and gives you an unmistakable notification when something is wrong.

Contention

The most important everyday effect that you will see in a cluster with more than one switch is contention. If any switch-to-switch links are oversubscribed (that is, if multiple messages can be scheduled to traverse the link at the same time), then there may be contention for the link and congestion in the network. Host-to-host connections using those links will not be able to operate at full bandwidth. The effect is that messages can be delayed, and overall system performance can be degraded. The effect may or may not be significant, depending on the particulars of your cluster and application. MPI Link-Checker computes a simple measure of contention by comparing average-case and best-case bandwidth for each connection as the network handles traffic from a number of hosts simultaneously.

Problem nodes

Incorrect BIOS settings, faulty connectors, loose connections, and defective motherboards, network controller cards, cables, and switches can all cause degraded performance. Note that in many cases your cluster will still operate without data errors, but with greatly reduced performance, as either the retry rate rises or the network throttles back to a less demanding protocol. These kinds of problems are often limited to a single node. MPI Link-Checker can quickly pinpoint any problematic nodes.

Miscellaneous variability

In normal cluster operation, you are going to observe some variability in latency and bandwidth measurements. You may observe some effects of the network topology: extra hops within the network will lead to increased latency. Some nodes may have faster hardware (CPU, memory bus, PCI bus) than others, leading to bandwidth variability. Networks are complex systems, leading to random misbehavior. Finally, the measurements we make in MPI Link-Checker are optimized for speed rather than precision, so there is going to be some measurement error. All this variability is interesting, but it is usually nothing to worry about.

MPI implementation issues

Every MPI implementation incorporates a number of design decisions to optimize performance. Often there are different protocols, involving more or less handshaking, for different message sizes. The crossover points between protocols can be set at MPI build time, or in some cases at run time. Incorrect settings can degrade performance (especially bandwidth) if your MPI-based application happens to use messages of a size near the crossover points. MPI Link-Checker shows graphs of performance vs. message size, highlighting any suboptimally set crossover points, so you can rebuild MPI, change the appropriate parameters at run time, or avoid troublesome message sizes in your application.

How MPI Link-Checker collects and displays data

Multiple views break the curse of dimensionality

MPI Link-Checker gathers a huge volume of data. In an n-node cluster, there are n2 connections. For each connection, by default the program makes transfer time measurements for 37 different message sizes and bandwidth measurements for 57 different message sizes. (If you want to, you can specify any number of different transfer time and bandwidth measurements; these numbers are just the defaults.) Each measurement is made repeatedly, and the average, best, worst, and most recent values are all available and potentially useful. For a 32-node cluster, this amounts to 322×(37+57)×4=385,024 potentially meaningful data values; the number increases quadratically with cluster size!

MPI Link-Checker makes all this data visible through multiple, interconnected views. Two grids give an overview of the latency and bandwidth data for all connections for a selected message size. Two graphs show the variation of transfer time and bandwidth with message size for a selected connection. Finally, a small table shows all available statistics for a selected connection and message size. Each of these elements also acts as a controller for the others: the grids to select a connection, the graphs to select message sizes, and the table to select the displayed statistic.

Which connections are measured?

When you start MPI Link-Checker, you specify the number of processes. The program makes bandwidth and transfer time measurements for every pair of processes, that is, for every possible connection. Each connection is measured independently in both directions, since connections are not necessarily symmetrical.

The connections are measured in parallel. At any time, roughly half the processes are sending and the other half are receiving. The senders and receivers are constantly changing in a complicated dance, randomized but precise, so eventually all combinations will be measured.

Since MPI permits a process to send a message to itself, MPI Link-Checker measures the transfer time and bandwidth for each process sending to itself. These measurements depend more on the MPI implementation than on the hardware, since the implementation may play some tricks to avoid data movement when everything happens within one process.

Similarly, a process may send a message to a different process running on the same node. Bandwidth measurements in this case measure both the quality of the MPI implementation and the memory bandwidth. Note that this case will not occur if you run MPI Link-Checker with one process per node. Also note that if you run MPI Link-Checker on a standalone multiway SMP, this and the sending-to-self case are the only cases that will occur!

Terminology: transfer time and latency

In this documentation, we use the term transfer time to refer to the one-way transfer time for a message of a specific size. When speaking precisely, we reserve the term latency for the one-way transfer time for a 0-byte message, although when speaking informally we may use latency in the broader sense, referring to the transfer time for any size message.

Units

Throughout MPI Link-Checker, latencies and transfer times are expressed in microseconds. One microsecond is 10-6 second, as usual.

Bandwidths are expressed in megabytes per second. In accordance with customary usage in the industry when discussing bandwidth, one megabtye per second is 106=1,000,000 bytes per second. This is a change from earlier versions of MPI Link-Checker which considered one megabyte per second to be 220=1,048,576 bytes per second.

Message sizes are expressed in bytes, kilobytes, or megabytes. For message sizes, one kilobyte is 210=1,024 bytes and one megabyte is 220=1,048,576 bytes.

Detailed installation instructions

MPI Link-Checker is distributed as a VMware image. After installation, it consists of two programs, a Python/wxPython-based interface program and a C program with MPI calls to do the measurements. The interface program runs under VMware, and is preinstalled in the VMware image. The C/MPI program runs on your cluster. It is distributed as (obfuscated) source code, which is compiled on your cluster with a C compiler using an mpicc wrapper. You do not have to modify the source at all.

Requirements

To run MPI Link-Checker you must be running the Linux operating system with the X Window System. You must have some version of MPI installed. You must have a C compiler, with the gettimeofday() system call in its library. The gettimeofday() call must have microsecond accuracy.

When you install the program, the system date should be correct and the system time should be fairly accurate (not off by more than an hour or so).

Installation procedure

MPI Link-Checker is distributed as a VMware image on a DVD. The DVD contains a README file with complete installation and configuration instructions. In summary:

  1. If the license key file microway.key is not provided on the DVD, compile and run mpilc on your cluster and submit the results to Microway. We will send you a license key file.
  2. Install and run VMware player, and open the VMware image.
  3. Copy the license key file into /usr/local/microway/microway.key on the VMware image.
  4. Install the MPI part of MPI Link-Checker on your cluster.
  5. Edit /usr/local/microway/link-checker.conf .
  6. Create an ssh public key on the VMware image, and append it to ~/.ssh/authorized_keys on the head node of your cluster.

Uninstalling

To uninstall MPI Link-Checker, delete the VMware image. Also, remove the installation directory /usr/local/microway and its contents on all nodes. (Don't do this if you still want to keep using our other cluster diagnostic tool, InfiniScope.)

Running the program

MPI Link-Checker is started by issuing a command with options, typically from a shell command line. The program name is 'lc'.

Command line options

The usual way to start MPI Link-Checker is with the 'lc' command on a shell command line, followed by one or more options. Each option may be given in single-letter form or long form. In single-letter form, an option consists of a single dash, a single letter, optional white space, and the value to be assigned to the option. For example, you can specify that you want to run 24 MPI processes with:

lc -n 24
or
lc -n24
In long form, an option consists of a two dashes, a multi-letter option name, an equal sign or white space (but not both), and the value to be assigned to the option. For example, you can specify that you want to run 24 MPI processes with:
lc --np 24
or
lc --np=24
The command syntax is as follows:
lc -h | --help 
or:
lc -v | --version 
or:
lc [-i | --init-file <configuration file name>]
    [-d | --data-file <input data file name>]
    [-o | --output-file <output file name>]
    [-G | --gui-hostname <GUI host name>]
    [-R | --ssh-command <command for remotely starting MPI job>]
    [-t | --mpirun-template "<mpi command template>"]
    [-p | --mpilc-program-name <mpilc program name>]
    [-n | --np <number of processes>]
    [-c | --cluster-name "<cluster name>"]
    [-g | --graphics-off] (no graphic display, data gathering only)
    [-l | --min-latency <smallest message size for latency measurements>]
    [-L | --max-latency <largest message size for latency measurements>]
    [-b | --min-bandwidth <smallest message size for bandwidth measurements>]
    [-B | --max-bandwidth <largest message size for bandwidth measurements>]
    [-s | --test-sequence-file <test sequence file name>]
    [-S | --single-measurement "<single performance measurement>"]
    [-A | --ascii]
Here is more detail on each of the options:
--help (or -h)
displays a program usage message
--version (or -v)
displays the MPI Link-Checker version number
--init-file (or -i)
specifies an alternate configuration file. On startup, the program will look for this file if specified. Otherwise, it will look for /usr/local/microway/link-checker.conf. The options --init-file, --data-file, --output_file, --graphics-off, --help, and --version may be specified only on the command line. All other options may be specified either in the configuration file or on the command line. If an option appears both on the command line and in the configuration file, the value on the command line will be used.
--data-file (or -d)
specifies the name of the data file if you are going to display existing data
--output-file (or -o)
specifies the name of the file in which measurement data will be stored. If not specified, data will not be stored anywhere. May be used with or without --graphics-off option.
--gui-hostname (or -G)
specifies the name or IP address of the host on which the GUI is running.
--ssh-command (or -R)
specifies how to access the node on which the MPI measurement job will be started
--mpirun-template (or -t)
specifies a template for starting an MPI program. It will normally be enclosed in double quotation marks because it contains spaces, and will contain the strings "<np>" and "<program>", with angle brackets but without quotation marks. The default is "mpirun -np <np> <program>"
--mpilc-program-name (or -p)
specifies the name of the MPI program that performs the performance measurements
--np (or -n)
specifies the number of processes to be run in the MPI measurement job
--cluster-name (or -c)
specifies the cluster name to appear in the data file and on the display. Double quotation marks are needed around the name if it contains any spaces.
--graphics-off (or -g)
specifies that data will not be displayed on the screen. Use with --output-file option.
--min-latency (or -l)
specifies the size (in bytes) of the smallest messages to be used for transfer time measurements. May be overridden by commands in the test sequence file if the --test-sequence-file option is used.
--max-latency (or -L)
specifies the size (in bytes) of the largest messages to be used for transfer time measurements. May be overridden by commands in the test sequence file if the --test-sequence-file option is used.
--min-bandwidth (or -b)
specifies the size (in bytes) of the smallest messages to be used for bandwidth measurements. May be overridden by commands in the test sequence file if the --test-sequence-file option is used.
--max-bandwidth (or -B)
specifies the size (in bytes) of the largest messages to be used for bandwidth measurements. May be overridden by commands in the test sequence file if the --test-sequence-file option is used.
--test-sequence-file (or -s)
specifies an alternate sequence of measurements to be performed. (See the "Measurements" section for more details.)
--single-measurement (or -S)
specifies a single measurement, to be executed repeatedly
--ascii (or -A)
produce data file output in ASCII instead of binary

Modes of operation

When you run MPI Link-Checker, you can gather data in real time and either display it on your monitor, save it in a data file, or both. Or you can display the results from a previously saved run.

You choose the desired mode by specifying appropriate command line options:

  1. Run tests and display results in real time: no options needed.
  2. Run tests, display results in real time, and also save results to a file: use --output-file option.
  3. Run tests and save results to a file: use --output file and --graphics-off options.
  4. Display results from a previously saved run: use --data-file option.

Choosing the number of processes

MPI Link-Checker uses an MPI program to gather data. The MPI program can run with any number of processes, subject to the limits imposed by your license; you tell it how many to use with with --np option. To find out as much as possible about your interconnect as quickly as possible, you should specify the process count to be the same as the number of nodes in your cluster. (Of course, you should also make sure that the MPI startup program assigns one process to each node.)

It is possible to run with one process per processor, as you would typically do with a real MPI application. If you do so, you will get data for each process communicating with each other process. Thus you will get lots of duplicate data. For example, if you run one process per node, then MPI Link-Checker will collect data for one connection in each direction for each pair of nodes, but if you run two processes per node, then MPI Link-Checker will collect data for four connections in each direction for each pair of nodes. Even with full parallelism it will take at least twice as long to gather the same effective amount of data, and in fact it may take even longer, perhaps much longer, because of contention on the hosts and on the network.

Occasionally the two processes on a node may have different performance, for example if there are contention and cache issues. It might be worthwhile to check whether this is the case by running one process per processor, but probably not on a regular basis.

Your MPI Link-Checker license is valid for a specific maximum number of processes. In fact, we add 8 to that number, giving you eight bonus processes. This will enable you to look at a few intranode connections without modifying your MPI hosts file.

For best results

The measurements made by MPI Link-Checker will be the most accurate if you are not running any other processes on any of the nodes in the cluster. Otherwise you may get CPU, memory, and cache contention. This holds especially for processes that use the interconnection network that the mpilc MPI program is using, which may contend for use of the interconnect. Note that the graphical part of MPI Link-Checker itself uses some resources, so measurements on the node that the program is being run on may be somewhat inaccurate.

The bottom line is: don't run another MPI job at the same time as MPI Link-Checker. MPI Link-Checker will not allow you to run two instances of the mpilc program itself, with the same name, at the same time.

Stopping the program

To stop MPI Link-Checker, close its window. The mpilc MPI data collection program will complete one or two connection measurements, which should take only a short time even for large messages on a large cluster.

In rare cases, you may have to kill the mpilc program manually. If you get unexpectedly poor performance on some nodes, check the status with a 'ps ax' command from the shell, and if necessary issue a 'killall mpilc' on each node.

Help

If you run the 'lc' program with the --help option, a usage message will be displayed.

If you have started the program and are showing the graphical interface, clicking on the large magenta rectangular panel (the “quick help” panel) in the bottom middle of the display will bring up the online help screen (equivalent to this manual). Pressing the F1 function key will also bring up the help screen. In addition, you can view the help screen with any web browser: it is /usr/local/microway/link-checker.html.

Measurements

When MPI Link-Checker is gathering data, it makes a sequence of tests. Most tests involve sending messages of a specified size between all pairs of processes. At any given time, all or almost all of the processes are active, half sending and half receiving.

There are three kinds of tests: transfer time (including latency), bandwidth, and accuracy. Although the transfer time and bandwidth tests both make timing measurements, they do not make exactly the same measurement. The transfer time test is a ping-pong test. The bandwidth test calculates the average bandwidth for a number of simultaneous messages of the desired size. The accuracy test compares a received message with a calculated message, byte by byte.

Specifying the measurement sequence

MPI Link-Checker is designed to perform a specific sequence of measurements. By default, it will make transfer time measurements for message sizes from 0 bytes to 4096 bytes, and bandwidth measurements for 256 bytes to 4 megabytes (4194304 bytes). You can change the lower and upper limits for both the transfer time message size and the bandwidth message size by entering command line options or by changing the configuration file.

If you wish, you can choose a completely different sequence by specifying it in a test sequence file. You might wish to do this to speed up data acquisition, to limit the program to doing only latency measurements or only bandwidth measurements, or to do a detailed investigation over a number of closely spaced message sizes.

You tell MPI Link-Checker where the test sequence file is with the "--test-sequence-file" (or "-s") option.

The file itself specifies one measurement per line. Each measurement consists of a test name and a message size. The valid test names are “latency”, “bandwidth”, and “accuracy”. Message sizes may be any multiple of 4 bytes, ranging from 0 bytes to 4 megabytes (4194304 bytes), except that bandwidth tests require a message size of at least 4 bytes and accuracy tests require a message size of at least 8 bytes. Typical lines would be “latency 0” or “bandwidth 131072”. The words “latency”, “bandwidth”, and “accuracy” may be abbreviated “lat”, “bw”, and “acc”. The white space between the test name and the message size is optional. Multiples of one kilobyte (1024 bytes) may be abbreviated “k” or “K”. Multiples of one megabyte (1048576 bytes) may be abbreviated “m” or “M”.

You can cause a group of measurements to be repeated by putting a “repeat&rdquo command (followed by an integer count) on a line before the group and an “end” command on a line after the group. The word “repeat&rdquo may be abbreviated “rep&rdquo. “repeat&rdquo/“end” groups may be nested to any reasonable depth in the obvious way. When all the tests specified in the file have completed, the program will stop.

There are three test sequence files included in the distribution in /usr/local/microway. They are named “baseline-sequence&rdquo, “checkup-sequence&rdquo, and “quick-sequence&rdquo.

If you want just a single measurement to be made repeatedly, you don't need to create a separate file. Specify the measurement using the "--single-measurement" (or "-S") option. Just put the measurement name and message size in the configuration file, or inside quotation marks on the command line. (If there is no white space in the command, the quotation marks may be omitted.) A bandwidth test using repeated 4-megabyte messages may be specified by &ldquo-Sbw4m”.

The display

The most interesting feature of MPI Link-Checker is its display, designed to quickly pinpoint any abnormalities in cluster performance. The sections of the display each give a different view into the data, and give you control over what is shown in the other sections.

The panels on the display

There are five main sections of the display: two grids, two graphs, and a table. In addition, the quick help panel below the table displays a short help message, and serves as a button to invoke the help screen. Click anywhere on the magenta quick help panel to call up the help system.

Grids

There are two grids, one for latency or transfer time in the upper left corner of the display and one for bandwidth in the upper right corner. Each cell shows data for one connection, from a sender to a receiver. The senders are listed across the top, and the receivers are listed down the side. When a group of senders are combined in a column or a group of receivers are combined in a row, the column or row label corresponds to the first node in the group.

The displayed statistic is the average value, the best value, the worst value, the most recent value, or the contention index. The statistic and message size as well as the measurement units are shown in the title line above the grid.

The cells in a grid are color coded, using up to three different scales.

Within each scale, the pale colors are the best, the saturated colors are bad, and black is the worst.

A legend below the grid shows the best and worst values for each scale, along with the associated colors.

The cell of the currently selected connection for the graphs and table is outlined in cyan (if locked) or magenta (if not locked), and the sender and receiver for the connection are indicated with cyan or magenta labels at the top and left of each grid. Use the mouse to change the selected connection, or to lock or unlock it. (See “Controlling the display” below.) Both grids always have the same selected connection.

If there have been any data integrity errors on a connection, the corresponding cell will be yellow.

If the cells are large enough, transfer time measurement values are shown in microseconds and bandwidth measurements in megabytes per second. The contention index is shown in percent. If the cells are too small to show the full values, the numbers in the cells show two significant digits, with no decimal point or other indication of the order of magnitude of the values. You can see the full value for any cell by mousing over the cell and looking in the table. If the cells are even smaller, the cell colors are correct, but no data values are shown. Again, you can see the full value for any cell by mousing over the cell and looking in the table.

Contention

If you select the “Contention” statistic, the bandwidth grid will show a contention index for each connection (except self-to-self connections) for the selected message size. The latency grid will be blank. The contention index for a connection is the percentage by which the average bandwidth for the connection is worse than the best bandwidth, under specific load conditions, random parallel data transmission with each node either sending or receiving. (Note that the term “contention index” is not standard terminology.)

Graphs

There are two graphs, one for transfer time in the lower left corner of the display and one for bandwidth in the lower right corner. Both graphs refer to the same connection, the connection currently selected in the grids.

The horizontal axis on both graphs is message size, in bytes. The scale is logarithmic.

The vertical axis on the transfer time graph is one-way transfer time, in microseconds. The scale is either linear or logarithmic, depending on the range of values being displayed. If the plot is logarithmic, a measured transfer time of zero will not be plotted, both because it cannot be plotted and because it is probably a measurement artifact.

The vertical axis on the bandwidth graph is bandwidth, in megabytes per second. The scale is linear.

The currently selected message size for the latency grid and the transfer time column in the table is shown with a cyan line (if locked) or a magenta line (if unlocked). You can change it with the mouse. The currently selected message size for the bandwidth grid and the bandwidth column in the table is also shown with a cyan line (if locked) or a magenta line (if unlocked). You can change it with the mouse too. (See “Controlling the display” below.)

Half power point shown in bandwidth graph

In the bandwidth graph, the half power point is displayed. This is the message size for which the bandwidth is half its maximum value. It's an indicator of the shape of the bandwidth graph. The smaller the value of the half power point, the better the performance, since it means that the small-message overhead is less significant even for relatively small messages.

The half power point is marked on the graph with a small box with crosshairs, and its value (in bytes) is labeled “N 1/2”. (The usual notation is “N1/2”, in which the “1/2” is a subscript.)

Especially early in the measurement sequence, the estimated value of the half power point may be unreliable, or even completely unavailable. Also note that different connections often have different values for the half power point. Because of this variability, it can be difficult to get a reliable overall value, but it can still be useful in comparing interconnects.

Table

The table in the lower middle part of the display shows all available statistics for the currently selected connection at the currently selected message sizes.

Either the “Latency” or the “Transfer Time” column label is highlighted in cyan or magenta, to indicate which measurement is being shown in the grid at the upper left. Cyan indicates that the selection is locked, magenta that it is unlocked. The latency is just the transfer time for 0-byte messages. You can choose which one is shown and lock or unlock your selection with the mouse. (See “Controlling the display” below.)

One of the row labels “Last”, “Best”, “Worst”, “Average”, or “Contention” is highlighted in cyan (if locked) or magenta (if unlocked), to indicate which statistic is being shown in both grids and both graphs. You can choose which one is shown and lock or unlock your selection with the mouse. (See “Controlling what is displayed” below.)

The row labeled “Passes” is the number of measurements made so far.

Controlling the display

An important feature of MPI Link-Checker is that the different views of the data are interrelated. Each panel has selectors that control what is shown in the other panels.

What each panel shows and controls

The grid panels show data for all connections at once, but for only one message size and one statistic. By mousing over a cell in the grid, you can focus on one connection to be shown in more detail in the graph and table panels.

The graph panels show data for a number of message sizes at once, but for only one connection and one statistic. By moving the mouse in a grid, you can focus on one message size to be shown in more detail in the corresponding graph panel and in the table panel.

The table panel shows all available statistics, but for only one connection and one message size. By clicking on a row label in the table, you can focus on one statistic to be used in the graph and grid displays. In addition, you can click on a column header in the table to choose whether the upper left grid shows latencies or more general transfer times.

Using the mouse for display selection

There are five selectors: 1) connection, 2) message size for transfer time measurements, 3) message size for bandwidth measurements, 4) displayed statistic, and 5) whether to show latency or transfer time on the upper left grid.

The currently selected values are always highlighted in cyan if the selection is locked or in magenta if the selection is not locked.

Each of the five selectors can be in a locked or unlocked state. If a selector is in the unlocked state, the selection will follow the mouse if it's in the corresponding panel, with no clicking. If a selector is in the locked state, mousing in the panel has no effect on that selector, but it can be unlocked by left-clicking.

Mouseovers work in the unlocked state

In the unlocked state you can select different connections by mousing over the corresponding cells in either grid. The magenta highlighting will follow your selection. (The connection selectors in the two grids are always linked together.) There may be a short delay before the new selection takes effect.

Similarly, you can select a message size for the transfer time grid (if it's being displayed) and the transfer time column of the table by moving the mouse over the transfer time graph. The selected message size will be marked with a vertical magenta (unlocked) or cyan (locked) line. The bandwidth message size is selected the same way with a vertical magenta (unlocked) or cyan (locked) line in the bandwidth graph.

To select statistics and to make the latency/transfer time selection, you can mouse over the appropriate row and column labels in the table, which are highlighted in cyan when selected and locked, or in magenta when selected but not locked.

Right-click to lock, left-click to unlock

If you want to lock a selector, right-click the desired selection. Mouseovers will be inhibited for that selector. If you want to unlock a locked selector, left-click on a new selection. The new selection will not be locked.

Interpreting the display

Usually the displayed data is self-explanatory. Certain patterns do crop up repeatedly, though, and it's worthwhile getting to know them.

Uniform pale grid colors

Uniform pale colors in a grid mean that nothing is wrong. Minor variations in the pale colors probably don't mean anything, nor do stray darker colors, especially for connections involving the root node.

Bright or dark cross in grid

A bright red, brown, or black cross in the transfer time or bandwidth grid is the most common symptom of a problem. Usually both the row and the column of the cross refer to the same node. This node has some problem. MPI Link-Checker can't tell you what it is, but it may be a defective network card, a bad connector (at either end), a loose connector (at either end), or a bad port on a switch. Or it may be an incorrect BIOS setting, for example causing the node to use 100 megabit Ethernet instead of Gigabit Ethernet. Often the cross will be more prominent for some message sizes than for others.

Yellow cells in grids

Yellow cells in the grids mean that you have a data integrity problem that must be addressed! Data is getting through, but sometimes the data is incorrect. This means that you may be getting garbage results from your application without knowing it. Usually the yellow cells will all be in the same row or column, or in one row and the corresponding column, which quickly identifies the offending node. You can check connections and cables and so on, but the problem may be deeper in the system, such as a bad motherboard. If you can't fix the problem, at least get the offending node out of the cluster.

Blocks of slightly different colors in latency grid

Large blocks of slightly different light colors in the latency grid make up a common pattern, and simply show extra hops through the network. Each hop adds about a microsecond or more to the latency. This does show some potential data paths to avoid if low latency is important in your application.

Small blocks of slightly different light colors in the latency grid, making a somewhat distorted checkerboard pattern, are essentially the same symptom, except that the nodes are probably connected to switches in a haphazard fashion. To maintain good control over your cluster, you may want to recable your network in a systematic manner, such as connecting consecutive nodes to consecutive switch ports. However, this may not be necessary if your application can not take any advantage of your network topology. In fact, sometimes a random topology may be better!

Blocks of different colors in bandwidth grid when measuring contention

Large blocks of different colors in the bandwidth grid when you are showing the contention index are another common pattern. The blocks will show up when the host-to-host path passes through two or more switches, and when there is not enough available switch-to-switch bandwidth.

Dark areas in bandwidth grid

Large dark areas in the bandwidth grid, especially for small message sizes, may simply indicate that other MPI jobs (perhaps an earlier run of MPI Link-Checker) are running. Use the “ps x” command to check this, and either wait until the other jobs have finished, or kill them.

Random dark cells in bandwidth grid

Random dark cells in the bandwidth grid at small message sizes are usually nothing to worry about. They usually reflect the difficulty of making accurate bandwidth measurements for small message sizes. Some connections may be reporting incorrectly high bandwidths, so the remaining connections look bad in comparison.

Discontinuities in bandwidth graph, inter-node

Discontinuities in the bandwidth graph below about 16K bytes for connections between different nodes often mean that your MPI implementation is switching from an Eager protocol (sending short messages all at once) to a Rendezvous protocol (sending large messages in two steps, a handshaking message followed by the data) at an inopportune point. You may be able to adjust this point at run time, and you very likely can do so when building MPI. Remember, though, that MPI Link-Checker is measuring only a subset of possible point-to-point communication patterns. Collective communications may show different behavior. In addition, using a larger small-message threshold will require more memory for buffers. If you are not sending messages of the particular size, you won't see any improvement anyway.

Discontinuities in the bandwidth graph for larger messages for connections between different nodes may reflect a problem with the drivers for the interconnect. You may have to live with this, or wait for upgraded drivers.

Discontinuities in bandwidth graph, intra-node

Discontinuities in the bandwidth graph for messages from a process to itself or to a process on the same node are hard to interpret, but are probably either a reflection of how your MPI implementation handles these special cases, or the result of a cache effect. This is probably nothing to worry about, and there is probably nothing you can do about it anyway.

License renewal

When your license expires, you may purchase a new one from Microway. You will be given a new activation string. To update your license, run MPI Link-Checker from the command line with the "--keying" option. You will be prompted for an activation key. Enter your new license key (40 digits plus text describing your license) and you'll be all set. You do not have to be logged in as root to renew your license.

Support

We want you to get MPI Link-Checker running as quickly and easily as possible. Please send reports to tech@microway.com so that our technical support team can work with you to resolve any problems.

Technical Support

Microway's formal support for MPI Link Checker consists of getting the product to work on your cluster. We don't anticipate any problems with compiling and running the MPI program on your system, because this portion of the product has been built and tested on many different clusters.

If you are having difficulty interpreting the results of your measurements, we are willing to look at your data and offer advice.

Our Technical Support Department can be reached at 1-508-746-7341 or by email at tech@microway.com.

MPI Consulting Services

Microway has been developing and writing parallel codes for over 15 years. We have a group of in house Linux and MPI experts who can help you overcome problems with your MPI codes and the clusters you are running them on.

About Microway

About Microway

Microway designs, builds, and services high quality clusters, workstations, and networks for High Performance Computing. Our internet address is microway.com. Our telephone number is 1-508-746-7341.

Microway provides our customers with leading edge technologies for high performance computing solutions. We establish and maintain industry recognized products and expertise for cluster interconnect, cluster management and HPC storage solutions.

Microway's reputation as the world leader in innovative solutions for High Performance Technical Computing has been unchallenged since 1982, when our software made it possible to use the 8087 math coprocessor in the IBM PC. Our products consistently receive excellent reviews, our prices are competitive, and our service and technical support are outstanding. Microway's top notch Research and Development Staff keeps you on the leading edge of technology with timely, powerful new products. At Microway our customers are treated as our most valuable resource, which is why our customer base remains strong and continues to grow.

Microway's products include clusters, silent workstations, InfiniBand-based switches, multi-function HCAs and storage solutions. Our clusters incorporate AMD Quad-Core Opterons, Intel Quad-Core Xeons, NVIDIA Tesla GPU products, and Mellanox silicon.

Designed and developed in-house, Microway software includes MCMS cluster management tools; InfiniScope™ InfiniBand diagnostic software; and MPI Link-Checker™ MPI diagnostic tool. Microway's Linux-based clusters and data solutions are used by customers in life sciences, academia, enterprise and government research laboratories.

Our industry recognized trademarks include Microway, FasTree, InfiniScope, MPI Link-Checker, Navion, TriCom, WhisperStation, NumberSmasher, NodeWatch, ServaStor, and Quadputer.

The technical staff at Microway is qualified to assist you in benchmarking and speeding up your existing code and enhancing your present software and hardware investment. Our staff has over 50 years combined experience in designing Linux cluster configurations. We offer white papers on this web site, as well as technical documentation of the hardware and software we design and integrate. To design your next custom system or cluster, please call our Sales Department at 1-508-746-7341. Our Technical Support Department can be reached at the same number or via email at tech@microway.com.

For more than twenty-six years, the employees at Microway have earned our reputation for excellence. We are proud of this reputation and totally committed to designing innovative products that provide state of the art solutions required to keep our customers on the leading edge of technology.

Microway ... technology you can count on, since 1982.

Microway: Technology you can count on

Revised March 2009