Microway, Inc.
Version 3.2.4
Copyright 2006-2009
You must have the OFED (Open Fabrics Enterprise Distribution) software installed on the cluster. InfiniScope has been tested with versions 1.0 through 1.3 of OFED. Three things from the OFED distribution are required:
Previous versions of InfiniScope required that Python and wxPython be installed, but this is no longer the case.
To run flop, the Fabric Loading Program, you must have MPI
installed. Any version should work, but of course it must be an
InfiniBand-capable version like mvapich if you want to exercise the
InfiniBand fabric.
Installation
InfiniScope is distributed as a VMware image. After installation, it consists of a Python/wxPython-based interface program. The interface program runs under VMware, and is preinstalled in the VMware image.
A separate InfiniScope collector program (named iscollect) must be installed on your cluster. It is distributed as obfuscated C source code, which is compiled on your cluster with a C compiler. You do not have to modify the source.
A separate “Fabric LOading Program” (named flop) is included in the InfiniScope distribution. It is a C/MPI program distributed as obfuscated source code, which is compiled on your cluster with an mpicc wrapper around a C compiler.
MPI Link-Checker is distributed as a VMware image on a DVD. The DVD contains a README file with complete installation and configuration instructions. In summary:
Make sure that /usr/local/microway/iscollect is available on all nodes of the cluster. This is not an MPI program, but it must run on all nodes of the cluster to collect data from the InfiniBand ports. It must be owned by root, with group "infiniscope", and protection 4750. Setuid root is required to read and write the port registers on the InfiniBand HCAs and switches.
If you didn't do so during the running of the installation script, add users to the infiniscope group by editing /etc/group .
Put /usr/local/microway in the path for all users who are going to be using InfiniScope.
If you didn't do so during the running of the installation script, modify the startup file for the OpenSM subnet manager, /etc/opensm.conf, so that the VERBOSE option contains the string "-D 0x40".
# -D 0x40 writes out the forwarding tables for InfiniScope VERBOSE="-D 0x40"This will cause the subnet manager to write two files every time it runs, /tmp/subnet.lst (the subnet topology file) and /tmp/osm.fdbs (the forwarding table listing). (The files may be in /var/log instead of /tmp, and may have different names, for different versions of OFED.)
Restart the subnet manager after changing the startup file.
service opensmd restart
To uninstall InfiniScope, delete the VMware image. Also, remove the
installation directory /usr/local/microway and its contents on all
nodes. (Don't do this if you still want to keep using our other
cluster diagnostic tool, MPI Link-Checker.)
Starting InfiniScope
On the VMware image, make sure /usr/local/microway is in your
path. Start InfiniScope by typing "is".
Several command line options may be used. They may be combined and used in any order, although some combinations do not make sense. You may but need not put a space between a single letter abbreviation and a following value. You may put a space or an equal sign between an extended option name and a following value.
Note that most of the command line options may be specified in the configuration file. For options that change only rarely, using the configuration file will probably be more convenient.
is -h | --help or: is -v | --version or: is [-i | --init-file <configuration file name>] [-d | --data-file <input data file name>] [-o | --output <output data file name>] [-A | --ascii] [-m | --map <name of file containing displayed name map>] [-g | --gui-hostname <GUI host name>] [-C | --collector-ssh-command "<command for starting InfiniScope data collector>"] [-F | --feeder-ssh-command "<command for starting InfiniScope data feeders>"] [-p | --collector-program-name <collector program name>] [-s | --single-collector] [-x | --excluded-nodes "<node node ... (nodes not running collector)>"] [-a | --allow-multiples] [-e | --expansion <fabric display box expansion factor>] [-n | --no-logging] [-L | --looping]Here is more detail on each of the options:
The top row of boxes shows the color and shape scale for the port bandwidth display. The next row of boxes (or more than one row if there are more than 24 HCA ports) shows the measured short term bandwidth for each port of each host. The remaining rows of boxes show the bandwidth for each port of each switch, one switch per line.
The boxes at both ends of a DDR connection are outlined in blue. The boxes at the ends of an SDR connection are outlined in red. (QDR connections, when they become available, will be outlined in green.)
The boxes at both ends of a 1X connection and the line between them will flash yellow and black.
If a port has seen any hardware errors, there will be a red "E" over its box. Errors already present in the hardware port registers when you start InfiniScope are suppressed, as well as many errors that occur when the fabric topology changes.
By moving the mouse over a box, you select the port corresponding to that box for the graph at the bottom. Instead of selecting a single port, you may select a switch by moving the mouse over its label on the left, or you may select "all hosts" by moving the mouse over the "Hosts" label on the left. The selected box (port, switch, or all hosts) will be outlined with a colored square or rectangle. If a selected port has seen hardware errors, details of the errors will be shown in the graph in the lower part of the display (see below).
If you click a box, that box will be locked as the selection. Clicking again anywhere unlocks the selection.
If a switch is selected, there will be colored lines showing all connections to that switch. Unless the fabric is too big, the colors of the lines change randomly so that they can be distinguished from each other more easily.
Similarly, if "All hosts" is selected, there will be colored lines showing all connections to HCAs. Again, the colors of the lines change randomly unless the fabric is too big.
If a single port is selected, there will be a colored line showing the connection to that port, and the port at the other end of the connection will be outlined with a gray square.
In an HCA port is selected, all switch ports on paths to the selected port (not from the port: almost every port beyond the first switch can occur on paths from a port) will be outlined with a gray square. This is just some extra information that might be helpful in determining the cause of unexpectedly poor behavior.
If an HCA port is locked as the selection, you can move the mouse over a different HCA port. Doing so will cause the path through the fabric from the selected port to the port under the mouse to be displayed as a set of colored lines.
If you click on the graph, the data for the selected port will be cleared and the graph will be restarted. This can be useful to rescale the graph. If you double-click on the graph, the data for all ports will be cleared.
If you right-click on the graph, it will be frozen. (Data collection will continue.) Right-clicking again, or moving the mouse outside the graph area, will unfreeze the graph.
If a port is selected and the port has seen hardware errors, the type and number of errors will be shown at the bottom of the graph, using different colored text for different error types. The indication will be something like “Sym:12/561”, indicating that a total of 561 symbol errors were observed during 12 different queries.
The times of the errors will be indicated on the graph with colored marks, the color of a mark corresponding to the color of the text for the corresponding error count. Full details of errors will also be written to the console and the log file.
To stop recording, click the "End recording" item of the Save data
menu. You can resume recording later if you wish. The recorded data
will not show a gap, and the times on the graph scale may be wrong.
Control buttons
Clicking the Gather button unfreezes the display and resumes gathering data. All the data history will be reset. When you click it, the Gather button returns to being a Freeze button.
Changing the measurement interval or clicking the Reset button also unfreezes the display, resumes data gathering, and changes the Gather button to a Freeze button.
If you set a very short interval (a few milliseconds), it may be impossible to query the fabric often enough. In this case, some measurements will be skipped to try to keep up with the measurement clock. The bandwidth graphs may appear jumpy in this case.
If you change the measurement interval, the graph scale label will change accordingly, maintaining the same degree of averaging. The data will be reset.
If you are scanning through the ports, clicking on the Reverse button will cause the scan to change direction and move backward through the ports. The Reverse button becomes a Stop scanning button.
If you are scanning the ports in reverse order, clicking on the Stop scanning button stops the scanning. The Stop scanning button becomes a Scan ports button again.
If you are scanning or reverse-scanning, all the ports' boxes will have small black check marks in them. If you right-click on any port box, the check mark will become a small red 'X', and the corresponding port will be excluded from the scan. You may exclude as many ports as you wish, except that there will always be at least one port being scanned. Right-clicking again will change the 'X' back to a check mark and will reengage scanning.
If you are scanning or reverse-scanning, right-clicking the "Hosts" box will toggle scanning for all HCA ports. Similarly, if you are scanning or reverse-scanning, right-clicking a switch's box will will toggle scanning for all ports on the switch.
If you are scanning or reverse-scanning, two buttons will appear next to the scan time buttons. Clicking the upper "SCAN ALL" button will cause all ports to be included in the scan. Clicking the lower "TOGGLE" button will toggle scanning (off-to-on and on-to-off) for each port, leaving at least one port being scanned.
The time for reverse-scanning is fixed at 0.75 second.
Messages
Messages are printed at the terminal when InfiniScope is
started or stopped and when fabric errors occur. Whenever the fabric
is redrawn because the subnet manager detected a change in the fabric,
the connections from all switch ports are shown. In addition, a
notice is printed once every hour, so that if the fabric dies, you
will know approximately when it died. The same messages are appended
to the log file /tmp/infiniscope.log.
Known restrictions
Sometimes when you start InfiniScope right after rebooting the
cluster, the host names are not correct: some of them may appear as
"25204", or InfiniScope may say that the subnet manager could not
determine the correct names of all the hosts. If this happens, simply
restart the subnet manager. You may also be able to get around the
problem by using the name map (-m) command line option.
With a very short measurement interval (1 ms), the label on the
Scan/Reverse/Stop scanning button may not be displayed, although it
still works properly.
Troubleshooting
Occasionally InfiniScope gets into an incorrect state.
To get it working again, stop it. Stop iscollect if any
instances of it are still running on the cluster nodes. Restart the
subnet manager, then restart InfiniScope.
If you have to restore a node (for example, because of a hard disk
failure), be sure that the current versions of iscollect and
flop are are copied onto the new disk.
Flop - the Fabric LOading Program
Flop is an MPI application that generates traffic to load the fabric
in various ways. It must be compiled using mpicc and run using mpirun
in the usual way. The mode of operation is specified using a
command-line option.
The usage for flop is:
mpirun -np <n> /usr/local/microway/flop [-f|-h|-r|-a|-c|-s|-l] [ranks for -s option] [message_size]If you normally start MPI programs in a different way, start flop the same way. The options are:
-f Full-duplex round robin -h Half-duplex round robin -r Ring -a All-to-all -c Cycle -s Select send/receive rank pairs -l List ranks and hosts
In all tests, you can specify the message size in bytes as the last argument. The default message size for all tests except the all-to-all test is 5000000 bytes. Each message is repeated 1000 times in each round.
In the full-duplex round robin test, each node is both sending and receiving at the same time. (One node is inactive in each round if there are an odd number of nodes.)
In the half-duplex round robin test, half the nodes are sending and half are receiving in each round. (One node is inactive in each round if there are an odd number of nodes.)
In the ring test, each node is receiving from one neighbor and sending to another, in a ring covering the entire cluster. The bandwidth of this test is limited to that of the slowest connection.
In the all-to-all test, there is just a single MPI_Alltoall() covering the entire cluster. This test requires more memory than the others, so the default message size is 1000000 bytes. You may have to make it even smaller for a large cluster.
In the cycle test, only one node is sending to only one other node in each round, but eventually every node will send to and receive from every other node.
In the selected ranks test, you can specify precise send/receive pairs, according to their MPI ranks. The ranks are specified in pairs, alternating between send rank and receive rank. Of course, there must be an even number of ranks listed. A node can be sending and receiving at the same time, but not safely more than one of each. To find the MPI rank of each node, run flop with the "-l" option. If you want to specify the message size, put it after the list of ranks.
flop doesn't do anything useful; it just loads the fabric by
sending random data around in the specified pattern. It doesn't report
anything either, except for the name of the test it is performing.
About Microway
Microway designs, builds, and services high quality clusters, workstations, and networks for High Performance Computing. Our internet address is microway.com. Our telephone number is 1-508-746-7341.
Microway provides our customers with leading edge technologies for high performance computing solutions. We establish and maintain industry recognized products and expertise for cluster interconnect, cluster management and HPC storage solutions.
Microway's reputation as the world leader in innovative solutions for High Performance Technical Computing has been unchallenged since 1982, when our software made it possible to use the 8087 math coprocessor in the IBM PC. Our products consistently receive excellent reviews, our prices are competitive, and our service and technical support are outstanding. Microway's top notch Research and Development Staff keeps you on the leading edge of technology with timely, powerful new products. At Microway our customers are treated as our most valuable resource, which is why our customer base remains strong and continues to grow.
Microway's products include clusters, silent workstations, InfiniBand-based switches, multi-function HCAs and storage solutions. Our clusters incorporate AMD Quad-Core Opterons, Intel Quad-Core Xeons, NVIDIA Tesla GPU products, and Mellanox silicon.
Designed and developed in-house, Microway software includes MCMS cluster management tools; InfiniScope™ InfiniBand diagnostic software; and MPI Link-Checker™ MPI diagnostic tool. Microway's Linux-based clusters and data solutions are used by customers in life sciences, academia, enterprise and government research laboratories.
Our industry recognized trademarks include Microway, FasTree, InfiniScope, MPI Link-Checker, Navion, TriCom, WhisperStation, NumberSmasher, NodeWatch, ServaStor, and Quadputer.
The technical staff at Microway is qualified to assist you in benchmarking and speeding up your existing code and enhancing your present software and hardware investment. Our staff has over 50 years combined experience in designing Linux cluster configurations. We offer white papers on this web site, as well as technical documentation of the hardware and software we design and integrate. To design your next custom system or cluster, please call our Sales Department at 1-508-746-7341. Our Technical Support Department can be reached at the same number or via email at tech@microway.com.
For more than twenty-six years, the employees at Microway have earned our reputation for excellence. We are proud of this reputation and totally committed to designing innovative products that provide state of the art solutions required to keep our customers on the leading edge of technology.
Microway ... technology you can count on, since 1982.