Monitoring an Email Installation

Monitoring an email installation is a critical part of maintaining a stable email installation. When done properly, it can provide the site operators with:

  1. Early detection of problems in the site.
  2. Information that can be used to diagnose a problem and isolate failures.
  3. Warn of email abuse from internal users or external sites.
  4. Information that can be used for capacity planning thereby avoiding unexpected system overloads.

This document describes some of the basics of monitoring Sun ONE Messaging Server email systems.

Example Email Deployment

While each email deployment is slightly different, the diagram below represents a basic email deployment.

               +---------+
               |         |
               |   ISP   |
               |Customers|                             .++++++++++.
               |         |                           .++++++++++++++.
               +------^--+                          ++++          ++++
                      |                             +++  INTERNET  +++
                 .____:                          _. ++++          ++++
                 v                              .'|  ++++++++++++++++
            --------:--------                 .'       ++++++++++++
                    |                       .'               |
                    v                     .'                 |
          +-------------------+         .'        ___________v___________
          |   load balancer   |       .'               |           |
          +-------------------+     .'                 |           |
                    |             .'                   |           |
____________________|___________.'_____________    +--------+ +--------+
      |          |          |          |           |        | |        |
      |          |          |          |           |  mta3  | |  mta4  |
 +--------+ +--------+ +--------+ +--------+       |        | |        |
 |        | |        | |        | |        |       +--------+ +--------+
 |  mmp1  | |  mmp2  | |  mta1  | |  mta2  |           |           |
 |        | |        | |        | |        |           |           |
 +--------+ +--------+ +--------+ +--------+           |           |
      |          |          |          |               |           |
______|__________|__________|__________|_______________|___________|______
        |           |           |                  |             |
        |           |           |                  |             |
        |           |           |             ,---------.   ,---------.
        |           |           |            |           | |           |
    ,-------.   ,-------.   ,-------.        |`---------'| |`---------'|
   |         | |         | |         |       |           | |           |
   |`-------'| |`-------'| |`-------'|       |  Store1   | |  Store2   |
   |  ldap1  | |  ldap2  | |  ldap3  |       |           | |           |
   |         | |         | |         |       |           | |           |
    `-------'   `-------'   `-------'         `---------'   `---------'

In this example, the customers of the ISP submit and retrieve email through the load balancer. The load balancer shields the email user from the underlying site architecture and helps provide a highly available email service. Email submitted to the site takes three potential routes:

  1. Mail from ISP customer to another ISP customer:

    load balancer -> mta[1,2] -> store[1,2]

  2. Mail from ISP customer to external recipient:

    load balancer -> mta[1,2] -> Internet

  3. Mail from external sender to ISP customer:

    mta[3,4] -> store[1,2]

Mail retrieval via POP or IMAP is always goes through the same path:
Load balancer <-> mmp[1,2] <-> store[1,2]

This document describes how to monitor this basic deployment. It will also suggest additional site monitoring that is essential to maintaining a well behaved email installation.

Monitoring the Health of the Email Services

The basic technique presented here to monitor the health of the site is to use the site, essentially like the customers, to send and retrieve email. To do this, at least one test email account must be set up on each of the message store machines. A system must be then be put into place to exercise and measure all of the potential message flow paths through the systems. When a failure occurs at the customer visible entry points (mta1-4 or mmp1-2), an automated attempt to isolate the failure to a particular system should take place and an alert sent. Ideally the alert should not be sent using any of the email infrastructure of the site to avoid having the alert fail to go out due to a system failure.

The next two sections describe a collection of scripts and programs that have been developed and deployed to monitor (Sun ONE) Messaging Server email installations. Many of the scripts are generic and only rely on the existence of the basic protocols used in all email installations (POP, SMTP, and LDAP).

Theses scripts/programs were based upon scripts/programs from a variety of sources. Some were adapted from scripts that were used to monitor SIMS installations; others were developed on site during customer escalations. . There are being supplied on an as is basis.

Support Scripts

There are two types of scripts, top-level scripts and support scripts. This section describes the support scripts. Under normal circumstances, these scripts are not called directly from the command line or from the crontab entries set up to monitor the site. These support scripts are called from the top-level scripts. They are being described here because they implement useful functions and may provide useful functions if the top-level scripts were rewritten, reconfigured, or enhanced. These support scripts are found in the lib/ subdirectory or the etc/ subdirectory.

etc/alarms.cfg

This is really a configuration file. More global configuration could be moved here, but for now it defines the following variables that are used by one or more scripts:

etc/sequence.cfg

The sequence.cfg file is a configuration script which tells the runmonitor program how to roll through the message stores, MTA routers, and MMP s on a regular basis. The basic strategy for this script is to use as many combinations of the three as possible to cover all the message flow paths. It may not be obvious at first, but the message store to be tested corresponds to the test user used on any given test.

The format of the entries in this file is

#
<minute> DMSG -u <store> -s <MTA Relay> -p <MMP>
05 DMSG -u ims1 -s ssmtp01 -p smux01
10 DMSG -u ims2 -s ssmtp01 -p smux01
...
55 DMSG -u ims4 -s ssmtp04 -p smux02
#

lib/sendreport

This script is used to send out the alarms and to mail out the nightly reports. It avoids using the local email infrastructure by sending email directly to the MX of each recipient s domain. It does this by examining the email address of each recipient, looking up the MX of the domain part of the email address, prioritizing the listed MX s, and attempting to send to this prioritized list until it succeeds in sending the message.

The following options are supported:

lib/rotatelogs

This script rotates the logs. It is usually called when the nightly report is mailed out. Calling the scripts allows a monitoring installation to keep a history of the daily logs for a configurable number of days before they are removed. This script will likely require some customization to match where the logs are kept on each system. The number of old log files to keep is a configuration variable defined in the script and defaults to 30

The following options are supported:

Top Level Scripts

These scripts are the top-level scripts used to monitor the site. Some of the scripts access the health of the system, attempt to isolate the failure to a particular machine and protocol, and send out alerts. The results are logged to a file and sent to a list of email address each day. Other scripts parse through the SMTP logs and report message rate and size information.

bin/check_files

The check_files script is used to warn if the number of checkpoint log files in the mboxlist directory are accumulating. It send s a warning message or a panic level if there is too large of a buildup.

bin/check_mconn

The check_mconn script is used to verify that the SMTP server is responsive and measures the timings of the responses to the SMTP dialogue.

bin/check_queues

The check_queues script logs the total number of messages in the active and deferred queue plus a number of system statistics including; available swap, scan rate, and CPU statistics. If the number of messages in any of the queues exceeds the threshold (queuewarnlimit or queuepaniclimit) specified in alarms.cfg, an alert is sent to the list of users (queuenotifyaddress) in alarms.cfg. The results of each invocation are logged to a file. The file is mailed out to a list of email address (reportrcpts) specified in alarms.cfg when the script is invoked with the -r option.

The options supported are:

Below is an example of the qmonitor.log file:

DATE COLLECTED: 10 Jan 2001
instance name = msg-smtp1
+---------------------------------------------------------------------------------------------------+
| System and Queue monitoring data                                                                  |
+-----+----------------+------------+---------+----------------------------+------------------------+
|     | Msgs in Queue  | Available  |         |            CPU             |   TCP/IP Connections   |
|Time | act   |  held  |   swap     |Scanrate |  us  |  sys |  wt  |   id  |smtp|imap|pop3|ldap|http|
+-----+-------+--------+------------+---------+------+------+------+-------+----+----+----+----+----+
|00:10|     0 |      0 |   2005656k |     0   |    0 |    0 |    2 |    97 |   0|  50|   0|  55|   0|
|00:15|     1 |      0 |   2003760k |     0   |    1 |    2 |    3 |    94 |   0|  51|   0|  55|   0|
|00:20|     0 |      0 |   1992640k |     0   |    0 |    0 |    3 |    96 |   0|  52|   0|  56|   0|
|00:25|     0 |      0 |   2004280k |     0   |    0 |    0 |    6 |    94 |   0|  50|   0|  55|   0|
|00:30|     0 |      0 |   2004552k |     0   |    0 |    1 |    5 |    94 |   0|  50|   0|  55|   0|
|00:35|     1 |      0 |   1992216k |     0   |    0 |    0 |    4 |    96 |   0|  48|   0|  56|   0|
|00:40|     0 |      0 |   2018200k |     0   |    0 |    1 |    5 |    94 |   0|  48|   0|  54|   0|
|00:45|     0 |      0 |   2005648k |     0   |    1 |    1 |    5 |    93 |   0|  48|   0|  55|   0|
|00:50|     0 |      0 |   2004264k |     0   |    0 |    0 |    4 |    95 |   0|  48|   0|  55|   0|
|00:55|     1 |      0 |   2005064k |     0   |    0 |    0 |    3 |    96 |   0|  49|   0|  55|   0|
|01:00|     0 |      0 |   1993824k |     0   |    6 |    4 |   46 |    44 |   0|  50|   0|  55|   0|
|01:05|     1 |      0 |   1981784k |     0   |    1 |    3 |    2 |    95 |   0|  51|   0|  56|   0|
|01:10|     0 |      0 |   1985232k |     0   |    1 |    4 |    5 |    91 |   0|  49|   0|  55|   0|
|01:15|     0 |      0 |   1972688k |     0   |    1 |    2 |    3 |    94 |   0|  50|   0|  56|   0|
|01:20|     0 |      0 |   2003048k |     0   |    1 |    4 |   45 |    50 |   0|  50|   0|  54|   0|
|01:25|     0 |      0 |   1994032k |     0   |    1 |    2 |   49 |    48 |   0|  50|   0|  55|   0|
|01:30|     0 |      0 |   2003240k |     0   |    2 |    2 |    5 |    92 |   0|  51|   0|  54|   0|
|01:35|     0 |      0 |   2005624k |     0   |    1 |    5 |   45 |    49 |   0|  51|   0|  54|   0|
|01:40|     1 |      0 |   1979776k |     0   |    1 |    3 |   35 |    61 |   0|  51|   0|  56|   0|
|01:45|     1 |      0 |   1976952k |     0   |   28 |   20 |   20 |    32 |   0|  52|   0|  56|   0|
|01:50|     0 |      0 |   1983448k |     0   |    4 |    9 |   24 |    62 |   0|  58|   0|  55|   0|
|01:55|     0 |      0 |   1990520k |     0   |    1 |    4 |   36 |    59 |   0|  51|   0|  55|   0|
|02:00|     0 |      0 |   1987624k |     0   |    9 |    8 |   39 |    44 |   0|  59|   0|  55|   0|
|02:05|     0 |      0 |   1994216k |     0   |    3 |   12 |   12 |    72 |   1|  49|   0|  55|   0|
|02:10|     1 |      0 |   1994408k |     0   |    7 |    9 |   14 |    70 |   0|  57|   0|  55|   0|
|02:15|     0 |      0 |   2004864k |     0   |    0 |    3 |   46 |    50 |   0|  49|   0|  57|   0|

bin/check_procs

The check_procs script is used to verify that certain key processes are in fact running on the Messaging Server machine. It sends an alarm if any of the required processes are not running.

bin/check_rndtrip

The check_rndtrip script is just a convenience script that allows the roundtrip monitoring to occur with a single entry in crontab. It reads from the etc/sequence.cfg to determine which combination of stores, MTA routers and MMPs it should cycle through.

bin/runmonitor

The runmonitor script is used to test out the email site by sending a message to a specified MTA and retrieving it from a specified MMP. In the event of a failure it will attempt to diagnose and isolate the failure. The results of each test are logged to a file (systest.log) and mailed out to a list of email address specified in alarms.cfg when the script is invoked with the r option.

Here is a summary of the functionality of this script when invoked in test mode (without the "-r" option).

The following options are supported:

Below is an example of the systest.log contents:

DATE COLLECTED: 10 Jan 2001
+--------------------------------------------------------+-----------------+
|                 Message Delivery time                  |                 |
+----------+-----------+------------+------------+-------+-----------------+
| Test run | Username: | Msg submit | Msg retr   | Total | Error           |
| at time: |           | SMTP:      | POP:       | Time: |                 |
+----------+-----------+------------+------------+-------+-----------------+
| 00:05:04 |      ims2 |    ssmtp02 |     smux01 |  2083 |                 |
| 00:10:03 |      ims3 |    ssmtp03 |     smux02 |   576 |                 |
| 00:15:03 |      ims4 |    ssmtp04 |     smux02 |  1715 |                 |
| 00:20:04 |      ims1 |    ssmtp02 |     smux02 |  1886 |                 |
| 00:25:03 |      ims2 |    ssmtp01 |     smux02 |  1914 |                 |
| 00:30:03 |      ims3 |    ssmtp04 |     smux01 |   614 |                 |

Putting it Together

Monitoring an email site such as the typical one presented in this document requires four steps

  1. Create the dummy users (at least one per message store).
  2. Install the scripts on each system in the message site.
  3. Configure the alarms.cfg variables on each system.
  4. Add entries to crontab

Only one system in the installation should run the check_rndtrip script. This system should have access to all of the systems in the site. All of the systems in the site should be configured to run the check_queues, check_files, check_ldap, and check_sys scripts.

Setup required for check_rndtrip

Assuming the site architecture outlined in this doc, the check_rndtrip monitoring could be configured using the following steps:

  1. Create 2 users, one on each message store machine. For example, user ims1 with mailhost of ms1 and user ims2 with mailhost of ms2.

  2. Unpack the scripts on mta1, this will be the system that will execute the check_rndtrip script to monitor the health of the entire email site.

  3. Configure the attributes in the alarms.cfg. At a minimum, the following attributes must be set in order to run check_rndtrip for the site:

  4. Modify the sequence.cfg configuretion file to rotate though MTAs (mta1, mta2, mta3, and mta4), MMP s (mmp1 and mmp2), and users (ims1 and ims2).

  5. Add two crontab entries:
    0,5,10,15,20,25,30,35,40,45,50,55 * * * * <path to script>/check_rndtrip
    4 0 * * * <path to script>/check_rndtrip -r
    

Setup required for check_queues, check_procs, check_ldap, check_sys and check_files

Assuming the typical site architecture outlined in this doc, the check_queues and other scripts could be configured using the steps below. Because these scripts parse through log files, they must be installed and run on each machine.

  1. Install the scripts on all of the machines; mta1, mta2, mta3, mta4, mmp1, mmp2, ms1, and ms2.
  2. Configure the attributes in alarms.cfg. At a minimum, the following attributes must be set for each system in the site.

Setup using Big Brother System and Network Monitor

The default reporting style is "bigbrother" edit the etc/alarms.cfg to setup the location of the BBHOME variable. No need to create an ext script unless you want to have the job scheduling run completely by bigbrother. In that case, consult the ~/plugins/plugin4bb ext example script for hints on how to use rsh or ssh (as mailsrv) to run these probes.

If you use a different reporting style by default, you can define that new reporting style in the plugins/report_formats.cfg file and at the same time modify your etc/alarms.cfg file to reflect that change.

Other Site Monitoring

The scripts described in this paper should help a sysadmin or support individual monitor and maintain a Messaging Server email infrastructure. However, there are other tools available to monitor an Messaging Server installation. For more information on some of the tools that come with iMS please read Chapter 15 of the Messaging Server admin guide (http://docs.sun.com/source/816-6009-10/monitor.htm ).