Link state issues and routing issues in Exchange 2000 Server and in Exchange Server 2003 (832281)

INTRODUCTION

This article discusses common link state issues and common routing issues that you may experience in Microsoft Exchange 2000 Server and in Microsoft Exchange Server 2003.

back to the top

The purpose of a routing group

The routing group is the smallest unit of servers that are likely to always be connected to one another. The routing group can be assumed to be one node on the graph of connector paths, with multiple possible connectors between routing groups.

To configure the way that messages are routed between servers so that point-to-point connections between servers are always made, the servers must be grouped in routing groups, and the Routing Group connectors must be defined.

In a routing group, link state information updates and routing information updates are pushed between master nodes and member nodes through a persistent port 691 Transmission Control Protocol (TCP) connection. Between two routing groups, servers advertise the X-LINK2STATE verb to exchange link state information by comparing the MD5 digest in the Exchange organization information packet of the two routing group bridgeheads. A mismatch triggers an exchange of link state information between the two servers through SMTP port 25.

back to the top

The role of a routing group master

The routing group master coordinates changes to link states that are learned by servers in its routing group and retrieves updates from the directory service. By having a single server coordinate the changes, you can treat a routing group as a single entity for the purposes of computing a least-cost path between routing groups in an organization.

back to the top

What occurs when the routing group master stops responding

All servers in the routing group continue to operate on the same information that they had at the time that they lost contact with the master.

When the routing group master comes back up, it examines the status of all other servers, reconstructs the link state information, processes the State Change Queue (SCQ), and then updates members in the routing group.

back to the top

Common issues

The following sections present several routing issues that you may experience. Additionally, the following sections suggest methods that you can use to troubleshoot the issues.

back to the top

Routing member node is not connected to master

When you use the WinRoute tool (Winroute.exe) to view Exchange organization routing, you may see the words "connected to master - NO" and a red X next to the organization's name. These words and the red X indicate that the routing member node is not connected to the master.

Image where you can see "connected to master - NO" and a red X that
follow the organization's name.

Image where you can see "connected to master - NO" and a red X that
follow the organization's name.

In a routing group, the routing group nodes, including the master, must be connected to the master node on Transmission Control Protocol (TCP) port 691 to propagate routing information and link state information to and from the master node.

Note To download the Microsoft Exchange Server 2003 WinRoute tool for troubleshooting routing in an Exchange 2000 and Exchange 2003 mail-handling environment, visit the following Microsoft Web site. The following file is available for download from the Microsoft Download Center:

Download the Winroute.exe package now. For more information about how to download Microsoft Support files, click the following article number to view the article in the Microsoft Knowledge Base:

119591 How to obtain Microsoft support files from online services

Microsoft scanned this file for viruses. Microsoft used the most current virus-detection software that was available on the date that the file was posted. The file is stored on security-enhanced servers that help prevent any unauthorized changes to the file.

To resolve this issue, follow these steps:

Make sure that the Exchange Routing Engine Service (RESvc service) is started on all affected servers in the routing group and that it remains in a controlled state. If the service is in an unstable state, the server may not connect to master nodes. Investigate the root cause of any unstable services before you go to the next step.
Verify that a firewall does not restrict TCP port 691. To do this, initiate a Telnet session to port 691 on the affected servers and on the master node. A Microsoft Routing Engine banner indicates an active state.
At the command prompt, run the netstat -a -n command. The output of this command reveals all member nodes and the master itself connecting to TCP port 691 on the master node.
In Event Viewer, check the application logs for any events that indicate a failure to authenticate by using the computer account , such as Domain\serverName$. Events such as Transport events 962 and 961 indicate a failure of the RESvc service to connect.
Verify that the affected servers or the Exchange Domain Server group that they belong to do not have the SendAs right missing, denied, or denied from a nested membership of another group. To do this, run the Exchange Trace Utility (Regtrace.exe), and then restart the RESvc service. For more information about RegTrace setup on Exchange 2000, click the following article number to view the article in the Microsoft Knowledge Base:
238614 How to set up Regtrace for Exchange 2000
Note For additional information about tools and processes that you can use to troubleshoot and to diagnose transport issues and routing issues in Exchange 2003, download the Exchange Server 2003 Transport and Routing Guide online book. To download this book, visit the following Microsoft Web site:
http://www.microsoft.com/technet/prodtechnol/exchange/2003/library/extransrout.mspx
Verify that the affected servers can generate a ServicePrincipalName (SPN) for authentication. To verify this, check the network address attribute of the affected servers by using the ADSI Edit tool (ADSIEdit.exe) or by using the Lightweight Directory Protocol tool (Ldp.exe).

Nodes in a routing group have to mutually authenticate with the routing group master to be connected. To do this, they use the ncacn_ip_tcp value in the Network address attribute of the Exchange Server computer to generate the SPN for the master node by using Kerberos authentication. Make sure that this value is a Fully Qualified Domain Name (FQDN) instead of a NetBIOS name or an IP address. Restart the RESvc service.
Check the application log and the system log on all the affected servers for any Kerberos authentication errors. Kerberos authentication errors may be caused by an expired domain computer account password. To gain additional information about this issue, run the NLTEST utility with debug flags. For more information about how to run the NLTEST utility with debug flags, click the following article number to view the article in the Microsoft Knowledge Base:
109626 Enabling debug logging for the Net Logon service
Important If the domain computer account password has apparently expired, you must contact Microsoft Product Support Services (PSS) to confirm and to correct the issue. For a complete list of Microsoft Product Support Services phone numbers and information about support costs, visit the following Microsoft Web site:
http://support.microsoft.com/default.aspx?scid=fh;[LN];CNTACTMS
Verify that the FQDN of the virtual server matches the FQDN in Domain Name System (DNS).
If the membership of the routing group spans multiple domains, make sure that DNS is correctly designed and implemented between the domains.
Look for any third-party applications that use Group Policy objects to restrict permissions or to restrict security settings.

back to the top

Routing group master wars

In a routing group, the first server installed in the routing group is automatically elected as the master node. As other servers are installed, the administrator has the option to appoint another server as master.

When the new routing group master is elected, only one server should be assigned the master role at a time. This rule is enforced by an algorithm that is based on the formula "(N/2) +1" (where N denotes the number of servers in the routing group). The algorithm calculates the number of servers in the routing group that must agree and that must acknowledge the master. Therefore, the member nodes send link state ATTACH data (information about the routing group) to the master.

It is not uncommon for two or more servers to have erroneous information about which server is the current routing group master. For example, if a routing group master was moved or was deleted, and another master node was not chosen, the MsExchRoutingMasterDN attribute may point to a non-existent server.

This issue may also occur when an old master does not detach as master, or when a problematic node keeps sending incorrect link state ATTACH information.

Note In Microsoft Exchange Server 2003, if a routing group points to a deleted object, the master node gives up its role as master and initiates a shutdown.

To resolve this issue, use one of the following methods:

Look for link state data propagation through TCP port 691, for firewall hindrances such as firewall blocking of TCP port 691, and for SMTP filters.
Look for Active Directory replication latencies.
Look for network problem and latencies.
Look for deleted routing group masters or servers that no longer exist. If this is the case, a Transport event 958 that references a routing group master distinguished name that no longer exists is logged in the application log. Use the Lightweight Directory Protocol (Ldp.exe) tool or the ADSI Edit (Adsiedit.exe) tool to verify that this is the case.

back to the top

Deleted routing groups are followed by [object_not_found_in_DS]

When servers are moved between routing groups, and when the routing groups are subsequently deleted, if you use Winroute.exe you may see the text [object_not_found_in_DS] next to the object name.

Image where Winroute.exe may show [object_not_found_in_DS]
that follows the object name.

Image where Winroute.exe may show [object_not_found_in_DS]
that follows the object name.

This issue may occur if the routing engine service tries to correlate an object that still exists in a dynamic routing library that is maintained by the server with objects in Active Directory, where the object does not exist any more. Tips to resolve this issue:

Restart all servers in the organization at the same time. This action updates routing information. Additionally, this action removes deleted routing groups and deleted connectors.
Use the Remonitor.exe tool in injection mode.

Note Contact Microsoft Product Support Services for information about the Remonitor.exe tool in injection mode. For a complete list of Microsoft Product Support Services phone numbers and information about support costs, visit the following Microsoft Web site:
http://support.microsoft.com/default.aspx?scid=fh;[LN];CNTACTMS
Make sure that the servers are on a recent build of Exchange Server and that they have the Exchange Server service pack rollups installed.

Note Applying the hotfix that is described in the following Knowledge Base article is no longer necessary if your servers are on a recent build of Exchange Server and have the current Exchange Server service pack rollups installed. If you cannot install the most recent Exchange Server service pack rollups, apply the hotfix that is described in the following Knowledge Base article:
330279 Deleted routing groups are listed in the WinRoute tool; fix requires Exchange 2000 SP3
Restart all Exchange Server services and Windows Management Instrumentation (WMI) services on all Exchange Server computers in the organization. This resolution is effective only if all servers are restarted at the same time.

Note Contact Microsoft Product Support Services for information about restarting all servers at the same. For a complete list of Microsoft Product Support Services phone numbers and information about support costs, visit the following Microsoft Web site:
http://support.microsoft.com/default.aspx?scid=fh;[LN];CNTACTMS
Make sure that the account that is logged on to the server has sufficient permissions. To do this, run Winroute.exe under the System Account.

Note The lack of sufficient read permissions may cause Winroute.exe to incorrectly report [object_not_found_in_DS].

back to the top

Connectors are not reported to be marked as "DOWN"

When you use the Winroute.exe tool to view Exchange routing topology, you may see that connectors that are unavailable are reported as being available ( they are marked as "UP"). This behavior may occur for the following connectors:

Connectors that use DNS to route. For example, this behavior may occur for SMTP connectors that use DNS instead of smart host.
Microsoft Exchange 5.5 Server connectors or Exchange Development Kit (EDK) connectors. These connectors do not use link state routing.
Routing group connectors with source bridgeheads of the "any" type.
Any connectors where one bridgehead is an Exchange 5.5 Server computer.
Connectors that use smart host settings and recently changed smart hosts.

back to the top

Link state oscillations: connectors are repeatedly marked as "UP" and then as "DOWN"

This common scenario involves connectors being marked as "UP" and then as "DOWN" repeatedly. It causes excessive link state updates between servers. These excessive link state updates cause a very expensive and frequent recalculation of routes within the server. This is also indicated by Event 4005 Reset Routes. This issue may occur in the following scenarios:

Network problems. Use a network trace to diagnose this scenario.
A reaction to link status notification calls from underlying protocol services, such as SMTP/AQ and message transfer agent (MTA). This behavior is caused by interference on the X.400 protocol levels or on the SMTP protocol levels by third-party applications.

In this scenario, only a network monitor capture can reveal the issues that are involved. Additionally, if you notice very frequent changes of the major versions, of the minor versions, and of the user versions in the WinRoute tool, this may also indicate a link state problem (see the WinRoute routing version changes section).

To reduce link state oscillations, apply the hotfix that is described in the following article in the Microsoft Knowledge Base:

825314 Link state traffic saturates slow links between servers

After the hotfix has been applied, you must enable the AttachedTimeout registry subkey to make sure that the hotfix works as expected.

Warning Serious problems might occur if you modify the registry incorrectly by using Registry Editor or by using another method. These problems might require that you reinstall your operating system. Microsoft cannot guarantee that these problems can be solved. Modify the registry at your own risk.

To enable the AttachedTimeout registry value, follow these steps:

Click Start, click Run, type regedit, and then click OK.
Locate the HKLM\SYSTEM\CurrentControlSet\Services\RESvc\Parameters subkey.
Right-click the Parameters subkey, point to New, and then click DWORD value.
Name the new value AttachedTimeout.
Double-click AttachedTimeout, and then type any data value from 1 to 604800. Click to select Decimal for the Base type.

Note The AttachedTimeout value represents time in seconds. The valid range for this value is 1 second to 604,800 seconds (7 days).
Click OK, and then quit Registry Editor.

Note Contact Microsoft Product Support Services for more information about the AttachedTimeout registry subkey. For a complete list of Microsoft Product Support Services phone numbers and information about support costs, visit the following Microsoft Web site:

http://support.microsoft.com/default.aspx?scid=fh;[LN];CNTACTMS

back to the top

How connector states affect link states

A connector can be located anywhere in any routing group in the Exchange organization. A specific connector that is frequently marked as "UP" and as "DOWN" may seriously affect the possible routes that a message can take through the organization. Such a connect may even lead to mail loops.

Exchange routing chooses the most optimal path, based on variables such as cost, message type, and restrictions. Exchange routing locates the next server for a message to make the next hop to, and then Exchange routing gives the name of the next server to Message Queuing. Because the oscillating state of a connector causes link state changes, Exchange has to repeatedly recalculate the optimal path. This recalculation process involves queries to the directory service.

back to the top

How link states affect connector states

When Message Queuing detects that a link to the bridgehead server on a connector failed, it calls into routing by using a method that is named LinkStateNotify( ). Routing then suppresses this information for up to 10 minutes to prevent connector state fluctuation, and then routing relays this information to the routing group master. If routing decides to mark the connector as "DOWN," this change is propagated to all computers in the organization, including the computer where the original failure occurred. This behavior leads to a very expensive process that is named "reset routes." Thereafter, the routing engine no longer recommends that the Advanced Queuing engine (AQ) connect to the "failed" next-hop computer. The reverse is true for a connector that is marked as "UP."

back to the top

WinRoute routing version changes

The WinRoute tool reports routing versions in the following format: "RoutingGroup (d5.2.3)." The three numbers that are separated by periods that follow the routing group name are the major version, the minor version, and the user version.

Major version changes are typically changes in directory service that involve routing and connectors. If there is a frequent change here, monitor it by using the Remonitor.exe tool, and then investigate it for a probable root cause. For example, an administrator may make significant changes in directory service. A major version of zero is shown for isolated routing groups with no routing and no link state exchange with other nodes. Additionally, a major version of zero is shown for Microsoft Exchange 5.5 Server-based sites because they do not use link state information.

A minor version change may indicate changes to the state of a connector. Frequent changes may be caused by faulty links or by links that fluctuate between the "UP" state and the "DOWN" state. AQ tries to send a message over a connector. If AQ fails, it sends a notification to routing to mark the connector as "DOWN." Then, AQ initiates retry pings to the connector. After AQ detects that the connector is up, AQ notifies routing by calling the LinkStateNotify() method.

User version changes may occur in the following situations:

Servers attach to or detach from master nodes.
WMI services send data to the routing group master.
There is callback registration by routing clients such as MTA or SMTP.
There are routing group membership changes.
You rename the routing group
A new master node is elected.

back to the top

Base-level callbacks

Routing base-level callbacks are updates that occur after a routing group object is modified, and after the updates are then propagated throughout the organization. The Winroute.exe major version changes may be triggered by the following events:

Renaming a routing group
Electing a new routing group master
Removing a routing group member
Adding a routing group member

back to the top

One-level callbacks

One-level callbacks are typically updates to routing when changes that are one level below the routing group object are detected. Some examples of this are deleting a connector in the routing group and adding a connecter to the routing group.

back to the top

DNS

Incorrect configuration of Domain Name System (DNS) may cause several routing issues. These issues are addressed in the following sections.

back to the top

The DNS Resolver sink event on the SMTP virtual server

The DNS Resolver sink event is primarily for resolving external SMTP domains. Your internal Active Directory servers and DNS servers still have to be able to resolve all Exchange Server computers internally.

The SMTP virtual server DNS Resolver sink event is synchronous and can affect performance on a heavily used server. To slightly improve the situation, increase the number of threads that are used for DNS lookups.

The DNS Resolver sink event is used only when a server is not in the Exchange organization. Exchange Server determines this by querying Active Directory directory service.

back to the top

Windows 2000 DNS API

If you use the DNS Resolver tool for name resolution, the lookups that are created by this tool are asynchronous and are much faster than using the default settings of the external DNS Resolver sink event.

Exchange DNS that uses the Windows DNS API or the Exchange DNS Resolver sink event has to be able to resolve an Internet Protocol address (IP address) in the following ways:

mail exchanger resource record (MX record)-to-IP address
MX record -to-A record-to-IP address
MX record-to-CNAME record-to-A record-to-IP address
CNAME record-to-A record-to-IP address
A record-to-IP address

DNS records that are incorrectly configured, especially MX records and CNAME records, may seriously affect mail flow.

Note Although Microsoft Exchange Server 2003 does provide limited support for chained CNAME records, we do not recommend implementing this configuration.

In Microsoft Exchange Server 2003, the external DNS Resolver sink event has been improved. Additionally, you can use the DNS Diagnostic utility (DNSdiag.exe) from the Windows Server 2003 Resource Kit to troubleshoot DNS issues that involve the external SMTP resolver and the Windows TCP/IP DNS. DNSdiag.exe shows the asynchronous queries and the synchronous queries to Global DNS servers or to the DNS server that are called by the DNS sink event. Additionally, DNSdiag.exe shows any corresponding failures or errors.

Note The DNS Diagnostic utility is also known as also known as the DNS Resolver tool. They are the same file, DNSdiag.exe. The following file is available for download from the Microsoft Download Center:

Download the Dnsdiag.exe package now. For more information about how to download Microsoft Support files, click the following article number to view the article in the Microsoft Knowledge Base:

119591 How to obtain Microsoft support files from online services

Link state issues and routing issues in Exchange 2000 Server and in Exchange Server 2003 (832281)

SUMMARY

IN THIS TASK

INTRODUCTION

The purpose of a routing group

The role of a routing group master

What occurs when the routing group master stops responding

Common issues

Routing member node is not connected to master

Routing group master wars

Deleted routing groups are followed by [object_not_found_in_DS]

Connectors are not reported to be marked as "DOWN"

Link state oscillations: connectors are repeatedly marked as "UP" and then as "DOWN"

How connector states affect link states

How link states affect connector states

WinRoute routing version changes

Base-level callbacks

One-level callbacks

DNS

The DNS Resolver sink event on the SMTP virtual server

Windows 2000 DNS API