10    Memory Channel Application Programming Interface Library

The Memory Channel Application Programming Interface (API) implements highly efficient memory sharing between Memory Channel API cluster members, with automatic error-handling, locking, and UNIX style protections. This chapter contains information to help you develop applications based on the Memory Channel API library. It explains the differences between Memory Channel address space and traditional shared memory, and describes how programming using Memory Channel as a transport differs from programming using shared memory as a transport.

This chapter also contains examples that show how to use the Memory Channel API library functions in programs. You will find these code files in the /usr/examples/cluster/ directory. Each file contains compilation instructions.

The chapter discusses the following topics:

10.1    Initializing the Memory Channel API Library

To run applications that are based on the Memory Channel API library, the library must be initialized on each host in the Memory Channel API cluster. The imc_init command initializes the Memory Channel API library and allows applications to use the API. Initialization of the Memory Channel API library occurs either by automatic execution of the imc_init command at system boot time, or when the system administrator invokes the command from the command line after the system boots.

Initialization of the Memory Channel API library at system boot time is controlled by the IMC_AUTO_INIT variable in the /etc/rc.config file. If the value of this variable is set to 1, the imc_init command is invoked at system boot time. When the Memory Channel API library is initialized at boot time, the values of the -a maxalloc and -r maxrecv flags are set to the values that are specified by the IMC_MAX_ALLOC and IMC_MAX_RECV variables in the /etc/rc.config file. The default value for the maxalloc parameter and the maxrecv parameter is 10 MB.

If the IMC_AUTO_INIT variable is set to zero (0), the Memory Channel API library is not initialized at system boot time. The system administrator must invoke the imc_init command to initialize the library. The parameter values in the /etc/rc.config file are not used when the imc_init command is manually invoked.

The imc_init command initializes the Memory Channel API library the first time it is invoked, whether this happens at system boot time or after the system has booted. The value of the -a maxalloc flag must be the same on all hosts in the Memory Channel API cluster. If different values are specified, the maximum value that is specified for any host determines the clusterwide value that applies to all hosts.

After the Memory Channel API library has been initialized on the current host, the system administrator can invoke the imc_init command again to reconfigure the values of the maxalloc and maxrecv resource limits, without forcing a reboot. The system administrator can increase or decrease either limit, but the new limits cannot be lower than the current usage of the resources. Reconfiguring the cluster from the command line does not read or modify the values that are specified in the /etc/rc.config file. The system administrator can use the rcmgr(8) command to modify the parameters and have them take effect when the system reboots.

You must have root privileges to execute the imc_init command.

10.2    The Memory Channel Multirail Model

The Memory Channel multirail model supports the concept of physical rails and logical rails. A physical rail is defined as a Memory Channel hub with its cables and Memory Channel adapters, and the Memory Channel driver for the adapters on each node. A logical rail is made up of one or two physical rails.

A cluster can have one or more logical rails up to a maximum of four. Logical rails can be configured in the following styles:

10.2.1    Single-Rail Style

If a cluster is configured in the single-rail style, there is a one-to-one relationship between physical rails and logical rails. This configuration has no failover properties; if the physical rail fails, the logical rail fails.

A benefit of the single-rail configuration is that applications can access the aggregate address space of all logical rails and utilize their aggregate bandwidth for maximum performance.

Figure 10-1 shows a single-rail Memory Channel configuration with three logical rails, each of which is also a physical rail.

Figure 10-1:  Single-Rail Memory Channel Configuration

10.2.2    Failover Pair Style

If a cluster is configured in the failover pair style, a logical rail consists of two physical rails, with one physical rail active and the other inactive. If the active physical rail fails, a failover takes place and the inactive physical rail is used, allowing the logical rail to remain active after the failover. This failover is transparent to the user.

The failover pair style can only exist in a Memory Channel configuration consisting of two physical rails.

The failover pair configuration provides availability in the event of a physical rail failure, because the second physical rail is redundant. However, only the address space and bandwidth of a single physical rail are available at any given time.

Figure 10-2 shows a multirail Memory Channel configuration in the failover pair style. The illustrated configuration has one logical rail, which is made up of two physical rails.

Figure 10-2:  Failover Pair Memory Channel Configuration

10.2.3    Configuring the Memory Channel Multirail Model

When you implement the Memory Channel multirail model, all nodes in a cluster must be configured with an equal number of physical rails, which are configured into an equal number of logical rails, each with the same failover style.

The system configuration parameter rm_rail_style, in the /etc/sysconfigtab file, is used to set multirail styles. The rm_rail_style parameter can be set to one of the following values:

The default value of the rm_rail_style parameter is 1.

The rm_rail_style parameter must have the same value for all nodes in a cluster, or configuration errors may occur.

To change the value of the rm_rail_style parameter to zero (0) for a single-rail style, change the /etc/sysconfigtab file by adding or modifying the following stanza for the rm subsystem:

rm:  rm_rail_style=0
 

Note

We recommend that you use sysconfigdb(8) to modify or to add stanzas in the /etc/sysconfigtab file.

If you change the rm_rail_style parameter, you must halt the entire cluster and then reboot each member system.

Note

A cluster will fail if any logical rail fails. See Section 10.4.3 for more information.

Error handling for the Memory Channel multirail model is implemented for specified logical rails. See Section 10.6.6 for a description of Memory Channel API library error-management functions and code examples.

Note

The Memory Channel multirail model does not facilitate any type of cluster reconfiguration, such as the addition of hubs or Memory Channel adapters. For such reconfiguration, you must first shut down the cluster completely.

10.3    Tuning Your Memory Channel Configuration

The imc_init command initializes the Memory Channel API library with certain resource defaults. Depending on your application, you may require more resources than the defaults allow. In some cases, you can change certain Memory Channel parameters and virtual memory resource parameters to overcome these limitations. The following sections describe these parameters and explain how to change them.

10.3.1    Extending Memory Channel Address Space

The amount of total Memory Channel address space that is available to the Memory Channel API library is specified using the maxalloc parameter of the imc_init command. The maximum amount of Memory Channel address space that can be attached for receive on a host is specified using the maxrecv parameter of the imc_init command. The default limit in each case is 10 MB. (Section 10.1 describes how to initialize the Memory Channel API library using the imc_init command.)

You can use the rcmgr(8) command to change the value that is used during an automatic initialization by setting the variables IMC_MAX_ALLOC and IMC_MAX_RECV. For example, you can set the variables to allow a total of 80 MB of Memory Channel address space to be made available to the Memory Channel API library clusterwide, and to allow 60 MB of Memory Channel address space to be attached for receive on the current host, as follows:

   rcmgr set IMC_MAX_ALLOC 80
   rcmgr set IMC_MAX_RECV 60
 

If you use the rcmgr(8) command to set new limits, they will take effect when the system reboots.

You can use the Memory Channel API library initialization command, imc_init, to change both the amount of total Memory Channel address space available and the maximum amount of Memory Channel address space that can be attached for receive, after the Memory Channel API library has been initialized. For example, to allow a total amount of 80 MB of Memory Channel address space to be made available clusterwide, and to allow 60 MB of Memory Channel address space to be attached for receive on the current host, use the following command:

imc_init -a 80 -r 60

If you use the imc_init command to set new limits, they will be lost when the system reboots, and the values of the IMC_MAX_ALLOC and IMC_MAX_RECV variables will be used as limits.

10.3.2    Increasing Wired Memory

Every page of Memory Channel address space that is attached for receive must be backed by a page of physical memory on your system. This memory is nonpageable; that is, it is wired memory. The amount of wired memory on a host cannot be increased infinitely; the system configuration parameter vm_syswiredpercent will impose a limit. You can change the vm_syswiredpercent parameter in the /etc/sysconfigtab file.

For example, if you want to set the vm_syswiredpercent parameter to 80, the vm stanza in the /etc/sysconfigtab file must contain the following entry:

vm:  vm_syswiredpercent=80
 

If you change the vm_syswiredpercent parameter, you must reboot the system.

Note

The default amount of wired memory is sufficient for most operations. We recommend that you exercise caution in changing this limit.

10.4    Troubleshooting

The following sections describe error conditions that you may encounter when using the Memory Channel API library functions, and suggest solutions.

10.4.1    IMC_NOTINIT Return Code

The IMC_NOTINIT status is returned when the imc_init command has not been run, or when the imc_init command has failed to run correctly.

The imc_init command must be run on each host in the Memory Channel API cluster before you can use the Memory Channel API library functions. (Section 10.1 describes how to initialize the Memory Channel API library using the imc_init command.)

If the imc_init command does not run successfully, see Section 10.4.2 for suggested solutions.

10.4.2    Memory Channel API Library Initialization Failure

The Memory Channel API library may fail to initialize on a host; if this happens, an error message is displayed on the console and is written to the messages log file in the /usr/var/adm directory. Use the following list of error messages and solutions to eliminate the error:

10.4.3    Fatal Memory Channel Errors

Sometimes the Memory Channel API fails to initialize because of problems with the physical Memory Channel configuration or interconnect. Error messages that are displayed on the console in these circumstances do not mention the Memory Channel API. The following sections describe some of the more common reasons for such failures.

10.4.3.1    Logical Rail Failure

If any logical rail fails, a system panic occurs on one or more hosts in the cluster, and the following error message is displayed on the console:

panic (cpu 0): rm_delete_context: fatal MC error
 

To solve this problem, ensure that the hub is powered up and that all cables are connected properly; then halt the entire cluster and reboot each member system.

10.4.3.2    Logical Rail Initialization Failure

If the logical rail configuration for a logical rail on this node does not match that of a logical rail on other cluster members, a system panic occurs on one or more hosts in the cluster, and error messages of the following form are displayed on the console:

rm_slave_init
rail configuration does not match cluster expectations for logical rail
0
logical rail 0 has failed initialization
rm_delete_context: lcsr = 0x2a80078, mcerr = 0x20001, mcport =
0x72400001
panic (cpu 0): rm_delete_context: fatal MC error
 

This error can occur if the configuration parameter rm_rail_style is not identical on every node.

To solve this problem, follow these steps:

  1. Halt the system.

  2. Boot /genvmunix.

  3. Modify the /etc/sysconfigtab file as described in Section 10.2.3.

  4. Reboot the kernel with Memory Channel API cluster support (/vmunix).

10.4.4    IMC_MCFULL Return Code

The IMC_MCFULL status is returned if there is not enough Memory Channel address space to perform an operation.

The amount of total Memory Channel address space that is available to the Memory Channel API library is specified by using the maxalloc parameter of the imc_init command, as described in Section 10.4.2.

You can use the rcmgr(8) command or the Memory Channel API library initialization command, imc_init, to increase the amount of Memory Channel address space that is available to the library clusterwide. See Section 10.3.1 for more details.

10.4.5    IMC_RXFULL Return Code

The IMC_RXFULL status is returned by the imc_asattach function, if receive mapping space is exhausted when an attempt is made to attach a region for receive.

Note

The default amount of receive space on the current host is 10 MB.

The maximum amount of Memory Channel address space that can be attached for receive on a host is specified using the maxrecv parameter of the imc_init command, as described in Section 10.1.

You can use the rcmgr(8) command or the Memory Channel API library initialization command, imc_init, to extend the maximum amount of Memory Channel address space that can be attached for receive on the host. See Section 10.3.1 for more details.

10.4.6    IMC_WIRED_LIMIT Return Code

The IMC_WIRED_LIMIT return value indicates that an attempt has been made to exceed the maximum quantity of wired memory.

The system configuration parameter vm_syswiredpercent specifies the wired memory limit; see Section 10.3.2 for information on changing this limit.

10.4.7    IMC_MAPENTRIES Return Code

The IMC_MAPENTRIES return value indicates that the maximum number of virtual memory map entries has been exceeded for the current process.

10.4.8    IMC_NOMEM Return Code

The IMC_NOMEM return status indicates a malloc function failure while performing a Memory Channel API function call.

This will happen if process virtual memory has been exceeded, and can be remedied by using the usual techniques for extending process virtual memory limits; that is, by using the limit command and the unlimit command for the C shell, and by using the ulimit command for the Bourne shell and the Korn shell.

10.4.9    IMC_NORESOURCES Return Code

The IMC_NORESOURCES return value indicates that there are insufficient Memory Channel data structures available to perform the required operation. However, the amount of available Memory Channel data structures is fixed, and cannot be increased by changing a parameter. To solve this problem, amend the application to use fewer regions or locks.

10.5    Initializing the Memory Channel API Library for a User Program

The imc_api_init function is used to initialize the Memory Channel API library in a user program. Call the imc_api_init function in a process before any of the other Memory Channel API functions are called. If a process forks, the imc_api_init function must be called before calling any other API functions in the child process, or an undefined behavior will result.

10.6    Accessing Memory Channel Address Space

The Memory Channel interconnect provides a form of memory sharing between Memory Channel API cluster members. The Memory Channel API library is used to set up the memory sharing, allowing processes on different members of the cluster to exchange data using direct read and write operations to addresses in their virtual address space. When the memory sharing has been set up by the Memory Channel API library, these direct read and write operations take place at hardware speeds without involving the operating system or the Memory Channel API library software functions.

When a system is configured with Memory Channel, part of the physical address space of the system is assigned to the Memory Channel address space. The size of the Memory Channel address space is specified by the imc_init command. A process accesses this Memory Channel address space by using the Memory Channel API to map a region of Memory Channel address space to its own virtual address space.

Applications that want to access the Memory Channel address space on different cluster members can allocate part of the address space for a particular purpose by calling the imc_asalloc function. The key parameter associates a clusterwide key with the region. Other processes that allocate the same region also specify this key. This allows processes to coordinate access to the region.

To use an allocated region of Memory Channel address space, a process maps the region into its own process virtual address space, using the imc_asattach function or the imc_asattach_ptp function. When a process attaches to a Memory Channel region, an area of virtual address space that is the same size as the Memory Channel region is added to the process virtual address space. When attaching the region, the process indicates whether the region is mapped to receive or transmit data, as follows:

A process can attach to a Memory Channel region in broadcast mode, point-to-point mode, or loopback mode. These methods of attach are described in Section 10.6.1.

Memory sharing using the Memory Channel interconnect is similar to conventional shared memory in that, after it is established, simple accesses to virtual address space allow two different processes to share data. However, there are two differences between these memory-sharing mechanisms that you must allow for, as follows:

10.6.1    Attaching to Memory Channel Address Space

The following sections describe the ways in which a process can attach to Memory Channel address space. There are three ways in which a process can attach to Memory Channel address space, as follows:

This section also explains initial coherency, reading and writing Memory Channel regions, latency-related coherency, and error management, and includes some code examples.

10.6.1.1    Broadcast Attach

When one process maps a region for transmit and other processes map the same region for receive, the data that the transmit process writes to the region is transmitted on Memory Channel to the receive memory of the other processes. Figure 10-3 shows a three-host Memory Channel implementation that shows how the address spaces are mapped.

Figure 10-3:  Broadcast Address Space Mapping

With the address spaces that are mapped as shown in Figure 10-3, note the following:

  1. Process A allocates a region of Memory Channel address space. Process A then maps the allocated region to its virtual address space when it attaches the region for transmit using the imc_asattach function.

  2. Process B and Process C both allocate the same region of Memory Channel address space as Process A. However, unlike Process A, Process B and Process C both attach the region to receive data.

  3. When data is written to the virtual address space of Process A, the data is transmitted on Memory Channel.

  4. When the data from Process A appears on Memory Channel, it is written to the physical memory on Hosts B and C that backs the virtual address spaces of Processes B and C that were allocated to receive the data.

10.6.1.2    Point-to-Point Attach

An allocated region of Memory Channel address space can be attached for transmit in point-to-point mode to the virtual address space of a process on another node. This is done by calling the imc_asattach_ptp function with a specified host as a parameter. This means that writes to the region are sent only to the host that is specified in the parameter, and not to all hosts in the cluster.

Regions that are attached using the imc_asattach_ptp function are always attached in transmit mode, and are write-only. Figure 10-4 shows a two-host Memory Channel implementation that shows point-to-point address space mapping.

Figure 10-4:  Point-to-Point Address Space Mapping

With the address spaces mapped as shown in Figure 10-4, note the following:

  1. Process 1 allocates a region of Memory Channel address space. It then maps the allocated region to its virtual address space when it attaches the region point-to-point to Host B using the imc_asattach_ptp function.

  2. Process 2 allocates the region and then attaches it for receive using the imc_asattach function.

  3. When data is written to the virtual address space of Process 1, the data is transmitted on Memory Channel.

  4. When the data from Process 1 appears on Memory Channel, it is written to the physical memory that backs the virtual address space of Process 2 on Host B.

10.6.1.3    Loopback Attach

A region can be attached for both transmit and receive by processes on a host. Data that is written by the host is written to other hosts that have attached the region for receive. However, by default, data that is written by the host is not also written to the receive memory on that host; it is written only to other hosts. If you want a host to see data that it writes, you must specify the IMC_LOOPBACK flag to the imc_asattach function when attaching for transmit.

The loopback attribute of a region is set up on a per-host basis, and is determined by the value of the flag parameter to the first transmit attach on that host.

If you specify the value IMC_LOOPBACK for the flag parameter, two Memory Channel transactions occur for every write, one to write the data and one to loop the data back.

Because of the nature of point-to-point attach mode, looped-back writes are not permitted.

Figure 10-5 shows a configuration in which a region of Memory Channel address space is attached both for transmit with loopback and for receive.

Figure 10-5:  Loopback Address Space Mapping

10.6.2    Initial Coherency

When a Memory Channel region is attached for receive, the initial contents are undefined. This situation can arise because a process that has mapped the same Memory Channel region for transmit might update the contents of the region before other processes map the region for receive. This is referred to as the initial coherency problem. You can overcome this in two ways:

10.6.3    Reading and Writing Memory Channel Regions

Processes that attach a region of Memory Channel address space can only write to a transmit pointer, and can only read from a receive pointer. Any attempt to read a transmit pointer will result in a segmentation violation.

Apart from explicit read operations on Memory Channel transmit pointers, segmentation violations will also result from operations that cause the compiler to generate read-modify-write cycles; for example:

10.6.4    Address Space Example

Example 10-1 shows how to initialize, allocate, and attach to a region of Memory Channel address space, and also shows two of the differences between Memory Channel address space and traditional shared memory:

The sample program shown in Example 10-1 executes in master or slave mode, as specified by a command-line parameter. In master mode, the program writes its own process identifier (PID) to a data structure in the global Memory Channel address space. In slave mode, the program polls a data structure in the Memory Channel address space to determine the PID of the master process.

Note

Make sure that your programs are flexible in their use of keys to prevent problems resulting from key clashes. We recommend that you use meaningful, application-specific keys.

Example 10-1:  Accessing Regions of Memory Channel Address Space

/* /usr/examples/cluster/mc_ex1.c */
 
#include <c_asm.h>
#include <sys/types.h>
#include <sys/imc.h>
#define VALID 756
 
main (int argc, char *argv[])
{
    imc_asid_t   glob_id;
    typedef struct {
                    pid_t     pid;
                    volatile int valid;  [1]
                   } clust_pid;
 
    clust_pid   *global_record;
    caddr_t     add_rx_ptr = 0, add_tx_ptr = 0;
    int         status;
    int         master;
    int         logical_rail=0;
 
    /* check for correct number of arguments /*
 
    if (argc != 2) {
                    printf("usage: mcpid 0|1\n");
                    exit(-1);
                   }
 
    /* test if process is master or slave */
 
    master  = atoi(argv[1]);  [2]
 
 
    /* initialize Memory Channel API library */
 
    status  = imc_api_init(NULL);  [3]
 
 
    if (status < 0) {
                     imc_perror("imc_api_init::",status);  [4]
                     exit(-2);
                    }
 
    imc_asalloc(123, 8192, IMC_URW, 0, &glob_id,
                         logical_rail);  [5]
 
    if (master) {
                 imc_asattach(glob_id, IMC_TRANSMIT, IMC_SHARED,
                              0, &add_tx_ptr);  [6]
 
                 global_record = (clust_pid*)add_tx_ptr;  [7]
                 global_record->pid = getpid();
                 mb();  [8]
                 global_record->valid = VALID;
                 mb();
                }
 
    else        {  /* secondary process */
 
                 imc_asattach(glob_id, IMC_RECEIVE, IMC_SHARED,
                              0, &add_rx_ptr);  [9]
 
                 (char*)global_record  = add_rx_ptr;
 
                 while ( global_record->valid != VALID)
                       ;  /* continue polling */  [10]
 
                 printf("pid of master process is %d\n",
                        global_record->pid);
                }
   imc_asdetach(glob_id);
   imc_asdealloc(glob_id);  [11]
}
 

  1. The valid flag is declared as volatile to prevent the compiler from performing any optimizations that might prevent the code from reading the updated PID value from memory. [Return to example]

  2. The first argument on the command line indicates whether the process is a master (argument equal to 1) or a slave process (argument not not equal to 1). [Return to example]

  3. The imc_api_init function initializes the Memory Channel API library. Call it before calling any of the other Memory Channel API library functions. [Return to example]

  4. All Memory Channel API library functions return a zero (0) status if successful. The imc_perror function decodes error status values. For brevity, this example ignores the status from all functions other than the imc_api_init function. [Return to example]

  5. The imc_asalloc function allocates a region of Memory Channel address space with the following characteristics:

    [Return to example]

  6. The master process attaches the region for transmit by calling the imc_asattach function and specifying the glob_id identifier, which was returned by the call to the imc_asalloc function. The imc_asattach function returns add_tx_ptr, a pointer to the address of the region in the process virtual address space. The IMC_SHARED value signifies that the region is shareable, so other processes on this host can also attach the region. [Return to example]

  7. The program points the global record structure to the region of virtual memory in the process virtual address space that is backed by the Memory Channel reason, and writes the process ID in the pid field of the global record. Note that the master process has attached the region for transmit; therefore, it can only write data in the field. An attempt to read the field will result in a segmentation violation; for example:

    (pid_t)x = global_record->pid;
     
    

    [Return to example]

  8. The program uses memory barrier instructions to ensure that the pid field is forced out of the Alpha CPU write buffer before the VALID flag is set. [Return to example]

  9. The slave process attaches the region for receive by calling the imc_asattach function and specifying the glob_id identifier, which was returned by the call to the imc_asalloc function. The imc_asattach function returns add_rx_ptr, a pointer to the address of the region in the process virtual address space. On mapping, the contents of the region may not be consistent on all processes that map the region. Therefore, start the slave process before the master to ensure that all writes by the master process appear in the virtual address space of the slave process. [Return to example]

  10. The slave process overlays the region with the global record structure and polls the valid flag. The earlier declaration of the flag as volatile ensures that the flag is immune to compiler optimizations, which might result in the field being stored in a register. This ensures that the loop will load a new value from memory at each iteration and will eventually detect the transition to VALID. [Return to example]

  11. At termination, the master and slave processes explicitly detach and deallocate the region by calling the imc_asdetach function and the imc_asdealloc function. In the case of abnormal termination, the allocated regions are automatically freed when the processes exit. [Return to example]

10.6.5    Latency Related Coherency

As described in Section 10.6.2, the initial coherency problem can be overcome by retransmitting the data after all mappings of the same region for receive have been completed, or by specifying at allocation time that the region is coherent. However, when a process writes to a transmit pointer, several microseconds can elapse before the update is reflected in the physical memory that corresponds to the receive pointer. If the process reads the receive pointer during that interval, the data it reads might be incorrect. This is known as the latency-related coherency problem.

Latency problems do not arise in conventional shared memory systems. Memory and cache control ensure that store and load instructions are synchronized with data transfers.

Example 10-2 shows two versions of a program that decrements a global process count and detects the count reaching zero (0). The first program uses System V shared memory and interprocess communication. The second uses the Memory Channel API library.

Example 10-2:  System V IPC and Memory Channel Code Comparison

/* /usr/examples/cluster/mc_ex2.c */
 
/****************************************
 *********  System V IPC example  *******
 ****************************************/
 
#include <sys/types.h>
#include <sys/ipc.h>
#include <sys/shm.h>
main()
{
   typedef struct {
                   int  proc_count;
                   int  remainder[2047]
                  } global_page;
   global_page  *mypage;
   int  shmid;
 
   shmid = shmget(123, 8192, IPC_CREAT |  SHM_R |  SHM_W);
 
   (caddr_t)mypage = shmat(shmid, 0,  0);  /* attach the
                                              global region     */
 
   mypage->proc_count ++;               /* increment process
                                           count                */
 
   /* body of program goes here */
           .
           .
           .
   /* clean up */
 
   mypage->proc_count --;               /* decrement process
                                           count                */
   if (mypage->proc_count == 0 )
                printf("The last process is exiting\n");
           .
           .
           .
}
 
/****************************************
 *******  Memory Channel example  *******
 ****************************************/
 
#include <sys/types.h>
#include <sys/imc.h>
main()
{
   typedef struct {
                   int  proc_count;
                   int  remainder[2047]
                  } global_page;
   global_page  *mypage_rx, *mypage_tx;  [1]
   imc_asid_t   glob_id;
   int          logical_rail=0;
   int          temp;
 
   imc_api_init(NULL);
 
   imc_asalloc(123, 8192, IMC_URW | IMC_GRW, 0, &glob_id,
               logical_rail);  [2]
 
   imc_asattach(glob_id, IMC_TRANSMIT, IMC_SHARED,
                IMC_LOOPBACK, &(caddr_t)mypage_tx);  [3]
 
   imc_asattach(glob_id, IMC_RECEIVE, IMC_SHARED,
                0, &(caddr_t)mypage_rx);  [4]
 
   /* increment process count */
 
   mypage_tx->proc_count = mypage_rx->proc_count + 1;  [5]
 
   /* body of program goes here */
           
.
.
.
/* clean up */   /* decrement process count   temp = mypage_rx->proc_count - 1 [6]   mypage_tx->proc_count = temp;   /* wait for MEMORY CHANNEL update to occur */   while (mypage_rx->proc_count != temp) ;   if (mypage_rx->proc_count == 0 ) printf("The last process is exiting\n");
.
.
.
}    

  1. The process must be able to read the data that it writes to the Memory Channel global address space. Therefore, it declares two addresses, one for transmit and one for receive. [Return to example]

  2. The imc_asalloc function allocates a region of Memory Channel address space. The characteristics of the region are as follows:

    [Return to example]

  3. This call to the imc_asattach function attaches the region for transmit at the address that is pointed to by the mypage_tx variable. The value of the flag parameter is set to IMC_LOOPBACK, so that any time the process writes data to the region, the data is looped back to the receive memory. [Return to example]

  4. This call to the imc_asattach function attaches the region for receive at the address that is pointed to by the mypage_rx variable. [Return to example]

  5. The program increments the global process count by adding 1 to the value in the receive pointer, and by assigning the result into the transmit pointer. When the program writes to the transmit pointer, it does not wait to ensure that the write instruction completes. [Return to example]

  6. After the body of the program completes, the program decrements the process count and tests that the decremented value was transmitted to the other hosts in the cluster. To ensure that it examines the decremented count (rather than some transient value), the program stores the decremented count in a local variable, temp. It writes the decremented count to the transmit region, and then waits for the value in the receive region to match the value in temp. When the match occurs, the program knows that the decremented process count has been written to the Memory Channel address space. [Return to example]

In this example, the use of the local variable ensures that the program compares the value in the receive memory with the value that was transmitted. An attempt to use the value in the receive memory before ensuring that the value had been updated may result in erroneous data being read.

10.6.6    Error Management

In a shared memory system, the process of reading and writing to memory is assumed to be error-free. In a Memory Channel system, the error rate is of the order of three errors per year. This is much lower than the error rates of standard networks and I/O subsystems.

The Memory Channel hardware reports detected errors to the Memory Channel software. The Memory Channel hardware provides two guarantees that make it possible to develop applications that can cope with errors:

These guarantees simplify the process of developing reliable and efficient messaging systems.

The Memory Channel API library provides the following functions to help applications implement error management:

The operating system maintains a count of the number of errors that occur on the cluster. The system increments the value whenever it detects a Memory Channel hardware error in the cluster, and when a host joins or leaves the cluster.

The task of detecting and processing an error takes a small, but finite, amount of time. This means that the count that is returned by the imc_rderrcnt_mr function might not be up-to-date with respect to an error that has just occurred on another host in the cluster. On the local host, the count is always up-to-date.

Use the imc_rderrcnt_mr function to implement a simple and effective error-detection mechanism by reading the error count before transmitting a message, and including the count in the message. The receiving process compares the error count in the message body with the local value that is determined after the message arrives. The local value is guaranteed to be up-to-date, so if this value is the same as the transmitted value, then it is certain that no intervening errors occurred. Example 10-3 shows this technique.

Example 10-3:  Error Detection Using the imc_rderrcnt_mr Function

/* /usr/examples/cluster/mc_ex3.c */
 
/*****************************************
*********  Transmitting Process  *********
******************************************/
 
#include <sys/imc.h>
#include <c_asm.h>
main()
{
        typedef struct {
                volatile        int     msg_arrived;
                int     send_count;
                int     remainder[2046];
        } global_page;
        global_page     *mypage_rx, *mypage_tx;
        imc_asid_t      glob_id;
        int             i;
        volatile        int err_count;
 
 
        imc_api_init(NULL);
 
        imc_asalloc (1234, 8192, IMC_URW, 0, &glob_id,0);
        imc_asattach (glob_id, IMC_TRANSMIT, IMC_SHARED, IMC_LOOPBACK,
                      &(caddr_t)mypage_tx);
        imc_asattach (glob_id, IMC_RECEIVE, IMC_SHARED, 0,
                      &(caddr_t)mypage_rx);
 
        /*  save the error count  */
        while (  (err_count = imc_rderrcnt_mr(0) ) < 0 )
                ;
 
        mypage_tx->send_count = err_count;
 
        /* store message data */
        for (i = 0; i < 2046; i++)
                mypage_tx->remainder[i] = i;
 
        /* now mark as valid */
        mb();
 
        do {
                mypage_tx->msg_arrived = 1;
        } while (mypage_rx->msg_arrived != 1);  /* ensure no error on
                                                 valid flag          */
 
}
 
 
/*****************************************
***********  Receiving Process  **********
******************************************/
 
#include <sys/imc.h>
main()
{
        typedef struct {
                volatile        int     msg_arrived;
                int     send_count;
                int     remainder[2046];
        } global_page;
        global_page     *mypage_rx, *mypage_tx;
        imc_asid_t      glob_id;
        int             i;
        volatile        int err_count;
 
        imc_api_init(NULL);
 
        imc_asalloc (1234, 8192, IMC_URW, 0, &glob_id,0);
        imc_asattach (glob_id, IMC_RECEIVE, IMC_SHARED, 0,
                      &(caddr_t)mypage_rx);
 
        /* wait for message arrival */
        while ( mypage_rx->msg_arrived == 0 )
                ;
 
        /* get  this systems error count */
        while ( (err_count = imc_rderrcnt_mr(0) ) < 0 )
                ;
 
        if (err_count == mypage_rx->send_count) {
                /* no error, process the body */
                            .....
        }
        else {
                /* do error  processing */
                        ......
        }
}
 

As shown in Example 10-3, the imc_rderrcnt_mr function can be safely used to detect errors at the receiving end of a message. However, it cannot be guaranteed to detect errors at the transmitting end. This is because there is a small, but finite, possibility that the transmitting process will read the error count before the transmitting host has been notified of an error occurring on the receiving host. In Example 10-3, the program must rely on a higher-level protocol informing the transmitting host of the error.

The imc_ckerrcnt_mr function provides guaranteed error detection for a specified logical rail. This function takes a user-supplied local error count and a logical rail number as parameters, and returns an error in the following circumstances:

If the function returns successfully, no errors have been detected between when the local error count was stored and the imc_ckerrcnt_mr function was called.

The imc_ckerrcnt_mr function reads the Memory Channel adapter hardware error status for the specified logical rail; this is a hardware operation that takes several microseconds. Therefore, the imc_ckerrcnt_mr function takes longer to execute than the imc_rderrcnt_mr function, which reads only a memory location.

Example 10-4 shows an amended version of the send sequence shown in Example 10-3. In Example 10-4, the transmitting process performs error detection.

Example 10-4:  Error Detection Using the imc_ckerrcnt_mr Function

/* /usr/examples/cluster/mc_ex4.c */
 
/**********************************************/
/*  Transmitting Process With Error Detection */
/**********************************************/
 
#include <c_asm.h>
#define mb() asm("mb")
 
#include <sys/imc.h>
main()
{
        typedef struct {
                volatile        int     msg_arrived;
                int     send_count;
                int     remainder[2046];
        } global_page;
        global_page     *mypage_rx, *mypage_tx;
        imc_asid_t      glob_id;
        int             i, status;
        volatile        int     err_count;
 
        imc_api_init(NULL);
 
        imc_asalloc (1234, 8192, IMC_URW, 0, &glob_id,0);
        imc_asattach (glob_id, IMC_TRANSMIT, IMC_SHARED, IMC_LOOPBACK,
                      &(caddr_t)mypage_tx);
        imc_asattach (glob_id, IMC_RECEIVE, IMC_SHARED, 0,
                      &(caddr_t)mypage_rx);
 
        /*  save the error count  */
        while (  (err_count = imc_rderrcnt_mr(0) ) < 0 )
                ;
 
        do {
                mypage_tx->send_count = err_count;
 
                /* store message data */
                for (i = 0; i < 2046; i++)
                        mypage_tx->remainder[i] = i;
 
                /* now mark as valid */
                mb();
 
                mypage_tx->msg_arrived = 1;
 
        /*  if error occurs, retransmit */
 
        } while ( (status = imc_ckerrcnt_mr(&err_count,0)) != IMC_SUCCESS);
}
 

10.7    Clusterwide Locks

In a Memory Channel system, the processes communicate by reading and writing regions of the Memory Channel address space. The preceding sections contain sample programs that show arbitrary reading and writing of regions. In practice, however, a locking mechanism is sometimes needed to provide controlled access to regions and to other clusterwide resources. The Memory Channel API library provides a set of lock functions that enable applications to implement access control on resources.

The Memory Channel API library implements locks by using mapped pages of the global Memory Channel address space. For efficiency reasons, locks are allocated in sets rather than individually. The imc_lkalloc function allows you to allocate a lock set. For example, if you want to use 20 locks, it is more efficient to create one set with 20 locks than five sets with four locks each, and so on.

To facilitate the initial coordination of distributed applications, the imc_lkalloc function allows a process to atomically (that is, in a single operation) allocate the lock set and acquire the first lock in the set. This feature allows the process to determine whether or not it is the first process to allocate the lock set. If it is, the process is guaranteed access to the lock and can safely initialize the resource.

Instead of allocating the lock set and acquiring the first lock atomically, a process can call the imc_lkalloc function and then the imc_lkacquire function. In that case, however, there is a risk that another process might acquire the lock between the two function calls, and the first process will not be guaranteed access to the lock.

Example 10-5 shows a program in which the first process to lock a region of Memory Channel address space initializes the region, and the processes that subsequently access the region simply update the process count.

Example 10-5:  Locking Memory Channel Regions

/* /usr/examples/cluster/mc_ex5.c */
 
#include <sys/types.h>
#include <sys/imc.h>
 
main ( )
{
    imc_asid_t   glob_id;
    imc_lkid_t   lock_id;
    int          locks = 4;
    int          status;
 
    typedef struct {
                    int    proc_count;
                    int    pattern[2047];
                   } clust_rec;
 
    clust_rec *global_record_tx, *global_record_rx;  [1]
    caddr_t      add_rx_ptr = 0, add_tx_ptr = 0;
    int     j;
 
    status  = imc_api_init(NULL);
 
    imc_asalloc(123, 8192, IMC_URW, 0, &glob_id, 0);
 
    imc_asattach(glob_id, IMC_TRANSMIT, IMC_SHARED,
                 IMC_LOOPBACK, &add_tx_ptr);
 
    imc_asattach(glob_id, IMC_RECEIVE, IMC_SHARED,
                 0, &add_rx_ptr);
 
    global_record_tx = (clust_rec*) add_tx_ptr;  [2]
    global_record_rx = (clust_rec*) add_rx_ptr;
 
 
    status = imc_lkalloc(456, &locks, IMC_LKU,  IMC_CREATOR,
                         &lock_id);  [3]
    if (status == IMC_SUCCESS)
    {
 
    /* This is the first process. Initialize the global region  */
 
        global_record_tx->proc_count = 0;  [4]
        for (j = 0; j < 2047; j++)
                global_record_tx->pattern[j] = j;
 
        /* release the lock */
        imc_lkrelease(lock_id, 0);  [5]
 
    }
 
 
   /* This is a secondary process */
 
    else if (status == IMC_EXISTS)
    {
       imc_lkalloc(456, &locks, IMC_LKU, 0, &lock_id);  [6]
 
       imc_lkacquire(lock_id, 0, 0, IMC_LOCKWAIT);  [7]
 
       /* wait for access to region */
 
       global_record_tx->proc_count = global_record_rx->proc_count+1;  [8]
 
       /* release the lock */
 
       imc_lkrelease(lock_id, 0);
 
    }
 
    /* body of program goes here */
           
.
.
.
/* clean up */   imc_lkdealloc(lock_id); [9] imc_asdetach(glob_id); imc_asdealloc(glob_id); }  

  1. The process, in order to read the data that it writes to the Memory Channel global address space, maps the region for transmit and for receive. See Example 10-2 for a detailed description of this procedure. [Return to example]

  2. The program overlays the transmit and receive pointers with the global record structure. [Return to example]

  3. The process tries to create a lock set that contains four locks and a key value of 456. The call to the imc_lkalloc function also specifies the IMC_CREATOR flag. Therefore, if the lock set is not already allocated, the function will automatically acquire lock zero (0). If the lock set already exists, the imc_lkalloc function fails to allocate the lock set and returns the value IMC_EXISTS. [Return to example]

  4. The process that creates the lock set (and consequently holds lock zero (0)) initializes the global region. [Return to example]

  5. When the process finishes initializing the region, it calls the imc_lkrelease function to release the lock. [Return to example]

  6. Secondary processes that execute after the region has been initialized, having failed in the first call to the imc_lkalloc function, now call the function again without the IMC_CREATOR flag. Because the value of the key parameter is the same (456), this call allocates the same lock set. [Return to example]

  7. The secondary process calls the imc_lkacquire function to acquire lock zero (0) from the lock set. [Return to example]

  8. The secondary process updates the process count and writes it to the transmit region. [Return to example]

  9. At the end of the program, the processes release all Memory Channel resources. [Return to example]

When a process acquires a lock, other processes executing on the cluster cannot acquire that lock.

Waiting for locks to become free entails busy spinning and has a significant effect on performance. Therefore, in the interest of overall system performance, make your applications acquire locks only as they are needed and release them promptly.

10.8    Cluster Signals

The Memory Channel API library provides the imc_kill function to allow processes to send signals to specified processes executing on a remote host in a cluster. This function is similar to the UNIX kill(2) function. When the kill function is used in a cluster, the signal is sent to all processes whose process group number is equal to the absolute value of the PID, even if that process is on another cluster member. The PID is guaranteed to be unique across the cluster.

The main differences between the imc_kill function and the kill function are that the imc_kill function does not allow the sending of signals across cluster members nor does it support the sending of signals to multiple processes.

10.9    Cluster Information

The following sections discuss how to use the Memory Channel API functions to access cluster information, and how to access status information from the command line.

10.9.1    Using Memory Channel API Functions to Access Memory Channel API Cluster Information

The Memory Channel API library provides the imc_getclusterinfo function, which allows processes to get information about the hosts in a Memory Channel API cluster. The function returns one or more of the following:

The function does not return information about a host unless the Memory Channel API library is initialized on the host.

The Memory Channel API library provides the imc_wait_cluster_event function to block a calling thread until a specified cluster event occurs. The following Memory Channel API cluster events are valid:

The imc_wait_cluster_event function examines the current representation of the Memory Channel API cluster configuration item that is being monitored, and returns the new Memory Channel API cluster configuration.

Example 10-6 shows how you can use the imc_getclusterinfo function with the imc_wait_cluster_event function to request the names of the members of the Memory Channel API cluster and the active Memory Channel logical rails bitmask, and then wait for an event change on either.

Example 10-6:  Requesting Memory Channel API Cluster Information; Waiting for Memory Channel API Cluster Events

/* /usr/examples/cluster/mc_ex6.c */
 
#include <sys/imc.h>
 
main ( )
{
 
    imc_railinfo    mask;
    imc_hostinfo    hostinfo;
 
    int             status;
    imc_infoType    items[3];
    imc_eventType   events[3];
 
 
    items[0] = IMC_GET_ACTIVERAILS;
    items[1] = IMC_GET_HOSTS;
    items[2] = 0;
 
    events[0] = IMC_CC_EVENT_RAIL;
    events[1] = IMC_CC_EVENT_HOST;
    events[2] = 0;
 
    imc_api_init(NULL);
 
    status = imc_getclusterinfo(items,2,mask,sizeof(imc_railinfo),
                                    &hostinfo,sizeof(imc_hostinfo));
 
    if (status != IMC_SUCCESS)
        imc_perror("imc_getclusterinfo:",status);
 
    status = imc_wait_cluster_event(events, 2, 0,
                                    mask, sizeof(imc_railinfo),
                                    &hostinfo, sizeof(imc_hostinfo));
 
    if ((status != IMC_HOST_CHANGE) && (status != IMC_RAIL_CHANGE))
        imc_perror("imc_wait_cluster_event didn't complete:",status);
 
}   /*main*/
 

10.9.2    Accessing Memory Channel Status Information from the Command Line

The Memory Channel API library provides the imcs command to report on Memory Channel status. The imcs command writes information to the standard output about currently active Memory Channel facilities. The output is displayed as a list of regions or lock sets, and includes the following information:

10.10    Comparison of Shared Memory and Message Passing Models

There are two models that you can use to develop applications that are based on the Memory Channel API library:

At first, the shared memory approach might seem more suited to the Memory Channel features. However, developers who use this model must deal with the latency, coherency, and error-detection problems that are described in this chapter. In some cases, it might be more appropriate to develop a simple message-passing library that hides these problems from applications. The data transfer functions in such a library can be implemented completely in user space. Therefore, they can operate as efficiently as implementations based on the shared memory model.

10.11    Frequently Asked Questions

This section contains answers to questions that are asked by programmers who use the Memory Channel API to develop programs for TruCluster systems.

10.11.1    IMC_NOMAPPER Return Code

Question: An attempt was made to do an attach to a coherent region using the imc_asattach function. The function returned the value IMC_NOMAPPER. What does this mean?

Answer: This return value indicates that the imc_mapper process is missing from a system in your Memory Channel API cluster.

The imc_mapper process is automatically started in the following cases:

To solve this problem, reboot the system from which the imc_mapper process is missing.

This error may occur if you shut down a system to single-user mode from init level 3, and then return the system to multi-user mode without doing a complete reboot. If you want to reboot a system that runs TruCluster Server software, you must do a full reboot of that system.

10.11.2    Efficient Data Copy

Question: How can data be copied to a Memory Channel transmit region in order to obtain maximum Memory Channel bandwidth?

Answer: The Memory Channel API imc_bcopy function provides an efficient way of copying aligned or unaligned data to Memory Channel. The imc_bcopy function has been optimized to make maximum use of the buffering capability of a Compaq Alpha CPU.

You can also use the imc_bcopy function to copy data efficiently between two buffers in standard memory.

10.11.3    Memory Channel Bandwidth Availability

Question: Is maximum Memory Channel bandwidth available when using coherent Memory Channel regions?

Answer: No. Coherent regions use the loopback feature to ensure local coherency. Therefore, every write data cycle has a corresponding cycle to loop the data back; this halves the available bandwidth. See Section 10.6.1.3 for more information about the loopback feature.

10.11.4    Memory Channel API Cluster Configuration Change

Question: How can a program determine whether a Memory Channel API cluster configuration change has occurred?

Answer: You can use the new imc_wait_cluster_event function to monitor hosts that are joining or leaving the Memory Channel API cluster, or to monitor changes in the state of the active logical rails. You can write a program that calls the imc_wait_cluster_event function in a separate thread; this blocks the caller until a state change occurs.

10.11.5    Bus Error Message

Question: When a program tries to set a value in an attached transmit region, it crashes with the following message:

Bus error (core dumped)

Why does this happen?

Answer: The data type of the value may be smaller than 32 bits (in C, an int is a 32-bit data item, and a short is a 16-bit data item). The Compaq Alpha processor, like other RISC processors, reads and writes data in 64-bit units or 32-bit units. When you assign a value to a data item that is smaller than 32 bits, the compiler generates code that loads a 32-bit unit, changes the bytes that are to be modified, and then stores the entire 32-bit unit. If such a data item is in a Memory Channel region that is attached for transmit, the assignment causes a read operation to occur in the attached area. Because transmit areas are write-only, a bus error is reported.

You can prevent this problem by ensuring that all accesses are done on 32-bit data items. See Section 10.6.3 for more information.