WASD VMS Hypertext Services -Technical Overview

21 - Server Performance

21.1 - Simple File Request Turn-Around
21.2 - Scripting
21.3 - SSL
21.4 - Suggestions
[next] [previous] [contents] [full-page]

The server has a single-process, multi-threaded, asynchronous I/O design. On a single-processor system this is the most efficient approach. On a multi-processor system it is limited by the single process context (with scripts executing within their own context). For I/O constrained processing (the most common in general Web environments) the AST-driven approach is quite efficient.

The test system was a lightly-loaded AlphaServer 4100 4/400 (4 x 400MHz CPUs), VMS v7.3-2 and DEC TCP/IP 5.4. No Keep-Alive: functionality was employed so each request required a complete TCP/IP connection and disposal. DNS (name resolution) and access logging were disabled. The server and test-bench utility were located on separate systems with 100 Mbps Fast-Ethernet interconnection.

As of v7.1 the performance data is collected using the "ApacheBench" utility (23.5 - Apache Bench). DCL procedures with sets of ApacheBench calls are used to benchmark requests. These procedures and the generated output from benchmark runs (collected via $@procedure/OUTPUT=filename) are available in the HT_ROOT:[EXERCISE] directory.

These results are indicative only!

On a clustered, multi-user system too many things vary slightly all the time. Hence the batching of accesses, interleaved between servers, attempting to provide a representative result.


OSU/Apache Comparison

Until v5.3 a direct comparison of performance between OSU and WASD had not been made (even to satisfy the author's own occasional curiosity). After a number of users with experience in both environments commented ... WASD seemed faster, was it? ... it was decided to make and provide comparisons using the same metrics used on WASD for some time.

Every endeavour has been made to ensure the comparison is as equitable as possible (e.g. each server executes at the same process priority, has a suitable cache enabled, runs on the same machine in the same relatively quiescent environment. Each test run was interleaved between each server to try and distribute any environment variations. Tests showing a port 7080 were to WASD, port 7777 to the OSU server, and port 8888 to Apache. All servers were configured "out-of-the-box", minimal changes (generally just path mappings), WASD executing via the [INSTALL]DEMO.COM procedure.

Of course performance is just one of a number of considerations in any software environment (otherwise we wouldn't be using VMS now would we? ;-) No specific conclusions are promoted by the author. Readers may draw their own from the results recorded below.

For this document the results were derived using the WASD v9.0, CSWS V1.3 (based on Apache 1.3.26), and OSU 3.10 servers. CSWS V1.3 still seems to be the most widely deployed Apache on VMS, perhaps due to some widely discussed deployment issues with SWS V2.0 (based on Apache 2.0.47), this has remained the baseline VMS Apache comparison.


21.1 - Simple File Request Turn-Around

A series of tests using batches of accesses. The first test returned an empty file measuring response and file access time, without any actual transfer. The second requested a file of 64K characters, testing performance with a more realistic load. All were done using one and ten concurrent requests. Note that the Apache measurement is "out-of-the-box" - the author could find no hint of a file cache, let-alone how to enable/disable one.

Cache Disabled - Requests/Second
ResponseConcurrentWASDOSUApache
0K120011745
0K1025212547
64K1784343
64K10935427

Cache Enabled - Requests/Second
ResponseConcurrentWASDOSUApache
0K152141534
0K1083152238
64K11024328
64K101345532

Result file:

HT_ROOT:[EXERCISE]PERF_FILES_NOCACHE_AB_V90.TXT
HT_ROOT:[EXERCISE]PERF_FILES_AB_V90.TXT

With both WASD cached and non-cached throughput actually improves at ten concurrent requests (undoubtably due to the latency of the serial TCP/IP connection/disconnection in one-by-one, compared to several happening concurrently).

Note that the response and transfer benefits decline noticably with file size (transfer time). The difference between cached and non-cached with the zero file size (no actual data transfer involved) gives some indication of the raw difference in response latency, some 250-300% improvement. This is a fairly crude analysis, but does give some indication of cache efficiencies.

Just one other indicative metric of the two servers, CPU time consumed during the file measurement runs. The value for Apache was not measured as it would be distributed over an indeterminate number of child processes.

CPU Time Consumed (Seconds)
CacheWASDOSUApache
Disabled11.948.7-
Enabled4.638.1-


File Transfer Rate

Under similar conditions results indicate a potential transfer rate well in excess of 1 Mbyte per second. This serves to demonstrate that server architecture should not be the limiting factor in file throughput.

Transfer Rate - MBytes/Second
ResponseConcurrentWASDOSUApache
3.9MB (7700 blocks)18.55.58.7
3.9MB (7700 blocks)107.45.98.3

Result file:

HT_ROOT:[EXERCISE]PERF_XFER_AB_V90.TXT

The results for Apache indicate one occasion where a collection of child processes performs very well (with assistance from generous VCC_... cache settings).


File Record Format

The server can handle STREAM, STREAM_LF, STREAM_CR, FIXED and UNDEFINED record formats very much more efficiently than VARIABLE or VFC files.

With STREAM, FIXED and UNDEFINED files the assumption is that HTTP carriage-control is within the file itself (i.e. at least the newline (LF), all that is required required by browsers), and does not require additional processing. With VARIABLE record files the carriage-control is implied and therefore each record requires additional processing by the server to supply it. Even with variable record files having multiple records buffered by the HTTPd before writing them collectively to the network improving efficiency, stream and binary file reads are by Virtual Block and are written to the network immediately making the transfer of these very efficient indeed!


21.2 - Scripting

Persistant-subprocesses are probably the most efficient solution for child-process scripting under VMS. See "Scripting Environment" document. The I/O still needs to be on-served to the client by the server.

A simple performance evaluation shows the relative merits of the four WASD scripting environments available, plus a comparison with OSU and Apache. HT_ROOT:[SRC.CGIPLUS]CGIPLUSTEST.C, which executes in both standard CGI and CGIplus environments, and an ISAPI example DLL, HT_ROOT:[SRC.CGIPLUS]ISAPIEXAMPLE.C, which provides equivalent output. A series of accesses were made. The first test returned only the HTTP header, evaluating raw request turn-around time. The second test requested a body of 64K characters, again testing performance with a more realistic load.

DECnet-based scripting was tested using essentially the same environment as subprocess-based CGI, assessing the performance of the same script being executed using DECnet to manage the processes. Three separate environments have been evaluated, WASD-DECnet-CGI, WASD-OSU-emulation and OSU. The OSU script used the WASD CGISYM.C utility to generate the required CGI symbols (also see WASD/OSU Comparison). DECnet-Plus T5.0.3 was in use.

CGI Scripting - Requests/Second
ResponseConcurrentCGICGIplusISAPIDECnet-CGIOSU-emulOSUApache
0KB1252542491615124
0KB10634733513630255
64KB1219585151494
64KB102746453227185

Result file:

HT_ROOT:[EXERCISE]PERF_SCRIPTS_AB_V90.TXT


Scripting Observations

Although these results are indicative only, they do show CGIplus and ISAPI to have a potential for improvement over standard CGI from a factor of 5 (500%) up to factors in excess of 10 (1000%) - a not inconsiderable improvement. Of course this test generates the output stream very simply and efficiently and so excludes any actual processing time that may be required by a "real" application. If the script/application has a large activation time the reduction in response latency could be even more significant (e.g. Perl scripts and RDMS access languages).

CGIplus under V7.2 has seen a dramatic increase in throughput over previous version benchmarks ... in excess of a factor of 2 (100%)! This is entirely due to the new "struct" mode available. See the Scripting Overview for further detail.


DECnet Observations

This section comments on non-persistant scripts (i.e. those that must run-up and run-down with each request - general CGI behaviour). Although not shown here measurements of connection reuse show significant benefits in reduced response times, consistency of response times and overall throughput, showing a difference of some 200% over non-reuse (similar improvements were reported with the OSU 3.3a server).

With ten simultaneous and back-to-back scripts and no connection reuse many more network processes are generated than just ten. This is due to the NETSERVER maintenance tasks such as log creation and purging, activating and deactivating the task, etc., adding latency into this script environment. The throughput was generally still lower than with subprocess-based scripting.

While earlier versions cautioned on the use of DECnet-based scripting this has been relaxed somewhat through connection reuse.


WASD/OSU Comparison

A direct comparison of CGI performance between WASD and OSU scripting is biased in favour of WASD, as OSU scripting is based on its own protocol with CGI behaviour layered-in above scripts that require it. Therefore a non-CGI comparison was devised. The script is designed to favour neither environment, merely return the plain-text string "Hello!" as quickly as possible. Data for Apache is also included, although this type of scripting is not really its forte.

  $! OSU and WASD scripting face-to-face in a script that favours neither unduly
  $ if f$type(WWWEXEC_RUNDOWN_STRING) .nes. ""
  $ then
  $    write net_link "<DNETTEXT>"
  $    write net_link "200 Success"
  $    write net_link "Hello!"
  $    write net_link "</DNETTEXT>"
  $ else
  $    write sys$output "Content-Type: text/plain"
  $    write sys$output ""
  $    write sys$output "Hello!"
  $ endif

Face-to-Face - Requests/Second
 ConcurrentCGICGIplusISAPIDECnet-CGIOSU-emulOSUApache
"Hello!"150n/an/an/an/a295
"Hello!"10123n/an/an/an/a606

Result file:

HT_ROOT:[EXERCISE]PERF_SCRIPTS_AB_V90.TXT


WASD/Apache Scripting Comparison

CGI scripting is notoriously slow (as illustrated above), hence the effort expended by designers in creating persistent scripting environments - those where the scripting engine (and perhaps other state) is maintained between requests. Both WASD and Apache implement these as integrated modules, the former as CGIplus/RTE, and in the latter as loadable modules.

The following comparison uses two of the most common scripting environments and engines shared between WASD and Apache, Perl and PHP. The engines used in both server environments were identical. WASD 9.0 with PHPWASD123 and PERLRTE121 packages. CSWS 1.3 with CSWS_PHP-V0101 and PERL-V0506-1-1 packages.

A simple script for each engine is used as a common test-bench for the two servers.

  <!-- face2face.php -->
  <?php
  echo "<B>Hello!</B>"
  ?>
 
  # face2face.pl
  print "Content-Type:  text/html\n\n
  <B>Hello!</B>
  ";

These are designed to measure the script environment and its activation latencies, rather than the time required to process script content (which should be consistent considering they are the same engines). In addition, the standard php_info.php is used to demonstrate with a script that actually performs some processing. No data is provided for the OSU package.

Persistent Scripting - Requests/Second
 ConcurrentWASDApache
face2face.pl16015
face2face.pl1010829
face2face.php15832
face2face.php1014057
php_info.php14327
php_info.php109446

Result file:

HT_ROOT:[EXERCISE]PERF_PERSIST_AB_V90.TXT


Persistent Scripting Observations

These results demonstrate the efficiency and scalability of the WASD CGIplus/RTE technology used to implement its persistent scripting environments. Most site-specific scripts can also be built using the libraries, code fragments, and example scripts provided with the WASD package, and obtain similar efficiencies and low latencies. See "Scripting Environment" document.


21.3 - SSL

At this time there are no definitive measurements of SSL performance (18 - Secure Sockets Layer). One might expect that because of the CPU-intensive cryptography employed in SSL requests that performance, particularly where concurrent requests are in progress, would be significantly lower. In practice SSL seems to provide more-than-acceptable responsiveness.


21.4 - Suggestions

Here are some suggestions for improving the performance of the server, listed in approximate order of significance. Note that these will have proportionally less impact on an otherwise heavily loaded system.

  1. Disable host name resolution (configuration parameter [DNSLookup]). DNS latency can slow request processing significantly! Most log analysis tools can convert literal addresses so DNS resolution is often an unnecessary burden.

  2. Later versions of TCP/IP Services for OpenVMS seem to have large default values for socket send and receive buffers. MultiNet and TCPware are reported to improve transfer of large responses by increasing low default values for send buffer size. The WASD global configuration directives [SocketSizeRcvBuf] and [SocketSizeSndBuf] allow default values to be adjusted. WATCH can be used to report network connection buffer values.

  3. Enable caching (configuration parameter [Cache]).

  4. Ensure served files are not VARIABLE record format (see above). Enable STREAM-LF conversion using a value such as 250 (configuration parameter [StreamLF], and SET against required paths using mapping rules).

  5. Use persistant-subprocess DCL/scripting (configuration parameter [ZombieLifeTime])

  6. Ensure script processes are given every possible chance to persist (configuration parameter [DclBitBucketTimeout]).

  7. Use the persistent scripting capabilities of CGIplus or ISAPI whenever possible.

  8. Ensure the server account's WSQUO and WSEXTENT quotas are adequate. A constantly paging server is a slow server!

  9. Tune the network and DCL output buffer size to the Maximum Transfer Unit (MTU) of the server's network interface. Using Digital TCP/IP Services (a.k.a. UCX) display the MTU.
      TCPIP> SHOW INTERFACE
                                                                 Packets
      Interface   IP_Addr         Network mask          Receive          Send     MTU
     
       SE0        203.127.158.3   255.255.255.0          376960        704345    1500
       LO0        127.0.0.1       255.0.0.0                 306           306       0
    

    In this example the MTU of the ethernet interface is 1500 (bytes). Set the [BufferSizeNetWrite] configuration directive to be some multiple of this. In the case of 1500, say 3000, 4500 or 6000. Also set the [BufferSizeDclOutput] to the same value. Rationale: always use completely filled network packets when transmitting data.

  10. Disable logging (configuration parameter [Logging]).

  11. Set the HTTP server process priority higher, say to 6 (use startup qualifier /PRIORITY=). Do this after due consideration. It will only improve response time if the system is also used for other, lower priority purposes. It will not help if Web-serving is the sole acitivity of the system.

  12. Reduce to as few as possible the number of mapping and authorization rules, particularly those that have conditions that require additional evaluation. Also see 14 - Request Processing Configuration.

  13. Use a pre-defined log format (e.g. "common", configuration parameter [LogFormat]). User-specified formats require more processing for each enrty.

  14. Disable request history (configuration parameter [RequestHistory]).

  15. Disable activity statistics (configuration parameter [ActivityDays]).


[next] [previous] [contents] [full-page]