W3C Lib Architecture

Libwww Architecture

Under
Construction Under construction. Any suggestions or ideas are welcome at libwww@w3.org.

The W3C Reference Library, a.k.a. Library of Common Code, is a general code base written in portable C that can be used as a basis for building clients, servers and many other Web applications. It contains reference code for accessing HTTP, FTP, Gopher, News, WAIS, Telnet servers, and the local file system. Furthermore it provides modules for parsing, managing and presenting hypertext objects to the user and a wide spectra of generic programming utilities. The Library is the basis for many World-Wide Web applications and the W3C reference applications are built on top of it.

This document describes the architecture of the Library in generic terms without referring too much directly to the code itself. It is meant to give an overview of the design which is required if you intend to enhance the Library. If you are looking for a specific description of the API then please read the User's Guide.

NOTE This document is also available as one big HTML file intended for printout. Please note that not all links in this version work!

Table of Contents

  • Introduction
  • Basic Design Model

  • The Core

  • Application Modules


    Henrik Frystyk, libwww@w3.org, December 1995
    Architecture - Introduction W3C Lib Architecture

    Introduction

    The W3C Reference Library is a general code base that can be used as a basis for building a large variety of World-Wide Web applications. Its main purpose is to provide services to transmit data objects rendered in many different media types either to or from a remote server using the most common Internet access methods or the local file system. It provides plain C reference implementations of those specifications and is especially designed to be used on a large set of different platforms. Version 3.1 supports more than 20 Unix flavors, VMS, Windows NT, and ongoing work is being done to extend the set of platforms.

    Even though plain C does not support an object oriented model but merely enables the concept, many of the data structures in the Library are derived from the class notation. This leads to situations where forced type casting is required in order to use a reference to a subclass where a superclass is expected. The forced type casting problem and inheritance in general would be solved if an object oriented programming language was to be used instead of C, but the current standardization and deployment level of object oriented languages in general would imply that a part of the portability would get lost in the transition. There are several intermediate solutions under consideration where one or more object oriented APIs built on top of the Library provides the application programmer with a cleaner interface.

    Many of the features of the Library are demonstrated in the Line Mode Browser which is a text terminal client built right on top of the Library. Even though this application is usable as an independent Web application, its main purpose is to show a working example of how the Library can be used. However, it is important to note that the Line Mode Browser is only one way of using the Library and many other applications may want to use it in other ways.

    The development of the W3C Reference Library was started by Tim Berners-Lee in 1990, and today the Library is a multi functional code base with a large amount of knowledge about network programming and portability built into it with help from Ari Luotonen, Jean-Francois Groff, Håkon W. Lie and a large number of people on the Internet.


    Henrik Frystyk, libwww@w3.org, December 1995
    Architecture - Design Model W3C Lib Architecture

    Basic Design Model

    The main criteria behind the design of the W3C Reference Library was to make it easily extendable as new Internet standards evolve for transportation and representation of data objects. The philosofi was to make it possible to dynamically "plug-in" new modules without touching the inner parts of the Library. On platforms that support dynamic linking this can be used to change the functionality of an application completely at runtime and eventually the Library can be extended to support some of the new concepts of mobile code where new modules can be down loaded from the network at runtime as they are needed in the application. The result of this concept was a Library architecture consisting of 5 main parts as illustrated in the figure below:

    Design

    The figure is similar to a protocol stack where the lower layers provide a set of services to the upper layers. This is also the case in the Library where the "layering" is as follows:

    Generic Utilities
    The Library provides a large set of generic utility modules such as container classes, string utilities, network utilities etc. They have the important function to separate the upper layer code from platform specific implementations using a large set of macros that makes the Library more portable. The modules are used throughout the Library itself and can easily be employed in many applications.
    Core
    This part is the fundamental part of the Library. The size of the core is deliberately kept small and it is important to note that it can do nothing on its own; all the functionality for accessing the network, parsing data objects, handling user interaction, logging etc. is part of the upper modules in the figure. The core provides a standard interface to the application program for requesting a service but most often the handling of the request itself takes place outside the core.
    Stream Modules
    All data is transported back and forth from the application to the network and vice verse using streams. Streams are objects that accept blocks of characters, pretty much as ANSI C FILE strems accept blocks of characters. A block can be as small as one character but large blocks are normally preferred for better performance. Often, even though not required, a stream has an output to which it directs outgoing data. An example of a stream with no output is a stream that acts like a black hole - it absorbs data without ever sending it out again. However, the typical situation for a stream is to have an output and to perform some kind of data conversion on the incoming data before it is redirected to the output.
    Access Modules
    The Access modules are protocol specific modules that makes the application capable of communicating with a wide range of Internet services. The Library comes with a wide set of protocol access modules such as HTTP, FTP, Gopher, WAIS, NNTP, Telnet, rlogin, TN3270, and the local file system, but new ones can easily be added to the list.
    Application Modules
    The application modules are often specific for client applications including functions that require user interaction, management of history lists, call back functions, logging etc. The reference implementation of these modules are often intended for character based applications like the Line Mode Browser. More advanced clients can override them, that is, a module with an identical interface is provided by the application, and the loading of the default module suppressed.
    When writing an application most of the code interacting with the Library will consist of access modules, stream modules, and application modules. These modules can either provide additional functionality or override existing functionality in the Library in order to make use of more platform dependent implementations. The latter will typically be the case with the application modules which must be adjusted to a given graphic platform.

    The User's Guide explains more on how to set up and use the Access modules and the Stream modules in an application and how to use the application modules. The rest of this document on the architecture of the W3C Reference Library is devoted to describing the Core of the Library.


    Henrik Frystyk, libwww@w3.org, December 1995
    Architecture - Overview of the Core W3C Lib Architecture

    Overview of the Core

    The main concept in the Library is a "request/response" model where an application issues a request for a URI (URL). The Library then tries to fulfill the request as efficient as possible either by requesting the URL at the origin server, a proxy server, a gateway, directly from the local file system, or a locally cached version. Data is delivered back to the application as soon as it gets ready which guarantees minimum access delay for the application. From version 3.0, the Library supports threads including its own platform independent thread model called "libwww threads". This allows multiple requests to be handled simultaneously without blocking the application while waiting on data.

    Requests and Responses

    The "request/response" model is illustrated in the control/data diagram shown below. The diagram shows only the core modules - the other modules are "pasted in" later. Note, that the Library code is to the right of the thick vertical line (green), and the application to the left can be any type of application, for example a proxy or a client. The architecture of the Library does support clients and proxies in pretty much the same way as it makes little difference to the Library: a client has a user interface whereas a server has a network interface. It is a good idea to study the Line Mode Browser and the httpd as reference implementations using the Library to see this duality.

    Another thing to note is that the Library from version 3.1 supports large scale data flow from the application to the network as well as from the network to the application. This has an important impact on the functionality that can be put into applications, for example allowing collaborative authoring possibilities via the Web. The architecture behind this is described in the section "Post Webs - an API for PUT and POST".

    Flow

    The thin lines (red) is control flow, the thick lines (blue) is data flow and the "lightning" (magenta) is control flow as a result of events handled by the Library. Let's see what happens when an application issues a request. The description is based on having an event loop - this can either be the one provided by the Library or an external event loop provided by the application. The section on libwww threads explains more on how this can be set up. The numbers refer to the figure above.

    1. The event manager is waiting for an event from the application. This can for example be a user clicking the mouse on a link or types a number on the keyboard. When an event arrives, the event manager calls the user event handler provided by the application.
    2. The user event handler creates a request object and uses one of the load methods.
    3. The Request object creates a new Net object.
    4. The Net object calls any call back functions registered to be called before the request is actually started. This can for example be mapping the URL to another destination, checking the cache, look for proxy servers and gateways etc.
    5. if the request has to access the net then the Net object passes it to the protocol object
    6. The after callback functions are called when the request is terminated. Types of operations you want to make here can for example be logging, history update etc. If the before call back functions implies that no net access is required then the protocol object is not used at all.
    7. The event callback function is now called to actually get the document.
    8. When data is arriving then the Format manager is contacted to build a stream stack.
    9. The converted data is either handed from the network to the application or from the application to the network as it gets ready. If no data is ready, control is given back to the event manager.
    10. If an error occurs then a dialog callback function is called to notify the user
    This description is the "macro" description of how the core modules interact and in the rest of this document we shall see more of the details of what is going on inside the core modules and what objects are involved. Note that by using a threaded model, the Library can handle multiple requests simultaneously. An example on how to do this is described in the section "Libwww Threads".
    Request Object
    The access manager is the main entry point for requesting a data object pointed to by a URI. It has a set of methods that allows the application to request different services, for example to get a URI, post a URI, or to search a URI.
    Protocol Object
    The protocol manager is invoked by the access manager in order to access a document not found in memory or in cache. The manager consists of a set of protocol modules handling the access schemes HTTP, FTP, NNTP, Gopher, WAIS, Telnet, and access to the local file system. The protocol modules are registered dynamically (using static linking) and the User's Guide describes how modules can be registered. Each protocol module is responsible for establishing the connection to the remote server (or the local file-system) and extract information using a specific access method. When data arrives from the network, it is passed on to the format manager.
    Format Manager
    The stream format manager takes care of the transportation of streams of data from the network to the application and vice versa. It also performs any parsing and data format conversion requested based on a set of registered format converters and a simple algorithm for selecting the best conversion. As the protocol modules, data format converters can be registered dynamically, and the current set of streams includes among others: MIME, SGML, HTML, and LaTeX.
    Error Object
    This module manages an information stack which contains information of all errors occurred during the communication with a remote server or simply information about the current state. Using a stack for this kind of information provides the possibility of nested error messages where each message can be classified and filtered according to its impact on the current request, for example "Fatal", "Non-Fatal", "Warning" etc. The filtering can be used to decide which level of messages will be passed back to the user.
    Net Object
    The net manager provides an interface for handling asynchronous sockets which is an integral part of the Library.
    Event Manager
    The event manager is a "session layer" handling which thread should be the active thread. A thread can either be an internal libwww thread or an external thread, for example a Posix thread, and the event manager can itself be either the internal Library manager or an external event manager. Currently the internal event manager uses a select function call to decide which thread should be made the active one, however an external event manager can use another decision model. One of the design ideas behind the event manager is that it can be extended to a full session layer manager handling for example the control of a HTTP-NG connection. The event manager is described together with the internal thread model in the section "Libwww Threads".


    Henrik Frystyk, libwww@w3.org, December 1995
    Architecture - Core Objects and Managers W3C Lib Architecture

    Core Objects and Managers

    The Library Core contains a set of objects central to the Library. Each of the core modules as explained in section "Control and Data Flow"are relying on one or more of these objects. This section describes the relationship between the core modules, the core objects and the relationship between the core objects themselves.

    The figure below is very similar to the one in section "Control and Data Flow", but it also introduces the core objects associated.

    Structures

    R HTRequest
    The HTRequest object contains information necessary to handle a request issued by the application. It contains information about the method to be used (for example "GET" and "PUT"), user preferences (language, content type etc.) specific for this request, where the output of the data object should go etc. A HTRequest object exists until the request reaches a final state, either success or failure, after which it can be discarded. Normally, a HTRequest object is created by the application, but the Library is capable of creating them on its own under certain circumstances. An example is when the Library creates a "Post Web" as explained in section "Building a POST Web, an API for PUT and POST".
    A HTAnchor
    HTAnchor objects represent any document which may be the source or destination of hypertext links. The HTAnchor object contains all information about the object, whether it has been loaded, metainformation like language, media type etc., and any relations to other objects. The Library defines two anchor classes: a parent anchor and a child anchor. The former contains information about whole data objects and the latter contains about subparts of a data object. The HTAnchor object is a generic superclass of both parent anchors and child anchors. Section "Anchor Objects" describes anchors and their relations in more detail.
    N HTNet
    The HTNet is a network object which contains all information required to read and write from the network. It contains the current socket descriptor (or ANSI C file descriptor) used for reading and writing, which input buffer to use and where to put the data once they are read. It also contains timing information on how long it takes to connect to a remote host and how many times it has tried to connect. This information is used by the DNS Cache in order to optimize access on multi homed hosts. The HTNet object is also a key element in the libwww thread model where it is used to identify a "thread". The libwww thread model is explained in "Description of libwww Threads".
    E HTError
    The HTError object contains information about errors occured along the way when a request is handled by the Library. Errors can be nested and the object is independent of the natural langue used to pass information to the User. The definition of the messages may be handled by the application - or it can ignore it all together.
    S HTStream
    The stream object accepts sequences of characters. It is a destination of data which can be thought of much like an output stream in C++ or an ANSI C-file stream for writing data to a disk or another peripheral device. The broad definition makes streams very flexible and they are used as the main method to transport data from the application to the network and vise versa. The Library defines two stream classes: A generic stream class and a specialized stream class for structured data using SGML lexical tokens. The contents of the two classes is described in detail in section "Streams Objects".
    The following figure illustrates the relations between the core objects themselves.

    Structures

    1. When an application issues a request the access manager binds the anchor corresponding to a URL together with a request object. The binding exists until the request reaches a final state after which the application can discard the request object. Normally the anchor object stays in memory during the whole life time of the application as the set of anchors represent the part of the Web that the application has been in touch with including metainformation etc.
    2. The application can make a binding between the request object and the desired destination for the data when it arrives, typically from the network. The request object is by default bound to a presentation stream which presents a hypertext object to the user on the screen, but it can also be written to a file, represented as source text etc.
    3. If the file cache is enabled a cache object is created and linked to the anchor object by the cache manager so that the access manager on any future requests can use the cached version (if not stale). As mentioned, the cache manager is yet to be fully designed, and the current approach may change.
    4. If the data object is not found in the cache or in memory the protocol manager is called by the access manager. The protocol manager then executes a specific protocol module which creates a HTNet object and binds it to the request object. The HTNet object is maintained uniquely by the protocol module and is removed by the protocol module as soon as the communication with the remote server reaches a final state.
    5. The request object also has a link to any error information related to it. At the end of the request this information is handled by the error manager and an error message may be generated and passed to the user.
    6. When data starts arriving, typically from the network, it is directed down the stream chain which can either already exist or is created as data arrives (stream chains are described in the section "Stream Objects". In the case where the application is transmitting a data object to a remote server, there are two steam chains directed in opposite directions: one from the application to the network and one from the network to the application.
    7. The end of the stream chain is the stream that the user may have defined when the request first was issued or it can be the default destination which is presenting the information on the screen. Between the first and the last stream in the stream chain there can be any number of other stream objects performing operations either directly on the data, or on the stream flow itself. A T-stream is an example of the latter where the stream flow is divided into two.
    8. The application receives the data arriving from the network via the "HText" object (or any of the other stream interfaces as explained in section The HTML Parser in the User's Guide).


    Henrik Frystyk, libwww@w3.org, December 1995
    Architecture - Anchors
    W3C Lib Architecture

    The Anchor Object

    Anchors represent any references to data objects which may be the sources or destinations of hypertext links. This section contains a general description of the model used to bind anchors together in an internal representation in the W3C Reference Library. The anchors are organized into a sub-web which represents the part of the web that the application (often the user) has been in touch with. In this sub-web, any anchor can be the source of zero, one, or many links and it may be the destination of zero, one, or many links. That is, any anchor can point to and be pointed to by any number of links. Having an anchor being the source of many links is often used in the POST method, where for example the same data object is to be posted to a News group, a mailing list and a HTTP server. This is explained in the section "Building a POST Web, an API for PUT and POST"

    Every data object has an anchor associated with it. Anchors exist throughout the lifetime of the application, but as this generally is not the case for data objects, it is possible to have an anchor without a data object. If the data object is stored in the file cache or in memory, the parent anchor contains a link to it so the application can access it either directly or through the file cache manager. There are two types of anchors in the Library:

    parent anchors
    Represents whole data objects. That is, the destination of a link pointing to a parent anchor is the full contents of the data object. Parent anchors are used to store all information about a data object, for example the content type, language, and length.
    child anchors
    Represents a subpart of a data object. A subpart is declared by making a NAME tag in the anchor declaration and a child anchor is the destination of a link if the HREF link declaration contains a "#" and a tag appended to the URI. Child anchors do not contain any information about the data object itself. They only keep a handle (or a "tag") pointing into the data object kept by the corresponding parent anchor.
    Both types of anchors are subclasses of a generic anchor class which defines a set of outgoing links to where the anchor points. Every parent anchor points to an address which may or may not exist. In the case of posting an anchor to a remote server, the address pointed to is yet to be created. The client can assign an address for the object but it might be overridden (or completely denied) by the remote server. The relationship between parent anchors and child anchors is illustrated in the figure.

    Anchors

    1. Parent anchors keep a list of its children which is used to avoid having multiple example of the same child and in the garbage collection of anchors.
    2. All child anchors have a pointer to their parent as only the parent anchors keep information about the data object itself. Parent anchors simply have a pointer to themselves.
    3. Every parent anchor have an address which is a URL pointing to a resource that may or may not exist.
    4. Parents can have a data object associated using the HyperDoc object. In this case anchor B and C has a data object but A hasn't which can either be because the anchor has not yet been requested or the data object has been discarded from memory by the application.
    5. Any anchor can have any number of links pointing to a set of destinations. In most situations there is only one destination, but multiple destinations is typical when posting data objects to a remote server.
    6. This anchor has two destinations. By default the main destination will be the one selected.
    7. Parent anchors keep a list of other anchors pointing to it. This information is required if a single parent anchor (and its children) is removed from the sub-web.


    Henrik Frystyk, libwww@w3.org, December 1995
    Architecture - Request Object W3C Lib Architecture

    The Request Object

    Under Construction Under construction. Any suggestions or ideas are welcome at libwww@w3.org.


    Henrik Frystyk, libwww@w3.org, December 1995
    Architecture - Libwww Threads and Net objects W3C Lib Architecture

    Net Objects and Libwww Threads

    In a single-process, single threaded environment all requests to, for example, the I/O interface blocks any further progress in the process. Any combination of a multiprocess or multi threaded implementation of the Library makes provision for the application to request several independent documents at the same time without getting blocked by slow I/O operations. As a Web application is expected to use much of the time doing I/O such as "connect" and "read", a high degree of optimization can be obtained if multiple threads can run at the same time.

    Library version 3.0 was designed to be thread compatible. It can either be used with conventional threads or with the "libwww thread" concept which allows an application to handle requests in a constrained asynchronous manner using non-blocking I/O and an event loop based on a select system call. As a result, I/O operations such as establishment of a TCP connection to a remote server and reading from the network can be handled without letting the user wait until the operation has terminated. Instead the user can issue new requests, interrupt ongoing requests, scroll a document etc.

    Version 3.1 of the Library has an enhanced libwww thread model as it supports writing large amount of data from the application to the network, also using non-blocking I/O operations. The main purpose of Librray 3.1 was to provide a basic support for remote collaborative work through the HTTP methods PUT and POST.

    As libwww threads are not really threads but a notion of using non-blocking I/O for accessing data objects from the network (or local file system), it can be used on any number of platforms with or without native support for threads, and this section describes the model behind libwww threads and how it affects applications.

    The Net object

    The Net object contains all the state information required to stop and start execute a request using asynchronous IO. The use of aynchronous IO has an important implication on the implementation of the access modules in the Library, for example the HTTP module which is explained later:

    • Global variables can be used only if they at all time are independent of the current state of the active Net object.
    • Automatic variables can be used only if they are initialized on every entry to the function and stay state independent of the current request throughout their lifetime.
    • All information necessary for completing a request must be kept in an autonomous data object that is passed round the via the stack.

    The main reason for keeping the Net object separate from the Request object is that some requests require more than one Net object, for example FTP which has a Net object for the control TCP connection and a Net object for each data TCP connection. In the case of HTTP/1.0 and HTTP/1.1, there is a 1:1 correspondance between a Net object and a request object. In HTTP/1.2 a Net object can live longer than a single request as persistent connections might handle a set of requests over the same TCP socket. Net objects can be used in three different ways:

    1. All requests are preemptive and all I/O is blocking
    2. Requests are non-preemptive managed by an internal event loop
    3. Requests are non-preemptive managed by an external event loop
    The three modes are described in more detail in the section on Internal and External Events. In mode 2) the Net objects is used to make the binding between the socket based internal event loop (using a select() call) and a request, so that a socket ready for an I/O action can make the corresponing libwww thread active. In mode 1) and 3) Net objects represents the socket interface of a request.

    Creation and Termination of a Net object

    A Net object is created by the Net manager from within the Request manager every time a request is passed to the Library. A request can either be issued by the application or the Library itself for example as a result of redirection, access authentication, or when a new data connection is created in a FTP request. All new Net objects are automatically associated with a group which might already exist or be created together with the new Net object.

    When a Net object has been created, the Request manager returns immediately to the caller and does not see the request before it has terminated either with a success or an error as result. The request can either be started immediately by the Net manager or put into a queue if the maximum number of open TCP connections is reached. When a request is terminated there are typically a set of tasks that the application would like to do:

    • Update the history list
    • Report the result to the log manager
    • Update the display
    • etc.
    Handling the termination of a request is based on call back functions that can be registered in the Net manager dynamically at run time. Multiple call back functions can be registered in which case they are all called from the Net manager in the sequence they were registered. As an example, the Request manager registers a call back function to handle the status of the request regarding to some internal actions. This function is registered at initialization time of the Library. The application can add its own call back functions to be called on termination of a request.


    Henrik Frystyk, libwww@w3.org, December 1995
    Architecture - Errors W3C Lib Architecture

    The Error Object

    The error object contains error information about a request. Errors may occur at any instant during a request and often an error result in a cascade of errors. This is explained in the User's Guide


    Henrik Frystyk, libwww@w3.org, December 1995
    Architecture - Streams W3C Lib Architecture

    Data Transportation using Streams

    A stream is an object which accepts sequences of characters. It is a destination of data which can be thought of much like an output stream in C++ or an ANSI C-file stream for writing data to a disk or another peripheral device. It can be anything that accepts data, for example another stream, ANSI C-file stream, or even a black hole which absorbs data without ever sending it out again. Streams are used to transport data internally in the Library between the application, the network, and the local file system. Streams can be cascaded into a stream chain by directing the output of a stream which often is called the sink or target into another stream. This means that the processing of data can be done as the total effect of several cascaded streams.

    From version 3.1 of the Library, streams are both used to transport data from the application to the network and vice verse which enables applications to send data objects to the remote server which is a requirement for doing collaborative work using HTTP as the transport carrier. The stream-based architecture allows the Library to be event driven in the sense that data is put down a stream as it gets ready, for example from the network, and any necessary actions then cascade off this event. An event can also be data arriving from the application which would be the case when an application is posting a data object to a remote server.

    The Library has two fundamental stream classes which are described in the following:

    • A generic superclass
    • A structured stream subclass
    Apart from these classes, many stream modules have their own subclass definitions of either the generic stream class or the structured class. These definitions can be found in the individual stream modules.

    The Generic Stream Class

    The generic stream class is a superclass of all other streams and it provides a uniform interface to all stream objects regardless of what stream sub-class they originate from. The generic stream class is defined with the following set of methods.

    Stream

    The Structured Stream Class

    A structured stream is a subclass of a stream, but instead of just accepting data, it also accepts the SGML "begin element", "end element", and "put entity". The conversion from a generic stream to a structured stream is done by the SGML tokenizer which recognizes basic SGML mark up like "<", ">", entities etc.

    StrucStream

    A structured stream therefore represents a structured document. The elements and entities in the stream are referred to by numbers, rather than strings. A DTD contains the mapping between element names and numbers, so each instance of a structured stream is associated with a corresponding DTD. The only DTD which is currently in the Library is an extended version of the HTML DTD level 1, but current work is done to update this to comply with the emerging HTML level 3 specification.

    As for generic streams, it is not required that the stream actually has a output - it can for example be a stream writing to a file where no output is required.

    Cascaded Streams

    Streams are often cascaded into a stream chain but before explaining why a stream chain is a flexible construction for data transportation, let's have a look at what kind of streams, the Library provides. The stream modules be divided into groups depending on their behavior:

    Protocol Streams
    Internal streams that parses or generates protocol specific information to communicate with remote servers.
    Converters
    Streams that can be used to convert data from one media type to another or create a data object and present it to the user.
    Presenters
    These are streams that save the data to a local file and calls an external program, for example a postscript viewer.
    I/O Streams
    Streams that can write data to a socket or an ANSI C FILE object. This can be used when redirecting a request to a local file of when saving a document in the cache
    Basic Streams
    A set of basic utility streams with no or little internal contents but required in order to cascade streams.
    The first four stream classes often fall into a natural order in a stream chain which is indicated in the the figure below. Here two typical stream pipes are shown for data flowing from the network to the application and vise verse:

    Stream Chains

    As a more specific example, the figure below shows how streams are cascaded when data from a remote HTTP server is handled by the Library. In this case, the stream chain is built as data arrives to the Library from the network: The first stream can decide whether it is a 0.9 or a 1.0 response from the first line in the response; The HTTP header parser stream can decide the format of the body when the header part is parsed and so forth. In other situations the stream chain can be setup before data arrives if the format is known a priori to the data acquisition.

    The ground symbol symbolizes that all data goes into a black hole where nothing is radiated from. The two stream outputs going to the application from each of the converters symbolizes that error information is separated from other data objects. This allows the application to direct any body part in an error message, for example from a "401 Unauthorized" HTTP status, code to a separate "debug" window where it can be displayed without affecting the current document view.


    Henrik Frystyk, libwww@w3.org, December 1995

    Architecture - The Protocol Object W3C Lib Architecture

    The Protocol Object

    The Core Library doesn't know anything about protocol modules. This has the big advantage that an application can register any number of protocols including access to the local file system. The Library core is independent of whether you make an application capable of speaking all known Internet protocols or only can access, for example a HTTP server - to the Library it's all a set of callback functions.

    The Protocol object handles two types of callback functions: one for a client application and one for a server application. This means that you can register not only any type of protocol clients but also their server counterparts. Whether it is a client module or a server module, a protocol module is identified by an access scheme which is identical to the access scheme known from the URL syntax, for example

    	http://www.w3.org
    	ftp://ftp.w3.org
    	etc.
    
    The User's Guide describes in detail how you can set up the protocol modules in your application.


    Henrik Frystyk, libwww@w3.org, December 1995
    Architecture - Format Negotiation W3C Lib Architecture

    The Data Format Manager

    This is explained in the User's Guide


    Henrik Frystyk, libwww@w3.org, December 1995
    Architecture - Events W3C Lib Architecture

    Internal and External Events

    This section describes what happens when an event arrives to the Library - either from the application or from the network. The Library provides three different ways of handling events, and it is necessary to be aware of these modes in the design phase of an application as they have an impact on the architecture of the application. The Library can be used in multiple modes simultaneously and an application can change mode as a function of the action requested by the user. The three different modes are described in the following:

    Base Mode (preemptive)
    In this mode all requests are handled in a preemptive way that does not allow for any events to pause the execution of a thread or kill it. This mode is in other words strictly single threaded and the major difference between this mode and the next two modes is that all sockets are made blocking instead of non-blocking. This mode can either be used in forking applications or in threaded applications using an external thread model where non-blocking I/O is not a requirement.
    Active Mode (Internal Event Loop)
    In this mode the event loop is placed in the Library in the HTEvntrg module. The mode can either be used by character based applications with a limited capability of user interaction, or it can be used by more advanced GUI clients where the window widget allows redirection of user events to one or more sockets that can be recognized by a select() call. It should be noted, that even though all sockets are non-blocking, the select() function is blocking if no sockets are pending so if no actions are pending, the select call will be put to sleep.

    The HTNet module contains a thread scheduler which gives highest priority to the events on the redirected user events which allows a smooth operation on GUI applications with a fast response time. This mode has a major impact on the design of the application as much of the application code may find itself within call back functions. As an example, this mode is currently used by the Arena client and the Line Mode Browser.

    Passive mode (External Event Loop)
    This mode is intended for applications where user events can not be redirected to a socket or there is already an event loop that can not work together with the event loop in the Library. The major difference from the Active mode is that instead of using the event loop defined in the HTEvntrg module, this module is overridden by the application as described in the "User's Guide". The Passive mode has the same impact on the application architecture as the Active mode except for the event loop, as all library interactions with the application are based on call back function.
    One important limitation in the libwww thread model is that the behavior is undefined if an external scheduler is provided using the internal threads in the Library with preemptive scheduling mechanism. The reason for this is that the Library is "libwww thread safe" when using one stack and one set of registers as in Active mode only when a change of active thread is done as a result of a blocking I/O operation. However, using an external thread model, this problem does not exist.

    Providing Call Back Functions

    The thread model in the Library is foreseen to work with native thread interfaces but can also be used in a non-threaded environment. In the latter case, the Library handles the creation and termination of its internal threads without any interaction required by the application. The thread model is based on call back functions of which at least one user event handler and a event terminator must must be supplied by the application. However, the application is free to register as many additional user event handlers as it wants.

    Callback

    The dashed lines from the event loop to some of the access modules symbolizes that the access method is not yet implemented using non-blocking I/O, but the event loop is still a part of the call-stack. In this situation the Library will automatically use blocking sockets which is equivalent to the Base Mode.

    User Event Handlers
    An application can register a set of user event handlers to handle events on sockets defined by the application to contain actions taken by the user. This can for example be interrupting a request, start a new request, or scroll a page. However, this requires that the actual window manager supports redirection of event on sockets.
    Event Termination
    This function is called from the Library every time a request is terminated. It passes the result of the request so that the application can update the history list etc. depending on the result. From the Library's point of view there is little difference between a user event handler and this function, as it in both cases is a call back function.
    Timeout Handler
    In Active mode, the select() function in the Library event loop is blocking even though the sockets are non-blocking. This means that if no actions are pending on any of the registered sockets then the application will block in the select() call. However, in order to avoid sockets hanging around forever, a timeout is provided so that hanging threads can be terminated.

    Returning from a Call Back Function

    Often an event handler needs to return information about a change of state as a result of an action executed by the handler, for example if a new request is issued, a ongoing request is interrupted, the application is to terminated etc. This information can be handed back to the Library using the return values of the call back function.

    There are several situations where a thread has to be killed before it has terminated normally. This can either be done internally by the Library or the application. The application indicates that a thread is to be interrupted, for example if the user has requested the operation to stop, by using a specific return value from one of the user event handlers. The Library then kills the thread immediately and the result is returned to the application.


    Henrik Frystyk, libwww@w3.org, December 1995
    Post Webs - an API for PUT and POST W3C Lib Architecture

    Post Webs - a Generic Model for Posting on the Web

    The HTTP PUT and POST are required features when extending the Web to a fully collaborative tool with features like remote authoring, annotations, update of data bases etc. Many Web applications are currently capable of transferring data from HTML forms to a HTTP server. However, form data is typically small amounts of text based data, and a more generic mechanism is needed for transmitting an arbitrary data object to any kind of remote server. This document describes how this functionality can be provided by the "Post Web" model and how this model interacts with the user, the application, and the W3C Reference Library. One of the advantages of this model is that it does not require any modification, neither to the HTTP/1.0 specification nor to the HTML form definition.

    What is a Post Web?

    A "Post Web" is used as an abstraction mechanism for enabling the user to perform multiple operations (methods) on a data object rendered in multiple representations determined for multiple destinations. This may seem complicated but the Post Web is in fact a very simple model as will become clear in the following sections. The purpose of the Post Web is to take a set of common situations from the world of email and news; merge it with the features of HTTP, and put the result into the Web model. This leads to the following set of requirements:
    • A post operation can involve one source and a multiple number of destinations.
    • The source can either be a URL referencing a local or a remote data object, or it can be any object internally managed by the application, for example a memory buffer containing a document created by the user.
    • Any of the destinations can be a URL referencing either a local or a remote data object. The object may or may not exist by the time the posting is initiated.
    • The model must not be limited to use HTTP but should be a generic mechanism for any kind of access scheme supported by the Web model.
    • The model must provide possibility for data format conversion from one media type to another on the fly when the data object is moved from the source to one or more of the destinations.
    • The user must be able to specify a relation between a source and any of the destinations, for example "Written by". This is equivalent to the "<LINK>" element in HTML and the "Link:" header in HTTP and is used to incorporate semantics into the Web topology.
    • It must be possible to specify individual operations used for each destination where an operation can be any non-idempotent operation (or method) defined by HTTP/1.0. For example, if three destinations are specified then one can use PUT, another POST, and the third can use LINK. In the following, post written in lower case refers to any non-idempotent HTTP method whereas POST written in uppercase refers to a specific HTTP method.
    The Post Web model provides a homogeneous interface to a post operation regardless of the destination, the specific method, and the data format used. It describes the full operation from defining the source and destinations to actually transfer the data over the network. This process involves there players: the user, the application, and the W3C Reference Library. Each of these uses the Post Web model but on different levels of abstraction:
    The user
    To the user, the Post Web is a way of defining a source object and one or more destinations to where the object is to be posted. The model allows the user to describe relations between the source and any of the destinations and also what method should be used.
    The application
    To the application, the Post Web is a set of bindings between a source and any of the destinations describing a request for changing the current Web topology. A binding is described by the link itself, a link relation, the method (operation) to be performed, and if any data format conversion has to be performed.
    The Library
    The Library interprets the Post Web as a set of related requests specifying the access scheme, the operation to be done, the data flow between them, and the data formats in this data flow.
    The following paragraphs describe the three layers of abstraction, how they are interconnected and thus defining the Post Web model.

    The User Builds a Post Web

    For all the possible destinations in a Post Web, the user can specify what method should be applied, any relations between the source and any of the destinations, and if any data for conversion should be performed. The relations are semantically identical to the HTML "Link" tag and the HTTP "Link" header, and it can for example describe authorship, relations to other data objects etc.

    The description of the Post Web model includes a basic example in which a user wants to post the same data object or variations thereof to two mailing lists, a news group and at the same time store the data object on a remote HTTP server. This scenario can be graphically represented as a Post Web consisting of five nodes: one source and four destinations:

    User's View

    This document does not specify the user interface for building a Post Web as this is tightly connected to the platform involved, but obviously it should take advantage of any graphic features etc. Typically a GUI-client could use drag-and-drop icons for building the Web. For example, the Post Web could be visualized using a collection of icons representing commonly used recipients and then let the user drag lines between the data object to be posted and the recipients.

    When the user has finished specifying the source, the destinations, the methods, and any relations between them, the user's version of the Post Web is ready to be submitted and the application can take the information and convert it to a lower abstraction level.

    The Application Generates a Request

    While the description of the user's view of a Post Web is fairly abstract, an actual application must transform the information into a specific representation supported by the Library. To the application, the Post Web is a request for change in the topology of the Web. The application can describe this change using anchor objects which is the Library's representation of the Web where each node represents a data object or a subpart of a data object that the application has been in contact with while browsing on the Web.

    In the figure below, each of the four anchors has a data object and a URL related to it. Any of the addresses or data objects may or may not exist when the Post Web is submitted by the application. If the source does not exist then this will result in an error, but if a destination data object does exist then the post operation is committed then might result in replacement, deletion, update, or any other outcome as a result of the method applied.

    Application's View

    The Library provides an API for handling anchor objects including how to link the objects together as indicated in the figure above. This is explained in more detail in the User's Guide.

    The Library Serves the Request

    When the application has bound the source anchor to the destination anchors with the appropriate methods and link relations, the Post Web can be handed over to the Library in order to transfer the data object from the source to the destinations. The Library is responsible for handling the actual protocol communication, and hence this part of the Post Web model is the lowest layer of abstraction. Therefore the design goals for this layer of the Post Web is somewhat more technical than the first two layers:
    • Posting to multiple destinations must be compatible with libwww threads and extern thread implementations. In the case of libwww threads, it must use non-blocking, interruptible I/O.
    • The Library must be capable of handling concurrent write and read operations to and from the network.
    • There must be no timing requirements that can lead to race conditions between any of the destinations and the source or between destinations.
    • Redirections and access authentication must be handled on both the source side and any of the destinations.
    Internally, the Library represents a Post Web in two different ways: A static and a dynamic binding between the source the destinations. The static binding is created when the application issues the request, and it exists until all the sub-requests in the Post Web have reached a final state. The dynamic binding depends on the data flow and exists only as long as data is passed through the Post Web. The dynamic binding can be set up and taken down independently of the static binding, and often this happens multiple times during the handling of a request.

    As described in the section "Core Objects ", the HTRequest object is one of the core objects used to describe a request from the application. This object is used in the static binding between the source and the destinations and it is initialized as soon as the request is passed to the Library from the application.

    libwww's Request View

    At this point no information is known about the data object itself, so the static binding only contains information about who the source and the destinations are. The dynamic binding carries information about data format, content length and other essential metainformation about the object. The dynamic binding is basically a stream chain that is established as this information gets available from the source server:

    libwww's Stream View

    1. As soon as the source server (which might be the local file system or a remote HTTP server) is ready to accept a request, it is sent of by the Library.
    2. The Library then waits until the source server starts sending back a response. In the mean time, the application can issue request other requests as the model is based on non-blocking I/O.
    3. As soon as data arrives and the data format is identified, the dynamic bindings between the source and the destinations can be setup. The binding is basically a connection between the target of the source request and the input of any of the destination requests.In the case of multiple destination, T-streams can be added to supply the required number of outgoing data flows.
    4. The destination is now ready for transmitting a request. In the case of HTTP, the destination request can not be transmitted before the full header is known, which is when the meta information from the source data object is parsed.
    5. A response will arrive to each of the destination requests determining whether the posting can continue or not.
    6. When the dynamic binding is established, any data format conversion can be inserted between the target of the source request and the input of any of the destination requests. A converter can either be placed directly at the target or on any of the inputs, so that all destinations can have different renditions of the data object. As the content length often will change as a converter other than a through line is used, it can be required to insert a content length counter stream which will buffer the data object before it is emitted from the stream.

    Updating the Web Topology

    The application can use the result of the operation returned from the Library to either regard the change in the topology of the Web as successful, erroneous, or any degree in between. The application can use this information to for example update any graphical visualization of the part of the Web that the user has traversed.

    The result of posting a data object varies from protocol to protocol. Typically transaction oriented protocols can provide an immediate result whereas relayed protocols can not. As a general rule in the design of the Library other protocols than HTTP should be supported but not extended beyond their individual limitations. This means that the Library has to be flexible enough to handle more than one result from a posting transaction dependent on the protocol used. As an example, an immediate result from a post transaction is available using NNTP or HTTP whereas the result from SMTP might be delayed several days. In practice there is no way that the application can await a response for that amount of time, and it should therefore be treated as "Accepted" with no guarantee of completeness.

    The Library handles the update of the internal anchor representation of the Web by registering the outcome of each post operation and bind that to the link between the source and the destination. This allows the application to query how two anchors are related and what the outcome of the operation was that caused the link to be established.


    Henrik Frystyk, libwww@w3.org, December 1995
    DNS Cache W3C Lib Architecture

    DNS Cache and Host Name Canonicalization

    An excessive communication with remote Domain Name Servers (DNS) can produce a significant time-overhead in requesting a document from a remote server which can result in degraded performance of the application. This is often the case in spite of DNS's own cache, as the request still has to cross the network. In order to prevent this, the Library has its internal memory cache of host names which is updated every time a host name is looked up in the DNS cache. Once the host name has been resolved into an IP-address, it is stored in the cache. The entry stays in the cache until either an error occurs when connecting to the remote host or it is removed during garbage collection. However, as the information kept in the cache is fairly small, it can contain a large set of elements.

    Multi-homed hosts are treated specially as all available IP-addresses returned from DNS are stored in the cache. Every time a request is made to the host, the time-to-connect is measured and a weight function is calculated to indicate how fast the IP-address was. The weight function used is

    Weight function

    where alpha" indicates the sensitivity of the function and Delta is the connect time. If one IP-address is not reachable a penalty of x seconds is added to the weight where the penalty is a function of the error returned from the "connect" call. The next time a request is initiated to the remote host, the IP-address with the smallest weight is used.

    A problem with both the host name cache and the data object cache is to detect when two URLs are equivalent. The only way this can be done internally in the Library is to canonicalize the URLs before they are compared. This has for some time been done by looking at the path segment of the URLs and remove redundant information by converting URLs like

    	foo/./bar/ = foo/redundant/../bar/ = foo/bar/
    
    The method is optimized and expanded so that also host names are canonicalized. Hence the following URLs are all recognized to be identical:
    	http://www/ = http://www.w3.org:80/ = http://Www.W3.Org/ =
    	http://www.w3.org./ = http://www.w3.org/
    
    However, the canonicalization does not recognize alias host names which would require that this information is stored in the cache. In order to do this, a separate resolver library must be provided as this information is normally not returned by the default resolver libraries. Also these library do not support non-blocking sockets and hence delay can not be avoided when resolving a host name. The solution is of course to write a resolver library which handles these features, and it is under consideration.


    Henrik Frystyk, libwww@w3.org, December 1995
    Architecture - Application modules W3C Lib Architecture

    Application modules

    Until now we have described the architecture of the Library Core. Even though an application can be used uniquely using the Core and the Utilities, some of the application modules are worth looking at. This section describes the acritecture behind the main application modules.


    Henrik Frystyk, libwww@w3.org, December 1995
    Architecture - Cache W3C Lib Architecture

    The Cache Manager

    Caching is a required part of any efficient Internet access applications as it saves bandwidth and improves access performance significantly in almost all types of accesses. This sections describes the architecture behind the cache management in the Library. The cache management is intended to be used both as a proxy cache and a client cache or simply as a cache relay. It does not include the interaction between an application and a proxy server as this is regarded as an external access and hence outside the scope of the local cache. The basic structure of the cache is illustrated in the figure below.

    Cache Organization

    The figure described the cache hierarchy starting from left to right; it does not describe the data flow. Any of the three cache handlers can be left out in which case a cache request will fall through to the next handler in the hierarchy and finally be passed to the protocol manager which issues a request to either the origin server, a proxy server, or a gateway. Any of the handlers can also be short circuited by using a set of cache directives which are explained in the User's Guide. In the following, each part will be described in more detail.

    Memory Cache

    The memory cache is completely handled by the application and is only consulted by the Library when servicing a request. It is considered private to a specific instance of an application and is not intended to be shared between instances. Handling the memory cache includes the following tasks: object storage, garbage collection, and object retrieval. The application can initiate a memory cache handler by registering a call back function that is called from within the Library on each request. The details of this registration is described in the User's Guide.

    Traditionally, the memory cache is based on handling the graphic objects described by the HyperDoc object in memory as the user keeps requesting new documents. The HyperDoc object is only declared in the Library - the real definition is left to the application as it is for the application to handle graphic objects. For example, the Line Mode Browser has its own definition of the HyperDoc object called HText which describes a fully parsed HTML object with enough information to display itself to the user. However, the memory cache handler can handle other objects than HTML, for example images, audio clips etc. It is important to note that the Library does not imply any limitations on the usage of the memory cache.

    The memory cache must define its own garbage collection algorithm which can be based on available memory etc. Again, the Line Mode Browser has a very simple memory management of how long objects stay around in memory. It is determined by a constant in the GridText module and is by default set to 5 documents. This approach can be much more advanced and the memory garbage collection can be determined by the size of the graphic objects, when they expire etc. but the API is the same no matter how the garbage collector is implemented.

    Private File Cache

    The private file cache is to be regarded as a direct extension of the memory cache as intended for intermediate term storage of data objects. As the memory cache, it is intended to be private to a single instance of an application as long as the instance is running. However, as a file cache is persistent, it can be shared between several instances of various applications as long as exactly one instance owns the private cache at any one time. The single ownership of a private cache means that the cache can be accessed via the local file system by one instance of an application only.

    There are two purposes of the private file cache:

    1. To maintain a persistent cache for applications that do not have a shared cache.
    2. To maintain a private persistent cache for specific groups of documents that are not to be shared among other applications. Examples of such are documents with a HTTP header Pragma: Private which will be introduced in HTTP/1.1
    Often an important difference between the memory cache and the file cache is the format of the data. As mentioned above, in the memory cache, the cached objects can be pre-parsed objects ready to be displayed to the user. In a file cache the data objects are always stored along with their metainformation so that important header information like Expires, Last-Modified, Language etc. is a part of the stored object together with any unknown metainformation that might be a part of the object.

    Shared File Cache

    A shared file cache which can be accessed by several independent applications requires its own cache manager in order to ensure a consistent cache and to handle garbage collection. A shared file cache can in many ways be regarded as similar to a proxy cache as a single application do not know when a cached object is either discarded or refreshed in the shared cache area.

    If a shared cache manager does exist then the only remaining purpose of a private file cache is to store explicitly private objects. All other objects will be stored in the shared cache.

    As for the private file cache, the data objects are always stored along with their metainformation so that any metainformation associated with an object can be returned to the requesting application.


    Henrik Frystyk, libwww@w3.org, December 1995
    Architecture - Usage of State machines W3C Lib Architecture

    Protocol Modules as State machines

    A part of the libwww thread model is to keep track of the current state in the communication interface to the network. As an example, this section describes the current implementation of the HTTP module and how it has been implemented as a state machine. The HTTP module is based on the HTTP 1.0 specification but is backwards compatible with the 0.9 version. The major difference between the implementation before version 3.0 of the Library is that this version is a state machine based on the state diagram illustrated below. This implementation has several advantages even though the HTTP protocol is stateless by nature.

    The individual states and the transitions between them are explained in the following sections.

    BEGIN State
    This state is the idle state or initial state where the HTTP module awaits a new request passed from the application.
    NEED_CONNECTION State
    The HTTP module is now ready for setting up a connection to the remote host. The connection is always initiated by a connect system call. In order to minimize the access to the Domain Name Server, all host names to previous visited hosts are stored in a local host cache as explained in section "DNS Cache and Host Name Canonicalization". The cache handles multi homed hosts in a special way in that it measures the time it takes to actually make a connection to one of the IP-addresses. This time is stored together with the specific IP-address and the host name in the cache and on the next connection to the same host the IP-address with the fastest connect time is chosen.
    NEED_REQUEST State
    The HTTP Request is what the application sends to the remote HTTP server just after the establishment of the connection. The request consists of a HTTP header line, a set of HTTP Headers, and possibly a data object to be posted to the server. The header line has the following format:
    	<METHOD> <URI> <HTTP-VERSION> CRLF
    
    SENT_REQUEST State
    When the request is sent the module waits until a response is given from the server or the connection is timed out in case or an error situation. As the module does not know whether the remote server is a HTTP 0.9 server or a HTTP 1.0 it must look at the first part of the response to figure out what version of HTTP is returned. The reason is that the HTTP protocol 0.9 does not contain a HTTP header line in the response. It simply starts to send the requested data object as soon as the GET request is handled.
    NEED_ACCESS_AUTHORIZATION State
    If a 401 Unauthorized status code is returned the module asks the user for a user id and a password, see also the " HTTP Basic Access Authorization Scheme". The connection is closed before the user is asked for the user-id and password so any new request initiated upon a 401 status code causes a new connection to be established. This is done in order to avoid having the connection hanging around waiting while the applications is waiting for user input.
    REDIRECTION State
    The remote server returns a redirection status code if the URI has either been moved temporarily or permanent to another location, possibly on another HTTP server or any other service, for example FTP or gopher. The HTTP module supports both a temporarily and a permanent redirection code returned from the server:
    301 Moved
    The load procedure is recursively called on a 301 redirection code. The new URI is parsed back to the user as information via the Error and Information module, and a new request generated. The new request can be of any access scheme accepted in a URI. An upper limit of redirections has been defined (default to 10) in order to avoid infinite loops.
    302 Found
    The functionality is the same as for a 301 Moved return status. A clever application can use the returned URI to change the document in which the URI originates so that the URI points to the new location.
    NO_DATA State
    When a return code indicates that no data object or resource follows the HTTP headers the HTTP module can terminate the request and pass control back to the application.
    NEED_BODY State
    If a body is included in the response from the server, the module must prepare to read the data from the network and direct it to the destination set up by the application. This is done by setting up a stream stack with the required conversions.
    GOT_DATA State
    When the data object has been parsed through the stream stack, the HTTP module terminates the request and handles control back to the application.
    ERROR or FAILURE State
    If at any point in the request handling a fatal error occurs the request is aborted and the connection closed. All information about the error is parsed back to the application via the Error and Information Module. As the HTTP protocol is stateless, all errors are fatal between the server and the server. If the erroneous request is to be repeated, the request starts in the initial state.


    Henrik Frystyk, libwww@w3.org, December 1995