Site Valet Daemon

The Valet Daemon and Agents

The Site Valet Daemon valetd is a control program. Its purpose is to run a set of individual Agent programs. It runs the Agents as required, and facilitates communication between the agents.

valetd is $VALET_BASE/sbin/valetd. The agents it controls are in $VALET_BASE/libexec.

valetd

valetd is the Site Valet master daemon. It serves to start up the various Agents as required, and to signal or restart agents as appropriate when new data are available to process.

Valetd should be started at system boot time, but late in the boot sequence after all networking services are up and running.

The Valet Agents

The Valet Agents work according to a common pattern:
  1. Read a list of URLs ready for processing by the agent from the database.
  2. Process the URLs as a batch
  3. Write the result to the database

This approach helps to minimise the load on the database, but means that data may be lost if an agent is terminated unexpectedly. Under normal shutdown, including the TERM signal, results are flushed to the database before exit. Note that loss of unflushed data is not a cause for concern, as any work lost will simply be redone next time the agent runs.

Spidering Agents

Spidering Agents are those which make HTTP requests to webservers, and which may make more than one request to a server. If these were simply to process all URLs, it would put the servers under rapid-fire (many HTTP requests in quick succession), which is not acceptable robot behaviour. It could also be inefficient.

They therefore introduce an additional layer into the above pattern. They process one URL per target server as quickly as possible. But no server is revisited immediately: it will always wait for a time Robot.poll (specified in the general configuration; default 60 seconds) before making the next request.

The actual order and timing in which URLs are processed is indeterminate, and depends primarily on the speed of responses from the various servers concerned.

HeadAgent

HeadAgent checks HTTP information for start URLs specified in the domains configuration, and for all URLs that are the target of any link from a page at a site being monitored, except where prohibited by robot rules.

This determines among other things the MIME Type and last modified dates of the URLs visited, which is used to determine when URLs should be fetched for analysis.

After flushing to the database, HeadAgent notifies valetd that new data may be available for GetAgent.

GetAgent

GetAgent fetches pages from domains monitored, where these are of a MIME Type of interest. If a page has previously been analysed and is not updated, GetAgent will simply update the last checked timestamp without fetching it. Also GetAgent records a hash on each page fetched, to determine when the page contents have changed in cases where that information is not provided in the HTTP headers.

After flushing to the database, GetAgent notifies valetd that new data may be available for ParseAgent.

Spider Support Agents

Spider support agents are those that are not concerned with polling HTTP servers and the associated scheduling issues, but which must nevertheless process data promptly whenever they become available, as the spidering agents depend on their results.

These agents may be signalled or (re)started at any time by valetd, and for speed and efficiency do not exit for some time after a batch of processing is completed.

ParseAgent

ParseAgent parses HTML and XHTML documents for links and metadata.

After flushing to the database, ParseAgent notifies valetd that new data may be available for HeadAgent and RobotAgent.

RobotAgent

RobotAgent fetches and parses robots.txt files that control what URLs a server permits Valet to access.

After flushing to the database, RobotAgent notifies valetd that new data may be available for HeadAgent and GetAgent.

Transient Agents

Transient agents run infrequently and exit on completion. Nothing time-sensitive depends on them, so there is no requirement for communication other than start/stop with valetd.

AccessAgent

AccessAgent performs automatic accessibility analysis on new HTML and XHTML documents. It records an entry in the audit trail under the name htnorm.

HValAgent

HValAgent validates documents of HTML and SGML MIMETypes.

XValAgent

XValAgent validates documents of XML and XHTML MIMETypes.