Box UK - be creative. be innovative. be bold.

Intelligent information sharing across the web

There are several methods currently evolving which can enable resource and information sharing across the web. Through the use of metadata, information on remote sites can be well-described using global standards, thus facilitating automated information harvesting in an intelligent manner. In particular, harvesting from remote sites can be carried out  by one of three choices:

  1. HTML Spider – a typical search-engine-like application that downloads standard HTML web pages from external sites, parses them for content and metadata (implicit or explicit, such as meta-tags), and records the results locally.

    Pros – Any site can join the system, no extra development needed.

    Cons – Implicit metadata is inaccurate, explicit metadata has no set structure and is limited.

  2. Web Services – each site participating in a search develops a common API that can be queried over HTTP with SOAP (or some other message-exchange language).  On request, the external site will query the internal database, and return the results in a specific format.  The central ‘search’ engine can then collate results from multiple external sites and present the combined results to the user.

    Pros – Each site retains control over each search.  Searches are farmed out to each site on a per-request basis, therefore results/content is always up-to-date.

    Cons – Difficult to implement, slow results for end-user.

  3. RDF Spider – a similar ‘spider’ application to that used in search engines, but one that only downloads and parses RDF files; these files contain explicit metadata descriptions about the resources at each site.

    Pros – Implicit metadata, can be structured but not limited in scope.

    Cons – Each participating site must produce an RDF file.

Box UK have experience with implementing each type of solution.  For the Virtual Teacher Centre (VTC), a best-of-both-worlds solution was developed, with a spider that could collect and parse standard HTML files and RDF metadata files. 

A recent project - Port Cities - consists of five geographically dispersed museums and archives implementing separate Content Management solutions.  Box UK was responsible for three of the sites, together with the main umbrella site.  Amaxus was deployed at each Box UK site, automatically publishing an RDF file for three of the sites.  The remaining two sites were given the metadata specification (Dublin Core), together with simple RDF implementation guidelines and examples.

The umbrella site utilizes a spider for the periodic collection and parsing of each partner RDF file, facilitating a fast and accurate metadata-based central search.

Standards Compliance

XML, v 1.0   (http://www.w3.org/TR/REC-xml)
RDF    (http://www.w3.org/TR/REC-rdf-syntax/)
Dublin Core   (http://dublincore.org/usage/terms/dc/current-elements/)
Dublin Core RDF (http://dublincore.org/documents/2002/07/31/dcmes-xml/)

Interoperability

Box UK’s Content Management System, Amaxus, uses XML as a native data exchange format.  The system can automatically produce RDF files, in XML format.  The metadata fields inside the RDF files are, by default, Dublin Core, but can also be any other metadata specification (e.g. IMS LOM).

Authentication

Web Service based applications can utilize key transaction authentication, or secure HTTP for privacy.  RDF and HTML spiders normally use little or no authentication or security, as the information is public.  Secure HTTP transactions, or encrypted RDF files could be used if deemed necessary.

Application Architecture

Box UK’s HTML and RDF spider is a multi-threaded Java application.  Standard web requests are made when collecting content from partner sites, i.e. HTTP over port 80.  Web service applications can be configured to use a variety of protocols, but usually employ HTTP over port 80.

Customisation Potential

The spider can be administered through a web interface.

Application Accessibility

The spider application will run on a single central server, but the HTML administration page for the spider can be accessed from any compatible web browser (given the correct security details).

Application Performance

The spider is multithreaded and can make multiple simultaneous connections to different web sites.  Hence, a slow web site will have minimal impact on the overall time it takes the spider to run as it can continue spidering other sites whilst waiting for the slow one to respond.

The spider makes use of a database connection pool to minimise the overhead associated with creating and closing database connections.  In addition, the spider may have several connections in use at any time and so does not need to wait for one site's data to be entered into the database before starting to insert another's.

The spider will make the best use of the bandwidth available to it and will run over a slow internet connection.

The spider is I/O bound and is not particularly CPU intensive.  The bulk of the work performed by the spider involves downloading web pages and extracting hyperlinks they contain.  The time taken to extract the hyperlinks from a web page is generally negligible, hence the speed of the internet connection available to the spider will have a direct influence on the time it takes to run.

The resources consumed and quantity of data brought-back by the spider can be controlled by specifying the total number of sites to spider, the number of sites to simultaneously spider and the number of web pages to bring back from each site.

View this page in pdf formatView this page in rdf format

Glossary

RDF
Resource Description Framework
XML
Extensible Markup Language
Java
Java
Amaxus
XML Content Management System
HTML
HyperText Markup Language
Dublin Core
A recommended set of fields for describing a resource.
Metadata
Metadata is structured data about data.
CMS
Content Management System
Web Services
Web Services

About This Page

Published: 13th Mar 2003
Tech: XML
Tech: Java