img
Home > Candidate Patterns > Resource Crawling

Resource Crawling

img

How can information available across multiple services be effectively queried?

Problem

The service consumer of a single service can rely on specific service capabilities to query relevant data, but data distributed across multiple services can be difficult to consistently query.

Solution

Develop program logic to "crawl" through the services within a service inventory by invoking fetch capabilities and following links from resource to resource. Collate the retrieved data in a centralized indexing service to support queries.

Application

This patterns is typically applied together with Entity Linking [SDP], Reusable Contract [SDP], and Lightweight Endpoint [SDP] in order to create a system of navigable resources.

Fetch methods that are included in the reusable contract are typically defined as "safe" so that services can guarantee that these fetches are read-only and do not have unanticipated side-effects.

Crawling occurs only over resources the indexing service considers relevant to its consumers. The crawling activity is seeded with known resource identifiers to ensure that it is able to locate all relevant resources. Services may provide explicit lists of resources that should be included in the crawl to ensure adequate indexing coverage. Services may also include information that excludes particular sets of resources from being included in the query.

Impacts

Service consumers are able to go to a single indexing or search service to locate information about any resource in a service inventory. Services are decoupled from query processing load and implementation details.

Indexed information may not always be current and will therefore need to be periodically refreshed. Fetches used to index resource data can place additional load on services.

Data with security requirements must be treated as such by the indexing service, or must be excluded from the scope of the crawl.

Crawling techniques can be used to pre-cache information that a consumer is likely to need next after processing information at the current resource. This can improve consumer latency for safe requests.

Architecture

Inventory, Composition, Service

Status

Under Review

Contributors

Raj Balasubramanian, Benjamin Carlyle, Cesare Pautasso
img

Crawling is a common technique to support loosely coupled indexing of resources on the Web. Search engines in particular will periodically query resources they consider relevant and index them for inclusion in search queries.

Related Patterns in This Catalog

Entity Linking, Lightweight Endpoint, Reusable Contract

Related Service-Oriented Computing Goals

Increased Organizational Agility, Reduced IT Burden