Web crawler

From IAB Wiki
Jump to: navigation, search

A web crawler (also known as an automatic indexer, bot, Web spider, Web robot) is a software program which visits Web pages in a methodical, automated manner.

This process is called Web crawling or spidering, and the resulting data is used for various purposes, including building indexes for search engines, validating that ads are being displayed in the appropriate context, and detecting malicious code on compromised web servers.

Many web crawlers will politely identify themselves via their user-agent string, which provides a reliable way of excluding a significant amount of non-human traffic from advertising metrics. The IAB (in conjunction with ABCe) maintains a list of known user-agent strings as the Spiders and Bots list. However, those web crawlers attempting to discover malicious code often must attempt to appear to be human traffic, which requires secondary, behavioral filtering to detect.

Most web crawlers will respect a file called robots.txt, hosted in the root of a web site. This file informs the web crawler which directories should and shouldn't be indexed, but does not enact any actual access restrictions.

Technically, a web crawler is a specific type of bot, or software agent.

See bot and intelligent agents.

Personal tools