Web crawler

From IAB Wiki
(Difference between revisions)
Jump to: navigation, search
 
(3 intermediate revisions by 3 users not shown)
Line 1: Line 1:
 
[[Category:Glossary]]
 
[[Category:Glossary]]
A software program which visits Web pages to build indexes for search engines. [[Spider|See Spider,]] [[Bot|Bot,]] [[intelligent agents|and intelligent agents]].
+
A '''web crawler''' (also known as an ''automatic indexer'', ''bot'', ''Web spider'', ''Web robot'') is a software program which visits Web pages in a methodical, automated manner. 
[http://cvresumewritingservices.org/ Resumes]
+
 
[http://essaywritingservices.org/prices.php write my paper]
+
This process is called ''Web crawling'' or ''spidering'', and the resulting data is used for various purposes, including building indexes for search engines, validating that ads are being displayed in the appropriate context, and detecting malicious code on compromised web servers.  
 +
 
 +
Many web crawlers will politely identify themselves via their user-agent string, which provides a reliable way of excluding a significant amount of non-human traffic from advertising metrics.  The IAB (in conjunction with ABCe) maintains a list of known user-agent strings as the [http://www.iab.net/spiders Spiders and Bots list].  However, those web crawlers attempting to discover malicious code often must attempt to appear to be human traffic, which requires secondary, behavioral filtering to detect. 
 +
 
 +
Most web crawlers will respect a file called robots.txt, hosted in the root of a web site. This file informs the web crawler which directories should and shouldn't be indexed, but does not enact any actual access restrictions. 
 +
 
 +
Technically, a web crawler is a specific type of bot, or software agent. 
 +
 
 +
See [[bot]] and [[intelligent agent]]s.

Latest revision as of 11:20, 25 September 2012

A web crawler (also known as an automatic indexer, bot, Web spider, Web robot) is a software program which visits Web pages in a methodical, automated manner.

This process is called Web crawling or spidering, and the resulting data is used for various purposes, including building indexes for search engines, validating that ads are being displayed in the appropriate context, and detecting malicious code on compromised web servers.

Many web crawlers will politely identify themselves via their user-agent string, which provides a reliable way of excluding a significant amount of non-human traffic from advertising metrics. The IAB (in conjunction with ABCe) maintains a list of known user-agent strings as the Spiders and Bots list. However, those web crawlers attempting to discover malicious code often must attempt to appear to be human traffic, which requires secondary, behavioral filtering to detect.

Most web crawlers will respect a file called robots.txt, hosted in the root of a web site. This file informs the web crawler which directories should and shouldn't be indexed, but does not enact any actual access restrictions.

Technically, a web crawler is a specific type of bot, or software agent.

See bot and intelligent agents.

Personal tools
Namespaces
Variants
Actions
Navigation
Toolbox