Web crawlers are used to extract information from web sites for many purposes.
The simple example given here accepts an URL and exposes some functions to query the content of the page.
To check out the source code of this example : https://github.com/nadeeth/crawler
If you are going to make any improvements to this code, I recommend you to follow TDD and use the unit test class in the code.
The init() function initializes the xpath object for the page URL assigned to url attribute.
You can use two functions given above to write another function which crawls all the linked pages recursively by maintaining a queue of links.
[Ref]
http://en.wikipedia.org/wiki/Web_crawler
http://en.wikipedia.org/wiki/Web_scraping
The simple example given here accepts an URL and exposes some functions to query the content of the page.
To check out the source code of this example : https://github.com/nadeeth/crawler
If you are going to make any improvements to this code, I recommend you to follow TDD and use the unit test class in the code.
Step 1 : Create the class, init function and required attributes
In this example, xpath is used for querying the given web page. There is an attribute to hold the page url, and another to hold the xpath object of the loaded page.The init() function initializes the xpath object for the page URL assigned to url attribute.
class Crawler { public $url = false; protected $xpath = false; public function init() { $xmlDoc = new DOMDocument(); @$xmlDoc->loadHTML(file_get_contents($this->url)); $this->xpath = new DOMXPath(@$xmlDoc); } }In the next two steps, you will add some simple functions to this class for querying data.
Step 2 : A method to query the content of tags
This function returns an array of contents for the given tag name within the given limit (default limit is 10).public function getInnerText($tag, $limit = 10) { $content = Array(); $searchNode = $this->xpath->query("(//{$tag})[position() <= {$limit}]"); foreach ($searchNode as $node) { $content[] = $node->nodeValue; } return $content; }
Step 3 : A method to extract tag attributes (images, links etc.)
This function returns an array of attributes for the give tag type on the page within the given limit (default limit is 10).public function getTagAtrributes($tag, $attr, $limit = 10) { $attributes = Array(); $searchNode = $this->xpath->query("(//{$tag})[position() <= {$limit}]"); foreach ($searchNode as $node) { $attributes[] = $node->getAttribute($attr); } return $attributes; }
Step 4 : Usage
Given below is the basic way of initializing and querying, please check the unit test class in the repository for more details.$crawler = new Crawler(); $crawler->url = "crawlerTestVictim.html"; $crawler->init(); $images = $crawler->getTagAtrributes("img", "src");//Get the images $headings = $crawler->getInnerText('h1');//Get the headings $links = $crawler->getTagAtrributes("a", "href");//Get the links
You can use two functions given above to write another function which crawls all the linked pages recursively by maintaining a queue of links.
[Ref]
http://en.wikipedia.org/wiki/Web_crawler
http://en.wikipedia.org/wiki/Web_scraping
Comments