Skip to main content

Posts

Showing posts from December, 2013

How to create a simple Web Crawler

Web crawlers are used to extract information from web sites for many purposes.

The simple example given here accepts an URL and exposes some functions to query the content of the page.

To check out the source code of this example : https://github.com/nadeeth/crawler

If you are going to make any improvements to this code, I recommend you to follow TDD and use the unit test class in the code.

Step 1 : Create the class, init function and required attributes In this example, xpath is used for querying the given web page. There is an attribute to hold the page url, and another to hold the xpath object of the loaded page.

The init() function initializes the xpath object for the page URL assigned to url attribute.

class Crawler { public $url = false; protected $xpath = false; public function init() { $xmlDoc = new DOMDocument(); @$xmlDoc->loadHTML(file_get_contents($this->url)); $this->xpath = new DOMXPath(@$xmlDoc); } } In the next two ste…