Skip to main content

How to create a simple Web Crawler

Web crawlers are used to extract information from web sites for many purposes.

The simple example given here accepts an URL and exposes some functions to query the content of the page.

To check out the source code of this example : https://github.com/nadeeth/crawler

If you are going to make any improvements to this code, I recommend you to follow TDD and use the unit test class in the code.

Step 1 : Create the class, init function and required attributes

In this example, xpath is used for querying the given web page. There is an attribute to hold the page url, and another to hold the xpath object of the loaded page.

The init() function initializes the xpath object for the page URL assigned to url attribute.

class Crawler {

    public $url = false;
    protected $xpath = false;

    public function init() {

        $xmlDoc = new DOMDocument();
        @$xmlDoc->loadHTML(file_get_contents($this->url));
        $this->xpath = new DOMXPath(@$xmlDoc);
    }
}
In the next two steps, you will add some simple functions to this class for querying data.

Step 2 : A method to query the content of tags

This function returns an array of contents for the given tag name within the given limit (default limit is 10).

public function getInnerText($tag, $limit = 10) {        

    $content = Array();
    $searchNode = $this->xpath->query("(//{$tag})[position() <= {$limit}]");

    foreach ($searchNode as $node) {
        $content[] = $node->nodeValue;
    }

    return $content;
}




Step 3 : A method to extract tag attributes (images, links etc.)

This function returns an array of attributes for the give tag type on the page within the given limit (default limit is 10).

public function getTagAtrributes($tag, $attr, $limit = 10) {

    $attributes = Array();
    $searchNode = $this->xpath->query("(//{$tag})[position() <= {$limit}]");

    foreach ($searchNode as $node) {
        $attributes[] = $node->getAttribute($attr);
    }

    return $attributes;
}

Step 4 : Usage

Given below is the basic way of initializing and querying, please check the unit test class in the repository for more details.

$crawler = new Crawler();
$crawler->url = "crawlerTestVictim.html";
$crawler->init();

$images = $crawler->getTagAtrributes("img", "src");//Get the images
$headings = $crawler->getInnerText('h1');//Get the headings 
$links = $crawler->getTagAtrributes("a", "href");//Get the links

You can use two functions given above to write another function which crawls all the linked pages recursively by maintaining a queue of links.

[Ref]
http://en.wikipedia.org/wiki/Web_crawler
http://en.wikipedia.org/wiki/Web_scraping

Comments

Popular posts from this blog

Common Characteristics of Enterprise Applications

Last week, I was conducting a tech talk about “Architectural Patterns of Enterprise Applications” with our team. The discussion was mainly based on Marin Fowler's famous book “Patterns of Enterprise Application Architecture”. So, I thought, it's good to write something about that in my Blog. Given below are few common characteristics of Enterprise Applications. If any software product has the following characteristics, we can identify it as an Enterprise Application. These ware originally documented by “Martin Fowler”, in his book “Patterns of Enterprise Application Architecture”. Persistent Data - Enterprise applications usually involve persistent data. The data is persistent because it needs to be around between multiple runs of the program—indeed, it usually needs to persist for several years. Also during this time there will be many changes in the programs that use it. Lot of Data - There's usually a lot of data, a moderate system will have over 1 GB of data organ

How to avoid those little issues one can get when creating new projects with Zend Framwork...

CSS styles my not apply to the templates : Some times this happens when you haven't created a virtual host for the application. After the virtual host is created styles will apply properly, or just use some url helper like "baseUrl" when linking the style sheets to the templates. Error messages when using the commands like "zf enable layout", "zf create db-table ....", "zf create model ...", "zf configure db-adapter", and "zf create action ...." etc... You may get error like "Action 'enable' is not a valid action", "Action 'configure' is not a valid action"......... etc... Reason for this may be, the Zend library is not in the include path. If so add the library path to the include path. Or check whether there are old libraries in any directories of include paths that conflicts with the new one. Check whether there is a one as a PEAR package. And check whether the "zf" c

How to create a waveform animation with HTML and JavaScript

Recently, for a pet project, I wanted to create an animation of a waveform based on the varying decibel level of the microphone input. I was thinking about a really simple way to accomplish this with SVG and JavaScript. Given below is the first sample code I did on the Codepen. You can change this code to use with any framework of your choice. For this sample code, I am using a random number as the microphone input. You can replace it with any other time based input. Code Pen : https://codepen.io/nadeeth/pen/vmaYXw For this example, you need an HTML code snippet like the one given below. It’s just an SVG with a Polyline element inside.  <div style="text-align:center">   <svg height="150" width="400" id='svg'>     <polyline id="polyline-id" fill="none" stroke="#005c66" stroke-width="1" />   </svg> </div> And then the plain Javascript code to animate the polyline. You can