Skip to main content

How to create a simple Web Crawler

Web crawlers are used to extract information from web sites for many purposes.

The simple example given here accepts an URL and exposes some functions to query the content of the page.

To check out the source code of this example :

If you are going to make any improvements to this code, I recommend you to follow TDD and use the unit test class in the code.

Step 1 : Create the class, init function and required attributes

In this example, xpath is used for querying the given web page. There is an attribute to hold the page url, and another to hold the xpath object of the loaded page.

The init() function initializes the xpath object for the page URL assigned to url attribute.

class Crawler {

    public $url = false;
    protected $xpath = false;

    public function init() {

        $xmlDoc = new DOMDocument();
        $this->xpath = new DOMXPath(@$xmlDoc);
In the next two steps, you will add some simple functions to this class for querying data.

Step 2 : A method to query the content of tags

This function returns an array of contents for the given tag name within the given limit (default limit is 10).

public function getInnerText($tag, $limit = 10) {        

    $content = Array();
    $searchNode = $this->xpath->query("(//{$tag})[position() <= {$limit}]");

    foreach ($searchNode as $node) {
        $content[] = $node->nodeValue;

    return $content;

Step 3 : A method to extract tag attributes (images, links etc.)

This function returns an array of attributes for the give tag type on the page within the given limit (default limit is 10).

public function getTagAtrributes($tag, $attr, $limit = 10) {

    $attributes = Array();
    $searchNode = $this->xpath->query("(//{$tag})[position() <= {$limit}]");

    foreach ($searchNode as $node) {
        $attributes[] = $node->getAttribute($attr);

    return $attributes;

Step 4 : Usage

Given below is the basic way of initializing and querying, please check the unit test class in the repository for more details.

$crawler = new Crawler();
$crawler->url = "crawlerTestVictim.html";

$images = $crawler->getTagAtrributes("img", "src");//Get the images
$headings = $crawler->getInnerText('h1');//Get the headings 
$links = $crawler->getTagAtrributes("a", "href");//Get the links

You can use two functions given above to write another function which crawls all the linked pages recursively by maintaining a queue of links.



Popular posts from this blog

How to create a new module for vtiger...

Recently, I had to create a new module for vtigerCRM for my client in current working place. I did search in many places including the official vtiger sites, but couldn’t find a better documentation for my purpose. The latest vtiger version at that time was 5.0.3. Because I had some experience doing lots of core modifications for this system, I did decide to read the source code and find how to add a new module. Finally, I could create a new module and started the project. So, I thought it will be a good thing to write some thing on my blog about this topic, so that others who want to do this thing can read. Given below is a brief description about how to create a new module for vtiger CRM 5.0.3. Source code of this example module is also available to Download.
Step 01: Creating the module directory and minimum required files.
Create a directory called “newModule” inside your vtiger modules directory, or any other name that you prefer. Now, module index file should be created. Create a…

How to create a waveform animation with HTML and JavaScript

Recently, for a pet project, I wanted to create an animation of a waveform based on the varying decibel level of the microphone input. I was thinking about a really simple way to accomplish this with SVG and JavaScript. Given below is the first sample code I did on the Codepen. You can change this code to use with any framework of your choice.

For this sample code, I am using a random number as the microphone input. You can replace it with any other time based input.

Code Pen :

For this example, you need an HTML code snippet like the one given below. It’s just an SVG with a Polyline element inside. 

<div style="text-align:center">   <svg height="150" width="400" id='svg'>     <polyline id="polyline-id" fill="none" stroke="#005c66" stroke-width="1" />   </svg> </div> And then the plain Javascript code to animate the polyline. You can replac…

De Morgan's Laws in Programming

Recently, while I was reviewing some codes, I saw there were some conditional statements that check for the same condition but written in different ways. Most of these statements were written with common sense without using any mathematical analysis, since those are too simple to go for a more formal approach. The two identical conditional statements that has been written in different ways are given below.


if ($comment['deleted'] == '1' || $comment['approved'] == '0') {
} else {


if ($comment['deleted'] == '0' && $comment['approved'] == '1') {
} else {

Obviously, the above lines say that the inverse of the first condition is equals to the second condition and vice versa. That is...

 ($comment['deleted'] == '1' || $comment['ap…