Skip to main content

How to create a simple Web Crawler

Web crawlers are used to extract information from web sites for many purposes.

The simple example given here accepts an URL and exposes some functions to query the content of the page.

To check out the source code of this example :

If you are going to make any improvements to this code, I recommend you to follow TDD and use the unit test class in the code.

Step 1 : Create the class, init function and required attributes

In this example, xpath is used for querying the given web page. There is an attribute to hold the page url, and another to hold the xpath object of the loaded page.

The init() function initializes the xpath object for the page URL assigned to url attribute.

class Crawler {

    public $url = false;
    protected $xpath = false;

    public function init() {

        $xmlDoc = new DOMDocument();
        $this->xpath = new DOMXPath(@$xmlDoc);
In the next two steps, you will add some simple functions to this class for querying data.

Step 2 : A method to query the content of tags

This function returns an array of contents for the given tag name within the given limit (default limit is 10).

public function getInnerText($tag, $limit = 10) {        

    $content = Array();
    $searchNode = $this->xpath->query("(//{$tag})[position() <= {$limit}]");

    foreach ($searchNode as $node) {
        $content[] = $node->nodeValue;

    return $content;

Step 3 : A method to extract tag attributes (images, links etc.)

This function returns an array of attributes for the give tag type on the page within the given limit (default limit is 10).

public function getTagAtrributes($tag, $attr, $limit = 10) {

    $attributes = Array();
    $searchNode = $this->xpath->query("(//{$tag})[position() <= {$limit}]");

    foreach ($searchNode as $node) {
        $attributes[] = $node->getAttribute($attr);

    return $attributes;

Step 4 : Usage

Given below is the basic way of initializing and querying, please check the unit test class in the repository for more details.

$crawler = new Crawler();
$crawler->url = "crawlerTestVictim.html";

$images = $crawler->getTagAtrributes("img", "src");//Get the images
$headings = $crawler->getInnerText('h1');//Get the headings 
$links = $crawler->getTagAtrributes("a", "href");//Get the links

You can use two functions given above to write another function which crawls all the linked pages recursively by maintaining a queue of links.



Popular posts from this blog

How to create a new module for vtiger...

Recently, I had to create a new module for vtigerCRM for my client in current working place. I did search in many places including the official vtiger sites, but couldn’t find a better documentation for my purpose. The latest vtiger version at that time was 5.0.3. Because I had some experience doing lots of core modifications for this system, I did decide to read the source code and find how to add a new module. Finally, I could create a new module and started the project. So, I thought it will be a good thing to write some thing on my blog about this topic, so that others who want to do this thing can read. Given below is a brief description about how to create a new module for vtiger CRM 5.0.3. Source code of this example module is also available to Download.
Step 01: Creating the module directory and minimum required files.
Create a directory called “newModule” inside your vtiger modules directory, or any other name that you prefer. Now, module index file should be created. Create a…

Common Characteristics of Enterprise Applications

Last week, I was conducting a tech talk about “Architectural Patterns of Enterprise Applications” with our team. The discussion was mainly based on Marin Fowler's famous book “Patterns of Enterprise Application Architecture”. So, I thought, it's good to write something about that in my Blog. Given below are few common characteristics of Enterprise Applications. If any software product has the following characteristics, we can identify it as an Enterprise Application. These ware originally documented by “Martin Fowler”, in his book “Patterns of Enterprise Application Architecture”.

Persistent Data - Enterprise applications usually involve persistent data. The data is persistent because it needs to be around between multiple runs of the program—indeed, it usually needs to persist for several years. Also during this time there will be many changes in the programs that use it.
Lot of Data - There's usually a lot of data, a moderate system will have over 1 GB of data organized in…

De Morgan's Laws in Programming

Recently, while I was reviewing some codes, I saw there were some conditional statements that check for the same condition but written in different ways. Most of these statements were written with common sense without using any mathematical analysis, since those are too simple to go for a more formal approach. The two identical conditional statements that has been written in different ways are given below.


if ($comment['deleted'] == '1' || $comment['approved'] == '0') {
} else {


if ($comment['deleted'] == '0' && $comment['approved'] == '1') {
} else {

Obviously, the above lines say that the inverse of the first condition is equals to the second condition and vice versa. That is...

 ($comment['deleted'] == '1' || $comment['ap…