Crawlers are everywhere. They move on and on to many webpages each second. The most biggest of them are Google’s. It already crawled almost 90% of the web and is still crawling. If you’re like me and want to create a more advanced crawler with options and features, this post will help you.
When I created my Search Engine test project, it needed an awesome crawler. I knew that developing one on my own will be hard especially when you have to follow robots.txt and other rules. So, I googled and found this great library which is 10 years old !
I tested it and it was great. It had all the cool features. It follows robots.txt, it acts like a browser when visiting sites. So, the site owner won’t know what visited just then.
Download & Install
You first have to download the library from the projects’ website. The latest versions from 0.82 is recommended. After downloading, extract the main folder PHPCrawl_{version_number} to your site’s source root. Rename this folder to “PHPCrawl”, so that when new version code are extracted, the folder name remains the same.
We’ll use the files in this extracted folder to create our crawler.
Here’s a Directory tree :
PHPCrawl
... Crawler Library Files Here
index.php
We will make our crawler in the index.php file.
Customize Crawler
As I said before, we’ll write the code for the crawler in index.php file. You can type it on another file if you want.
We include the crawler library file first inside PHP Tags :
<?
include("/PHPCrawl/libs/PHPCrawler.class.php");
Then, we extend the PHPCrawl class as to suit our needs. By using this code, we replace the default function that is called when a document is loaded by the crawler :
class SBCrawler extends PHPCrawler {
function handleDocumentInfo(PHPCrawlerDocumentInfo $p){
$pageurl= $p->url;
$status = $p->http_status_code;
$source = $p->source;
if($status==200 && $source!=""){
// Page Successfully Got
echo $pageurl."<br/>";
}
}
}
Note that in the above code, we check if the HTTP status code of the page is 200 which is OK and if the source (page HTML code) is not null. If the conditions are met, then the URL then crawled is printed.
If you want to manipulate or extract contents from the DOM, you can use PHP libraries like Simple HTML DOM. Here is an example of using Simple HTML DOM to get the contents of the “
if($status == 200 && $source != ""){
// Page Successfully Got
$html = str_get_html($source );
if(is_object($html)){
$t = $html->find("title", 0);
if($t){
$title = $t->innertext;
}
echo $title." - ".$pageurl."<br/>";
$html->clear();
unset($html);
}
}
Note that, you have to include the Simple HTML DOM library and you include the above code in the custom class SBCrawler we created before.
If you want to make customizations to the document info got from the crawler, see the full document info you can get from it here and it’s recommended that you put the code inside the status code checking we did on SBCrawler class.
Create Crawler
Now, we make the function crawl which creates the class object and set the options for crawling :
function crawl($u){
$C = new SBCrawler();
$C->setURL($u);
$C->addContentTypeReceiveRule("#text/html#");
$C->addURLFilterRule("#(jpg|gif|png|pdf|jpeg|svg|css|js)$# i"); /* We don't want to crawl non HTML pages */
$C->setTrafficLimit(2000 * 1024);
$C->obeyRobotsTxt(true); /* Should We follow robots.txt */
$C->go();
}
You can set more options too. Here are some of the main options that I think you’ll need :
Code |
$C->obeyRobotsTxt(true) |
$C->setFollowMode(2) |
$C->setPageLimit(0) |
$C->setUserAgentString("PHPCrawl") |
$C->enableCookieHandling(true) |
The possibilities are endless. As you can see, you can make many customizations of how the crawler should be and act like.
Start Crawling
It’s time to give a URL to crawl() and start the crawling processes. Here’s how you start crawling the site “http://subinsb.com” :
crawl("http://subinsb.com");
Crawling process can be time consuming and don’t worry that the script takes too much to load. If you want to display the results immediately after each crawl, use flush() after the echo in SBCrawler class. Here’s a sample :
if($status == 200 && $source != ""){
// Page Successfully Got
echo $pageurl."<br/>";
flush();
}
You can see the demo of this flush() here.
PHPCrawl library will be updated in course of time and it’s features will rise and rise in time. Hope you succeeded through this tutorial and had a great result. If you had a problem with something, feel free to comment and I’ll be here to help you out. 😀