A Web Crawler is a program that crawls through the sites in the Web and indexes those URL‘s. Search Engines uses a crawler to index URL’s on the Web. Google uses a crawler written in Python. There are other search engines that uses different types of crawlers.
In this post I’m going to tell you how to create a simple Web Crawler in PHP.
The codes shown here was created by me. It took me 2 days to create a simple crawler. Then How much time would it take to create a perfect crawler ? Creating a Crawler is a very hard task. It’s like creating a Robot. Let’s start building a crawler.
For parsing the web page of a URL, we are going to use Simple HTML Dom class which can be downloaded at Sourceforge. Include the file “simple_html_dom.php” and mention the variables we are going to use :
include "simple_html_dom.php";
$crawled_urls = array();
$found_urls = array();
Then, Add the functions we are going to use. The following function will convert relative URL‘s to absolute URL‘s :
function rel2abs($rel, $base) {
if (parse_url($rel, PHP_URL_SCHEME) != '') {
return $rel;
}
if ($rel[0] == '#' || $rel[0] == '?') {
return $base . $rel;
}
extract(parse_url($base));
$path = preg_replace('#/[^/]*$#', '', $path);
if ($rel[0] == '/') {
$path = '';
}
$abs = "$host$path/$rel";
$re = array('#(/.?/)#', '#/(?!..)[^/]+/../#');
for ($n = 1; $n & gt; 0; $abs = preg_replace($re, '/', $abs, -1, $n)) {}
$abs = str_replace('../', '', $abs);
return $scheme . '://' . $abs;
}
The following function will change the URL‘s found when crawling to real URL‘s :
function perfect_url($u, $b) {
$bp = parse_url($b);
if (($bp['path'] != '/' & amp; & amp; $bp['path'] != '') || $bp['path'] == '') {
if ($bp['scheme'] == '') {
$scheme = 'http';
} else {
$scheme = $bp['scheme'];
}
$b = $scheme . '://' . $bp['host'] . '/';
}
if (substr($u, 0, 2) == '//') {
$u = 'http:' . $u;
}
if (substr($u, 0, 4) != 'http') {
$u = rel2abs($u, $b);
}
return $u;
}
This code is the core of the crawler :
function crawl_site($u) {
global $crawled_urls, $found_urls;
$uen = urlencode($u);
if ((array_key_exists($uen, $crawled_urls) == 0 || $crawled_urls[$uen] & lt; date('YmdHis', strtotime('-25 seconds', time())))) {
$html = file_get_html($u);
$crawled_urls[$uen] = date('YmdHis');
foreach ($html- & gt; find('a') as $li) {
$url = perfect_url($li- & gt; href, $u);
$enurl = urlencode($url);
if ($url != '' & amp; & amp; substr($url, 0, 4) != 'mail' & amp; & amp; substr($url, 0, 4) != 'java' & amp; & amp; array_key_exists($enurl, $found_urls) == 0) {
$found_urls[$enurl] = 1;
echo $url . PHP_EOL;
}
}
}
}
Finally, we will call the crawl_site function to crawl a URL. I’m going to use http://subinsb.com for crawling.
crawl_site("http://subinsb.com");
When you run the PHP crawler now, you will get all the URL’s in the page. You can again crawl those founded URL’s to find more URL’s, but you would need a fast Server and a High Speed Internet Connection.
A Super Computer and an Internet Connection of 10 GB/Second would be perfect for that. If you think that your computer is fast and can crawl many URL’s, then change the following line in the code :
echo $url . PHP_EOL;
to :
crawl_site($url);
Note :- The code isn’t perfect, there may be errors when crawling some URL’s. I don’t recommend you to crawl the URL’s found again unless you have a Super Computer and a High Speed Internet Connection. Feel free to make the crawler better, awesome and fast @ GitHub.
If you have any problems / suggestions / feedback, echo it in the comments. Your Feedback is my happiness.