Create An Advanced Crawler In PHP


Crawlers are everywhere. They move on and on to many webpages each second. The most biggest of them are Google’s. It already crawled almost 90% of the web and is still crawling. If you’re like me and want to create a more advanced crawler with options and features, this post will help you.

When I created my Search Engine test project, it needed an awesome crawler. I knew that developing one on my own will be hard especially when you have to follow robots.txt and other rules. So, I googled and found this great library which is 10 years old !

I tested it and it was great. It had all the cool features. It follows robots.txt, it acts like a browser when visiting sites. So, the site owner won’t know what visited just then.

Download & Install

You first have to download the library from the projects’ website. The latest versions from 0.82 is recommended. After downloading, extract the main folder PHPCrawl_{version_number} to your site’s source root. Rename this folder to “PHPCrawl”, so that when new version code are extracted, the folder name remains the same.

We’ll use the files in this extracted folder to create our crawler.

Here’s a Directory tree :

PHPCrawl
    ... Crawler Library Files Here
index.php

We will make our crawler in the index.php file.

Customize Crawler

As I said before, we’ll write the code for the crawler in index.php file. You can type it on another file if you want.

We include the crawler library file first inside PHP Tags :

<?
include("/PHPCrawl/libs/PHPCrawler.class.php");

Then, we extend the PHPCrawl class as to suit our needs. By using this code, we replace the default function that is called when a document is loaded by the crawler :

class SBCrawler extends PHPCrawler { 
 function handleDocumentInfo(PHPCrawlerDocumentInfo $p){ 
  $pageurl= $p->url;
  $status = $p->http_status_code;
  $source = $p->source;
  if($status==200 && $source!=""){
   // Page Successfully Got
   echo $pageurl."<br/>";
  }
 }
}

Note that in the above code, we check if the HTTP status code of the page is 200 which is OK and if the source (page HTML code) is not null. If the conditions are met, then the URL then crawled is printed.

If you want to manipulate or extract contents from the DOM, you can use PHP libraries like Simple HTML DOM. Here is an example of using Simple HTML DOM to get the contents of the “” tag :</p> <pre class="prettyprint"><code>if($status == 200 && $source != ""){ // Page Successfully Got $html = str_get_html($source ); if(is_object($html)){ $t = $html->find("title", 0); if($t){ $title = $t->innertext; } echo $title." - ".$pageurl."<br/>"; $html->clear(); unset($html); } }</code></pre> <p>Note that, you have to include the <strong>Simple HTML DOM</strong> library and you include the above code in the custom class <strong>SBCrawler</strong> we created before.</p> <p>If you want to make customizations to the document info got from the crawler, see the full document info you can <a href="phpcrawl.cuab.de/classreferences/PHPCrawlerDocumentInfo/overview.html" target="_blank">get from it here</a> and it’s recommended that you put the code inside the status code checking we did on <strong>SBCrawler</strong> class.</p> <h2 id="create-crawler">Create Crawler</h2> <p>Now, we make the function <strong>crawl</strong> which creates the class object and set the options for crawling :</p> <pre class="prettyprint"><code>function crawl($u){ $C = new SBCrawler(); $C->setURL($u); $C->addContentTypeReceiveRule("#text/html#"); $C->addURLFilterRule("#(jpg|gif|png|pdf|jpeg|svg|css|js)$# i"); /* We don't want to crawl non HTML pages */ $C->setTrafficLimit(2000 * 1024); $C->obeyRobotsTxt(true); /* Should We follow robots.txt */ $C->go(); }</code></pre> <p>You can set <a href="http://phpcrawl.cuab.de/classreferences/PHPCrawler/overview.html" target="_blank">more options too</a>. Here are some of the main options that I think you’ll need :</p> <table class="table"> <tr> <td> Code </td> <td> Description </td> </tr> <tr> <td> $C-><a href="http://phpcrawl.cuab.de/classreferences/PHPCrawler/method_detail_tpl_method_obeyRobotsTxt.htm" target="_blank">obeyRobotsTxt</a>(true) </td> <td> Whether the crawler should obey the robots.txt file of domain </td> </tr> <tr> <td> $C-><a href="http://phpcrawl.cuab.de/classreferences/PHPCrawler/method_detail_tpl_method_setFollowMode.htm" target="_blank">setFollowMod</a>e(2) </td> <td> The type of follow mode. Default 2 (crawls only URLs in the same domain). 0 for all domain URLs. </td> </tr> <tr> <td> $C-><a href="http://phpcrawl.cuab.de/classreferences/PHPCrawler/method_detail_tpl_method_setPageLimit.htm" target="_blank">setPageLimit</a>(0) </td> <td> How many pages should the crawler crawl. Default 0 (no limit) </td> </tr> <tr> <td> $C-><a href="http://phpcrawl.cuab.de/classreferences/PHPCrawler/method_detail_tpl_method_setUserAgentString.htm" target="_blank">setUserAgentString</a>(“PHPCrawl”) </td> <td> The User Agent String. Crawler visits site with this string so that the owner knows what visited. </td> </tr> <tr> <td> $C-><a href="http://phpcrawl.cuab.de/classreferences/PHPCrawler/method_detail_tpl_method_enableCookieHandling.htm" target="_blank">enableCookieHandling</a>(true) </td> <td> Whether the crawler should acts like a browser (storing cookies). </td> </tr> </table> <p>The possibilities are endless. As you can see, you can make many customizations of how the crawler should be and act like.</p> <h2 id="start-crawling">Start Crawling</h2> <p>It’s time to give a URL to <strong>crawl()</strong> and start the crawling processes. Here’s how you start crawling the site “<a href="http://subinsb.com&#8221">http://subinsb.com&#8221</a>; :</p> <pre class="prettyprint"><code>crawl("http://subinsb.com");</code></pre> <p>Crawling process can be time consuming and don’t worry that the script takes too much to load. If you want to display the results immediately after each crawl, use <strong>flush()</strong> after the <strong>echo</strong> in <strong>SBCrawler</strong> class. Here’s a sample :</p> <pre class="prettyprint"><code>if($status == 200 && $source != ""){ // Page Successfully Got echo $pageurl."<br/>"; flush(); }</code></pre> <p>You can see the demo of this <strong>flush()</strong> <a href="http://demos.subinsb.com/php/advanced-crawler/" target="_blank">here</a>.</p> <p>PHPCrawl library will be updated in course of time and it’s features will rise and rise in time. Hope you succeeded through this tutorial and had a great result. If you had a problem with something, feel free to comment and I’ll be here to help you out. 😀</p> </article> </div> <nav class="pagination is-centered is-large box" id="site-navigation" role="navigation" aria-label="pagination"> <a class="pagination-previous" href="https://subinsb.com/ubuntu-linux-create-localhost-website/" title="Create a localhost Website in Ubuntu 11.04 & Up" >Previous Post</a> <div class="addthis_inline_share_toolbox"></div> <a class="pagination-next" href="https://subinsb.com/global-warming-yet/" >Next Post</a> </nav> <div class="comments box"> <div id="disqus_thread"></div> <script type="text/javascript"> disqus_url = 'http://' + location.host + location.pathname.replace(/\/$/, ""); (function() { if (window.location.hostname == "localhost") return; var dsq = document.createElement('script'); dsq.type = 'text/javascript'; dsq.async = true; var disqus_shortname = 'subinsblog'; dsq.src = '//' + disqus_shortname + '.disqus.com/embed.js'; (document.getElementsByTagName('head')[0] || document.getElementsByTagName('body')[0]).appendChild(dsq); })(); </script> <noscript>Please enable JavaScript to view the <a href="http://disqus.com/?ref_noscript">comments powered by Disqus.</a></noscript> </div> </div> </div> </div> </div> <footer id="colophon" class="site-footer" role="contentinfo"> <div class="footer-nav"> <blockquote><span title="a.k.a SubinSiby">subins2000</span> contribute to & develop<br/><a href="https://en.wikipedia.org/wiki/Free_and_open-source_software" target="_blank">Free & Open Source Softwares</a><br/>Teenage is sweet !</blockquote> <div style="text-align: center;"> <div class="side"> <ul> <li><a href="/about">About</a></li> <li><a href="/donate">Donate</a></li> <li><a href="#search-form">Search</a></li> <li><a target="_blank" href="/sitemap.xml">Sitemap</a></li> <li><a target="_blank" href="http://feeds.feedburner.com/subinsblog">RSS</a></li> <li><a href="#mailchimpsf_widget-2">Subscribe!</a></li> </ul> </div> <div class="side"> <ul> <li><a target="_blank" href="https://github.com/subins2000">GitHub</a></li> <li><a target="_blank" href="https://gitlab.com/subins2000">GitLab</a></li> <li><a target="_blank" href="https://www.facebook.com/SubinSiby">Facebook</a></li> <li><a target="_blank" href="https://www.twitter.com/SubinSiby">@SubinSiby</a></li> <li><a target="_blank" href="https://plus.google.com/+SubinSiby">+SubinSiby</a></li> <li><a target="_blank" href="https://youtube.com/subinsiby">YouTube</a></li> </ul> </div> </div> </div> <div class="site-info"> <p> This blog is created, written and maintained by Subin Siby. It is built with <a target="_blank" href="https://gohugo.io/">Hugo</a> and hosted by <u><a target="_blank" href="https://gitlab.com">GitLab</a></u>. This blog and my projects are continuing because of the support from you and the <u><a href="/donate">donations</a></u>. Servers are costly ! Please be <b>generous</b> to consider a <u><a href="/donate">donation</a></u> if you found something helpful.</p> <div class="smilie" title="Do you know that this is just a text transformed into a smiley face by CSS ? Check the source !">:-)</div> <span></span><span></span><span></span><span></span><span></span><span></span><span></span> </div> </footer> <script type="text/javascript" src="//cdn.rawgit.com/google/code-prettify/master/loader/run_prettify.js?lang=perl&skin=sunburst&autoload=true" async="async"></script> <script type="text/javascript" src="//s7.addthis.com/js/300/addthis_widget.js#pubid=ra-4fb122922591215b" async="async"></script> <script> (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){ (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o), m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m) })(window,document,'script','https://www.google-analytics.com/analytics.js','ga'); ga('create', 'UA-33042168-1', 'auto'); ga('send', 'pageview'); </script> </body> </html>