Create An Advanced Crawler In PHP

Crawlers are everywhere. They move on and on to many webpages each second. The most biggest of them are Google’s. It already crawled almost 90% of the web and is still crawling. If you’re like me and want to create a more advanced crawler with options and features, this post will help you.

Download Demo

When I created my Search Engine test project, it needed an awesome crawler. I knew that developing one on my own will be hard especially when you have to follow robots.txt and other rules. So, I googled and found this great library which is 10 years old !

I tested it and it was great. It had all the cool features. It follows robots.txt, it acts like a browser when visiting sites. So, the site owner won’t know what visited just then.

Download & Install

You first have to download the library from the projects’ website. The latest versions from 0.82 is recommended. After downloading, extract the main folder PHPCrawl_{version_number} to your site’s source root. Rename this folder to “PHPCrawl”, so that when new version code are extracted, the folder name remains the same.

We’ll use the files in this extracted folder to create our crawler.

Here’s a Directory tree :

PHPCrawl
    ... Crawler Library Files Here
index.php

We will make our crawler in the index.php file.

Customize Crawler

As I said before, we’ll write the code for the crawler in index.php file. You can type it on another file if you want.

We include the crawler library file first inside PHP Tags :

<?
include("/PHPCrawl/libs/PHPCrawler.class.php");

Then, we extend the PHPCrawl class as to suit our needs. By using this code, we replace the default function that is called when a document is loaded by the crawler :

class SBCrawler extends PHPCrawler { 
 function handleDocumentInfo(PHPCrawlerDocumentInfo $p){ 
  $pageurl= $p->url;
  $status = $p->http_status_code;
  $source = $p->source;
  if($status==200 && $source!=""){
   // Page Successfully Got
   echo $pageurl."<br/>";
  }
 }
}

Note that in the above code, we check if the HTTP status code of the page is 200 which is OK and if the source (page HTML code) is not null. If the conditions are met, then the URL then crawled is printed.

If you want to manipulate or extract contents from the DOM, you can use PHP libraries like Simple HTML DOM. Here is an example of using Simple HTML DOM to get the contents of the “” tag :</p> <pre class="prettyprint"><code>if($status == 200 && $source != ""){ // Page Successfully Got $html = str_get_html($source ); if(is_object($html)){ $t = $html->find("title", 0); if($t){ $title = $t->innertext; } echo $title." - ".$pageurl."<br/>"; $html->clear(); unset($html); } }</code></pre> <p>Note that, you have to include the <strong>Simple HTML DOM</strong> library and you include the above code in the custom class <strong>SBCrawler</strong> we created before.</p> <p>If you want to make customizations to the document info got from the crawler, see the full document info you can <a href="phpcrawl.cuab.de/classreferences/PHPCrawlerDocumentInfo/overview.html" target="_blank">get from it here</a> and it’s recommended that you put the code inside the status code checking we did on <strong>SBCrawler</strong> class.</p> <h2 id="create-crawler">Create Crawler</h2> <p>Now, we make the function <strong>crawl</strong> which creates the class object and set the options for crawling :</p> <pre class="prettyprint"><code>function crawl($u){ $C = new SBCrawler(); $C->setURL($u); $C->addContentTypeReceiveRule("#text/html#"); $C->addURLFilterRule("#(jpg|gif|png|pdf|jpeg|svg|css|js)$# i"); /* We don't want to crawl non HTML pages */ $C->setTrafficLimit(2000 * 1024); $C->obeyRobotsTxt(true); /* Should We follow robots.txt */ $C->go(); }</code></pre> <p>You can set <a href="http://phpcrawl.cuab.de/classreferences/PHPCrawler/overview.html" target="_blank">more options too</a>. Here are some of the main options that I think you’ll need :</p> <table class="table"> <tr> <td> Code </td> <pre><code><td> Description </td> </code></pre> </tr> <tr> <td> $C-><a href="http://phpcrawl.cuab.de/classreferences/PHPCrawler/method_detail_tpl_method_obeyRobotsTxt.htm" target="_blank">obeyRobotsTxt</a>(true) </td> <pre><code><td> Whether the crawler should obey the robots.txt file of domain </td> </code></pre> </tr> <tr> <td> $C-><a href="http://phpcrawl.cuab.de/classreferences/PHPCrawler/method_detail_tpl_method_setFollowMode.htm" target="_blank">setFollowMod</a>e(2) </td> <pre><code><td> The type of follow mode. Default 2 (crawls only URLs in the same domain). 0 for all domain URLs. </td> </code></pre> </tr> <tr> <td> $C-><a href="http://phpcrawl.cuab.de/classreferences/PHPCrawler/method_detail_tpl_method_setPageLimit.htm" target="_blank">setPageLimit</a>(0) </td> <pre><code><td> How many pages should the crawler crawl. Default 0 (no limit) </td> </code></pre> </tr> <tr> <td> $C-><a href="http://phpcrawl.cuab.de/classreferences/PHPCrawler/method_detail_tpl_method_setUserAgentString.htm" target="_blank">setUserAgentString</a>("PHPCrawl") </td> <pre><code><td> The User Agent String. Crawler visits site with this string so that the owner knows what visited. </td> </code></pre> </tr> <tr> <td> $C-><a href="http://phpcrawl.cuab.de/classreferences/PHPCrawler/method_detail_tpl_method_enableCookieHandling.htm" target="_blank">enableCookieHandling</a>(true) </td> <pre><code><td> Whether the crawler should acts like a browser (storing cookies). </td> </code></pre> </tr> </table> <p>The possibilities are endless. As you can see, you can make many customizations of how the crawler should be and act like.</p> <h2 id="start-crawling">Start Crawling</h2> <p>It’s time to give a URL to <strong>crawl()</strong> and start the crawling processes. Here’s how you start crawling the site “<a href="http://subinsb.com">http://subinsb.com</a>” :</p> <pre class="prettyprint"><code>crawl("http://subinsb.com");</code></pre> <p>Crawling process can be time consuming and don’t worry that the script takes too much to load. If you want to display the results immediately after each crawl, use <strong>flush()</strong> after the <strong>echo</strong> in <strong>SBCrawler</strong> class. Here’s a sample :</p> <pre class="prettyprint"><code>if($status == 200 && $source != ""){ // Page Successfully Got echo $pageurl."<br/>"; flush(); }</code></pre> <p>You can see the demo of this <strong>flush()</strong> <a href="http://demos.subinsb.com/php/advanced-crawler/" target="_blank">here</a>.</p> <p>PHPCrawl library will be updated in course of time and it’s features will rise and rise in time. Hope you succeeded through this tutorial and had a great result. If you had a problem with something, feel free to comment and I’ll be here to help you out. 😀</p> </article> </div> <nav class="pagination box" role="navigation" aria-label="pagination"> <div class="columns"> <div class="column is-one-third is-centered"> <a class="pagination-previous" href="https://subinsb.com/ubuntu-linux-create-localhost-website/" title="Create a localhost Website in Ubuntu 11.04 & Up" >Previous Post</a> <div class="addthis_inline_share_toolbox"></div> <a class="pagination-next" href="https://subinsb.com/global-warming-yet/" >Next Post</a> </div> <div class="column has-text-left"> <span>Related Posts</span> <ol> <li><a href="/how-to-create-a-simple-web-crawler-in-php/">How To Create A Simple Web Crawler in PHP</a></li> <li><a href="/ubuntu-linux-create-localhost-website/">Create a localhost Website in Ubuntu 11.04 & Up</a></li> <li><a href="/search-engine-in-php-part-3/">Create a Search Engine In PHP, MySQL | Part 3</a></li> <li><a href="/search-engine-in-php-part-2/">Create a Search Engine In PHP, MySQL | Part 2</a></li> <li><a href="/search-engine-in-php-part-1/">Create A Search Engine In PHP, MySQL | Part 1</a></li> </ol> </div> </div> </nav> <div class="comments box" id="comments"> <div id="disqus_thread"></div> <script> (function() { var d = document, s = d.createElement('script'); s.src = 'https://subinsblog.disqus.com/embed.js'; s.setAttribute('data-timestamp', +new Date()); (d.head || d.body).appendChild(s); })(); </script> <noscript>Please enable JavaScript to view the <a href="https://disqus.com/?ref_noscript">comments powered by Disqus.</a></noscript> </div> </div> </div> </div> </div> <script> document.querySelectorAll(".single-post img").forEach(imgElem => { const a = document.createElement("a"); a.href = imgElem.src; a.setAttribute("target", "_blank") const imgClone = imgElem.cloneNode(true); a.appendChild(imgClone); imgElem.replaceWith(a); }) </script> <script> if (location.hostname !== "localhost" && location.hostname !== "127.0.0.1") { var id = "/php-create-advanced-web-crawler/"; (function(i,s,o,g,r,a,m){i['ccc']=r;i[r]=i[r]||function(){ (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o), m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m) })(window,document,'script','//demos.subinsb.com/sanders/s.js?id='+id,'ga'); } </script> <script id="dsq-count-scr" src="//subinsblog.disqus.com/count.js" async></script> <footer id="colophon" class="site-footer" role="contentinfo"> <div class="footer-nav"> <a href="/about#profiles"> <blockquote>സുബിന്‍ സിബി<br/><span title="a.k.a SubinSiby">@subins2000</span><br/>/bin/su</blockquote> </a> <div style="text-align: center;"> <div class="side"> <ul> <li><a href="/about">About</a></li> <li><a href="/search">Search</a></li> <li><a target="_blank" href="/sitemap.xml">Sitemap</a></li> <li><a target="_blank" href="/index.xml ">RSS Feed</a></li> <li><a href="/about#subscribe">Subscribe</a></li> </ul> </div> <div class="side"> <ul> <li><a target="_blank" href="https://github.com/subins2000">GitHub</a></li> <li><a target="_blank" href="https://gitlab.com/subins2000">GitLab</a></li> <li><a target="_blank" href="https://t.me/SubinSiby">Telegram</a></li> <li><a target="_blank" href="https://aana.site/@subins2000">Mastodon</a></li> <li><a target="_blank" href="https://twitter.com/SubinSiby">Twitter</a></li> </ul> </div> </div> </div> <div class="site-info"> <p style="text-align: center;"> This blog is created, written and maintained by Subin Siby. It is built with <u><a target="_blank" href="https://gohugo.io/">Hugo</a></u> and hosted by <u><a target="_blank" href="https://gitlab.com">GitLab</a></u>. </p> <div class="smilie" title="This is just a text transformed into a smiley face by CSS. Check the source !">:-)</div> <span></span><span></span><span></span><span></span><span></span><span></span><span></span> </div> </footer> <svg style="display:none;"> <defs> <path id="shape-tab" d="M100,25C79.568,25,84.815,0,59.692,0H11.149C5.027,0,0,4.634,0,10.385V25"></path> <path id="shape-tab-right" d="M0,25C20.432,25,15.185,0,40.308,0h48.543C94.973,0,100,4.634,100,10.385V25"></path> <path id="shape-search" d="M3.327,96.684C5.534,98.895,8.434,100,11.331,100s5.797-1.105,8.004-3.316l21.321-21.322 c5.83,3.188,12.393,4.897,19.223,4.897c10.721,0,20.798-4.172,28.379-11.752c15.646-15.644,15.646-41.105,0.002-56.755 C80.677,4.171,70.598,0,59.877,0C49.159,0,39.08,4.171,31.504,11.752c-7.581,7.576-11.756,17.655-11.756,28.376 c0,6.832,1.71,13.396,4.9,19.226L3.327,80.675C-1.096,85.094-1.096,92.266,3.327,96.684z M59.879,68.938 c-7.695,0-14.93-2.996-20.371-8.435c-5.443-5.442-8.439-12.677-8.439-20.375c0-7.695,2.996-14.933,8.439-20.372 c5.441-5.44,12.676-8.436,20.369-8.436c7.698,0,14.936,2.997,20.378,8.436c11.231,11.236,11.231,29.515,0,40.747 C74.812,65.941,67.575,68.938,59.879,68.938z"></path> </defs> </svg> <script type="text/javascript" src="https://cdn.jsdelivr.net/gh/google/code-prettify@master/loader/run_prettify.js?lang=perl&skin=sunburst&autoload=true" async="async"></script> <script type="text/javascript" src="https://s7.addthis.com/js/300/addthis_widget.js#pubid=ra-4fb122922591215b" async="async"></script> <script data-goatcounter="https://subinsbdotcom.goatcounter.com/count" async src="//gc.zgo.at/count.js"></script> <script> var r = new XMLHttpRequest(); r.addEventListener('load', function() { document.querySelector('#sanders-page-counter').innerHTML = JSON.parse(this.responseText).count }) r.open('GET', 'https://subinsbdotcom.goatcounter.com/counter/' + encodeURIComponent(location.pathname) + '.json') r.send() </script> </body> </html>