Create A Search Engine In PHP, MySQL | Part 1


When we all search on the web using Google, Yahoo or Bing, we are curious about how it works and how it gets all the information on the web. When us programmers started coding, we all wanted to create a search engine. I too attempted to create a search engine and I ultimately failed (3 years back). But now, I improved my coding skills, improved my knowledge and ideas. So, I decided to create a new search engine with automatic crawling, indexing and stuff.

It works great and I made it Open Source. You can see the source code on GitHub or we can move together to understand the different code we will write by this series. This is the Part 1 of “How To Create A Search Engine In PHP”. You can see the finished product or download the finished source code.

Features

  • No JS, Less CSS, Light
  • Simple, Fast & Easy Use
  • Have own Crawler & Obeys robots.txt
  • Only indexes HTML Page (No JS files, CSS files, Images)
  • Not Vulnerable to SQL Injection
  • XSS Attack Not Possible
  • HTML 5 (SVG Images)
  • Uses PDO for Database Queries

Here’s the summary of the Search Engine we’re going to create :

  • Crawler Runs In The background
  • Crawler gets the and meta[name=description] from the page and inserts in to the database.</li> <li>When user searches for a query, MySQL searches for query in title, url and description and displays the results.</li> <li>User actually click search engine’s URL when they click on an external link of search results.</li> <li>We display the stats of the search engine (URLs crawled, Last Indexed URLs</li> </ul> <p>In this part of the tutorial, we make the base files and add code to them. Here is the directory tree :</p> <ul> <li>about</li> <li>    bot.php</li> <li>    index.php</li> <li>    stats.php</li> <li>cdn</li> <li>    css</li> <li>       all.css</li> <li>       index.css</li> <li>       search.css</li> <li>crawler</li> <li>       PHPCrawl</li> <li>         – A lot of Files Inside This</li> <li>       bgCrawl.php</li> <li>       crawl.php</li> <li>       crawlStatus.txt</li> <li>       runCrawl.php</li> <li>       simple_html_dom.php</li> <li>inc</li> <li>       config.php</li> <li>       error.php</li> <li>       functions.php</li> <li>       track.php</li> <li>.htaccess</li> <li>index.php</li> <li>robots.txt</li> <li>search.php</li> <li>url.php</li> </ul> <p>As you can see, there are a lot of files in the search engine we are going to create.</p> <h2 id="crawler">crawler</h2> <p>We uses two external libraries in our search engine. Download it and place them inside <strong>crawler</strong> folder.</p> <table class="table"> <tr> <td> PHPCrawl </td> <td> <a href="http://phpcrawl.cuab.de" target="_blank">phpcrawl.cuab.de</a> </td> <td> crawler/PHPCrawl </td> </tr> <tr> <td> SimpleHTMLDom </td> <td> <a href="http://simplehtmldom.sourceforge.net" target="_blank">simplehtmldom.sourceforge.net</a>‎ </td> <td> crawler/simple_html_dom.php </td> </tr> </table> <p>Simple HTML Dom only have one file : <strong>simple_html_dom.php</strong>. Place it directly in <strong>crawler</strong> folder.</p> <p>There is a bug in the PHPCrawl library which makes PHP errors when there is an invalid <strong>robots.txt</strong> file. There’s a way to fix that. Go to <strong>crawler/PHPCrawl/libs/PHPCrawlerRobotsTxtParser.class.php</strong> file and search for :</p> <pre class="prettyprint"><code>// First, get all "Disallow:"-paths</code></pre> <p>You will see a function named “buildRegExpressions”. In it, replace the code :</p> <pre class="prettyprint"><code>$disallow_pathes[] = trim($match[1]);</code></pre> <p>with the following code :</p> <pre class="prettyprint"><code>if(isset($match[1])){ $disallow_pathes[] = trim($match[1]); }</code></pre> <p>There is also another bug in the same file. Search for :</p> <pre class="prettyprint"><code>$non_follow_path_complpete</code></pre> <p>In the first match, replace the line of code :</p> <pre class="prettyprint"><code>$non_follow_path_complpete = $normalized_base_url.substr($disallow_pathes[$x], 1); // "http://www.foo.com/bla/"</code></pre> <p>with :</p> <pre class="prettyprint"><code>$non_follow_path_complpete = $normalized_base_url."/".substr($disallow_pathes[$x], 1); // "http://www.foo.com/bla/"</code></pre> <p>That’s all the bug fixes.</p> <h2 id="cdn">cdn</h2> <p>CDN – short for <strong>Content Delivery Network</strong> is the folder where we store our <strong>CSS</strong>, <strong>JS</strong> files. Since, our search engine don;t have any JS files, we don’t have to create the <strong>js</strong> folder. But we should create the <strong>css</strong> folder inside <strong>cdn</strong>. In the <strong>css</strong> directory, create 3 files named <strong>all.css</strong>, <strong>index.css</strong> and <strong>search.css</strong>.</p> <h3 id="strong-all-css-strong"><strong>all.css</strong></h3> <pre class="prettyprint"><code>*{ margin:0px; border:0px; padding:0px; font-family: Ubuntu; } body{ font-size:14px; line-height: 20px; } .header{ position: absolute; top: 0px; left: 0px; right: 0px; background: #EEE; padding: 5px 20px; } .header form{ display: inline-block; } .header .logo{ margin-right:10px; text-decoration: none; font-size:22px; display: inline-block; vertical-align:middle; color:black; } .header .searchForm #query{ width:180px; -webkit-transition:1s; transition:1s; height:30px; font-size:15px; margin-left:10px; } .header .searchForm #query:focus{ width:300px; } .container{ margin: 50px auto 30px auto; display:table; } .footer{ position: fixed; bottom: 0px; left: 0px; right: 0px; background: #EEE; padding: 5px 20px; } .footer a{ margin-right:5px; } /* Default Styles */ a[href]{ text-decoration:none; } a[href]:hover{ text-decoration:underline; } h1,h2,h3,h4{ margin:10px 0px; } input[type="text"]{ padding:3px 5px; outline: none; border: 1px solid #EEE; border-radius: 2px; display: inline-block; vertical-align:middle; } input[type="text"]:hover{ border: 1px solid #DDD; } input[type="text"]:active, input[type="text"]:focus{ border: 1px solid #4585F1; } button, .button{ padding:5px 10px; display: inline-block; vertical-align:middle; background-color: rgb(77, 144, 254); background-image: -webkit-gradient(linear,left top,left bottom,from(rgb(77, 144, 254)),to(rgb(71, 135, 237))); background-image: -webkit-linear-gradient(top,rgb(77, 144, 254),rgb(71, 135, 237)); background-image: -moz-linear-gradient(top,rgb(77, 144, 254),rgb(71, 135, 237)); background-image: -ms-linear-gradient(top,rgb(77, 144, 254),rgb(71, 135, 237)); background-image: -o-linear-gradient(top,rgb(77, 144, 254),rgb(71, 135, 237)); background-image: linear-gradient(top,rgb(77, 144, 254),rgb(71, 135, 237)); cursor: pointer; color:white; border-radius:2px; } button:hover, .button:hover{ box-shadow: inset 0 2px 2px rgba(0,0,0,0.1); } .searchForm .shape-search{ height: 14px; padding:4px 10px; width: 35px; fill:white; } .searchForm button{ padding: 0px; height:30px; }</code></pre> <p>Yeah, we use <strong>Ubuntu</strong> font because I like it. Hope you like it too. We will add the <strong>fonts.googleapis.com</strong> stylesheet for proper loading of the font later.</p> <h3 id="index-css">index.css</h3> <pre class="prettyprint"><code>.container .searchForm{ margin-top:20px; padding:5px; } .container .searchForm #query{ width: 400px; padding:4px 5px; font-size:15px; } .container .searchForm .shape-search{ width: 100px; height:15px; } .container .searchForm div, .container .searchForm p{ margin-top:10px; }</code></pre> <p>This file styles the index page</p> <h3 id="search-css">search.css</h3> <pre class="prettyprint"><code>.header .searchForm #query{ width:400px !important; } .header{ text-align:center; } .container{ width:500px; } .container .info{ color: gray; } .results{ width: 500px; margin-top:25px; } .result{ margin:20px 0px; } .result .title{ margin-bottom: 0px; font-size: 17px; } .result .url{ font-size: 13px; color: #006621; overflow:hidden; height:20px; } .pages{ width: 500px; text-align: center; margin: 0px auto 10px auto; } .pages .button{ margin: 0px 3px 3px; } .pages .button.current{ background:black; }</code></pre> <p>This style file styles the searh results page (search.php).</p> <p>Everything’s finished for <strong>cdn</strong> folder. Let’s move on to <strong>inc</strong> folder.</p> <h2 id="inc">inc</h2> <p>Contains the includable files. This folder plays a major role in our search engine. The main file is <strong>functions.php</strong> which is used by all the static pages (which comes from dynamic page).</p> <h3 id="functions-php">functions.php</h3> <pre class="prettyprint"><code><? include("config.php"); session_start(); $GLOBALS['q']=isset($_GET['q']) ? htmlspecialchars(urldecode($_GET['q'])):""; $GLOBALS['displayQ']=$GLOBALS['q']; $GLOBALS['q']=strtolower($GLOBALS['q']); $GLOBALS['p']=isset($_GET['p']) && is_numeric($_GET['p']) ? $_GET['p']:1; $GLOBALS['dbh']=$dbh; function htmlFilt($s){ $s=str_replace("<", "<", $s); $s=str_replace(">", ">", $s); return $s; } function head($title="", $IncOtherCss=array()){ $title=$title=="" ? "Web Search" : $title." - Web Search"; /* Display The <title> tag */ echo "<title>$title</title>"; /* The Stylesheets */ $cssFiles = array_merge( array( "all", "http://fonts.googleapis.com/css?family=Ubuntu" ), $IncOtherCss ); foreach($cssFiles as $css){ $url=preg_match("/http/", $css) ? $css : HOST."/cdn/css/$css.css"; echo "<link href='".$url."' async='async' rel='stylesheet' />"; } echo "<meta name='description' content="Search the world's information, webpages, problems and more. Find exactly what you're looking for easily without any ads and other distractions"/>"; } function headerElem(){ // header() is already a function in PHP $header = "<div class='header'><a class='logo' href='".HOST."'><strong>Web Search</strong></a><form method='GET' action='".HOST."/search.php' class='searchForm'><input id='query' type='text' placeholder='Your Query' autocomplete='off' name='q' value="".$GLOBALS['displayQ'].""/><button><svg viewBox='0 0 100 100' class='shape-search'><use xlink:href='#shape-search'></use></svg></button></form></div>"; echo $header; } function footer(){ include("track.php"); $footer = "<div class='footer'><a href='".HOST."/about'>About</a><a href='".HOST."/about/stats.php'>Stats</a><a href='".HOST."/about/bot.php'>Dingo</a><div style='float:right;'>© Copyright Subin ".date("Y")."</div></div>"; $footer.=' <svg style="display:none;"> <defs> <path id="shape-search" d="m 85.160239,99.375807 c -0.828634,-0.2952 -6.785463,-5.7653 -13.237403,-12.1558 l -11.730795,-11.6193 -6.6207,2.1766 C 33.39036,84.411907 12.627177,75.515007 3.6984912,56.407007 -5.6131124,36.479667 3.2485677,12.852077 23.649685,3.2119175 29.682607,0.36117746 31.404851,0.01130746 39.459783,5.746345e-5 50.03976,-0.01474254 56.477126,1.9699875 63.781566,7.4987375 77.935087,18.211537 83.541599,36.335507 77.964788,53.348307 l -2.173424,6.6304 11.744957,11.7927 c 9.455968,9.4945 11.857728,12.4888 12.323668,15.3642 1.319521,8.1432 -6.925821,15.008903 -14.69975,12.2402 z m -33.083916,-33.2366 c 5.656943,-2.5459 11.702601,-8.5732 14.216739,-14.1737 8.683318,-19.34281 -5.230473,-40.9032 -26.331076,-40.80178 -26.510022,0.12741 -38.6174499,32.4025 -18.836563,50.21308 2.774148,2.4979 7.069057,5.1647 9.656546,5.9963 5.992636,1.9257 15.497206,1.375 21.294354,-1.2339 z"></path> </defs> </svg>'; echo $footer; } /* Results */ function getResults(){ $q=$GLOBALS['q']; $p=$GLOBALS['p']; $start=($p-1)*10; if($q!=null){ $starttime = microtime(true); $sql=$GLOBALS['dbh']->prepare("SELECT `title`, `url`, `description` FROM search WHERE `title` LIKE :q OR `url` LIKE :q OR `description` LIKE :q ORDER By id"); $sql->bindValue(":q", "%$q%");; $sql->execute(); $endtime = microtime(true); if($sql->rowCount()==0 || $start>$sql->rowCount()){ return 0; }else{ $duration = $endtime - $starttime; $res=array(); $res['count']=$sql->rowCount(); $res['time']=round($duration, 4); $limitedResults=$GLOBALS['dbh']->prepare("SELECT `title`, `url`, `description` FROM search WHERE `title` LIKE :q OR `url` LIKE :q OR `description` LIKE :q ORDER BY id LIMIT :start,:limit"); $limitedResults->bindValue(":q", "%$q%"); $limitedResults->bindValue(":start", $start, PDO::PARAM_INT); $limitedResults->bindValue(":limit", 10, PDO::PARAM_INT); $limitedResults->execute(); while($r=$limitedResults->fetch()){ $res["results"][]=array($r['title'], $r['url'], $r['description']); } return $res; } } } ?></code></pre> <p> </p> <p><strong>functions.php</strong> makes the <strong><head</strong><strong>></strong> tag, <strong>footer</strong> and <strong>header</strong>. It also fetches the search results, make GLOBAL variables etc…</p> <h3 id="config-php">config.php</h3> <p>The configuration file. Contains information about the database.</p> <pre class="prettyprint"><code><? /* Configuration */ ini_set("display_errors", "on"); // Do you want to see the errors ? define("HOST", "http://search.subinsb.com"); // No '/' at the end $host = "localhost"; // Hostname $port = "3306"; // MySQL Port; Default : 3306 $user = "username"; // Username Here $pass = "password"; // Password Here $db = "search"; // Database Name $dbh = new PDO('mysql:dbname='.$db.';host='.$host.';port='.$port, $user, $pass); /* End Configuration */ ?></code></pre> <p>Add the domain of the search engine inside the <strong>HOST</strong> constant, so that you don’t need to change domains elsewhere on any other pages.</p> <h3 id="error-php">error.php</h3> <p>The page which is displayed if any errors occur.</p> <pre class="prettyprint"><code><html> <head></head> <body> <h1>404 Not Found</h1> <p> The request file was not found on this server. </p> </body> </html></code></pre> <p>The next part will contain the code of <strong>index.php</strong> and other files. It will be published shortly.**</p> <p>**</p> </article> </div> <nav class="pagination is-centered is-large box" id="site-navigation" role="navigation" aria-label="pagination"> <a class="pagination-previous" href="https://subinsb.com/mysql-now-in-js-utc-local/" title="MySQL NOW() in JavaScript (UTC & Local)" >Previous Post</a> <div class="addthis_inline_share_toolbox"></div> <a class="pagination-next" href="https://subinsb.com/search-engine-in-php-part-2/" >Next Post</a> </nav> <div class="comments box"> <div id="disqus_thread"></div> <script type="text/javascript"> disqus_url = 'http://' + location.host + location.pathname.replace(/\/$/, ""); (function() { if (window.location.hostname == "localhost") return; var dsq = document.createElement('script'); dsq.type = 'text/javascript'; dsq.async = true; var disqus_shortname = 'subinsblog'; dsq.src = '//' + disqus_shortname + '.disqus.com/embed.js'; (document.getElementsByTagName('head')[0] || document.getElementsByTagName('body')[0]).appendChild(dsq); })(); </script> <noscript>Please enable JavaScript to view the <a href="http://disqus.com/?ref_noscript">comments powered by Disqus.</a></noscript> </div> </div> </div> </div> </div> <footer id="colophon" class="site-footer" role="contentinfo"> <div class="footer-nav"> <blockquote><span title="a.k.a SubinSiby">subins2000</span> contribute to & develop<br/><a href="https://en.wikipedia.org/wiki/Free_and_open-source_software" target="_blank">Free & Open Source Softwares</a><br/>Teenage is sweet !</blockquote> <div style="text-align: center;"> <div class="side"> <ul> <li><a href="/about">About</a></li> <li><a href="/donate">Donate</a></li> <li><a href="#search-form">Search</a></li> <li><a target="_blank" href="/sitemap.xml">Sitemap</a></li> <li><a target="_blank" href="http://feeds.feedburner.com/subinsblog">RSS</a></li> <li><a href="#mailchimpsf_widget-2">Subscribe!</a></li> </ul> </div> <div class="side"> <ul> <li><a target="_blank" href="https://github.com/subins2000">GitHub</a></li> <li><a target="_blank" href="https://gitlab.com/subins2000">GitLab</a></li> <li><a target="_blank" href="https://www.facebook.com/SubinSiby">Facebook</a></li> <li><a target="_blank" href="https://www.twitter.com/SubinSiby">@SubinSiby</a></li> <li><a target="_blank" href="https://plus.google.com/+SubinSiby">+SubinSiby</a></li> <li><a target="_blank" href="https://youtube.com/subinsiby">YouTube</a></li> </ul> </div> </div> </div> <div class="site-info"> <p> This blog is created, written and maintained by Subin Siby. It is built with <a target="_blank" href="https://gohugo.io/">Hugo</a> and hosted by <u><a target="_blank" href="https://gitlab.com">GitLab</a></u>. This blog and my projects are continuing because of the support from you and the <u><a href="/donate">donations</a></u>. Servers are costly ! Please be <b>generous</b> to consider a <u><a href="/donate">donation</a></u> if you found something helpful.</p> <div class="smilie" title="Do you know that this is just a text transformed into a smiley face by CSS ? Check the source !">:-)</div> <span></span><span></span><span></span><span></span><span></span><span></span><span></span> </div> </footer> <script type="text/javascript" src="//cdn.rawgit.com/google/code-prettify/master/loader/run_prettify.js?lang=perl&skin=sunburst&autoload=true" async="async"></script> <script type="text/javascript" src="//s7.addthis.com/js/300/addthis_widget.js#pubid=ra-4fb122922591215b" async="async"></script> <script> (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){ (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o), m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m) })(window,document,'script','https://www.google-analytics.com/analytics.js','ga'); ga('create', 'UA-33042168-1', 'auto'); ga('send', 'pageview'); </script> </body> </html>