When we all search on the web using Google, Yahoo or Bing, we are curious about how it works and how it gets all the information on the web. When us programmers started coding, we all wanted to create a search engine. I too attempted to create a search engine and I ultimately failed (3 years back). But now, I improved my coding skills, improved my knowledge and ideas. So, I decided to create a new search engine with automatic crawling, indexing and stuff.

It works great and I made it Open Source. You can see the source code on GitHub or we can move together to understand the different code we will write by this series. This is the Part 1 of “How To Create A Search Engine In PHP”. You can see the finished product or download the finished source code.

Features

  • No JS, Less CSS, Light
  • Simple, Fast & Easy Use
  • Have own Crawler & Obeys robots.txt
  • Only indexes HTML Page (No JS files, CSS files, Images)
  • Not Vulnerable to SQL Injection
  • XSS Attack Not Possible
  • HTML 5 (SVG Images)
  • Uses PDO for Database Queries

Here’s the summary of the Search Engine we’re going to create :

  • Crawler Runs In The background
  • Crawler gets the <title> and meta[name=description] from the page and inserts in to the database.
  • When user searches for a query, MySQL searches for query in title, url and description and displays the results.
  • User actually click search engine’s URL when they click on an external link of search results.
  • We display the stats of the search engine (URLs crawled, Last Indexed URLs

In this part of the tutorial, we make the base files and add code to them. Here is the directory tree :

  • about
  •     bot.php
  •     index.php
  •     stats.php
  • cdn
  •     css
  •        all.css
  •        index.css
  •        search.css
  • crawler
  •        PHPCrawl
  •          – A lot of Files Inside This
  •        bgCrawl.php
  •        crawl.php
  •        crawlStatus.txt
  •        runCrawl.php
  •        simple_html_dom.php
  • inc
  •        config.php
  •        error.php
  •        functions.php
  •        track.php
  • .htaccess
  • index.php
  • robots.txt
  • search.php
  • url.php

As you can see, there are a lot of files in the search engine we are going to create.

crawler

We uses two external libraries in our search engine. Download it and place them inside crawler folder.

PHPCrawl phpcrawl.cuab.de crawler/PHPCrawl
SimpleHTMLDom simplehtmldom.sourceforge.net crawler/simple_html_dom.php

Simple HTML Dom only have one file : simple_html_dom.php. Place it directly in crawler folder.

There is a bug in the PHPCrawl library which makes PHP errors when there is an invalid robots.txt file. There’s a way to fix that. Go to crawler/PHPCrawl/libs/PHPCrawlerRobotsTxtParser.class.php file and search for :

// First, get all "Disallow:"-paths

You will see a function named “buildRegExpressions”. In it, replace the code :

$disallow_pathes[] = trim($match[1]);

with the following code :

if(isset($match[1])){
 $disallow_pathes[] = trim($match[1]);
}

There is also another bug in the same file. Search for :

$non_follow_path_complpete

In the first match, replace the line of code :

$non_follow_path_complpete = $normalized_base_url.substr($disallow_pathes[$x], 1); // "http://www.foo.com/bla/"

with :

$non_follow_path_complpete = $normalized_base_url."/".substr($disallow_pathes[$x], 1); // "http://www.foo.com/bla/"

That’s all the bug fixes.

cdn

CDN – short for Content Delivery Network is the folder where we store our CSSJS files. Since, our search engine don;t have any JS files, we don’t have to create the js folder. But we should create the css folder inside cdn. In the css directory, create 3 files named all.cssindex.css and search.css.

all.css

*{
 margin:0px;
 border:0px;
 padding:0px;
 font-family: Ubuntu;
}
body{
 font-size:14px;
 line-height: 20px;
}
.header{
 position: absolute;
 top: 0px;
 left: 0px;
 right: 0px;
 background: #EEE;
 padding: 5px 20px;
}
.header form{
 display: inline-block;
}
.header .logo{
 margin-right:10px;
 text-decoration: none;
 font-size:22px;
 display: inline-block;
 vertical-align:middle;
 color:black;
}
.header .searchForm #query{
 width:180px;
 -webkit-transition:1s;
 transition:1s;
 height:30px;
 font-size:15px;
 margin-left:10px;
}
.header .searchForm #query:focus{
 width:300px;
}
.container{
 margin: 50px auto 30px auto;
 display:table;
}
.footer{
 position: fixed;
 bottom: 0px;
 left: 0px;
 right: 0px;
 background: #EEE;
 padding: 5px 20px;
}
.footer a{
 margin-right:5px;
}
/* Default Styles */
a[href]{
 text-decoration:none;
}
a[href]:hover{
 text-decoration:underline;
}
h1,h2,h3,h4{
 margin:10px 0px;
}
input[type="text"]{
 padding:3px 5px;
 outline: none;
 border: 1px solid #EEE;
 border-radius: 2px;
 display: inline-block;
 vertical-align:middle;
}
input[type="text"]:hover{
 border: 1px solid #DDD;
}
input[type="text"]:active, input[type="text"]:focus{
 border: 1px solid #4585F1;
}
button, .button{
 padding:5px 10px;
 display: inline-block;
 vertical-align:middle;
 background-color: rgb(77, 144, 254);
 background-image: -webkit-gradient(linear,left top,left bottom,from(rgb(77, 144, 254)),to(rgb(71, 135, 237)));
 background-image: -webkit-linear-gradient(top,rgb(77, 144, 254),rgb(71, 135, 237));
 background-image: -moz-linear-gradient(top,rgb(77, 144, 254),rgb(71, 135, 237));
 background-image: -ms-linear-gradient(top,rgb(77, 144, 254),rgb(71, 135, 237));
 background-image: -o-linear-gradient(top,rgb(77, 144, 254),rgb(71, 135, 237));
 background-image: linear-gradient(top,rgb(77, 144, 254),rgb(71, 135, 237));
 cursor: pointer;
 color:white;
 border-radius:2px;
}
button:hover, .button:hover{
 box-shadow: inset 0 2px 2px rgba(0,0,0,0.1);
}
.searchForm .shape-search{
 height: 14px;
 padding:4px 10px;
 width: 35px;
 fill:white;
}
.searchForm button{
 padding: 0px;
 height:30px;
}

Yeah, we use Ubuntu font because I like it. Hope you like it too. We will add the fonts.googleapis.com stylesheet for proper loading of the font later.

index.css

.container .searchForm{
 margin-top:20px;
 padding:5px;
}
.container .searchForm #query{
 width: 400px;
 padding:4px 5px;
 font-size:15px;
}
.container .searchForm .shape-search{
 width: 100px;
 height:15px;
}
.container .searchForm div, .container .searchForm p{
 margin-top:10px;
}

This file styles the index page

search.css

.header .searchForm #query{
 width:400px !important;
}
.header{
 text-align:center;
}
.container{
 width:500px;
}
.container .info{
 color: gray;
}
.results{
 width: 500px;
 margin-top:25px;
}
.result{
 margin:20px 0px;
}
.result .title{
 margin-bottom: 0px;
 font-size: 17px;
}
.result .url{
 font-size: 13px;
 color: #006621;
 overflow:hidden;
 height:20px;
}
.pages{
 width: 500px;
 text-align: center;
 margin: 0px auto 10px auto;
}
.pages .button{
 margin: 0px 3px 3px;
}
.pages .button.current{
 background:black;
}

This style file styles the searh results page (search.php).

Everything’s finished for cdn folder. Let’s move on to inc folder.

inc

Contains the includable files. This folder plays a major role in our search engine. The main file is functions.php which is used by all the static pages (which comes from dynamic page).

functions.php

<?
include("config.php");

session_start();
$GLOBALS['q']=isset($_GET['q']) ? htmlspecialchars(urldecode($_GET['q'])):"";
$GLOBALS['displayQ']=$GLOBALS['q'];
$GLOBALS['q']=strtolower($GLOBALS['q']);
$GLOBALS['p']=isset($_GET['p']) && is_numeric($_GET['p']) ? $_GET['p']:1;
$GLOBALS['dbh']=$dbh;
function htmlFilt($s){
 $s=str_replace("<", "&lt;", $s);
 $s=str_replace(">", "&gt;", $s);
 return $s;
}
function head($title="", $IncOtherCss=array()){
 $title=$title=="" ? "Web Search" : $title." - Web Search";
 /* Display The <title> tag */
 echo "<title>$title</title>";
 /* The Stylesheets */
 $cssFiles = array_merge(
  array(
   "all",
   "http://fonts.googleapis.com/css?family=Ubuntu"
  ),
  $IncOtherCss
 );
 foreach($cssFiles as $css){
  $url=preg_match("/http/", $css) ? $css : HOST."/cdn/css/$css.css";
  echo "<link href='".$url."' async='async' rel='stylesheet' />";
 }
 echo "<meta name='description' content="Search the world's information, webpages, problems and more. Find exactly what you're looking for easily without any ads and other distractions"/>";
}
function headerElem(){ // header() is already a function in PHP
 $header = "<div class='header'><a class='logo' href='".HOST."'><strong>Web Search</strong></a><form method='GET' action='".HOST."/search.php' class='searchForm'><input id='query' type='text' placeholder='Your Query' autocomplete='off' name='q' value="".$GLOBALS['displayQ'].""/><button><svg viewBox='0 0 100 100' class='shape-search'><use xlink:href='#shape-search'></use></svg></button></form></div>";
 echo $header;
}
function footer(){
include("track.php");
 $footer = "<div class='footer'><a href='".HOST."/about'>About</a><a href='".HOST."/about/stats.php'>Stats</a><a href='".HOST."/about/bot.php'>Dingo</a><div style='float:right;'>&copy; Copyright Subin ".date("Y")."</div></div>";
 $footer.='
 <svg style="display:none;">
  <defs>
  <path id="shape-search" d="m 85.160239,99.375807 c -0.828634,-0.2952 -6.785463,-5.7653 -13.237403,-12.1558 l -11.730795,-11.6193 -6.6207,2.1766 C 33.39036,84.411907 12.627177,75.515007 3.6984912,56.407007 -5.6131124,36.479667 3.2485677,12.852077 23.649685,3.2119175 29.682607,0.36117746 31.404851,0.01130746 39.459783,5.746345e-5 50.03976,-0.01474254 56.477126,1.9699875 63.781566,7.4987375 77.935087,18.211537 83.541599,36.335507 77.964788,53.348307 l -2.173424,6.6304 11.744957,11.7927 c 9.455968,9.4945 11.857728,12.4888 12.323668,15.3642 1.319521,8.1432 -6.925821,15.008903 -14.69975,12.2402 z m -33.083916,-33.2366 c 5.656943,-2.5459 11.702601,-8.5732 14.216739,-14.1737 8.683318,-19.34281 -5.230473,-40.9032 -26.331076,-40.80178 -26.510022,0.12741 -38.6174499,32.4025 -18.836563,50.21308 2.774148,2.4979 7.069057,5.1647 9.656546,5.9963 5.992636,1.9257 15.497206,1.375 21.294354,-1.2339 z"></path>
  </defs>
 </svg>';
 echo $footer;
}
/* Results */
function getResults(){
 $q=$GLOBALS['q'];
 $p=$GLOBALS['p'];
 $start=($p-1)*10;
 if($q!=null){
  $starttime = microtime(true);
  $sql=$GLOBALS['dbh']->prepare("SELECT `title`, `url`, `description` FROM search WHERE `title` LIKE :q OR `url` LIKE :q OR `description` LIKE :q ORDER By id");
  $sql->bindValue(":q", "%$q%");;
  $sql->execute();
  $endtime = microtime(true);
  if($sql->rowCount()==0 || $start>$sql->rowCount()){
   return 0;
  }else{
   $duration = $endtime - $starttime;
   $res=array();
   $res['count']=$sql->rowCount();
   $res['time']=round($duration, 4);
   $limitedResults=$GLOBALS['dbh']->prepare("SELECT `title`, `url`, `description` FROM search WHERE `title` LIKE :q OR `url` LIKE :q OR `description` LIKE :q ORDER BY id LIMIT :start,:limit");
   $limitedResults->bindValue(":q", "%$q%");
   $limitedResults->bindValue(":start", $start, PDO::PARAM_INT);
   $limitedResults->bindValue(":limit", 10, PDO::PARAM_INT);
   $limitedResults->execute();
   while($r=$limitedResults->fetch()){
    $res["results"][]=array($r['title'], $r['url'], $r['description']);
   }
   return $res;
  }
 }
}
?>

 

functions.php makes the <head> tag, footer and header. It also fetches the search results, make GLOBAL variables etc…

config.php

The configuration file. Contains information about the database.

<?
/* Configuration */
ini_set("display_errors", "on"); // Do you want to see the errors ?
define("HOST", "http://search.subinsb.com"); // No '/' at the end

$host = "localhost"; // Hostname
$port = "3306"; // MySQL Port; Default : 3306
$user = "username"; // Username Here
$pass = "password"; // Password Here
$db = "search"; // Database Name
$dbh = new PDO('mysql:dbname='.$db.';host='.$host.';port='.$port, $user, $pass);

/* End Configuration */
?>

Add the domain of the search engine inside the HOST constant, so that you don’t need to change domains elsewhere on any other pages.

error.php

The page which is displayed if any errors occur.

<html>
 <head></head>
 <body>
  <h1>404 Not Found</h1>
  <p>
  The request file was not found on this server.
  </p>
 </body>
</html>

The next part will contain the code of index.php and other files. It will be published shortly.