Create a Search Engine In PHP, MySQL | Part 2


This is the Part 2 of “How To Create A Search Engine In PHP”. In this part, We’ll add code to files that are being displayed to the users. We make the database and the table in this part.

index.php

The main page of our Web Search is index.php. It has a simple design and is not very advanced as Yahoo or something else.

<?include("inc/functions.php");?>
<html>
 <head>
  <?head("", array("index"));?>
 </head>
 <body>
  <?headerElem();?>
  <div class="container">
   <center>
    <h1>Web Search</h1>
    <form class="searchForm" action="search.php" method="GET">
     <input type="text" autocomplete="off" name="q" id="query"/>
     <div>
      <button>
       <svg class='shape-search' viewBox="0 0 100 100" class='shape-search'><use xlink:href='#shape-search'></use></svg>
      </button>
     </div>
     <p>Free, Open Source & Anonymous</p>
    </form>
   </center>
 </div>
 <?footer();?>
 </body>
</html>

We use simple functions for every files. head() function accepts two parameters. The first parameter is the title of the web page. The main site title will automatically be appended to the title you give in the first parameter when it’s displayed as .

We also use SVG images to speed up the site. The “#shape-search” path is mentioned in the functions.php file which is dynamically added below the footer.**

**

url.php

When a user clicks on a link of search result, the user actually goes to this page. This page will redirect back to the original URL. When the site owner sees the stats log, he will find our Search Engine URL as a referrer link and our site is promoted.

<?
$url=isset($_GET['u']) ? urldecode($_GET['u']):"";
if(filter_var($url, FILTER_VALIDATE_URL) === FALSE || $url==""){
 header("Location: http://".$_SERVER['HTTP_HOST'], 302);
 exit;
}else{
 header("Location: ".$url);
 exit;
}
?>

We check if the URL is valid too, because we don’t want any douche bags messing with our site (No Offence).

search.php

We show the search results here in this page. Search results are obtained from functions.php and it is only displayed here.

<?include("inc/functions.php");?>
<html>
 <head>
 <?head($GLOBALS['displayQ'], array("search"));?>
 </head>
 <body>
  <?headerElem();?>
  <div class="container">
   <script>document.getElementById('query').focus();</script>
   <?
   if($GLOBALS['q']==""){
    echo "A query please...";
   }else{
    require "inc/spellcheck.php";
    $SC=new SpellCheck();
    $corSp=$SC->check($GLOBALS['q']);
    if($corSp!=""){
     echo "<p style='color:red;font-size:15px;margin-bottom:10px'>Did you mean ? <br/><a href='?q=$corSp'>".$corSp."</a></p>";
    }
    $res=getResults();
    if($res==0){
     echo "<p>Sorry, no results were found</p><h3>Search Suggestions</h3>";
     echo "<ul>";
     echo "<li>Check your spelling</li>";
     echo "<li>Try more general words</li>";
     echo "<li>Try different words that mean the same thing</li>";
     echo "</ul>";
    }else{
   ?>
    <div class="info">
     <strong><?echo $res['count'];?></strong>
     <?echo $res['count']==1 ? "result" : "results";?> found in <?echo $res['time'];?> seconds. Page <?echo $GLOBALS['p'];?>
    </div>
    <div class="results">
     <?
     foreach($res['results'] as $re){
      $t=htmlFilt($re[0]);
      $u=htmlFilt($re[1]);
      $d=htmlFilt($re[2]);
      if(strlen($GLOBALS['q']) > 2){
       $d=str_replace($GLOBALS['q'], "<strong>{$GLOBALS['q']}</strong>", $d);
      }
     ?>
      <div class="result">
       <h3 class="title">
        <a target="_blank" onmousedown="this.href='<?echo HOST;?>/url.php?u='+encodeURIComponent(this.getAttribute('data-href'));" data-href="<?echo $u;?>" href="<?echo $u;?>"><?echo strlen($t)>59 ? substr($t, 0, 59)."..":$t;?></a>
       </h3>
       <p class="url" title="<?echo $u;?>"><?echo $u;?></p>
       <p class="description"><?echo $d;?></p>
      </div>
     <?
     }
     ?>
    </div>
    <div class="pages">
     <?
     $count=(ceil($res['count']/10));
     $start=1;
     if($GLOBALS['p'] > 5 && $count > ($GLOBALS['p'] + 4)){
      $start=$GLOBALS['p']-4;
      $count=$count > ($start+8) ? ($start+8):$count;
     }elseif($GLOBALS['p'] > 5){
      if($GLOBALS['p']==$count){
       $start=$GLOBALS['p']-8;
      }elseif($GLOBALS['p']==($count-1)){
       $start=$GLOBALS['p']-7;
      }elseif($GLOBALS['p']==($count-2)){
       $start=$GLOBALS['p']-6;
      }elseif($GLOBALS['p']==($count-3)){
       $start=$GLOBALS['p']-5;
      }elseif($GLOBALS['p']==($count-4)){
       $start=$GLOBALS['p']-4;
      }
     }elseif($GLOBALS['p'] <= 5 && $count > ($GLOBALS['p'] + 5)){
      $count=$start+8;
     }
     for($i=$start;$i<=$count;$i++){
      $isC=$GLOBALS['p']==$i ? 'current':'';
      echo "<a href='?p=$i&q={$GLOBALS['q']}' class='button $isC'>$i</a>";
     }
     ?>
    </div>
   <? 
    }
   }
  ?>
  </div>
  <?footer();?>
 </body>
</html>

It’s a tremendous file. As you can see in the  tag, we use head() function. We display the query as the title of the page. If the query is empty, main title will be shown and the string “A query please…” is shown inside .container.

Database & Tables

It’s time to create the database and tables. You can create the database you like and I’ll give you the SQL query to create the table. BTW, the name of the table is search.

CREATE TABLE IF NOT EXISTS `search` (
 `id` int(11) NOT NULL AUTO_INCREMENT,
 `title` varchar(60) NOT NULL,
 `url` text NOT NULL,
 `description` varchar(160) NOT NULL,
 PRIMARY KEY (`id`),
 UNIQUE KEY `id` (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 AUTO_INCREMENT=1 ;

We add the title of the page, URL of the page and description of the page to this table.

Crawler

Now, lets’ build the crawler which runs in the background. In the previous part of the tutorial, we added the PHPCrawl & Simple HTML Dom library to the crawler folder. In this part, we add code to some of the files we created before.

crawlStatus.txt

The status of the crawling (0 / 1) is added inside this file. If it’s **runCrawl.php** will start calling the bgCrawl.php which runs the crawler in the background. For starting the crawler, we’ll add **** inside this file.

crawl.php

<?
if(!isset($crawlToken) || $crawlToken!=798821){
 die("Error");
}
$dir=realpath(dirname(__FILE__));
function shutdown(){ 
 global $dir;
 $a=error_get_last(); 
 if($a==null){
  echo "No errors";
 }else{
  file_put_contents($dir."/crawlStatus.txt", "0");
  include($dir."/runCrawl.php");
 }
}
register_shutdown_function('shutdown'); 
set_time_limit(30);
include($dir."/../inc/config.php");
include($dir."/PHPCrawl/libs/PHPCrawler.class.php");
include($dir."/simple_html_dom.php");
function addURL($t, $u, $d){
 global $dbh;
 if($t!="" && filter_var($u, FILTER_VALIDATE_URL)){
  $check=$dbh->prepare("SELECT `id` FROM `search` WHERE `url`=?");
  $check->execute(array($u));
  $t=preg_replace("/s+/", " ", $t);
  $t=substr($t, 0, 1)==" " ? substr_replace($t, "", 0, 1):$t;
  $t=substr($t, -1)==" " ? substr_replace($t, "", -1, 1):$t;
  $t=html_entity_decode($t, ENT_QUOTES);
  $d=html_entity_decode($d, ENT_QUOTES);
  echo $u."n";
  if($check->rowCount()==0){
   $sql=$dbh->prepare("INSERT INTO `search` (`title`, `url`, `description`) VALUES (?, ?, ?)");
   $sql->execute(array(
    $t,
    $u,
    $d
   ));
  }else{
   $sql=$dbh->prepare("UPDATE `search` SET `description` = ?, `title` = ? WHERE `url`=?");
   $sql->execute(array(
    $d,
    $t,
    $u
   ));
  }
 }
}
class WSCrawler extends PHPCrawler { 
 function handleDocumentInfo(PHPCrawlerDocumentInfo $p){ 
  $u=$p->url;
  $c=$p->http_status_code;
  $s=$p->source;
  if($c==200 && $s!=""){
   $html = str_get_html($s);
   if(is_object($html)){
    $d="";
    $do=$html->find("meta[name=description]", 0);
    if($do){
     $d=$do->content;
    }
    $t=$html->find("title", 0);
    if($t){
     $t=$t->innertext;
     addURL($t, $u, $d);
    }
    $html->clear(); 
    unset($html);
   }
  }
 }
}
function crawl($u){
 $C = new WSCrawler();
 $C->setURL($u);
 $C->addContentTypeReceiveRule("#text/html#");
 $C->addURLFilterRule("#(jpg|gif|png|pdf|jpeg|svg|css|js)$# i");
 $C->obeyRobotsTxt(true);
 $C->setUserAgentString("DingoBot (http://search.subinsb.com/about/bot.php)");
 $C->setFollowMode(0);
 $C->go();
}
// Get the last indexed URLs (If there isn't, use default URL's) & start Crawling
$last=$dbh->query("SELECT `url` FROM search");
$count=$last->rowCount();
if($count < 2){
 crawl("http://subinsb.com"); // The Default URL #1
 crawl("http://demos.subinsb.com"); // The Default URL #2
}else{
 $urls=$last->fetchAll();
 for($i=0;$i<2;$i++){
  $index=rand(0, $count-1);
  crawl($urls[$index]['url']);
 }
}
?>

The crawl() function is the one that initiates the crawling. When crawl.php is executed, PHP checks if there are 2 or more rows in the search table. If there is, Using MySQL, PHP gets 2 random URL’s from the table and send them to crawl() for crawling. If not, 2 default URL’s are crawled.

To prevent others from directly going to the URL and initiate the crawl, we only continue the crawl if the variable $crawlToken is set and it’s value is “798821”.

When each document is loaded by the crawler, handleDocumentInfo() function is called. This function will check if the status code of the document is “200” (OK) and the contents of the document is not null. If the checks are all Ok, then the title tag from the document and the meta description tag is obtained. If title is null or not found, the record is not inserted in to the table. The description field is not mandatory.

The title and description is filtered and the page URL is printed out.

bgCrawl.php

<?
$dir=realpath(dirname(__FILE__));
$GLOBALS['bgFull']="";
$crawlToken=418941;
include($dir."/crawl.php");
?>

This file is called by runCrawl.php for background running.

runCrawl.php

<?
$dir=realpath(dirname(__FILE__));
$s="$dir/crawlStatus.txt";
$c=file_get_contents($s);
if($c==0){
 function execInbg($cmd) { 
  if (substr(php_uname(), 0, 7) == "Windows"){ 
   pclose(popen("start /B ". $cmd, "r")); 
  }else{ 
   exec($cmd . " > /dev/null &"); 
  } 
 }
 execInbg("php -q $dir/bgCrawl.php");
 file_put_contents($s, 1);
 echo "Started Running";
}else{
 echo "Currently Running";
}
?>

To start the crawling process in background, you should go to this page by using the browser. The background crawling only starts if the contents of the crawlStatus.txt file is “0”. When the bgCrawl.php is executed using the shell command, the contents of the crawlStatus.txt is replaced with “1” indicating that the crawling process in the background has started.

Every time you update your site, you should visit this page to initiate background crawling. Other wise, crawling won’t be done and the search table’s contents won’t be increased.

This concludes the 2nd part. The next part will contain additional features and what more you can do with the search engine. The next part will also contain adding Spell Check to the search engine.