基於輕量級php搜尋sphider站內搜尋初級優化
轉載:https://blog.csdn.net/chijiaodaxie/article/details/48714373
站內搜尋初級優化
php1>. 概述:
站內搜尋引擎顧名思義即網站內的資訊搜尋引擎,隨著網路的發展,網站已經成為了企業或機構最重要的公共形象門戶。每天,大量潛在的客戶、合作者、投資人,分析師等會登陸企業的網站,網站帶給他們的感受將直接影響到他們對公司的評價。根據IDC的調查顯示:當用戶登陸一個網站時,在一開始如果不能很快地檢索到他所需要的資訊,則50%的使用者會立刻離開此網站,其中的60%將不再光顧這個網站,這意味著公司將永遠失去30%的潛在客戶。
當然,我也沒去考證過上面的資料準不準,但是可以看出站內的搜尋的展示結果質量的準確度對使用者的體驗是很重要的。
注:以下搜尋引擎都特指站內的搜尋
php2>. 搜尋引擎的自我修養:
一個優秀的站內搜尋除了一個醒目美觀的搜尋框外,最重要的是能快速準確的給出使用者所檢索的結果,此外還有一些附加功能可以提升使用者的體驗:
1. 自動提示:不僅能減少錯誤輸入,還能幫助我們推薦產品與產品分類;
2. 自動糾錯: 與“無搜尋結果”相比,顯示點結果總會減少些訪客跳出。但這是一把雙刃劍,若是推薦的詞質量太低,搜尋會顯得很不專業;
3. 相關搜尋:基於同義詞的能容推薦,能給訪客一些未想到的搜尋提示,加大覆蓋面,也加能增加使用者的點選量。
4. 結果過濾或者在結果中搜索:給使用者更精確的搜尋體驗;
5. 排序方式:如果搜尋有多重屬性,比如form站的下載量、點選次數或者評分高低,這樣能讓使用者在靠前的位置找到他的關注內容;
6. 高階搜尋…..
不在此一一列舉
php3>. 搜尋引擎的核心技術:
而這其中涉及到的技術有:分詞技術(還好我們不做中文網站)、頁面抓取分析(全文檢索)、建立索引、搜尋匹配和排序演算法、對搜尋關鍵詞的統計、關聯、推薦等演算法。
php4>. 站內搜尋的常用做法:
使用大型商業搜尋引擎提供的介面:
比如google、yahoo,國內的baidu 的API
優點:簡單省事,申請賬戶,使用API;
缺點:1. 不能瞭解具體的搜尋排序機制,不能對展示的結果做相應的控制,也不利於進行調整;
2. 免費版本有廣告,影響體驗。自己實現:
2.1) sql 的 like 查詢:
程式碼實現比較簡單,需要完全匹配搜尋的字串,否則搜尋不出結果,多關鍵字的搜尋結果展示差;2.2) 基於分詞的搜尋:
有一些開源的專案:Java裡比較有名的Lucene,口碑也很好,也有很多其他基於它的其他專案,可以支撐資料量較大的專案,速度很快,Java專案也可以借鑑,整合到專案裡,因為是Java寫的,而前期需要嵌入form站,所以只能忍痛割愛;
review的第二個開源專案是Sphinx,C++寫的,也比較主流,快,索引較大,搜尋精度不如Lecence,試用了一下其編譯好的exe檔案,速度確實快,聽說搜尋億級的資料的時間也在毫秒級,建立索引的時間在小時級,後期可以考慮使用;但是他們都有的問題是,牛逼閃閃但是專案較為龐大,封裝出了介面給我們呼叫,我們要修改內部的演算法,可能要track的程式碼較多,考慮到時間因素,選擇了一個輕量級的搜尋框架sphider,整個專案的程式碼量才不到300K, 估計撐死就一萬行,而且他是基於mysql和php的,看完之後簡直爽high了,這正是我們需要找的東西,跟蹤其程式碼走一遍,整體上能大概瞭解一個搜尋引擎的工作原理,我們下面展示的搜尋就是移植和微調了一下其搜尋方面的功能:
step 1>
獲得源資料:如果要檢索網頁內容,我們需要建立爬蟲爬取需要檢索的網頁的內容,存資料庫,但我們pdf轉的html實際上不太規整,而且搜尋的關鍵詞絕大部分都是cat或者post的name,所以我們省去了這一步驟,直接取資料庫裡的欄位作為元資料;
step 2>
分詞,提取關鍵詞,建立索引, 程式碼見:SearchindexController.class.php
1)新加資料表:keywords表,keyword_post多張,keyword_cat多張
2)介面:indexallpost(), indexallcat()
注:分詞速度較慢,後期跳出thinkphp的框架,用純SQL寫了一個提速版本,索引40萬資料,大概需要4-5min,當然與aphinx等比較還有較大的差距,有機會再放出來
程式碼邏輯:
->indexPost() & indexCat(): 取資料來源內容
->unique_word_array(): 每條資料按照多重規則分詞(分隔符、忽略詞、提取詞幹)
->計算權重(因為description等都是自動生成的,無意義,所以權重只是對keyword在資料來源中出現的次數簡單的計算)
->save_post_keywords(): 插入資料庫的keywords表,(keyword唯一)
->save_post_keywords(): 然後插入多張關係表(delete_post_keywords_relation(): 事先刪除關係表裡該post的資料,多表的存在可以緩解單表的壓力)
至此,分詞完畢!
<?php
namespace Admin\Controller;
use Admin\Controller\CommonController;
/**
* @author chijiaodaxie
*/
class SearchindexController extends CommonController {
//only indexpost and indexcat were public as APIs
private $keywords_array = array();
public function indexAllPost($reindex = 0){
set_time_limit(0);
$this->keywords_array = $this->get_all_keyword();
// dump($this->keywords_array);
$post_db = D('Post');
if($reindex){
echo "post全部重新索引, 馬力全開<br/><br/>";
$posts = $post_db->field(array('postid', 'name', 'catid'))->where(array('status'=>4))->select();
}else{
echo "增量索引, 為新增post加索引<br/><br/>";
$posts = $post_db->field(array('postid', 'name', 'catid'))->where(array(/*'status'=>4, */'indexed'=>0))->select();
}
$post_ids = array();
$failed_ids = array();
foreach($posts as $post){
$res = $this->indexPost($post['postid'], $post['name'], $post['catid']);
if($res){
$post_ids[] = $post['postid'];
}else{
$failed_ids[] = $post['postid'];
}
}
$post_id_str = implode(",", $post_ids);
$data['indexed'] = 1;
$post_db->where('postid in ('.$post_id_str.')')->save($data);
if ($failed_ids){
echo count($failed_ids)." posts was not index successly: <br/><br/>友情提示!注意亂碼問題<br/><br/>";
echo "Success ids: (".implode(", ", $post_ids).")<br/><br/>";
echo "Failed ids: (".implode(", ", $failed_ids).")";
}else{
echo "Success ids: (".implode(", ", $post_ids).")<br/><br/>";
echo "index success!!";
}
}
private function indexPost($postid, $postname, $catid){
$keywords = $this->unique_word_array($postname);
$this->delete_post_keywords_relation($postid);
$res = $this->save_post_keywords($keywords, $postid, $catid);
if(!$res){
return false;
}
return true;
}
public function indexAllCat($reindex = 0){
set_time_limit(0);
$this->keywords_array = $this->get_all_keyword();
$cat_db = D('Category');
if($reindex){
echo "cat全部重新索引, 馬力全開<br/><br/>";
$cats = $cat_db->field(array('catid', 'catname', 'parentid'))->where(array('disabled'=>0, 'ismenu'=>1))->select();
}else{
echo "增量索引, 為新增cat加索引<br/><br/>";
$cats = $cat_db->field(array('catid', 'catname', 'parentid'))->where(array('disabled'=>0, 'ismenu'=>1, 'indexed'=>0))->select();
}
$cat_ids = array();
$failed_ids = array();
foreach($cats as $cat){
$res = $this->indexCat($cat['catid'], $cat['catname'], $cat['parentid']);
if($res){
$cat_ids[] = $cat['catid'];
}else{
$failed_ids[] = $cat['catid'];
}
}
$cat_id_str = implode(",", $cat_ids);
$data['indexed'] = 1;
$cat_db->where('catid in ('.$cat_id_str.')')->save($data);
if ($failed_ids){
echo count($failed_ids)." cats was not index successly: <br/><br/>友情提示!注意亂碼問題<br/><br/>";
echo "Success ids: (".implode(", ", $cat_ids).")<br/><br/>";
echo "Failed ids: (".implode(", ", $failed_ids).")";
}else{
echo "index success!!<br/><br/>";
echo "Success ids: (".implode(", ", $cat_ids).")";
}
}
private function indexCat($catid, $catname, $parentid){
$keywords = $this->unique_word_array($catname);
$this->delete_cat_keywords_relation($catid);
$res = $this->save_cat_keywords($keywords, $catid, $parentid);
if(!$res){
return false;
}
return true;
}
private function unique_word_array($str){
if(is_array($str) && !empty($str)){
$str = implode(" ", $str);
}
$str = strtolower($str);
$str = preg_replace("/ /", " ", $str);
$str = preg_replace("/[\*\^\+\?\\\.\[\]\^\$\|\{\)\(\}~!\"\/@#£$%&=`´;><:,]+/", " ", $str);
$str = preg_replace('/\s+/', ' ', $str);
$arr = explode(" ", $str);
$min_word_length = C('MIN_WORD_LENGTH');
$word_upper_bound = C('WORD_UPPER_BOUND');
$index_numbers = C('INDEX_NUMBER');
$stem_words = C('STEM_WORDS');
$common = $this->get_common_word();
if ($stem_words == 1) {
$stem_word = new \Common\Plugin\Stem();
$newarr = array();
foreach ($arr as $val) {
$newarr[] = $stem_word->stem($val);
}
$arr = $newarr;
}
sort($arr);
reset($arr);
$newarr = array();
$i = 0;
$counter = 1;
$element = current($arr);
if ($index_numbers == 1) {
$pattern = "/[a-z0-9]+/";
} else {
$pattern = "/[a-z]+/";
}
$regs = array();
for ($n = 0; $n < sizeof($arr); $n ++) {
//check if word is long enough, contains alphabetic characters and is not a common word
//to eliminate/count multiple instance of words
$next_in_arr = next($arr);
if ($next_in_arr != $element) {
// $element = rtrim($element, ".,");
if (preg_match("/^(-|\\\')(.*)/", $element, $regs))
$element = $regs[2];
if (preg_match("/(.*)(\\\'|-|\'s|\')$/", $element, $regs))
$element = $regs[1];
if (strlen($element) > $min_word_length && preg_match($pattern, $this->remove_accents($element)) && (@ $common[$element] <> 1)) {
$newarr[$i][1] = $element;
$newarr[$i][2] = $counter;
$element = current($arr);
$i ++;
$counter = 1;
} else {
$element = $next_in_arr;
$counter = 1;
}
} else {
if ($counter < $word_upper_bound)
$counter ++;
}
}
// var_dump($newarr);
return $newarr;
}
/*
* save the keywords to post related table
*/
private function save_post_keywords($keywords, $post_id, $cat_id){
// $this->keywords_array;
$table_num = C('POST_KEYWORDS_NUM');
foreach($keywords as $keyword){
$word = $keyword[1];
// dump($word);
$wordmd5 = (int)(hexdec(substr(md5($word), 0, 1)))%$table_num;
$weight = $keyword[2];
if (strlen($word)<= 30) {
$keyword_id = $this->keywords_array[$word];
$keywords_db = M('search_keywords');
if ($keyword_id == "") {
$data['keyword'] = $word;
$data['post_word_frequency'] = 1;
$keyword_id = $keywords_db->add($data);
if(!$keyword_id){
return false;
}
if(!$keyword_id){
$a = $keywords_db->where(array('keyword'=>$word))->setInc('post_word_frequency', 1);
$thisword = $keywords_db->where(array('keyword'=>$word))->find();
$keyword_id = $thisword['keyword_id'];
}else{
$this->keywords_array[$word] = $keyword_id;
}
}else{
$a = $keywords_db->where(array('keyword'=>$word))->setInc('post_word_frequency', 1);
}
$inserts[$wordmd5][] = array('post_id'=>$post_id, 'keyword_id'=>$keyword_id, 'weight'=>$weight * 10, 'cat_id'=>$cat_id);
}
}
for ($i=0;$i<=$table_num; $i++) {
$char = $i;
if ($inserts[$char]) {
$post_keyword_db = M('search_post_keyword'.$char);
$res = $post_keyword_db->addAll($inserts[$char]);
if(!$res){
return false;
}
}
}
return true;
}
/*
* save the keywords to post related table
*/
private function save_cat_keywords($keywords, $cat_id, $parent_id){
$table_num = C('CAT_KEYWORDS_NUM');
foreach($keywords as $keyword){
$word = $keyword[1];
// dump($word);
$wordmd5 = (int)(hexdec(substr(md5($word), 0, 1)))%$table_num;
$weight = $keyword[2];
if (strlen($word)<= 30) {
$keyword_id = $this->keywords_array[$word];
$keywords_db = M('search_keywords');
if ($keyword_id == "") {
$data['keyword'] = $word;
$data['cat_word_frequency'] = 1;
$keyword_id = $keywords_db->add($data);
if(!$keyword_id){
return false;
}
if(!$keyword_id){
$a = $keywords_db->where(array('keyword'=>$word))->setInc('cat_word_frequency', 1);
// dump($a);
$thisword = $keywords_db->where(array('keyword'=>$word))->find();
$keyword_id = $thisword['keyword_id'];
}else{
$this->keywords_array[$word] = $keyword_id;
}
}else{
$a = $keywords_db->where(array('keyword'=>$word))->setInc('cat_word_frequency', 1);
}
$inserts[$wordmd5][] = array('cat_id'=>$cat_id, 'keyword_id'=>$keyword_id, 'weight'=>$weight * 10, 'parent_cat_id'=>$parent_id);
}
}
for ($i=0;$i<=$table_num; $i++) {
$char = dechex($i);
if ($inserts[$char]) {
$cat_keyword_db = M('search_cat_keyword'.$char);
$res = $cat_keyword_db->addAll($inserts[$char]);
if(!$res){
return false;
}
}
}
return true;
}
/*
* before index, delete the relation that may ecxit
*/
private function delete_post_keywords_relation($postid){
$table_num = C('POST_KEYWORDS_NUM');
for($i=0; $i<$table_num; $i++){
$char = dechex($i);
$db_name = 'search_post_keyword'.$char;
$relation_db = M($db_name);
$relation_db->where(array('post_id'=>$postid))->delete();
}
}
/*
* before index, delete the relation that may ecxit
*/
private function delete_cat_keywords_relation($catid){
$table_num = C('CAT_KEYWORDS_NUM');
for($i=0; $i<$table_num; $i++){
$char = dechex($i);
$relation_db = M('search_cat_keyword'.$char);
$relation_db->where(array('cat_id'=>$catid))->delete();
}
}
/*
* get all keywords that already exsit
*/
private function get_all_keyword(){
$keywords_array = array();
$keywords_db = M('search_keywords');
$keywords = $keywords_db->select();
// dump($keywords);
if($keywords){
foreach($keywords as $keyword){
$keywords_array[$keyword['keyword']] = $keyword['keyword_id'];
}
}
return $keywords_array;
}
/*
* trim the spical char mb:alabel
*/
private function remove_accents($string) {
return (strtr($string, "ÀÁÂÃÄÅÆàáâãäåæÒÓÔÕÕÖØòóôõöøÈÉÊËèéêëðÇçÐÌÍÎÏìíîïÙÚÛÜùúûüÑñÞßÿý",
"aaaaaaaaaaaaaaoooooooooooooeeeeeeeeecceiiiiiiiiuuuuuuuunntsyy"));
}
/*
*Common word that should be Ignore
*/
private function get_common_word(){
$common = array();
$lines = @file('common.txt');
if (is_array($lines)) {
while (list($id, $word) = each($lines)){
$common[trim($word)] = 1;
}
}
return $common;
}
}
step 3>
搜尋, 程式碼見:searchModel.class.php
新建search_query_log 表
介面: getPostByQueryString()
程式碼邏輯:
->makeboollist(): 對搜尋詞分詞(除了一些特定的需求,其規則要與資料來源的分詞一致,這樣才能保證搜尋的準確性,而一些特定的搜尋也分別提取出來,如禁止詞,搜尋片語等等)
->search(): 對每一類分出的詞進行search:
1)sql的like搜尋片語,關聯關係表和keywords表搜尋單詞和禁止詞;
2)確定and或者or的匹配關係,合併搜尋結果,同時計算出複雜度,注意時間複雜度;
3)若沒有找到,進入suggest環節,使用函式soundex()和levenshtein();
4)若有結果,確定是否cat控制,分別啟用不同的排序演算法;
若沒結果,顯示default內容;
5)根據postid找到相應的內容;
6)搜尋完成,log記錄(其內容方便往後的統計分析優化);
7)展示到相應的動態前端……
<?php
namespace Home\Model;
use Think\Model;
/**
*@Author chijiaodaxie
*/
class SearchModel extends Model{
private $entities = array(
"&" => "&",
"&apos" => "'",
"Þ" => "Þ",
"ß" => "ß",
"à" => "à",
"á" => "á",
"â" => "â",
"ã" => "ã",
"ä" => "ä",
"å" => "å",
"æ" => "æ",
"ç" => "ç",
"è" => "è",
"é" => "é",
"ê" => "ê",
"ë" => "ë",
"ì" => "ì",
"í" => "í",
"î" => "î",
"ï" => "ï",
"ð" => "ð",
"ñ" => "ñ",
"ò" => "ò",
"ó" => "ó",
"ô" => "ô",
"õ" => "õ",
"ö" => "ö",
"ø" => "ø",
"ù" => "ù",
"ú" => "ú",
"û" => "û",
"ü" => "ü",
"ý" => "ý",
"þ" => "þ",
"ÿ" => "ÿ",
"Þ" => "Þ",
"ß" => "ß",
"À" => "à",
"Á" => "á",
"Â" => "â",
"Ã" => "ã",
"Ä" => "ä",
"Å" => "å",
"&Aelig;" => "æ",
"Ç" => "ç",
"È" => "è",
"É" => "é",
"Ê" => "ê",
"Ë" => "ë",
"Ì" => "ì",
"Í" => "í",
"Î" => "î",
"Ï" => "ï",
"Ð" => "ð",
"Ñ" => "ñ",
"Ò" => "ò",
"Ó" => "ó",
"Ô" => "ô",
"Õ" => "õ",
"Ö" => "ö",
"Ø" => "ø",
"Ù" => "ù",
"Ú" => "ú",
"Û" => "û",
"Ü" => "ü",
"Ý" => "ý",
"&Yhorn;" => "þ",
"Ÿ" => "ÿ"
);
public function getPostByQueryString($query, $page_num, $pagesize){
$starttime = $this->getmicrotime();
if (substr_count($query,'"')==1){
$query=str_replace('"','',$query);
}
$words = $this->makeboollist($query);
// dump($words);
$data = $this->search($words, $page_num, $pagesize);
// dump($data);
if(isset($data['did_you_mean'])){
// dump($data['did_you_mean']);
$words['hilight'] = $words['+'] = array_values($data['did_you_mean']);
// dump($words);
// dump($words['hilight']);
$data_suggest = $this->search($words, $page_num, $pagesize);
$did_you_mean_b=$query;
$did_you_mean=$query;
while (list($key, $val) = each($data['did_you_mean'])) {
if ($key != $val && !stristr("<font color=#D54955><b>", $key) && !stristr("</b></font>", $key)) {
// dump($key);
// dump($val);
$did_you_mean_b = str_replace($key, "<font color=#D54955><b>$val</b></font>", $did_you_mean_b);
$did_you_mean = str_replace($key, "$val", $did_you_mean);
}
}
$a_href = "<a href=\"/search?q=".$did_you_mean."\">";
$data = $data_suggest;
$data['did_you_mean'] = $a_href.$did_you_mean_b."</a>";
// dump($data['did_you_mean']);
$data['results_suggest'] = $data_suggest['results'];
$data['results'] = 0;
}
$time = $this->getmicrotime() - $starttime;
$data['time'] = $time;
// dump($data);
return $data;
}
public function makeboollist($query){
//實體轉換
$stem_words = C('STEM_WORDS');
while ($char = each($this->entities)){
$query = preg_replace("/".$char[0]."/i", $char[1], $query);
}
$query = preg_replace("/"/i", "\"", $query);
$query = trim($query);
$returnWords = array();
//get all phrases
$regs = array();
// dump($query);
while (preg_match("/([-]?)\"([^\"]+)\"/", $query, $regs)) {
if ($regs[1] == '') {
$returnWords['+s'][] = $regs[2];
$returnWords['hilight'][] = $regs[2];
} else {
$returnWords['-s'][] = $regs[2];
}
$query = str_replace($regs[0], "", $query);
}
$query=str_replace('"','',$query);
$query = preg_replace("/[\*\^\+\?\\\.\[\]\^\$\|\{\)\(\}~!\"\/@#£$%&=`´;><:,-]+/", " ", $query);
$query = strtolower(preg_replace("/[ ]+/", " ", $query));
// $query = remove_accents($query);
$query = trim($query);
$words = explode(' ', $query);
if (!$query) {
$limit = 0;
} else {
$limit = count($words);
}
$k = 0;
//get all words (both include and exlude)
$includeWords = array();
while ($k < $limit) {
if (substr($words[$k], 0, 1) == '+') {
$includeWords[] = substr($words[$k], 1);
$returnWords['hilight'][] = substr($words[$k], 1);
if (!($this->ignoreWord(substr($words[$k], 1)))) {
if ($stem_words == 1) {
$stem_word = new \Common\Plugin\Stem();
$word = $stem_word->stem(substr($words[$k], 1));
if($word != substr($words[$k], 1)){
$returnWords['hilight'][] = $word;
$includeWords[] = $word;
}
}
}
} else if (substr($words[$k], 0, 1) == '-') {
$returnWords['-'][] = substr($words[$k],