1. 程式人生 > >查詢中使用全文索引

查詢中使用全文索引

上一片博文說明了全文索引的原理以及一些引數設定及如何建立全文索引。

MySQL資料庫支援全文索引的查詢,其語法如下:

MATCH (col1, col2,...) AGAINST (expr  [serarch_modifier])

serarch_modifier: 
{
  IN NATURAL LANGUAGE MODE |
IN NATURAL LANGUAGE MODE WITH QUERY EXPANSION |
IN BOOLEAN MODE |
WITH QUERY EXPANSION
}

MySQL資料庫通過mathc()...against()語法支援全文檢索的查詢,match指定了需要查詢的列,against指定了使用何種方法進行查詢。

NATURAL LANGUAGE

全文檢索通過match函式進行查詢,預設採用natural language模式,其表示查詢帶有指定word的文件。

上一片部落格中,建立了一個表,以及在表中插入了資料,並且建立了全文索引,表中內容如下:

mysql> select * from tb3;
+------------+-------------------------------------+
| FTS_DOC_ID | body                                |
+------------+-------------------------------------+
| 1 | pLease porridge in the pot | | 2 | please say sorry | | 4 | some like it hot, some like it cold | | 5 | i like coding | | 6 | fuck the company | +------------+-------------------------------------+
5 rows in set (0.00 sec)

我們查詢body中含有“please”單詞的記錄。

#不使用全文索引時的情況如下
mysql> select * from tb3 where body like "%please%"; +------------+----------------------------+ | FTS_DOC_ID | body | +------------+----------------------------+ | 1 | pLease porridge in the pot | | 2 | please say sorry | +------------+----------------------------+ 2 rows in set (0.00 sec) #檢視執行計劃 mysql> explain select * from tb3 where body like "%please%"; +----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+-------------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+-------------+ | 1 | SIMPLE | tb3 | NULL | ALL | NULL | NULL | NULL | NULL | 6 | 16.67 | Using where | +----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+-------------+ 1 row in set, 1 warning (0.03 sec) mysql> #使用全文索引 mysql> select * from tb3 where match(body) against("please" in natural language mode); +------------+----------------------------+ | FTS_DOC_ID | body | +------------+----------------------------+ | 1 | pLease porridge in the pot | | 2 | please say sorry | +------------+----------------------------+ 2 rows in set (0.03 sec) mysql> explain select * from tb3 where match(body) against("please" in natural language mode); +----+-------------+-------+------------+----------+---------------+----------+---------+-------+------+----------+-------------------------------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+-------+------------+----------+---------------+----------+---------+-------+------+----------+-------------------------------+ | 1 | SIMPLE | tb3 | NULL | fulltext | ft_index | ft_index | 0 | const | 1 | 100.00 | Using where; Ft_hints: sorted | +----+-------------+-------+------------+----------+---------------+----------+---------+-------+------+----------+-------------------------------+

由兩次執行計劃可以看到,使用全文索引過濾率為100%,僅掃描了一行;而不使用全文索引的時候過濾率僅為16.67%,並且是全表掃描。

在where條件中使用mathc函式,查詢返回的結果是根據相關性進行降序排序的,即相關性最高的結果放在第一位。相關性的值是一個非負的浮點數,0表示沒有任何相關性。根據MySQL官方文件可知,其相關性的計算依據是以下4個條件:

  • word是否在文件中出現。
  • word在文件中出現的次數。
  • word在索引列中的數量。
  • 多少個文件包含該word。
mysql> insert into tb3 select NULL,"please please please";        #插入一條資料
Query OK, 1 row affected (0.01 sec)
Records: 1  Duplicates: 0  Warnings: 0
mysql> select * from tb3 where match(body) against("please" in natural language mode);  #
+------------+----------------------------+
| FTS_DOC_ID | body                       |
+------------+----------------------------+
|          8 | please please please       |            #文件8的相關性比較高,因此第一個顯示。
|          1 | pLease porridge in the pot |
|          2 | please say sorry           |
+------------+----------------------------+
3 rows in set (0.00 sec)

#檢視匹配的結果總數。 mysql
> select count(*) from tb3 where match(body) against("please" in natural language mode); +----------+ | count(*) | +----------+ | 3 | +----------+ 1 row in set (0.00 sec) mysql> SELECT -> COUNT(IF(MATCH (body) AGAINST ("please" IN NATURAL LANGUAGE MODE),1,NULL)) -> AS count -> from tb3; +-------+ | count | +-------+ | 3 | +-------+ 1 row in set (0.00 sec) #這兩條語句中,第二條語句的執行效率比較高,是因為在第一條語句中還需要進行相關性的排序統計,而在第二局SQL中不需要。

此外還可以使用SQL語句檢視對應的相關性:

mysql> select fts_doc_id, body, match(body) against("please" in natural language mode) as relevence from tb3;
+------------+-------------------------------------+---------------------+
| fts_doc_id | body                                | relevence           |
+------------+-------------------------------------+---------------------+
|          1 | pLease porridge in the pot          | 0.13540691137313843 |
|          2 | please say sorry                    | 0.13540691137313843 |
|          4 | some like it hot, some like it cold |                   0 |
|          5 | i like coding                       |                   0 |
|          6 | fuck the company                    |                   0 |
|          8 | please please please                |  0.4062207341194153 |
+------------+-------------------------------------+---------------------+
6 rows in set (0.01 sec)

對於innodb儲存引擎的全文索引中,還需要考慮一下因素。

  • 查詢的word欄位在stopword列中,忽略該字串的查詢。
  • 查詢的word字元長度是否在區間[innodb_ft_min_token_size, innodb_ft_max_token_size]之間。
mysql> show variables like "innodb_ft_min%";
+--------------------------+-------+
| Variable_name            | Value |
+--------------------------+-------+
| innodb_ft_min_token_size | 3     |
+--------------------------+-------+
1 row in set (0.01 sec)

mysql> show variables like "innodb_ft_max%";
+--------------------------+-------+
| Variable_name            | Value |
+--------------------------+-------+
| innodb_ft_max_token_size | 84    |
+--------------------------+-------+
1 row in set (0.00 sec)

#這兩個引數用於控制innodb儲存引擎查詢字元的長度,當長度小於innodb_ft_min_token_size,或者長度大於innodb_ft_max_token_size時,會忽略該次的搜尋。

Boolean

MySQL資料庫允許使用IN BOOLEAN MODE修飾符來進行全文檢索。當使用該修飾符時,查詢字串的前後字元會有特殊的含義。

mysql> select * from tb3 where match(body) against ("+like -hot" in boolean mode);
+------------+---------------+
| FTS_DOC_ID | body          |
+------------+---------------+
|          5 | i like coding |
+------------+---------------+
1 row in set (0.00 sec)

#上面這個要求查詢含有like但是沒有hot字元的文件。

Boolean全文索引支援以下幾種操作:

  • +: 表示該word必須存在
  • -:表示該word必須被排除
  • (no operator): 表示該word是可選的,但是如果出現,其相關性會更高。
  • @distance表示查詢的多個單詞之間的距離是否在distance之內,distance的單位是位元組。這種全文素銀的查詢也稱為Proximity Search.
  • >表示出現該單詞增加相關性
  • <表示出現該單詞降低相關性
  • *表示以0個或多個字元。
  • “表示短語
#表示有please或者有hot的文件
mysql> select * from tb3 where match (body) against ("please hot" in boolean mode);
+------------+-------------------------------------+
| FTS_DOC_ID | body                                |
+------------+-------------------------------------+
|          4 | some like it hot, some like it cold |
|          8 | please please please                |
|          1 | pLease porridge in the pot          |
|          2 | please say sorry                    |
+------------+-------------------------------------+
4 rows in set (0.00 sec)

#使用distance,進行proximity search;
#please與sorry之間的距離在5個字元之內 mysql
> select * from tb3 where match(body) against ('"please sorry" @5' in boolean mode); +------------+------------------+ | FTS_DOC_ID | body | +------------+------------------+ | 2 | please say sorry | +------------+------------------+ 1 row in set (0.00 sec)
#please與sorry之間的距離在2個字元之內,查詢結果為空 mysql
> select * from tb3 where match(body) against ('"please sorry" @2' in boolean mode); Empty set (0.00 sec) #查詢是否有單詞please,或者hot進行相關性統計 mysql> select * from tb3 where match(body) against ('please >hot' in boolean mode);
+------------+-------------------------------------+
| FTS_DOC_ID | body                                |
+------------+-------------------------------------+
|          4 | some like it hot, some like it cold |
|          8 | please please please                |
|          1 | pLease porridge in the pot          |
|          2 | please say sorry                    |
+------------+-------------------------------------+
4 rows in set (0.00 sec)

#相關性有負數存在的, mysql
> select fts_doc_id, body, match(body) against("please >say <pot" in boolean mode) as relevence from tb3; +------------+-------------------------------------+----------------------+ | fts_doc_id | body | relevence | +------------+-------------------------------------+----------------------+ | 1 | pLease porridge in the pot | -0.15040236711502075 | | 2 | please say sorry | 1.849597692489624 | | 4 | some like it hot, some like it cold | 0 | | 5 | i like coding | 0 | | 6 | fuck the company | 0 | | 8 | please please please | 0.4062207341194153 | +------------+-------------------------------------+----------------------+ 6 rows in set (0.00 sec) mysql> #進行*號匹配 mysql> select * from tb3 where match(body) against ('so*' in boolean mode); +------------+-------------------------------------+ | FTS_DOC_ID | body | +------------+-------------------------------------+ | 4 | some like it hot, some like it cold | | 2 | please say sorry | +------------+-------------------------------------+ 2 rows in set (0.01 sec) mysql>

query expansion

MySQL資料庫還支援全文索引的擴充套件查詢。這種查詢通常是在查詢的關鍵詞太短,使用者需要implied knowledge(隱含知識)時進行。例如,對於單詞database的查詢,使用者可能希望查詢不僅僅包含database文件,可能還指那些包含MySQL,oracle,DB2,RDBMS的單詞,而這時可以使用query expansion模式開啟全文檢索的implied knowledge.

通常在查詢短語中新增WITH QUERY EXPANSION或IN NATURAL LANGUAGE MODE WITH QUERY EXPANSION也可以開啟blind query expansion.該查詢分為兩個階段。

  • 第一階段: 根據搜尋的單詞進行全文索引查詢。
  • 第二階段:根據第一階段產生的分詞再進行一次全文檢索查詢。
#建立如下表
CREATE
TABLE articles ( id int UNSIGNED auto_increment not null primary key, title varchar(60), body text, FULLTEXT ft_index(title, body) ) #插入資料 INSERT INTO articles(title, body) VALUES ("mysql tutorial", "DBMS stands for DATABASE...."), ("How to use mysql well","After you went through a ..."), ("optimizing mysql","in this tutorial we will SHOW...."), ("1001 mysql tricks","1. Never run mysqld as root..."), ("mysql VS pgsql","In the following database comparison..."), ("mysql security","When configured properly....,mysql..."), ("tunning DB2", "for IBM database"), ("IBM History","DB2 history for IBM...");

mysql> select * from articles; #查詢資料
+----+-----------------------+-----------------------------------------+
| id | title                 | body                                    |
+----+-----------------------+-----------------------------------------+
| 17 | mysql tutorial        | DBMS stands for DATABASE....            |
| 18 | How to use mysql well | After you went through a ...            |
| 19 | optimizing mysql      | in this tutorial we will SHOW....       |
| 20 | 1001 mysql tricks     | 1. Never run mysqld as root...          |
| 21 | mysql VS pgsql        | In the following database comparison... |
| 22 | mysql security        | When configured properly....,mysql...   |
| 23 | tunning DB2           | for IBM database                        |
| 24 | IBM History           | DB2 history for IBM...                  |
+----+-----------------------+-----------------------------------------+
8 rows in set (0.00 sec)

在這個例子中,並沒有顯示建立FTS_DOC_ID列,因此innodb儲存引擎會自動建立該列,並新增唯一索引(都是隱藏的)。此外,又建立了列title和body的聯合索引。

mysql> select * from articles where match(title, body) against("database" );   #預設是使用in natural language mode!
+----+----------------+-----------------------------------------+
| id | title          | body                                    |
+----+----------------+-----------------------------------------+
| 17 | mysql tutorial | DBMS stands for DATABASE....            |
| 21 | mysql VS pgsql | In the following database comparison... |
| 23 | tunning DB2    | for IBM database                        |
+----+----------------+-----------------------------------------+
3 rows in set (0.00 sec)

mysql> select * from articles where match(title, body) against("database" with query expansion);
+----+-----------------------+-----------------------------------------+
| id | title                 | body                                    |
+----+-----------------------+-----------------------------------------+
| 21 | mysql VS pgsql        | In the following database comparison... |
| 17 | mysql tutorial        | DBMS stands for DATABASE....            |
| 23 | tunning DB2           | for IBM database                        |
| 24 | IBM History           | DB2 history for IBM...                  |
| 19 | optimizing mysql      | in this tutorial we will SHOW....       |
| 22 | mysql security        | When configured properly....,mysql...   |
| 18 | How to use mysql well | After you went through a ...            |
| 20 | 1001 mysql tricks     | 1. Never run mysqld as root...          |
+----+-----------------------+-----------------------------------------+
8 rows in set (0.01 sec)

因為query expansion會帶來許多非相關性的查詢,因此使用的時候,需要特別注意。