es - elasticsearch自定義分析器 - 內建分詞過濾器 - 2

阿新 • • 發佈：2021-01-29

世界上並沒有完美的程式，但是我們並不因此而沮喪，因為寫程式就是一個不斷追求完美的過程。

自定義分析器 :

Character filters :
1. 作用 : 字元的增、刪、改轉換
2. 數量限制 : 可以有0個或多個
3. 內建字元過濾器 :
1. HTML Strip Character filter : 去除html標籤
2. Mapping Character filter : 對映替換
3. Pattern Replace Character filter : 正則替換
Tokenizer :
1. 作用 :
1. 分詞
2. 記錄詞的順序和位置（短語查詢）

3. 記錄詞的開頭和結尾位置（高亮）
4. 記錄詞的型別（分類）
2. 數量限制 : 有且只能有一個
3. 分類 :
1. 完整分詞 :
1. Standard
2. Letter
3. Lowercase
4. whitespace
5. UAX URL Email
6. Classic
7. Thai
2. 切詞 :
1. N-Gram
2. Edge N-Gram
3. 文字 :
1. Keyword
2. Pattern
3. Simple Pattern
4. Char Group
5. Simple Pattern split
6. Path
Token filters :
1. 作用 : 分詞的增、刪、改轉換

2. 數量限制 : 可以有0個或多個
3. 分類 :
1. apostrophe
2. asciifolding
3. cjk bigram
4. cjk width
5. classic
6. common grams
7. conditional
8. decimal digit

今天演示內容中：

common grams token filter
conditional token filter
重點關注。

# classic token filter
# 作用 :
#   1. 刪除'及後面字元
#   2. 刪除縮寫間的點
# 適用 : classic分詞器
GET /_analyze
{ 

  "tokenizer": "classic",
  "filter": ["classic"],
  "text": ["hello this is hi's good H.J.K.M. Q.U.I.C.K.  "]
}

# 結果
{
  "tokens" : [
    {
      "token" : "hello",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "this",
      "start_offset" : 6,
      "end_offset" : 10,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "is",
      "start_offset" : 11,
      "end_offset" : 13,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "hi",
      "start_offset" : 14,
      "end_offset" : 18,
      "type" : "<APOSTROPHE>",
      "position" : 3
    },
    {
      "token" : "good",
      "start_offset" : 19,
      "end_offset" : 23,
      "type" : "<ALPHANUM>",
      "position" : 4
    },
    {
      "token" : "HJKM",
      "start_offset" : 24,
      "end_offset" : 32,
      "type" : "<ACRONYM>",
      "position" : 5
    },
    {
      "token" : "QUICK",
      "start_offset" : 33,
      "end_offset" : 43,
      "type" : "<ACRONYM>",
      "position" : 6
    }
  ]
}

# common grams token filter
# 作用 :
#   1. 指定的詞與前後詞結合
#   2. 可以避免停用詞造成的損失
# 配置項 :
#   1. common_words      : 要結合的詞
#   2. common_words_path : 要結合詞的路徑
#   3. ignore_case       : 忽略大小寫，預設false
#   4. query_mode        : 是否單獨顯示指定的結合的詞，預設false - 顯示
GET /_analyze
{
  "tokenizer": "standard",
  "filter": [{
    "type"         : "common_grams",
    "common_words" : ["是", "的", "Is"],
    "ignore_case"  : true,
    "query_mode"   : true
  }],
  "text": ["我是中國人", "這是我的飯", "this is my food"]
}

# 結果
{
  "tokens" : [
    {
      "token" : "我_是",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "gram",
      "position" : 0
    },
    {
      "token" : "是_中",
      "start_offset" : 1,
      "end_offset" : 3,
      "type" : "gram",
      "position" : 1
    },
    {
      "token" : "中",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "<IDEOGRAPHIC>",
      "position" : 2
    },
    {
      "token" : "國",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "<IDEOGRAPHIC>",
      "position" : 3
    },
    {
      "token" : "人",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "<IDEOGRAPHIC>",
      "position" : 4
    },
    {
      "token" : "這_是",
      "start_offset" : 6,
      "end_offset" : 8,
      "type" : "gram",
      "position" : 105
    },
    {
      "token" : "是_我",
      "start_offset" : 7,
      "end_offset" : 9,
      "type" : "gram",
      "position" : 106
    },
    {
      "token" : "我_的",
      "start_offset" : 8,
      "end_offset" : 10,
      "type" : "gram",
      "position" : 107
    },
    {
      "token" : "的_飯",
      "start_offset" : 9,
      "end_offset" : 11,
      "type" : "gram",
      "position" : 108
    },
    {
      "token" : "this_is",
      "start_offset" : 12,
      "end_offset" : 19,
      "type" : "gram",
      "position" : 209
    },
    {
      "token" : "is_my",
      "start_offset" : 17,
      "end_offset" : 22,
      "type" : "gram",
      "position" : 210
    },
    {
      "token" : "my",
      "start_offset" : 20,
      "end_offset" : 22,
      "type" : "<ALPHANUM>",
      "position" : 211
    },
    {
      "token" : "food",
      "start_offset" : 23,
      "end_offset" : 27,
      "type" : "<ALPHANUM>",
      "position" : 212
    }
  ]
}

# conditional token filter
# 作用   : 條件過濾，以條件判斷是否執行過濾器中的內容
# 配置項 :
#   1. filter : 過濾器
#   2. script : 過濾指令碼
GET /_analyze
{
  "tokenizer": "standard",
  "filter": [{
      "type"   : "condition",
      "filter" : ["lowercase"],
      "script" : {
        "source": "token.getTerm().length() < 5"
      }
  }], 
  "text": ["THE QUICK BROWN FOX"]
}

# 結果
{
  "tokens" : [
    {
      "token" : "the",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "QUICK",
      "start_offset" : 4,
      "end_offset" : 9,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "BROWN",
      "start_offset" : 10,
      "end_offset" : 15,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "fox",
      "start_offset" : 16,
      "end_offset" : 19,
      "type" : "<ALPHANUM>",
      "position" : 3
    }
  ]
}

# decimal digit token filter
# 作用 : 特殊數字字元轉為阿拉伯數字
GET /_analyze
{
  "tokenizer": "keyword",
  "filter": ["decimal_digit"],
  "text": ["6.7 १ १-one two-२ ३ "]
}

# 結果
{
  "tokens" : [
    {
      "token" : "6.7 1 1-one two-2 3 ",
      "start_offset" : 0,
      "end_offset" : 20,
      "type" : "word",
      "position" : 0
    }
  ]
}

es - elasticsearch自定義分析器 - 內建分詞過濾器 - 2

技術標籤：stack - eses 世界上並沒有完美的程式，但是我們並不因此而沮喪，因為寫程式就是一個不斷追求完美的過程。

es - elasticsearch自定義分析器 - 內建分詞過濾器 - 6

技術標籤：stack - eses 世界上並沒有完美的程式，但是我們並不因此而沮喪，因為寫程式就是一個不斷追求完美的過程。

es - elasticsearch自定義分析器 - 內建分詞過濾器 - 7

技術標籤：stack - eses 世界上並沒有完美的程式，但是我們並不因此而沮喪，因為寫程式就是一個不斷追求完美的過程。

es - elasticsearch自定義分析器 - 內建分詞器

技術標籤：stack - eses 世界上並沒有完美的程式，但是我們並不因此而沮喪，因為寫程式就是一個不斷追求完美的過程。

es - elasticsearch 自定義分析器 - 字元過濾器

技術標籤：stack - eses 世界上並沒有完美的程式，但是我們並不因此而沮喪，因為寫程式就是一個不斷追求完美的過程。

ElasticSearch 自定義模板配置中文分詞

預設情況下，logstash是沒有配置中文分詞的，那要如何解決呢？解放方法：自定義模板配置中文分詞

ElasticSearch 分詞與內建分詞

1、什麼是分詞把文字轉換為一個個的單詞，分詞稱之為analysis. ES預設只對英文語句做分詞，中文不支援，每個中文漢字都會被拆分

DRF內建分頁器及自定義分頁類

分頁Pagination REST framework提供了分頁的支援。全域性配置我們可以在 settings.py 配置檔案中設定全域性的分頁方式，如：

Djang內建分頁和自定義分頁

內建分頁 views from django.core.paginator import Paginator,Page,PageNotAnInteger def DJs_pages(request):

.NET 自定義使用者控制元件分頁

1 <%if(total>0&&totalPage>0){%> 2 <div class=\"dataTables_info\"> 3共 <strong><%=total %></strong> 條

Spring Data Elasticsearch自定義方法的各類約定、寫法！

Spring Data 的另一個強大功能，是根據方法名稱自動實現功能。比如：你的方法名叫做：findByTitle，那麼它就知道你是根據title查詢，然後自動幫你完成，無需寫實現類。

自定義element-ui的分頁按鈕樣式

專案中需要用到的分頁按鈕樣式如圖: HTML部分: <el-pagination background layout=\"prev, pager, next, jumper\"

ES簡介和環境搭建 Elasticsearch使用系列-ES簡介和環境搭建Elasticsearch使用系列-ES簡介和環境搭建Elasticsearch使用系列-ES增刪查改基本操作+ik分詞Elasticsearch使用系列-基本查詢和聚合查詢+sql外掛Elasticsearch使用系列-Docker搭建Elasticsearch叢集

Elasticsearch使用系列-ES簡介和環境搭建 Elasticsearch使用系列-ES簡介和環境搭建 Elasticsearch使用系列-ES增刪查改基本操作+ik分詞

Android自定義View實現帶4圓角或者2圓角的效果

1 問題實現任意view經過自定義帶4圓角或者2圓角的效果 2 原理 1) 實現view 4圓角我們只需要把左邊的圖嵌入到右邊裡面去，最終顯示左邊的圖就行。

Elasticsearch從入門到放棄：分詞器初印象

Elasticsearch 系列回來了，先給因為這個系列關注我的同學說聲抱歉，拖了這麼久才回來，這個系列雖然叫「Elasticsearch 從入門到放棄」，但只有三篇就放棄還是有點過分的，所以還是回來繼續更新。

spring-security-oauth2 中優雅的擴充套件自定義(簡訊驗證碼)登入方式-系列2

跟蹤spring的登入邏輯發現,帳號密碼的驗證是在 tokenGranter 中完成的, 帳號密碼對應的是 org.springframework.security.oauth2.provider.password.ResourceOwnerPasswordTokenGranter;而spring找到對應的 tokenGra

XQuery內建函式以及 XSLT 2.0 共享相同

XQuery 函式 XQuery 含有超過 100 個內建的函式。這些函式可用於字串值、數值、日期以及時間比較、節點和 QName 操作、序列操作、邏輯值等等。您也可在 XQuery 中定義自己的函式。

ElasticSearch學習（二）——ik分詞及查詢操作

IK分詞器外掛分詞：即把一段中文或者別的劃分成一個個的關鍵字，在搜尋時將自己的資訊進行分詞，會把資料庫中或者索引庫中的資料進行分詞，然後進行一個匹配操作，預設的中文分詞是將每個詞看成一個詞，如

10【Vue全家桶之Vue基礎】Vue 常用特性/高階特性，常用特性: 表單操作 v-model，自定義指令，計算屬性，過濾器，傾聽器，生命週期

Vue 常用特性/高階特性常用特性: 表單操作 v-model 自定義指令計算屬性 methods 過濾器傾聽器生命週期

ElasticSearch學習系列（七）分詞

分詞裡面有兩個名詞：Analysis、Analyzer Analysis 文字分析是把全文字轉換一系列單詞的過程，叫成分詞。

es - elasticsearch自定義分析器 - 內建分詞過濾器 - 2

相關推薦