1. 程式人生 > >Elasticsearch的學習以及其JAVA API的使用

Elasticsearch的學習以及其JAVA API的使用

 

此文章主要整理Elasticsearch的實際使用中遇到的一些搜尋技巧以及JAVA API的呼叫方法。後續會不斷地補充。

  • 簡單搜尋

一條搜尋的json語句如下:

{
  "query": {
    ... 
  }
}
可以指定起始值和返回結果數實現分頁查詢,如下:

{
    "from": 0,
    "size": 10,
    "query": {
        "match_all": {}
    }
}

如果不指定分頁數的話預設起始值是0,返回結果數是10。

可以選擇性的載入一部分欄位,如下:

{
  "fields": [
    "userId"
  ],
  "query": {
    "match_all": {}
  }
}

表示Hits結果只加載userId欄位,如果fields欄位為空或不存在則只返回"_index","_type","_id","_score"這些欄位

  • Match All Query

{
    "query": {
        "match_all": {}
    }
}
matchAllQuery表示查詢匹配全部文件。其對應的Java類為MatchAllQueryBuilder。
  • Term Query

{
  "query": {
    "term" : { "user" : "Kimchy" } 
  }
}

termQuery表示精確匹配搜尋,不對內容進行分詞。即例項中表示是查詢內容的user欄位的值為Kimchy的文件。其對應的Java為

TermQueryBuilder。有多個構造器第一個引數為要匹配欄位,第二個引數為匹配值。

eg:

QueryBuilders.termQuery("name", "你的名字。")
  • Match Query

{
  "query": {
    "match": {
      "name": "甜心格格 第二季"
    }
  }
}

matchQuery匹配單個欄位查詢,即查詢name欄位名為"甜心格格 第二季"的文件。其對應的JAVA類為MatchQueryBuilder。

{
  "query": {
    "match": {
      "_all": "你神"
    }
  }
}

如果欄位為“_all”則表示對所有欄位進行檢索。matchQuery有三種類型:booleanphrase,phrase_prefix。

  • Boolean

boolean是預設型別。根據官網文件,設定為boolean時意味著對所提供的文字進行分析,並且分析過程根據所提供的文字構造布林查詢。設定operator可以控制,預設為or。即會對給出的值進行分詞。minimum_should_match 用來設定最小分詞匹配數。

  • Phrase和Phrase_prefix

phrase和phrase_prefix都可以檢索短語。不同的是phrase_prefix可以在最後一個詞進行字首匹配。

eg:

{
  "query": {
    "match_phrase_prefix": {
        "name": "quick brown f"
    }
  }
}
  • MultiMatch Query

{
  "query": {
    "multi_match": {
      "query": "你的名字(花絮預告)",
      "fields": [
        "name",
        "awards"
      ]
    }
  }
}

multiMatchQuery是多個欄位匹配值。field欄位可以使用萬用字元指定。比如*_name可以匹配例如first_name與last_name這樣的欄位。^可以提升欄位的重要度,例如name^3。

它的type屬性可以被設定為best_fields、most_fields、cross_fields、phrase、phrase_prefix這幾種。具體的用法今後再研究。

對應的JAVA類為MultiMatchQueryBuilder。

ps:還有一種用法

{
  "query": {
    "term": {
      "all_worlds": "日本"
    }
  }
}

這樣會查詢所有欄位中包含“日本”的文件。

  • Wildcard Query

{
  "query": {
    "wildcard": {
      "name": "*的*"
    }
  }
}

wildcardQuery是模糊查詢。?匹配單個字元,*匹配多個字元。JAVA類WildcardQueryBuilder。

  • Query String Query

{
    "query": {
        "query_string" : {
            "query" : "(new york city) OR (big apple)"
        }
    }
}
Parameter Description

query

The actual query to be parsed. See Query string syntax.

default_field

The default field for query terms if no prefix field is specified. Defaults to the index.query.default_field index settings, which in turn defaults to _all.

default_operator

The default operator used if no explicit operator is specified. For example, with a default operator of OR, the query capital of Hungary is translated to capital OR of OR Hungary, and with default operator of AND, the same query is translated to capital AND of AND Hungary. The default value is OR.

analyzer

The analyzer name used to analyze the query string.

allow_leading_wildcard

When set, * or ? are allowed as the first character. Defaults to true.

lowercase_expanded_terms

Whether terms of wildcard, prefix, fuzzy, and range queries are to be automatically lower-cased or not (since they are not analyzed). Defaults to true.

enable_position_increments

Set to true to enable position increments in result queries. Defaults to true.

fuzzy_max_expansions

Controls the number of terms fuzzy queries will expand to. Defaults to 50

fuzziness

Set the fuzziness for fuzzy queries. Defaults to AUTO. See Fuzzinesseditfor allowed settings.

fuzzy_prefix_length

Set the prefix length for fuzzy queries. Default is 0.

phrase_slop

Sets the default slop for phrases. If zero, then exact phrase matches are required. Default value is 0.

boost

Sets the boost value of the query. Defaults to 1.0.

analyze_wildcard

By default, wildcards terms in a query string are not analyzed. By setting this value to true, a best effort will be made to analyze those as well.

auto_generate_phrase_queries

Defaults to false.

max_determinized_states

Limit on how many automaton states regexp queries are allowed to create. This protects against too-difficult (e.g. exponentially hard) regexps. Defaults to 10000.

minimum_should_match

A value controlling how many "should" clauses in the resulting boolean query should match. It can be an absolute value (2), a percentage (30%) or a combination of both.

lenient

If set to true will cause format based failures (like providing text to a numeric field) to be ignored.

locale

Locale that should be used for string conversions. Defaults to ROOT.

time_zone

Time Zone to be applied to any range query related to dates. See also JODA timezone.

  • 複合查詢

  • Bool Query

{
  "query": {
    "bool": {
      "should": [
        {
          "term": {
            "releaseYear": "2014"
          }
        },
        {
          "match_phrase_prefix": {
            "name": "你的名字"
          }
        }
      ]
    }
  }
}

boolQuery為複合查詢,可以進行組合查詢。

Occur Description

must

The clause (query) must appear in matching documents and will contribute to the score.

filter

The clause (query) must appear in matching documents. However unlike must the score of the query will be ignored.

should

The clause (query) should appear in the matching document. In a boolean query with no must or filter clauses, one or more should clauses must match a document. The minimum number of should clauses to match can be set using the minimum_should_matchparameter.

must_not

The clause (query) must not appear in the matching documents.

  • JAVA API

  • 連線ES叢集

TransportClient利用transport模組遠端連線一個elasticsearch叢集。它並不加入到叢集中,只是簡單的獲得一個或者多個初始化的transport地址,並以輪詢的方式與這些地址進行通訊。

// on startup
Client client = new TransportClient()
        .addTransportAddress(new InetSocketTransportAddress("host1", 9300))
        .addTransportAddress(new InetSocketTransportAddress("host2", 9300));

// on shutdown
client.close();

注意,如果你有一個與elasticsearch叢集不同的叢集,你可以設定機器的名字。

Settings settings = ImmutableSettings.settingsBuilder()
        .put("cluster.name", "myClusterName").build();
Client client =    new TransportClient(settings);
//Add transport addresses and do something with the client...

你也可以用elasticsearch.yml檔案來設定。

這個客戶端可以嗅到叢集的其它部分,並將它們加入到機器列表。為了開啟該功能,設定client.transport.sniff為true。

Settings settings = ImmutableSettings.settingsBuilder()
        .put("client.transport.sniff", true).build();
TransportClient client = new TransportClient(settings);

其它的transport客戶端設定有如下幾個:

Parameter Description
client.transport.ignore_cluster_name true:忽略連線節點的叢集名驗證
client.transport.ping_timeout ping一個節點的響應時間,預設是5s
client.transport.nodes_sampler_interval

sample/ping 節點的時間間隔,預設是5s

 

PS:client使用完畢後最好關閉,測試過如果一直獲取連線不關閉的話連線可能會報錯。

  • 獲取文件

獲取API允許你通過id從索引中獲取型別化的JSON文件,如下例:

GetResponse response = client.prepareGet("twitter", "tweet", "1")
        .execute()
        .actionGet();

 

預設情況下,operationThreaded設定為true表示操作執行在不同的執行緒上面。下面是一個設定為false的例子。

GetResponse response = client.prepareGet("twitter", "tweet", "1")
        .setOperationThreaded(false)
        .execute()
        .actionGet();
  • 刪除文件

刪除api允許你通過id,從特定的索引中刪除型別化的JSON文件。

預設情況下,operationThreaded設定為true表示操作執行在不同的執行緒上面。下面是一個設定為false的例子。

DeleteResponse response = client.prepareDelete("twitter", "tweet", "1")
        .setOperationThreaded(false)
        .execute()
        .actionGet();
  • 新增或更新文件

 

你能夠建立一個UpdateRequest,然後將其傳送給client。

複製程式碼

UpdateRequest updateRequest = new UpdateRequest();
updateRequest.index("index");
updateRequest.type("type");
updateRequest.id("1");
updateRequest.doc(jsonBuilder()
        .startObject()
            .field("gender", "male")
        .endObject());
client.update(updateRequest).get();

或者你也可以利用prepareUpdate方法

 client.prepareUpdate("ttl", "doc", "1")
        .setScript("ctx._source.gender = \"male\""  , ScriptService.ScriptType.INLINE)
        .get();

 client.prepareUpdate("ttl", "doc", "1")
        .setDoc(jsonBuilder()
            .startObject()
                .field("gender", "male")
            .endObject())
        .get();

1-3行用指令碼來更新索引,5-10行用doc來更新索引。

當然,java API也支援使用upsert。如果文件還不存在,會根據upsert內容建立一個新的索引。

IndexRequest indexRequest = new IndexRequest("index", "type", "1")
        .source(jsonBuilder()
            .startObject()
                .field("name", "Joe Smith")
                .field("gender", "male")
            .endObject());
UpdateRequest updateRequest = new UpdateRequest("index", "type", "1")
        .doc(jsonBuilder()
            .startObject()
                .field("gender", "male")
            .endObject())
        .upsert(indexRequest);
client.update(updateRequest).get();

如果文件index/type/1已經存在,那麼在更新操作完成之後,文件為:

{
    "name"  : "Joe Dalton",
    "gender": "male"
}

否則,文件為:

{
    "name" : "Joe Smith",
    "gender": "male"
}
  • Bulk

bulk API允許開發者在一個請求中索引和刪除多個文件。下面是使用例項。

import static org.elasticsearch.common.xcontent.XContentFactory.*;

BulkRequestBuilder bulkRequest = client.prepareBulk();

// either use client#prepare, or use Requests# to directly build index/delete requests
bulkRequest.add(client.prepareIndex("twitter", "tweet", "1")
        .setSource(jsonBuilder()
                    .startObject()
                        .field("user", "kimchy")
                        .field("postDate", new Date())
                        .field("message", "trying out Elasticsearch")
                    .endObject()
                  )
        );

bulkRequest.add(client.prepareIndex("twitter", "tweet", "2")
        .setSource(jsonBuilder()
                    .startObject()
                        .field("user", "kimchy")
                        .field("postDate", new Date())
                        .field("message", "another post")
                    .endObject()
                  )
        );

BulkResponse bulkResponse = bulkRequest.execute().actionGet();
if (bulkResponse.hasFailures()) {
    // process failures by iterating through each bulk response item
}
  • 搜尋

搜尋API允許開發者執行一個搜尋查詢,返回滿足查詢條件的搜尋資訊。它能夠跨索引以及跨型別執行。查詢既可以用Java查詢API也可以用Java過濾API。 查詢的請求體由SearchSourceBuilder構建。

import org.elasticsearch.action.search.SearchResponse;
import org.elasticsearch.action.search.SearchType;
import org.elasticsearch.index.query.FilterBuilders.*;
import org.elasticsearch.index.query.QueryBuilders.*;

SearchResponse response = client.prepareSearch("index1", "index2")
        .setTypes("type1", "type2")
        .setSearchType(SearchType.DFS_QUERY_THEN_FETCH)
        .setQuery(QueryBuilders.termQuery("multi", "test"))             // Query
        .setPostFilter(FilterBuilders.rangeFilter("age").from(12).to(18))   // Filter
        .setFrom(0).setSize(60).setExplain(true)
        .execute()
        .actionGet();

注意,所有的引數都是可選的。下面是最簡潔的形式。

// MatchAll on the whole cluster with all default options
SearchResponse response = client.prepareSearch().execute().actionGet();

在Java中使用scrolls

import static org.elasticsearch.index.query.FilterBuilders.*;
import static org.elasticsearch.index.query.QueryBuilders.*;

QueryBuilder qb = termQuery("multi", "test");

SearchResponse scrollResp = client.prepareSearch(test)
        .setSearchType(SearchType.SCAN)
        .setScroll(new TimeValue(60000))
        .setQuery(qb)
        .setSize(100).execute().actionGet(); //100 hits per shard will be returned for each scroll
//Scroll until no hits are returned
while (true) {
    for (SearchHit hit : scrollResp.getHits()) {
        //Handle the hit...
    }
    scrollResp = client.prepareSearchScroll(scrollResp.getScrollId()).setScroll(new TimeValue(600000)).execute().actionGet();
    //Break condition: No hits are returned
    if (scrollResp.getHits().getHits().length == 0) {
        break;
    }
}

多搜尋API

SearchRequestBuilder srb1 = node.client()
    .prepareSearch().setQuery(QueryBuilders.queryString("elasticsearch")).setSize(1);
SearchRequestBuilder srb2 = node.client()
    .prepareSearch().setQuery(QueryBuilders.matchQuery("name", "kimchy")).setSize(1);

MultiSearchResponse sr = node.client().prepareMultiSearch()
        .add(srb1)
        .add(srb2)
        .execute().actionGet();

// You will get all individual responses from MultiSearchResponse#getResponses()
long nbHits = 0;
for (MultiSearchResponse.Item item : sr.getResponses()) {
    SearchResponse response = item.getResponse();
    nbHits += response.getHits().getTotalHits();
}

使用聚合

下面的例子顯示怎樣新增兩個聚合到你的搜尋中。

SearchResponse sr = node.client().prepareSearch()
    .setQuery(QueryBuilders.matchAllQuery())
    .addAggregation(
            AggregationBuilders.terms("agg1").field("field")
    )
    .addAggregation(
            AggregationBuilders.dateHistogram("agg2")
                    .field("birth")
                    .interval(DateHistogram.Interval.YEAR)
    )
    .execute().actionGet();

// Get your facet results
Terms agg1 = sr.getAggregations().get("agg1");
DateHistogram agg2 = sr.getAggregations().get("agg2");

使用搜索模板

定義你的模板引數為Map<String,String>

Map<String, String> template_params = new HashMap<>();
template_params.put("param_gender", "male");

你可以用你儲存在config/scripts目錄中的模板。例如,你擁有如下的檔案config/scripts/template_gender.mustache

{
    "template" : {
        "query" : {
            "match" : {
                "gender" : "{{param_gender}}"
            }
        }
    }
}

可以通過如下方式執行:

SearchResponse sr = client.prepareSearch()
        .setTemplateName("template_gender")
        .setTemplateType(ScriptService.ScriptType.FILE)
        .setTemplateParams(template_params)
        .get();

你也可以將模板儲存在一個專門的索引中,這個索引名為.scripts

client.preparePutIndexedScript("mustache", "template_gender",
        "{\n" +
        "    \"template\" : {\n" +
        "        \"query\" : {\n" +
        "            \"match\" : {\n" +
        "                \"gender\" : \"{{param_gender}}\"\n" +
        "            }\n" +
        "        }\n" +
        "    }\n" +
        "}").get();

為了用這個被索引的模板,需要用到ScriptService.ScriptType.INDEXED:

SearchResponse sr = client.prepareSearch()
        .setTemplateName("template_gender")
        .setTemplateType(ScriptService.ScriptType.INDEXED)
        .setTemplateParams(template_params)
        .get();
  •  查詢刪除

基於查詢的刪除API允許開發者基於查詢刪除一個或者多個索引、一個或者多個型別。下面是一個例子。

import static org.elasticsearch.index.query.FilterBuilders.*;
import static org.elasticsearch.index.query.QueryBuilders.*;

DeleteByQueryResponse response = client.prepareDeleteByQuery("test")
        .setQuery(termQuery("_type", "type1"))
        .execute()
        .actionGet();