1. 程式人生 > 實用技巧 >pt搜尋網站_搜尋pt 1簡要介紹

pt搜尋網站_搜尋pt 1簡要介紹

pt搜尋網站

The ability to search the entire web in less than a second for whatever we fancy knowing is one of the greatest achievements of recent history. But how does it work? What are its building blocks? And, most importantly, … can we hack together our own version of it? The latter is important because search is inevitably personal: it is all about our focus, preferences, resources at our disposal and even emotions. Plus, it’s really cool!

只需不到一秒鐘的時間,便可以在整個網路中進行搜尋,這是最近歷史上最偉大的成就之一。 但是它如何工作? 它的構成要素是什麼? 而且,最重要的是,……我們可以一起破解我們自己的版本嗎? 後者之所以重要,是因為搜尋不可避免地是個人化的:這全都與我們的專注,偏好,可支配的資源乃至情感有關。 另外,它真的很棒!

In this three part series, I will:

在這三部分系列中,我將:

  • Pt 1. Provide a gentle introduction to Search using both Google and Elasticsearch as examples

    Pt 1.使用Google和Elasticsearch作為示例,對搜尋進行簡要介紹

  • Pt 2. We will explain some state-of-the-art NLP techniques, compare results to traditional approaches and discuss pros and cons

    Pt2。我們將介紹一些最新的NLP技術,將結果與傳統方法進行比較,並討論利弊

  • Pt 3. Provide a hacker's guide to building your own search engine with Elasticsearch engine containing 1 million news headlines & employing state-of-the-art NLP for enhanced semantic searches…

    Pt 3.提供黑客指南,以使用Elasticsearch引擎構建您自己的搜尋引擎,其中包含100萬個新聞標題,並採用最新的NLP進行增強的語義搜尋…

搜尋-簡而言之 (Search - in a nutshell)

When we talk about search nowadays we often mean Semantic Search. What is semantic search, you ask? Imagine searching for the word “virus threat”. A simple lexical search approach will come back with documents containing the words exactly and with in particular order of importance. Additionally documents about "security threat" will also be considered relevant as they contain part of the query.

如今,當我們談論搜尋時,我們通常指的是語義搜尋。 您問什麼是語義搜尋? 想象一下搜尋“病毒威脅”一詞。 一種簡單的詞法搜尋方法將返回包含準確且特別重要的單詞的文件。 有關“安全威脅”的其他文件也將被視為相關文件,因為它們包含查詢的一部分。

Semantic search, on the other hand, is also able to pick up on the idea of “disease”, “infection” and “corona” - we have a far wider and potentially more accurate search reflecting the "meaning" of what we are looking for instead of sticking to its specific keywords. In this section, I have often sourced ideas from the work by Bast, Hannah; Buchhold, Björn; Haussmann, Elmar (2016). “Semantic search on text and knowledge bases”. In that text they state:

另一方面,語義搜尋還可以理解“疾病”,“感染”和“電暈”的概念-我們的搜尋範圍更廣,而且可能更準確,反映了我們所查詢內容的“含義”而不是堅持使用其特定的關鍵字。 在本節中,我經常從漢娜·巴斯特的作品中汲取靈感。 比約恩布赫霍爾德; 豪斯曼·埃爾瑪(2016)。 “基於文字和知識庫的語義搜尋” 。 他們在該文字中指出:

Semantic search denotes search with meaning, as distinguished from lexical search where the search engine looks for literal matches of the query words or variants of them, without understanding the overall meaning of the query

語義搜尋表示具有含義的搜尋,這與詞彙搜尋不同,後者是搜尋引擎在不瞭解查詢整體含義的情況下,查詢查詢詞或它們的變體的字面匹配。

The diagram below shows core components of a search engine of this kind

下圖顯示了這種搜尋引擎的核心元件

Image for post
Image by the author
圖片由作者提供

We focus on semantic search on text with some additional annotations (such as names, dates, links, etc) as opposed to say search on structured databases. This is essentially the typical web search we use all the time.

我們專注於對文字進行語義搜尋,並帶有一些附加的註釋(例如名稱,日期,連結等),而不是對結構化資料庫進行語義搜尋。 本質上,這是我們一直使用的典型網路搜尋

Please note, the article deals with a search that produces lists of relevant documents or individual facts, not additional steps such as ranking based source quality, eg PageRank, results summarisation, etc.

請注意,本文涉及的搜尋將產生相關文件或單個事實的列表,而不是諸如基於排名的源質量(例如PageRank,結果彙總等)之類的其他步驟。

查詢型別 (Query Types)

These can be broken down into:

這些可以分解為:

  • Keyword - these are shorthand searches, not proper sentences but where the set of keywords and sometimes order carries semantic meaning, for instance Neil Armstrong date of birth, pasta recipe under 10mins

    關鍵字-這些是速記搜尋,而不是正確的句子,但是其中的關鍵字集和有時包含語義含義的命令集,例如尼爾·阿姆斯特朗的生日,不到10分鐘的麵食食譜

  • Structured/Semi-structured - special syntax used in a query. It can represent either the full query or just refinements to it. For instance, this might be a restriction to only search a specific source, e.g. news from AP only. In other cases this might restrict the languages of the results or state mandatory elements of the query

    結構化/半結構化-查詢中使用的特殊語法。 它可以代表完整的查詢,也可以只是對其的完善。 例如,這可能是僅搜尋特定來源(例如僅來自AP的新聞)的限制 在其他情況下,這可能會限制結果的語言或說明查詢中的強制性元素

  • Natural language & natural questions - fully or mostly grammatically formed questions: “What is Neil Armstrong’s date of birth?”. This is the most natural way to interact with search, however, it also poses many difficulties. For instance, we could be asking multiple questions at once "Where can I park and what are opening hours?" or pose philosophical queries instead of fact searching ones "What is the meaning of life?". As you can see from the examples, the scope of questions is quite broad. While those make sense to us, algorithms tend to specialise in narrow tasks, hence the need for various algorithms working in concert that are able to determine which results are appropriate.

    自然語言和自然問題-完全或大部分為語法形式的問題:“尼爾·阿姆斯特朗的生日是什麼?”。 這是與搜尋進行互動的最自然的方式,但是,這也帶來了許多困難。 例如,我們可能一次要問多個問題:“我可以在哪裡停車,營業時間是幾點?” 或提出哲學問題,而不是進行事實詢問: “生命的意義是什麼?”。 從示例中可以看到,問題的範圍非常廣泛。 儘管這些對我們有意義,但是演算法往往專注於狹窄的任務,因此需要各種能夠協同工作的演算法,這些演算法能夠確定合適的結果。

查詢處理 (Query processing)

These are the different types of transformations the system might need to perform on the original entry before passing it on to the search algorithm. Those could be

在將原始條目傳遞給搜尋演算法之前,系統可能需要對它們進行不同型別的轉換。 那可能是

  • Extractive - where specific names, entities, places are extracted to further help the search and compare with values in the document metadata or against knowledge bases. For instance, in the below query Neil Armstrong the information box on the left is a result from invoking google’s Knowledge Base because the query was successfully matched with an entry from it

    提取-提取特定的名稱,實體,地點,以進一步幫助搜尋並與文件元資料中的值或知識庫進行比較。 例如,在下面的查詢尼爾·阿姆斯特朗(Neil Armstrong)中,左側的資訊框是呼叫Google知識庫的結果,因為查詢已成功與查詢中的條目匹配

Image for post
  • Filters and constraints - in cases where semi-structured queries specify some restrictions on the results, e.g. only news in English, the scope of the search will be translated to the search engine

    過濾器和約束-在半結構化查詢對結果指定某些限制的情況下,例如僅英語新聞,搜尋範圍將轉換為搜尋引擎

  • Other transformations are modifications to the search, e.g. for wildcard or fuzzy search. In which case the original query may be transformed into one or many variants. For instance, with fuzzy search, we might allow for some number of character modifications to the key words entered until we find the most likely word searched. See below, the result in Google when I look for Neil Armslong. Even though a gentleman by the name Armslong probably exists and is important, the system considers it is far more likely we made a typo.

    其他轉換是對搜尋的修改,例如用於萬用字元或模糊搜尋。 在這種情況下,原始查詢可能會轉換為一個或多個變體。 例如,對於模糊搜尋,我們可以允許對輸入的關鍵字進行一些字元修改,直到找到最可能搜尋到的單詞。 參見下文,當我尋找Neil Armslong時在Google中獲得的結果。 即使一個名叫Armslong的紳士可能存在並且很重要,系統仍認為我們打錯字的可能性更大。

Image for post

搜尋和排名 (Search and Rank)

Finally, one or more types of search & ranking approaches may be used. These will either be able to find an answer or return a ranked list of results matching the query. Ranking makes sure that more pertinent results are higher up - those might be results that mention keywords of the search more often than other results or contain relevant information to the query in their title or opening paragraphs. There are:

最後,可以使用一種或多種型別的搜尋和排名方法。 這些將能夠找到答案或返回與查詢匹配的結果的排序列表。 排名可確保相關性更高的結果更高-這些結果可能是比其他結果更頻繁提及搜尋關鍵字或在標題或開頭段落中包含與查詢相關的資訊的結果。 有:

  • Keyword searches - the most common types, where exact or very close to literal matches are made. The predominant part of searches is still done this way. What makes them semantic - they would use term occurrences to rank higher documents that appear more relevant to a keyword and recognize when some of the keywords are rare ranks hits on those higher than hits on more 'common' words in the query. A number of algorithms are available: BM25, tf-idf, various Learning to Rank methods, etc.

    關鍵字搜尋-最常見的型別,進行完全匹配或非常接近文字匹配的搜尋。 搜尋的主要部分仍以這種方式進行。 是什麼使它們具有語義-他們會使用術語出現來對看起來與關鍵字更相關的高階文件進行排名,並識別何時某些關鍵字在搜尋結果中的命中率高於對查詢中“常見”單詞的命中率。 可以使用多種演算法:BM25,tf-idf,各種學習排名方法等。

  • Contextual searches - I refer to any search based on textual embeddings that attempts to use the query entirely and find contextually relevant results. This is opposed to relying on any specific keywords or phrases individually to determine results. We will focus on this a bit more later, as it is central to this series. Some recent advances in NLP techniques here will help us improve the quality of search significantly.

    上下文搜尋-我指的是基於文字嵌入的任何搜尋,這些搜尋試圖完全使用查詢並查詢與上下文相關的結果。 這與單獨依賴任何特定的關鍵字或短語來確定結果相反。 我們將在稍後重點介紹這一點,因為它是本系列的核心。 NLP技術的一些最新進展將幫助我們顯著提高搜尋質量。

Lets quickly have a face-off - keyword vs contextual search. Searching for “virus thread”, on news headlines, the left set of results are from a keyword approach while the ones on the right are from contextual search. The latter gives us a number of results which are not matching any search term directly like example 5 on the right: “WHO highlights dangers of vector borne diseases”

讓我們快速面對面-關鍵字與上下文搜尋。 在新聞標題上搜索“病毒執行緒”時,左側的結果來自關鍵字方法,而右側的結果來自上下文搜尋。 後者為我們提供了許多與任何搜尋詞都不匹配的結果,如右側的示例5:“ WHO強調了媒介傳播疾病的危險”

Image for post
  • Knowledge base - as seen above, entries from a knowledge base can be matched directly to entries in a knowledge base and used further for generating a result. More advanced techniques can also apply where a keyword or natural language query can be transformed into a query to a knowledge base. For instance, ‘Astronauts on the moon’ would return another knowledge base result

    知識庫-如上所示,可以將知識庫中的條目直接與知識庫中的條目進行匹配,並進一步用於生成結果。 在將關鍵字或自然語言查詢轉換為知識庫查詢的情況下,也可以應用更高階的技術。 例如,“月球上的宇航員”將返回另一個知識庫結果

Image for post
  • Question-answering - traditionally, search engines have used modifications from the processing step to transform a natural question to a more keyword-like query and process it as such. More recently, advances in NLP have shown strong performance by algorithms that directly pinpoint whether and where an answer to a natural question can be found within a specific document. Unlike the other search paradigms from above, question answering focuses on providing an actual (single) answer as opposed to a list of documents (like the others in this list). Here is what happens when we ask about the moon landing as a natural question. In addition to a list of answers we get a specific answer.

    回答問題-傳統上,搜尋引擎使用了處理步驟的修改,將自然問題轉換為更像關鍵字的查詢並對其進行處理。 最近,通過直接查明在特定文件中是否可以找到自然問題的答案以及在何處可以找到自然問題的答案,NLP的進步已顯示出強大的效能。 與上面的其他搜尋範例不同,問題解答的重點是提供實際的(單個)答案,而不是文件列表(類似於此列表中的其他列表)。 這是我們自然而然地詢問月球著陸時發生的情況。 除了答案列表,我們還會提供特定的答案。

Image for post

However, the technique works similarly from a not-so-natural question ‘year of first moon landing’

但是,該技術的作用類似一個不太自然的問題,即“首次登陸月球的年份”

Image for post

Finally, slight modifications to the query can break the result and we no longer get an explicit answer, we even land somewhere else completely

最後,對查詢的輕微修改可能會破壞結果,我們不再獲得明確的答案,甚至完全將其降落到其他地方

Image for post

全部放在一起(Putting it all together)

To summarize, any query type can pass through a number of different modifications and be run through any of a number of search mechanisms to produce candidate results. Each of these approaches will express the confidence in their results, however, different confidence scores may not be comparable between different algorithms. At this stage, a further decision algorithm will be able to determine which answers are well suited and "confident" enough to be passed on to the user as the final list of answers.

總而言之,任何查詢型別都可以進行多種不同的修改,並可以通過多種搜尋機制中的任何一種來生成候選結果。 這些方法中的每一種都將表達對其結果的置信度,但是,不同演算法之間的不同置信度得分可能無法比較。 在這一階段,另一種決策演算法將能夠確定哪些答案非常合適並且“足夠有信心”,可以作為最終答案列表傳遞給使用者。

Image for post
Image by the author
圖片由作者提供

A functioning search engine can have any or at least one of each of the three steps of the process. We have seen that Google uses most of them under the hood, but what about making our own...

執行正常的搜尋引擎可以具有該過程的三個步驟中的任何一個或至少一個。 我們已經看到Google在幕後使用了大多數工具,但是如何製作自己的工具呢?

我應該透露我的祕密議程… (I should reveal my secret agenda…)

I actually wanted to hack my own search engine all along.

我實際上一直都想破解自己的搜尋引擎。

The tool of choice is Elasticsearch, primarily because it actually comes out of the box with a lot of search features. At the same time, it is very well supported and gets you a long way in terms of open source features.

選擇的工具是Elasticsearch,主要是因為它實際上具有很多搜尋功能,是開箱即用的。 同時,它得到了很好的支援,使您在開源功能方面走了很長一段路。

Here is a diagram of what we get out of the box with Elastic for the purposes of this discussion. Note that you should not trust me on this summary completeness as I have a specific objective in mind.

這是我們出於討論目的而使用Elastic開箱即用的圖表。 請注意,由於我有一個特定的目標,因此您不應該相信我的摘要完整性。

Image for post
Image by the author
圖片由作者提供

You will notice that Elastic can handle any query type (even though they will all be handled by default by a keyword search mechanism) and allows for further modifying your queries to fuzzy, wildcard and quite a few others. If the data allows this, one can also apply any number of structured conditions on the results: date of publishing, source, etc.

您會注意到,Elastic可以處理任何查詢型別(即使預設情況下將由關鍵字搜尋機制處理所有查詢型別),並允許您進一步將查詢修改為模糊,萬用字元和許多其他查詢。 如果資料允許,還可以對結果應用任意數量的結構化條件:釋出日期,來源等。

In terms of search & ranking, there is a lot of flexibility to keyword search but not much else.

在搜尋和排名方面,關鍵字搜尋具有很大的靈活性,但沒有太多其他靈活性。

Overall, this is a very impressive list of out of the box features. As it turns out, with some extra legwork we can even add contextual search. Which is what we will do next...

總體而言,這是非常令人印象深刻的現成功能列表。 事實證明,通過一些額外的工作,我們甚至可以新增上下文搜尋。 接下來我們要做的是...

結論 (Conclusion)

We have explored the major building blocks of search, how they work together and the impact on search results. Different types of queries may trigger different search algorithms with a result that is a mixture of approaches. Looking at examples from Google we see that the same user experience (typing into a simple text box) is serviced by a number of techniques.

我們探索了搜尋的主要組成部分,它們如何協同工作以及對搜尋結果的影響。 不同型別的查詢可能會觸發不同的搜尋演算法,其結果是多種方法的混合。 檢視Google的示例,我們可以看到,通過多種技術可以提供相同的使用者體驗(在簡單的文字框中鍵入內容)。

In the following articles, we will compare contextual and keyword search side-by-side (Pt 2) and finally in will combine a few different tools to extend the capabilities of Elasticsearch with additional contextual capabilities to build our own semantic search engine (Pt 3).

在接下來的文章中,我們將並行比較上下文搜尋和關鍵字搜尋( Pt 2 ),最後將結合一些不同的工具來擴充套件Elasticsearch的功能以及其他上下文功能以構建我們自己的語義搜尋引擎( Pt 3 )。

Btw, Neil Armslong

順便說一句,尼爾·阿姆斯隆

I hope you enjoyed reading this, we will be back with more in Pt 2, next week. In the meantime if you feel like saying Hi or just like to tell me I am wrong, feel free to reach out via LinkedIn

希望您喜歡閱讀本文,下週我們將在Pt 2中提供更多資訊。 同時,如果您想打招呼或只是想告訴我我錯了,請隨時通過LinkedIn與我們聯絡

Special thanks to Rich Knuszka for valuable feedback.

特別感謝Rich Knuszka的寶貴反饋。

Please note that I have no affiliation with Google or Elasticsearch and the opinions and analysis are my own

請注意,我與Google或Elasticsearch沒有任何隸屬關係,觀點和分析屬於我個人

翻譯自: https://towardsdatascience.com/search-pt-1-a-gentle-introduction-335656c0f814

pt搜尋網站