2021/02/16

[Python]Elasticsearchで日本語と英語の全文検索

概要

日本語や英語の記事データを全文検索したい要件が出てきたため、Elasticsearchを使うとどのようにしなければいけないのかを確認したかったため、テスト実装を行いました。

使用したversionは7.11.0です。
実装言語はPythonです。

Elasticsearchの基礎知識

RDBとの比較

おおよそ以下のような関係だとわかりました。

RDB	Elasticsearch	Note
Database	Index
Table	Type	7.0より廃止
Row	Document
Column	Field

Elasticsearchでは以下のように設計すると良さそうです。

検索を行う単位ごとにIndexを設計します
IndexはRDBのテーブルのような感じで設計してよく、他言語の切り替え等で、検索方法が異なる場合はIndexを分けてしまった方が良さそうです。
- 例えばお知らせ（annoucement）データを他言語で検索するニーズがあるのであれば、分割しておきます。
  - announcement_ja
  - announcement_en

Elasticsearchの検索処理概要

本の後ろにある索引を考えてみてください。本の重要な用語は、ページ番号とともにソートおよびリストされているため、その用語の場所がすぐにわかります。Elasticsearchの全文検索でも、同様な転置インデックスを使用しています。

本の後ろにある、索引と同じような感じで検索しているみたいです。

docker-composeで事前準備

docker-compose.yaml

version: "3"
services:
  elasticsearch:
    build:
      context: elasticsearch
      dockerfile: Dockerfile
    ports:
      - "9200:9200"
      - "9300:9300"
    environment:
      - "discovery.type=single-node"
    volumes:
      - ./.docker-volumes/elasticsearch:/usr/share/elasticsearch/data
  kibana:
    image: docker.elastic.co/kibana/kibana:7.11.0
    ports:
      - 5601:5601

elasticsearch/Dockerfile

FROM docker.elastic.co/elasticsearch/elasticsearch:7.11.0

# install Japanese (kuromoji) analysis plugin
# https://www.elastic.co/guide/en/elasticsearch/plugins/current/analysis-kuromoji.html
RUN elasticsearch-plugin install analysis-kuromoji

# install icu analysis plugin
# https://www.elastic.co/guide/en/elasticsearch/plugins/current/analysis-icu.html
RUN elasticsearch-plugin install analysis-icu

本番環境では、（必要に応じてマルチクラスター構成をとって）k8s上にdeployして使いたいなと思っています。

Pythonでの実装

client

ここに記載の通り

The client is thread safe and can be used in a multi threaded environment. Best practice is to create a single global instance of the client and use it throughout your application.

global変数にconnectionを格納して使用して良さそうでした。

from elasticsearch import Elasticsearch

elasticsearch_connections = None


def init():
    global elasticsearch_connections
    elasticsearch_connections = Elasticsearch(["localhost"], maxsize=25)


def conn():
    global elasticsearch_connections
    if elasticsearch_connections is None:
        init()
    return elasticsearch_connections

日本語の全文検索

import unittest


class TestElasticsearchDriver(unittest.TestCase):

    def test_ja_text_search(self):
        es = conn()  # 先程定義したclient
        index_name = "announcement_contents_ja"
        index_body = {
            "settings": {
                "analysis": {
                    "char_filter": {
                        "normalize": {
                            "type": "icu_normalizer",
                            "name": "nfkc",
                            "mode": "compose"
                        }
                    },
                    "tokenizer": {
                        "ja_kuromoji_tokenizer": {
                            "mode": "search",
                            "type": "kuromoji_tokenizer",
                        },
                        "ja_ngram_tokenizer": {
                            "type": "ngram",
                            "min_gram": 2,
                            "max_gram": 2,
                            "token_chars": [
                                "letter",
                                "digit"
                            ],
                        },
                    },
                    "analyzer": {
                        "ja_kuromoji_index_analyzer": {
                            "type": "custom",
                            "char_filter": [
                                "normalize",
                                "html_strip"
                            ],
                            "tokenizer": "ja_kuromoji_tokenizer",
                            "filter": [
                                "kuromoji_baseform",
                                "kuromoji_part_of_speech",
                                "cjk_width",
                                "ja_stop",
                                "kuromoji_stemmer",
                                "lowercase"
                            ]
                        },
                        "ja_kuromoji_search_analyzer": {
                            "type": "custom",
                            "char_filter": [
                                "normalize",
                                "html_strip"
                            ],
                            "tokenizer": "ja_kuromoji_tokenizer",
                            "filter": [
                                "kuromoji_baseform",
                                "kuromoji_part_of_speech",
                                "cjk_width",
                                "ja_stop",
                                "kuromoji_stemmer",
                                "lowercase"
                            ]
                        },
                        "ja_ngram_index_analyzer": {
                            "type": "custom",
                            "char_filter": [
                                "normalize",
                                "html_strip"
                            ],
                            "tokenizer": "ja_ngram_tokenizer",
                            "filter": [
                                "lowercase"
                            ]
                        },
                        "ja_ngram_search_analyzer": {
                            "type": "custom",
                            "char_filter": [
                                "normalize",
                                "html_strip"
                            ],
                            "tokenizer": "ja_ngram_tokenizer",
                            "filter": [
                                "lowercase"
                            ]
                        }
                    }
                }
            },
            "mappings": {
                "properties": {
                    "id": {
                        "type": "long"
                    },
                    "title": {
                        "type": "text",
                        "search_analyzer": "ja_kuromoji_search_analyzer",
                        "analyzer": "ja_kuromoji_index_analyzer"
                    },
                    "subtitle": {
                        "type": "text",
                        "search_analyzer": "ja_kuromoji_search_analyzer",
                        "analyzer": "ja_kuromoji_index_analyzer"
                    },
                    "body": {
                        "type": "text",
                        "search_analyzer": "ja_kuromoji_search_analyzer",
                        "analyzer": "ja_kuromoji_index_analyzer",
                        "fields": {
                            "ngram": {
                                "type": "text",
                                "search_analyzer": "ja_ngram_search_analyzer",
                                "analyzer": "ja_ngram_index_analyzer"
                            }
                        }
                    }
                }
            }
        }
        if es.indices.exists(index=index_name):
            es.indices.delete(index=index_name)
        es.indices.create(index=index_name, body=index_body)
        indices = es.cat.indices(index=index_name, h="index").splitlines()
        # インデックスの表示
        for index in indices:
            self.assertEqual(first=index_name, second=index)
        # インデックスの存在
        self.assertTrue(es.indices.exists(index=index_name))

        announcement_content = {
            "id": 1,
            "title": "システムメンテナンスのお知らせ",
            "subtitle": "2021年3月12日22:00よりシステムメンテナンスを実施します。",
            "body": "<h1>システムメンテナンスを実施します。</h1>",
        }
        # ドキュメントの登録
        es.create(index=index_name, id=announcement_content["id"], body=announcement_content)

        es.indices.refresh(index=index_name)

        # ドキュメントの検索
        # see https://blog.chocolapod.net/momokan/entry/114
        search_body = {
          "query": {
            "bool": {
              "must": [
                {
                  "multi_match": {
                    "query": "システム",
                    "fields": [
                      "body.ngram^1"
                    ],
                    "type": "phrase"
                  }
                }
              ],
              "should": [
                {
                  "multi_match": {
                    "query": "システム",
                    "fields": [
                      "body^1"
                    ],
                    "type": "phrase"
                  }
                }
              ]
            }
          }
        }
        results = es.search(index=index_name, body=search_body, size=3)
        self.assertEqual(first=1, second=len(results["hits"]["hits"]))

        html_tag_should_be_ignored_query = {
          "query": {
            "bool": {
              "must": [
                {
                  "multi_match": {
                    "query": "h1",
                    "fields": [
                      "body.ngram^1"
                    ],
                    "type": "phrase"
                  }
                }
              ],
              "should": [
                {
                  "multi_match": {
                    "query": "h1",
                    "fields": [
                      "body^1"
                    ],
                    "type": "phrase"
                  }
                }
              ]
            }
          }
        }
        results = es.search(index=index_name, body=html_tag_should_be_ignored_query, size=3)
        self.assertEqual(first=0, second=len(results["hits"]["hits"]))

英語の全文検索

import unittest


class TestElasticsearchDriver(unittest.TestCase):

    def test_en_text_search(self):
        es = conn()
        index_name = "announcement_contents_en"
        index_body = {
            "settings": {
                "analysis": {
                    "filter": {
                        "english_stop": {
                            "type": "stop",
                            "stopwords": "_english_"
                        },
                        "english_stemmer": {
                            "type": "stemmer",
                            "language": "english"
                        },
                        "english_possessive_stemmer": {
                            "type": "stemmer",
                            "language": "possessive_english"
                        }
                    },
                    "analyzer": {
                        "rebuilt_english": {
                            "tokenizer": "standard",
                            "filter": [
                                "english_possessive_stemmer",
                                "lowercase",
                                "english_stop",
                                "english_stemmer"
                            ]
                        }
                    }
                }
            },
            "mappings": {
                "properties": {
                    "id": {
                        "type": "long"
                    },
                    "title": {
                        "type": "text",
                        "search_analyzer": "rebuilt_english",
                        "analyzer": "rebuilt_english"
                    },
                    "subtitle": {
                        "type": "text",
                        "search_analyzer": "rebuilt_english",
                        "analyzer": "rebuilt_english"
                    },
                    "body": {
                        "type": "text",
                        "search_analyzer": "rebuilt_english",
                        "analyzer": "rebuilt_english"
                    }
                }
            }
        }
        if es.indices.exists(index=index_name):
            es.indices.delete(index=index_name)
        es.indices.create(index=index_name, body=index_body)
        indices = es.cat.indices(index=index_name, h="index").splitlines()
        # インデックスの表示
        for index in indices:
            self.assertEqual(first=index_name, second=index)
        # インデックスの存在
        self.assertTrue(es.indices.exists(index=index_name))

        announcement_content = {
            "id": 1,
            "title": "System maintenance announcement",
            "subtitle": "We plan to have a short system maintenance from 12th.Mar.2021 22:00.",
            "body": "<h1>System maintenance schedule</h1>",
        }
        # ドキュメントの登録
        es.create(index=index_name, id=announcement_content["id"], body=announcement_content)

        es.indices.refresh(index=index_name)

        # ドキュメントの検索
        # see https://blog.chocolapod.net/momokan/entry/114
        search_body = {
          "query": {
            "bool": {
              "must": [
                {
                  "multi_match": {
                    "query": "system",
                    "fields": [
                      "body^1"
                    ],
                    "type": "phrase"
                  }
                }
              ]
            }
          }
        }
        results = es.search(index=index_name, body=search_body, size=3)
        self.assertEqual(first=1, second=len(results["hits"]["hits"]))
        stop_ignored_query = {
            "query": {
                "bool": {
                    "must": [
                        {
                            "multi_match": {
                                "query": ".",
                                "fields": [
                                    "body^1"
                                ],
                                "type": "phrase"
                            }
                        }
                    ]
                }
            }
        }
        results = es.search(index=index_name, body=stop_ignored_query, size=3)
        self.assertEqual(first=0, second=len(results["hits"]["hits"]))

References

以上です。