OpenSearch (Elasticsearch) に CloudFront のアクセスログを連携するために必要な Ingest Pipeline についてのメモ。

Ingest Pipeline とは、Document を POST した時に、その Document を登録する前に加工することができる機能。
渡ってきた CloudFront のアクセスログをちょっと加工して、OpenSearch 上で扱いやすくする。

以下コマンド。

curl -H "Content-Type: application/json" -XPUT 'https://{domain}/_ingest/pipeline/cflog' -d '{
  "processors": [
    {
      "grok": {
        "field": "message",
        "pattern_definitions": {
          "CFDATETIME" : "\\d{4}-\\d{2}-\\d{2}\\t\\d{2}:\\d{2}:\\d{2}"
        },
        "patterns":[ "%{CFDATETIME:timestamp}\t%{NOTSPACE:edge_location}\t%{NOTSPACE:response_bytes}\t%{NOTSPACE:client_ip}\t%{NOTSPACE:http_method}\t%{NOTSPACE:distribution_host}\t%{NOTSPACE:http_path}\t%{NOTSPACE:http_status}\t%{NOTSPACE:http_referer}\t%{NOTSPACE:agent}\t%{NOTSPACE:query_strings}\t%{NOTSPACE:cookies}\t%{NOTSPACE:edge_result_type}\t%{NOTSPACE:edge_request_id}\t%{NOTSPACE:host_header}\t%{NOTSPACE:protocol}\t%{NOTSPACE:request_bytes}\t%{NOTSPACE:response_time:float}\t%{NOTSPACE:x_forwarded_for}\t%{NOTSPACE:tls_protocol}\t%{NOTSPACE:tls_cipher}\t%{NOTSPACE:edge_response_result_type}\t%{NOTSPACE:http_version}\t%{NOTSPACE:fle_status}\t%{NOTSPACE:fle_encrypted_fields}\t%{NOTSPACE:request_port}\t%{NOTSPACE:time_to_first_byte:float}\t%{NOTSPACE:edge_detailed_result_type}\t%{NOTSPACE:content_type}\t%{NOTSPACE:content_length}\t%{NOTSPACE:content_range_start}\t%{NOTSPACE:content_range_end}" ],
        "ignore_missing": true
      }
    },
    {
      "remove": {
        "field": "message"
      }
    },
    {
      "user_agent": {
        "field": "agent",
        "target_field": "user_agent",
        "ignore_failure": true
      }
    },
    {
      "remove": {
        "field": "agent",
        "ignore_failure": true
      }
    },
    {
      "gsub": {
        "field": "timestamp",
        "pattern": "(\\d{4}-\\d{2}-\\d{2})\\t(\\d{2}:\\d{2}:\\d{2})",
        "replacement": "$1T$2Z"
      }
    },
    {
      "date": {
        "field": "timestamp",
        "formats": [
          "date_time_no_millis"
        ]
      }
    }
  ]
}'

Ingest Pipeline は(恐らく)上から評価が行われる。
この設定について上から解説する。

1. grok processor

www.elastic.co

grok とは、定義済みの正規表現を使ってテキスト分割して指定した Field にマッピングするツール。
パターンはこの辺を参考に考えていく。
www.alibabacloud.com

また、CloudFront の標準ログは以下の定義を見ながら、パターンを考える。
docs.aws.amazon.com

grok processor では、最初に渡ってきた message という Field を、 patterns に従って各 Field に分割してマッピングするよう指示。

しかし、定義済みの正規表現の中に該当するものがないこともある。その場合、自分でカスタム正規表現を定義できる。それが pattern_definitions である。

CloudFront の標準ログでは、日付と時間が別の項目として出力されているので、それを1つの timestamp という Field にマッピングするために CFDATETIME という正規表現を定義している。
それ以外は ~~面倒くさくて~~ 特にこだわりは無いので、全て NOTSPACE (空白以外の文字列) で解析して各 Field にマッピングしている。

2. remove processor

www.elastic.co

grok processor で message Field を各 Filed に分割した後は不要になるので、 message Field を削除している。

3. user_agent processor

www.elastic.co

user_agent processor は指定した Field が User Agent であると認識させ、その User Agent を解析して更に詳細な Field にマッピングしてくれる。

4. remove processor

user_agent processor 処理後は agent Field そのものは不要になるので削除している。

5. gsub processor

www.elastic.co

grok でマッピングした timestamp に入っている文字列では、日付と時間がタブで区切られていて、そのままだと OpenSearch(Elasticsearch) 側に time field として認識してもらえない。
ISO8601 形式に変換するために、 timestamp を yyyy-MM-ddTHH:mm:ssZ 形式に変換している。

6. date processor

www.elastic.co

ISO8601 形式に変換した timestamp Field の文字列を date Field として認識させる。

ぬーログ

OpenSearch (Elasticsearch) に CloudFront のアクセスログを連携するための Ingest Pipeline

1. grok processor

2. remove processor

3. user_agent processor

4. remove processor

5. gsub processor

6. date processor