boyxuper's blog: 2017

Wednesday, October 11, 2017

forward: unicode for chinese characters

source: http://ju.outofmemory.cn/entry/53571

关于unicode汉字范围正则的写法
Chronos 2013-10-21 7952 阅读
正则范围 unicode 汉字
原来我使用的一直是 \u4e00-\u9fa5 ，今天在匹配中文标点的时候匹配不上，就查了一下相关资料，原来unicode跟中文有关的范围还有好几个。

字符范围表
1.标准CJK文字

范围：\u3400-\u4DB5,\u4E00-\u9FA5,\u9FA6-\u9FBB,\uF900-\uFA2D,\uFA30-\uFA6A,\uFA70-\uFAD9 说明：一共有好几个范围，除 \u4e00-\u9fa5外都不是很常用参考地址：http://www.unicode.org/Public/UNIDATA/Unihan.html

2.全角ASCII、全角中英文标点、半宽片假名、半宽平假名、半宽韩文字母

范围：\uFF00-\uFFEF 参考地址：http://www.unicode.org/charts/PDF/UFF00.pdf

3.CJK部首补充

范围：\u2E80-\u2EFF 参考地址：http://www.unicode.org/charts/PDF/U2E80.pdf

4.CJK标点符号

范围：\u3000-\u303F 参考地址：http://www.unicode.org/charts/PDF/U3000.pdf

5.CJK笔划

范围：\u31C0-\u31EF 参考地址：http://www.unicode.org/charts/PDF/U31C0.pdf

6.康熙部首

范围：\u2F00-\u2FDF 参考地址：http://www.unicode.org/charts/PDF/U2F00.pdf

7.汉字结构描述字符

范围：\u2FF0-\u2FFF 参考地址：http://www.unicode.org/charts/PDF/U2FF0.pdf

8.注音符号

范围：\u3100-\u312F 参考地址：http://www.unicode.org/charts/PDF/U3100.pdf

9.注音符号（闽南语、客家语扩展）

范围：\u31A0-\u31BF 参考地址：http://www.unicode.org/charts/PDF/U31A0.pdf

10.日文平假名

范围：\u3040-\u309F 参考地址：http://www.unicode.org/charts/PDF/U3040.pdf

11.日文片假名

范围：\u30A0-\u30FF 参考地址：http://www.unicode.org/charts/PDF/U30A0.pdf

12.日文片假名拼音扩展

范围：\u31F0-\u31FF 参考地址：http://www.unicode.org/charts/PDF/U31F0.pdf

13.韩文拼音

范围：\uAC00-\uD7AF 参考地址：http://www.unicode.org/charts/PDF/UAC00.pdf

14.韩文字母

范围：\u1100-\u11FF 参考地址：http://www.unicode.org/charts/PDF/U1100.pdf

15.韩文兼容字母

范围：\u3130-\u318F 参考地址：http://www.unicode.org/charts/PDF/U3130.pdf

16.易经六十四卦象

范围：\u4DC0-\u4DFF 参考地址：http://www.unicode.org/charts/PDF/U4DC0.pdf

17.彝文音节

范围：\uA000-\uA48F 参考地址：http://www.unicode.org/charts/PDF/UA000.pdf

18.彝文部首

范围：\uA490-\uA4CF 参考地址：http://www.unicode.org/charts/PDF/UA490.pdf

19.盲文符号

范围：\u2800-\u28FF 参考地址：http://www.unicode.org/charts/PDF/U2800.pdf

20.CJK字母及月份

范围：\u3200-\u32FF 参考地址：http://www.unicode.org/charts/PDF/U3200.pdf

21.CJK特殊符号（日期合并）

范围：\u3300-\u33FF 参考地址：http://www.unicode.org/charts/PDF/U3300.pdf

22.装饰符号（非CJK专用）

范围：\u2700-\u27BF 参考地址：http://www.unicode.org/charts/PDF/U2700.pdf

23.杂项符号（非CJK专用）

范围：\u2600-\u26FF 参考地址：http://www.unicode.org/charts/PDF/U2600.pdf

24.中文竖排标点

范围：\uFE10-\uFE1F 参考地址：http://www.unicode.org/charts/PDF/UFE10.pdf

25.CJK兼容符号（竖排变体、下划线、顿号）

范围：\uFE30-\uFE4F 参考地址：http://www.unicode.org/charts/PDF/UFE30.pdf

改进后的匹配表达式
[\u3400-\u4DB5\u4E00-\u9FA5\u9FA6-\u9FBB\uF900-\uFA2D\uFA30-\uFA6A\uFA70-\uFAD9\uFF00-\uFFEF\u2E80-\u2EFF\u3000-\u303F\u31C0-\u31EF] (注：这条基本能满足要求)
[\u3400-\u4DB5\u4E00-\u9FA5\u9FA6-\u9FBB\uF900-\uFA2D\uFA30-\uFA6A\uFA70-\uFAD9\uFF00-\uFFEF\u2E80-\u2EFF\u3000-\u303F\u31C0-\u31EF\u2F00-\u2FDF\u2FF0-\u2FFF\u3100-\u312F\u31A0-\u31BF\u3040-\u309F\u30A0-\u30FF\u31F0-\u31FF\uAC00-\uD7AF\u1100-\u11FF\u3130-\u318F\u4DC0-\u4DFF\uA000-\uA48F\uA490-\uA4CF\u2800-\u28FF\u3200-\u32FF\u3300-\u33FF\u2700-\u27BF\u2600-\u26FF\uFE10-\uFE1F\uFE30-\uFE4F] (注：这是完整版本)

Monday, August 21, 2017

recursive dns resolution

recursive dns resolution, [gist link], requirements:

dns
clientsubnetoption

# -*- coding: utf8 -*-
"""@author: boyxuper@date: 2017/8/21 16:07"""import random
import socket
from collections import defaultdict
from itertools import chain

import clientsubnetoption
import dns
import dns.name
import dns.message
import dns.rdatatype
import dns.resolver

_logger = lambda *args: None

def logger_full(fmt, *args):
    if not args:
        print fmt
        return
    print fmt % args

socket.inet_pton = lambda _, p: socket.inet_aton(p)
socket.inet_ntop = lambda _, n: socket.inet_ntoa(n)


def retry(n=3, exc_list=()):
    def decorator(fn):
        def wrapper(*args, **kwargs):
            tried = 1            while tried <= n:
                try:
                    return fn(*args, **kwargs)
                except exc_list as err:
                    print 'RETRY %s/%s, [%s]' % (tried, n, err)
                    tried += 1            else:
                raise err
        return wrapper
    return decorator


@retry(n=5, exc_list=(dns.exception.Timeout, dns.resolver.NoAnswer, ))
def query_dns0(domain, ns_ip, client_ip=None, **kwargs):
    kwargs.setdefault('rdtype', dns.rdatatype.A)
    message = dns.message.make_query(domain, **kwargs)
    if client_ip:
        message.use_edns(options=[clientsubnetoption.ClientSubnetOption(client_ip)])

    if callable(ns_ip):
        ns_ip = ns_ip()
    return dns.query.udp(message, ns_ip, timeout=TIMEOUT), ns_ip


# only authority can be cachedKNOWN_AUTHORITIES = defaultdict(list, **{
    '.': ['192.5.5.241', '199.7.83.42', '192.58.128.30', '192.36.148.17'],
    'net.': [
        '192.48.79.30', '192.35.51.30', '192.52.178.30', '192.5.6.30', 
        '192.26.92.30', '192.41.162.30', '192.12.94.30', '192.54.112.30', 
        '192.31.80.30', '192.43.172.30', '192.33.14.30', '192.55.83.30', '192.42.93.30'],
    'com.': ['192.33.14.30', '192.12.94.30', '192.55.83.30', '192.52.178.30',
             '192.26.92.30', '192.48.79.30', '192.5.6.30', '192.43.172.30',
             '192.42.93.30', '192.31.80.30', '192.54.112.30', '192.35.51.30', '192.41.162.30'],
    'cn.': [
        '203.119.29.1', '203.119.27.1', '203.119.28.1',
        '203.119.26.1', '202.112.0.44', '203.119.25.1'],
})
NS_IPS = defaultdict(set)
TIMEOUT = 2

def locate_nearest_authority(dns_name):
    """    :type dns_name: dns.name.Name    """    for depth in range(len(dns_name), 0, -1):
        _, sub = dns_name.split(depth)
        if sub.to_text() in KNOWN_AUTHORITIES:
            return sub, KNOWN_AUTHORITIES[sub.to_text()]

    assert False, 'impossible'

def is_ip(s):
    try:
        socket.inet_aton(s)
    except:
        return False    else:
        return True

def _authority_iterator(authority_ips, client_ip, logger=_logger):
    pos = random.randint(0, len(authority_ips))

    # for short ips    while True:
        non_ips = []
        for name in chain(authority_ips[pos:], authority_ips[:pos]):
            if is_ip(name):
                yield name
            else:
                non_ips.append(name)

        for name in non_ips:
            if name in NS_IPS:
                server_ips = list(NS_IPS[name])
            else:
                _, server_ips, _ = resolve_A(name, client_ip=client_ip, logger=logger)
                NS_IPS[name] = set(server_ips)

            yield server_ips[pos % len(server_ips)]


@retry(n=5, exc_list=(dns.exception.Timeout, dns.resolver.NoAnswer, ))
def query_dns(dns_name, client_ip=None, logger=_logger, **kwargs):
    sub, authority_ips = locate_nearest_authority(dns_name)
    iterator = _authority_iterator(authority_ips, client_ip, logger=_logger)

    logger('querying %s @NS"%s": %r', dns_name, sub, authority_ips)
    response, ns_ip = query_dns0(dns_name, ns_ip=iterator.next, client_ip=client_ip, **kwargs)
    resp_code = response.rcode()
    if resp_code != dns.rcode.NOERROR:
        if resp_code == dns.rcode.NXDOMAIN:
            raise Exception('%s does not exist on %s.' % (sub, ns_ip))
        else:
            raise Exception('Error %s' % dns.rcode.to_text(resp_code))

    return response


def resolve_A(domain, client_ip, logger=_logger):
    """     :return: answers        response.authority.__len__() == 1    response.authority.name == {Name}a.shifen.com.    response.authority.items:        0 = {NS} ns2.a.shifen.com.        1 = {NS} ns3.a.shifen.com.        2 = {NS} ns4.a.shifen.com.        3 = {NS} ns1.a.shifen.com.        4 = {NS} ns5.a.shifen.com.    """    if not domain.endswith('.'): domain += '.'
    dns_name = dns.name.from_text(domain)

    while True:
        response = query_dns(dns_name, client_ip=client_ip, logger=logger)
        # logger(response)
        instant_ips = defaultdict(list)
        final_answer = None
        # authority contains NS should be processed last, so it can leverage the instant IPs        for answer in chain(response.additional, response.answer, response.authority):
            answer_name = answer.name.to_text()

            for item in answer.items:
                item_text = item.to_text()
                if item.rdtype == dns.rdatatype.A:
                    if answer_name == domain: final_answer = answer
                    instant_ips[answer_name].append(item_text)
                elif item.rdtype == dns.rdatatype.NS:
                    if domain not in instant_ips:
                        logger('got NS: %s -> %s', answer_name, item_text)

                    if item_text in instant_ips:
                        authority_ips = instant_ips[item_text]
                        NS_IPS[item_text].update(authority_ips)
                    else:
                        authority_ips = [item_text]

                    KNOWN_AUTHORITIES[answer_name].extend(authority_ips)
                elif item.rdtype == dns.rdatatype.CNAME:
                    logger('CNAME: %s -> %s', domain, item_text)
                    domain, dns_name = item_text, dns.name.from_text(item_text)

        if domain in instant_ips:
            return domain, instant_ips[domain], final_answer.ttl


if __name__ == '__main__':
    log = _logger
    log = logger_full
    print resolve_A('www.baidu.com', client_ip='8.8.8.8', logger=log)
    print resolve_A('www.taobao.com', client_ip='8.8.8.8', logger=log)
    print resolve_A('www.google.com', client_ip='8.8.8.8', logger=log)
    print resolve_A('dl.tiku.zhan.com', client_ip='8.8.8.8', logger=log)
    print resolve_A('ns3.dnsv4.com', client_ip='8.8.8.8', logger=log)
    print resolve_A('www.facebook.com', client_ip='8.8.8.8', logger=log)
    print resolve_A('api.taoqian123.com', client_ip='8.8.8.8', logger=log)
    print resolve_A('hotsoon.snssdk.com', client_ip='66.103.188.117', logger=log)

Thursday, April 6, 2017

kubernetes 1.6环境下使用storageClass与ceph rbd整合

在每个k8s node上要安装ceph-common，确保有rbd命令

安装secret(集群级别)

apiVersion: v1
kind: Secret
metadata: name: ceph-secret
type: kubernetes.io/rbd   #非常重要，如果想让storageclass识别必须加这个，文档示例上没写，但是example里写了
data: key: QVFDZ2ZOOVkza3VyR3hBQXNYYmx6Mi9xVlBZNzN0VWZvMUlFRlE9PQ==  # ceph auth get-key client.admin | base64

安装storageClass(集群级别)

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata: name: fast
provisioner: kubernetes.io/rbd
parameters: monitors: 10.1.4.208:6789
  adminId: admin
  adminSecretName: ceph-secret
  pool: rbd
  userId: admin
  userSecretName: ceph-secret

安装PVC(namespace级别)

向配置的stroageClass申请PV
绑定后PV相当于卷，pod挂载时是挂载对PVC的引用
目前只支持指定大小，实际上代码修改后可以支持对fsType等的指定(目前不支持)
挂载方式是使用docker HostConfig.Binds来挂载，所以
1. 在node上看mount会看到：/dev/rbd0 on /var/lib/kubelet/plugins/kubernetes.io/rbd/rbd/rbd-image-kubernetes-dynamic-pvc-59474c44-1a77-11e7-8b1a-fa163e0dfa6d type ext4 (rw,relatime,stripe=4096,data=ordered)
2. HostConfig.Binds："/var/lib/kubelet/pods/b84a50db-1aa7-11e7-a0c8-fa163e0dfa6d/volumes/kubernetes.io~rbd/pvc-3c34a5f2-1a95-11e7-a0c8-fa163e0dfa6d:/mnt/rbd-rox"

kind: PersistentVolumeClaim
apiVersion: v1
metadata: name: pvc-rox
spec: accessModes:  - ReadOnlyMany
  resources: requests: storage: 100Mi
  storageClassName: fast

安装pod：

apiVersion: v1
kind: Pod
metadata: name: rbd-test5
spec: containers:  - name: rbd-rw
      image: docker.cloudin.com/cloudin/alpine:latest
      command: ["/bin/sleep", "10000"]
      volumeMounts:  - mountPath: "/mnt/rbd"  name: rbd-rox  # 也可以配置subPath，使用该pvc中的一个子目录
  volumes:  - name: rbd-rox
      persistentVolumeClaim: claimName: pvc-rox

疑难问题：

检查锁定的image: rbd lock list aaa.img --pool kube
检查当前节点的rbd map信息rbd showmapped
osd节点支持使用目录(系统原生fs)或块设备来创建，如果原生fs为ext4时，会有可能出现：File name too long
1. 配置调整ceph.conf
  1. osd max object name len = 256
  2. osd max object namespace len = 64
rbd
1. 稳定性依赖于ceph
2. 仅支持rox / rwo
3. 需要rwx就上ceph fs (文件系统级互斥锁)
ceph：rbd image有4个 features，layering, exclusive-lock, object-map, fast-diff, deep-flatten 因为目前内核仅支持layering
1. 修改默认配置每个ceph node的/etc/ceph/ceph.conf 添加一行 rbd_default_features = 1 这样之后创建的image 只有这一个feature
  1. 验证方式：ceph --show-config|grep rbd|grep features
  2. rbd_default_features = 1
2. ceph osd crush show-tunables -f json-pretty
  1. ceph osd crush tunables legacy
k8s < 1.6是不支持在pvc中使用storageClassName的
对于k8s 1.6，docker版本最好是1.12，在centos7是存在于默认yum源中的
如果使用kubeadm部署的k8s，controll-manager会在容器内运行
1. failed to create rbd image: executable file not found in $PATH
2. 需要在容器内安装rbd或改为系统环境运行
跟踪日志：journalctl -fu kubelet / journalctl -f
dowload_image.sh
1. docker pull zeewell/$1
  docker tag zeewell/$1 gcr.io/google_containers/$1
需要下载的镜像
1. bash download.sh etcd-amd64:3.0.17
  bash download.sh kube-controller-manager-amd64:v1.6.0
  bash download.sh kube-proxy-amd64:v1.6.0
  bash download.sh k8s-dns-sidecar-amd64:1.14.1
  bash download.sh k8s-dns-kube-dns-amd64:1.14.1
  bash download.sh k8s-dns-dnsmasq-nanny-amd64:1.14.1
vim /etc/systemd/system/kubelet.service.d/10-kubeadm.conf
1. 注释掉 #Environment="KUBELET_NETWORK_ARGS=--network-plugin=cni --cni-conf-dir=/etc/cni/net.d --cni-bin-dir=/opt/cni/bin"
2. KUBELET_KUBECONFIG_ARGS需要添加 --cgroup-driver=systemd

refs:

Thursday, March 30, 2017

chrome 阅读模式(distill mode) (desktop version)

在iOS上用习惯了的阅读模式，在chrome桌面版也可以用了～

add -enable-dom-distiller to chrome shortcut (desktop version)

tested on 56.0.2924.87 (64-bit) for win7 x64 (official release)

ref:
https://github.com/chromium/dom-distiller

Tuesday, March 28, 2017

kubernetes log collection with custom ELK stack

logstash / k8s.conf:

input {
  file {
    path => "/var/log/containers/*.log"
    start_position => "beginning"
  }
}

filter {
  kubernetes {}
  mutate { remove_field => "path" }
  json { source => "message" }
}

output {
  if "_jsonparsefailure" not in [tags] {
    mutate { remove_field => "message" }
  }
  stdout {  # for test
    codec => json
  }
  elasticsearch {
    hosts => ["elk-e0:9200", "elk-e1:9200", "elk-e2:9200"]
    index => "kube-%{+YYYY.MM.dd}"
  }
}

安装：

gpg --keyserver hkp://keys.gnupg.net --recv-keys 409B6B1796C275462A1703113804BB82D39DC0E3
\curl -sSL https://get.rvm.io | bash -s stable
rvm install jruby 1.7
dpkg -i logstash-5.2.2.deb
/usr/share/logstash/bin/logstash-plugin install logstash-filter-kubernetes-0.3.1.gem
/usr/share/logstash/bin/logstash -f k8s.conf

测试：

mkdir -p /var/log/containers/
echo '{"log":"\n","stream":"stdout","time":"2017-03-13T09:28:04.20730347Z"}' >> /var/log/containers/redis-master_default_master-ce24440abd65b3702d7dc0588a2a1e099bc41e6b7833456774bb4845d7958429.log

依据：

kubelet maintain the symlinks on /var/logs/containers/
output.elasticsearch.hosts: If given an array it will load balance requests across the hosts specified in the hosts parameter.
k8s默认是支持fluentd，所以推断必然有办法收集日志和分析tag
1. 不支持自定义es服务，只能用自带的，不符合需求
2. es服务以container方式运行于k8s的cluster-service上
3. 自动安装了fluentd到各个node上，并且自动配置和发现(k8s/service: elasticsearch-logging)
4. 相关代码位于kubernetes/cluster/addons/fluentd-elasticsearch/

参考文档：

schema:

PUT _template/template_kube
{
  "template": "kube-*",
  "settings": {
    "number_of_shards": 5,
    "number_of_replicas": 1
  },
  "mappings": {
    "_default_": {
      "properties": {
        "host": {
          "type": "keyword"
        },
        "kubernetes.container_id": {
          "type": "keyword"
        },
        "kubernetes.container_name": {
          "type": "keyword"
        },
        "kubernetes.namespace": {
          "type": "keyword"
        },
        "kubernetes.pod": {
          "type": "keyword"
        },
        "kubernetes.replication_controller": {
          "type": "keyword"
        },
        "log": {
          "type": "text",
          "analyzer": "english"
        },
        "message": {
          "type": "text"
        },
        "stream": {
          "type": "keyword"
        },
        "time": {
          "type": "date"
        },
        "tags": {
          "type": "nested"
        }
      }
    }
  }
}

依据：

Indexes imported from 2.x only support string and not text or keyword.
For the legacy mapping type string the index option only accepts legacy values analyzed (default, treat as full-text field), not_analyzed (treat as keyword field) and no.

refs：

未来展望

可能的日志深度分析：[ https://logz.io/learn/docker-monitoring-elk-stack/ ]
filter的逻辑还是较fluentd-k8s的插件为简单，如果需要更多元数据，可以做代码迁移
k8s集群本身的日志收集还没有做
对外提供Elasticsearch的schema自定义会有很重要的意义，但计费和资源控制就比较麻烦

日志空间回收

如果根据日志事件删除，会造成服务器负载高的问题
以index名称区分日期，"kube-%{+YYYY.MM.dd}"
crontab
1. 25 2 * * * curl -XDELETE "localhost:9200/kube-`date --date='30 day ago' +%F`"

refs

Elasticsearch learning log

Elasticsearch

https://www.elastic.co/guide/en/elasticsearch/reference/current/getting-started.html

Usages:

autocomplete suggestions
collect log or transaction data and you want to analyze and mine this data to look for trends, statistics, summarizations, or anomalies
price alerting platform which allows price-savvy customers to specify a rule
analytics/business-intelligence needs and want to quickly investigate, analyze, visualize, and ask ad-hoc questions

https://www.elastic.co/guide/en/elasticsearch/reference/current/_basic_concepts.html

Make sure that you don’t reuse the same Cluster names in different environments
a Node is identified by a name which by default is a random Universally Unique IDentifier (UUID) that is assigned to the node at startup. You can define any node name you want if you do not want the default.
An IndDex is a collection of documents that have somewhat similar characteristics.
Within an index, you can define one or more Types.
A Document is a basic unit of information that can be indexed.
Shards & Replicas you may change the number of replicas dynamically anytime but you cannot change the number of shards

Default 5 shards & 1 replica (means 2)

https://www.elastic.co/guide/en/elasticsearch/reference/current/_installation.html

Pass

https://www.elastic.co/guide/en/elasticsearch/reference/current/_cluster_health.html

curl http://localhost:9200/_cluster/health/?pretty
curl http://localhost:9200/_cluster/state?pretty
curl -XGET 'localhost:9200/_cat/indices?v&pretty'
curl -XGET 'localhost:9200/_cat/nodes?v&pretty'
curl -XGET 'localhost:9200/_cat/health?v&pretty'
curl -XDELETE 'localhost:9200/customer?pretty&pretty'
<REST Verb> /<Index>/<Type>/<ID>
curl -XPUT 'localhost:9200/customer/external/1?pretty&pretty' -H 'Content-Type: application/json' -d' { "name": "John Doe" } '

Update / insert
using the POST verb instead of PUT since we didn’t specify an ID.

curl -XPOST 'localhost:9200/customer/external/1/_update?pretty&pretty' -H 'Content-Type: application/json' -d'{"script" : "ctx._source.age += 5"}'
DELETE /customer/external/2?pretty
Batch Processing

POST /customer/external/_bulk?pretty
{"index":{"_id":"1"}}
{"name": "John Doe" }
{"index":{"_id":"2"}}
{"name": "Jane Doe" }
{"update":{"_id":"1"}}
{"doc": { "name": "John Doe becomes Jane Doe" } }
{"delete":{"_id":"2”}}
The Bulk API does not fail due to failures in one of the actions. you can check if a specific action failed or not.

https://www.elastic.co/guide/en/elasticsearch/reference/current/_the_search_api.html

The request body method allows you to be more expressive and also to define your searches in a more readable JSON format.
GET /bank/_search
{
"query": { "match_all": {} },
"sort": { "balance": { "order": "desc" } },
"_source": ["account_number", "balance"] # fields
"from": 10,
"size": 10
}
Query:
"bool": {
     "should": [
       { "match": { "address": "mill" } },
       { "match": { "address": "lane" } }
     ]

should -> or
must -> and
must_not -> all false

    "filter": {
       "range": {
         "balance": {
           "gte": 20000,
           "lte": 30000
         }
       }
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations.html

没时间看

https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-reindex.html

Full query DSL supporting
Cross cluster
Scripting
Fetch the status of all running reindex requests

curl -XGET 'localhost:9200/_tasks?detailed=true&actions=*reindex&pretty'

https://www.elastic.co/guide/en/elasticsearch/reference/current/ingest.html

pre-process documents before indexing, you define a pipeline that specifies a series of processors.
Append Processor
Convert Processor
Date Processor
Date Index Name Processor
Fail Processor
Foreach Processor
Grok Processor
Gsub Processor
Join Processor
JSON Processor
KV Processor
Lowercase Processor
Remove Processor
Rename Processor
Script Processor
Set Processor
Split Processor
Sort Processor
Trim Processor
Uppercase Processor
Dot Expander Processor

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-scroll.html#scroll-search-context

Keeping the search context alive
POST /_search/scroll
{
"scroll" : "1m",
"scroll_id" : "DXF1ZXJ5QW5kRmV0Y2gBAAAAAAAA..."
}
POST /twitter/tweet/_search?scroll=1m

https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-aliases.html

Filtered / routing / mapping

https://www.elastic.co/guide/en/elasticsearch/reference/current/general-recommendations.html

Don’t return large result sets, use scroll APIs
Avoid large documents, http.max_context_length is set to 100MB, Lucene still has a limit of about 2GB.

you want to make books searchable doesn’t necessarily mean that a document should consist of a whole book

Avoid sparsity

Avoid putting unrelated data in the same index
Even if you really need to put different kinds of documents in the same index, Normalize document structures
Avoid types… having multiple types that have different fields in a single index will also cause problems
norms can be disabled if producing scores is not necessary on a field

https://www.elastic.co/guide/en/elasticsearch/reference/current/glossary.html

document is stored in an index and has a type and an id. A document is a JSON object, The original JSON document that is indexed will be stored in the _source field
A mapping is like a schema definition in a relational database. The mapping also allows you to define (amongst other things) how the value for a field should be analyzed. Has a number of index-wide settings. Fields with the same name in different types in the same index must have the same mapping
An index is like a table in a relational database. It has a mapping which defines the fields in the index, which are grouped by multiple type.
Each primary shard can have zero or more replicas. A replica is a copy of the primary shard, and has two purposes: increase failover,increase performance. never be started on the same node as its primary shard.
A shard is a single Lucene instance. you never need to refer to shards directly.
A term is an exact value that is indexed in elasticsearch. The terms foo, Foo, FOO are NOT equivalent. can be searched for using term queries
Analysis is the process of converting full text to terms. These terms are what is actually stored in the index. A full text query (not a term query) for FoO:bAR will also be analyzed to the terms foo,bar and will thus match the terms stored in the index.
Text (or full text) is ordinary unstructured text, such as this paragraph.
Each document is stored in a single primary shard. primary shard is chosen by hashing the routing value. derived from document id or, parent document id(to ensure stored on the same shard). This value can be overridden by specifying a routing value at index time, or a routing field in the mapping.

https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules.html

Slow Log

https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping.html

Fields with the same name in different mapping types in the same index must have the same mapping.
dynamic mapping rules can guess fields types. Or you can define by Explicit mappings
existing type and field mappings cannot be updated

https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-types.html

token_count is really an integer field, to count the number of tokens in a string
Array support does not require a dedicated type
object for single JSON objects
nested for arrays of JSON objects
geo_point for lat/lon points
geo_shape for complex shapes like polygons
ip for IPv4 and IPv6 addresses
completion to provide auto-complete suggestions
murmur3 to compute hashes of values at index-time and store them in the index
the mapper-attachments plugin which supports indexing attachments like Microsoft Office formats, Open Document formats, ePub, HTML, etc. into an attachment datatype.
Percolator type: Accepts queries from the query-dsl

https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-all-field.html

The _all field concatenates the values of all of the other fields into one big string, then analyzed and indexed, but not stored. ("store": true)
All values treated as strings
The _all field takes fields’ boosts into account
copy_to parameter allows the creation of multiple custom _all fields
       "first_name": {
         "type":    "text",
         "copy_to": "full_name"
       },
       "last_name": {
         "type":    "text",
         "copy_to": "full_name"
       },

https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-percolate-query.html

https://www.elastic.co/guide/en/elasticsearch/reference/current/percolator.html#percolator
Stores query instead of document

https://www.elastic.co/guide/en/elasticsearch/reference/current/object.html

"properties": {
           "age": { "type": "integer" },
           "name": {
             "properties": {
               "first": { "type": "text" },
               "last": { "type": "text" }
             }
           }

https://www.elastic.co/guide/en/elasticsearch/reference/current/array.html

all values in the array must be of the same datatype.
null values are either replaced by the configured null_value or skipped entirely. An empty array [] is treated as a missing field — a field with no values.
Treated as a set of data, without order

https://www.elastic.co/guide/en/elasticsearch/reference/current/nested.html

use nested query to query them
Indexing a document with 100 nested fields actually indexes 101 documents

https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-fields.html

Meta fields reference, wrong url

https://www.elastic.co/guide/en/elasticsearch/reference/current/doc-values.html#doc-values

If you are sure that you don’t need to sort or aggregate on a field, or access the field value from a script, you can disable doc values in order to save disk space:

https://www.elastic.co/guide/en/elasticsearch/reference/current/dynamic-mapping.html

_default_ mapping, Configure the base mapping to be used for new mapping types.
PUT index-name/_settings "index.mapper.dynamic":false
numeric detection (which is disabled by default)

https://www.elastic.co/guide/en/elasticsearch/reference/current/dynamic-templates.html

How to dynamic mapping fields to type

https://www.elastic.co/guide/en/elasticsearch/plugins/5.2/mapper-size.html

The mapper-size plugin provides the _size meta field which, when enabled, indexes the size in bytes of the original _source field.

https://www.elastic.co/guide/en/elasticsearch/reference/current/keyword.html

Keyword fields are only searchable by their exact value.

https://www.elastic.co/guide/en/elasticsearch/reference/current/coerce.html#coerce

Coercion attempts to clean up dirty values to fit the datatype of a field by default

https://www.elastic.co/guide/en/elasticsearch/reference/current/multi-fields.html

"properties": {
       "city": {
         "type": "text",
         "fields": {
           "raw": {
             "type": "keyword", "analyzer": "english"
           }
         }
       }
"sort": {
"city.raw": "asc"
},

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-suggesters.html

Suggest search on different fields

https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis.html

converting text, like the body of any email, into tokens or terms which are added to the inverted index for searching
This same analysis process is applied to the query string at search time

https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-termvectors.html

Outputs the statistic information of terms in a document