source: http://ju.outofmemory.cn/entry/53571
关于unicode汉字范围正则的写法
Chronos 2013-10-21 7952 阅读
正则 范围 unicode 汉字
原来我使用的一直是 \u4e00-\u9fa5 ,今天在匹配中文标点的时候匹配不上,就查了一下相关资料,原来unicode跟中文有关的范围还有好几个。
字符范围表
1.标准CJK文字
范围:\u3400-\u4DB5,\u4E00-\u9FA5,\u9FA6-\u9FBB,\uF900-\uFA2D,\uFA30-\uFA6A,\uFA70-\uFAD9 说明:一共有好几个范围,除 \u4e00-\u9fa5外 都不是很常用 参考地址:http://www.unicode.org/Public/UNIDATA/Unihan.html
2.全角ASCII、全角中英文标点、半宽片假名、半宽平假名、半宽韩文字母
范围:\uFF00-\uFFEF 参考地址:http://www.unicode.org/charts/PDF/UFF00.pdf
3.CJK部首补充
范围:\u2E80-\u2EFF 参考地址:http://www.unicode.org/charts/PDF/U2E80.pdf
4.CJK标点符号
范围:\u3000-\u303F 参考地址:http://www.unicode.org/charts/PDF/U3000.pdf
5.CJK笔划
范围:\u31C0-\u31EF 参考地址:http://www.unicode.org/charts/PDF/U31C0.pdf
6.康熙部首
范围:\u2F00-\u2FDF 参考地址:http://www.unicode.org/charts/PDF/U2F00.pdf
7.汉字结构描述字符
范围:\u2FF0-\u2FFF 参考地址:http://www.unicode.org/charts/PDF/U2FF0.pdf
8.注音符号
范围:\u3100-\u312F 参考地址:http://www.unicode.org/charts/PDF/U3100.pdf
9.注音符号(闽南语、客家语扩展)
范围:\u31A0-\u31BF 参考地址:http://www.unicode.org/charts/PDF/U31A0.pdf
10.日文平假名
范围:\u3040-\u309F 参考地址:http://www.unicode.org/charts/PDF/U3040.pdf
11.日文片假名
范围:\u30A0-\u30FF 参考地址:http://www.unicode.org/charts/PDF/U30A0.pdf
12.日文片假名拼音扩展
范围:\u31F0-\u31FF 参考地址:http://www.unicode.org/charts/PDF/U31F0.pdf
13.韩文拼音
范围:\uAC00-\uD7AF 参考地址:http://www.unicode.org/charts/PDF/UAC00.pdf
14.韩文字母
范围:\u1100-\u11FF 参考地址:http://www.unicode.org/charts/PDF/U1100.pdf
15.韩文兼容字母
范围:\u3130-\u318F 参考地址:http://www.unicode.org/charts/PDF/U3130.pdf
16.易经六十四卦象
范围:\u4DC0-\u4DFF 参考地址:http://www.unicode.org/charts/PDF/U4DC0.pdf
17.彝文音节
范围:\uA000-\uA48F 参考地址:http://www.unicode.org/charts/PDF/UA000.pdf
18.彝文部首
范围:\uA490-\uA4CF 参考地址:http://www.unicode.org/charts/PDF/UA490.pdf
19.盲文符号
范围:\u2800-\u28FF 参考地址:http://www.unicode.org/charts/PDF/U2800.pdf
20.CJK字母及月份
范围:\u3200-\u32FF 参考地址:http://www.unicode.org/charts/PDF/U3200.pdf
21.CJK特殊符号(日期合并)
范围:\u3300-\u33FF 参考地址:http://www.unicode.org/charts/PDF/U3300.pdf
22.装饰符号(非CJK专用)
范围:\u2700-\u27BF 参考地址:http://www.unicode.org/charts/PDF/U2700.pdf
23.杂项符号(非CJK专用)
范围:\u2600-\u26FF 参考地址:http://www.unicode.org/charts/PDF/U2600.pdf
24.中文竖排标点
范围:\uFE10-\uFE1F 参考地址:http://www.unicode.org/charts/PDF/UFE10.pdf
25.CJK兼容符号(竖排变体、下划线、顿号)
范围:\uFE30-\uFE4F 参考地址:http://www.unicode.org/charts/PDF/UFE30.pdf
改进后的匹配表达式
[\u3400-\u4DB5\u4E00-\u9FA5\u9FA6-\u9FBB\uF900-\uFA2D\uFA30-\uFA6A\uFA70-\uFAD9\uFF00-\uFFEF\u2E80-\u2EFF\u3000-\u303F\u31C0-\u31EF] (注:这条基本能满足要求)
[\u3400-\u4DB5\u4E00-\u9FA5\u9FA6-\u9FBB\uF900-\uFA2D\uFA30-\uFA6A\uFA70-\uFAD9\uFF00-\uFFEF\u2E80-\u2EFF\u3000-\u303F\u31C0-\u31EF\u2F00-\u2FDF\u2FF0-\u2FFF\u3100-\u312F\u31A0-\u31BF\u3040-\u309F\u30A0-\u30FF\u31F0-\u31FF\uAC00-\uD7AF\u1100-\u11FF\u3130-\u318F\u4DC0-\u4DFF\uA000-\uA48F\uA490-\uA4CF\u2800-\u28FF\u3200-\u32FF\u3300-\u33FF\u2700-\u27BF\u2600-\u26FF\uFE10-\uFE1F\uFE30-\uFE4F] (注:这是完整版本)
Wednesday, October 11, 2017
Monday, August 21, 2017
recursive dns resolution
recursive dns resolution, [gist link], requirements:
- dns
- clientsubnetoption
# -*- coding: utf8 -*- """@author: boyxuper@date: 2017/8/21 16:07"""import random import socket from collections import defaultdict from itertools import chain import clientsubnetoption import dns import dns.name import dns.message import dns.rdatatype import dns.resolver _logger = lambda *args: None def logger_full(fmt, *args): if not args: print fmt return print fmt % args socket.inet_pton = lambda _, p: socket.inet_aton(p) socket.inet_ntop = lambda _, n: socket.inet_ntoa(n) def retry(n=3, exc_list=()): def decorator(fn): def wrapper(*args, **kwargs): tried = 1 while tried <= n: try: return fn(*args, **kwargs) except exc_list as err: print 'RETRY %s/%s, [%s]' % (tried, n, err) tried += 1 else: raise err return wrapper return decorator @retry(n=5, exc_list=(dns.exception.Timeout, dns.resolver.NoAnswer, )) def query_dns0(domain, ns_ip, client_ip=None, **kwargs): kwargs.setdefault('rdtype', dns.rdatatype.A) message = dns.message.make_query(domain, **kwargs) if client_ip: message.use_edns(options=[clientsubnetoption.ClientSubnetOption(client_ip)]) if callable(ns_ip): ns_ip = ns_ip() return dns.query.udp(message, ns_ip, timeout=TIMEOUT), ns_ip # only authority can be cachedKNOWN_AUTHORITIES = defaultdict(list, **{ '.': ['192.5.5.241', '199.7.83.42', '192.58.128.30', '192.36.148.17'], 'net.': [ '192.48.79.30', '192.35.51.30', '192.52.178.30', '192.5.6.30', '192.26.92.30', '192.41.162.30', '192.12.94.30', '192.54.112.30', '192.31.80.30', '192.43.172.30', '192.33.14.30', '192.55.83.30', '192.42.93.30'], 'com.': ['192.33.14.30', '192.12.94.30', '192.55.83.30', '192.52.178.30', '192.26.92.30', '192.48.79.30', '192.5.6.30', '192.43.172.30', '192.42.93.30', '192.31.80.30', '192.54.112.30', '192.35.51.30', '192.41.162.30'], 'cn.': [ '203.119.29.1', '203.119.27.1', '203.119.28.1', '203.119.26.1', '202.112.0.44', '203.119.25.1'], }) NS_IPS = defaultdict(set) TIMEOUT = 2 def locate_nearest_authority(dns_name): """ :type dns_name: dns.name.Name """ for depth in range(len(dns_name), 0, -1): _, sub = dns_name.split(depth) if sub.to_text() in KNOWN_AUTHORITIES: return sub, KNOWN_AUTHORITIES[sub.to_text()] assert False, 'impossible' def is_ip(s): try: socket.inet_aton(s) except: return False else: return True def _authority_iterator(authority_ips, client_ip, logger=_logger): pos = random.randint(0, len(authority_ips)) # for short ips while True: non_ips = [] for name in chain(authority_ips[pos:], authority_ips[:pos]): if is_ip(name): yield name else: non_ips.append(name) for name in non_ips: if name in NS_IPS: server_ips = list(NS_IPS[name]) else: _, server_ips, _ = resolve_A(name, client_ip=client_ip, logger=logger) NS_IPS[name] = set(server_ips) yield server_ips[pos % len(server_ips)] @retry(n=5, exc_list=(dns.exception.Timeout, dns.resolver.NoAnswer, )) def query_dns(dns_name, client_ip=None, logger=_logger, **kwargs): sub, authority_ips = locate_nearest_authority(dns_name) iterator = _authority_iterator(authority_ips, client_ip, logger=_logger) logger('querying %s @NS"%s": %r', dns_name, sub, authority_ips) response, ns_ip = query_dns0(dns_name, ns_ip=iterator.next, client_ip=client_ip, **kwargs) resp_code = response.rcode() if resp_code != dns.rcode.NOERROR: if resp_code == dns.rcode.NXDOMAIN: raise Exception('%s does not exist on %s.' % (sub, ns_ip)) else: raise Exception('Error %s' % dns.rcode.to_text(resp_code)) return response def resolve_A(domain, client_ip, logger=_logger): """ :return: answers response.authority.__len__() == 1 response.authority.name == {Name}a.shifen.com. response.authority.items: 0 = {NS} ns2.a.shifen.com. 1 = {NS} ns3.a.shifen.com. 2 = {NS} ns4.a.shifen.com. 3 = {NS} ns1.a.shifen.com. 4 = {NS} ns5.a.shifen.com. """ if not domain.endswith('.'): domain += '.' dns_name = dns.name.from_text(domain) while True: response = query_dns(dns_name, client_ip=client_ip, logger=logger) # logger(response) instant_ips = defaultdict(list) final_answer = None # authority contains NS should be processed last, so it can leverage the instant IPs for answer in chain(response.additional, response.answer, response.authority): answer_name = answer.name.to_text() for item in answer.items: item_text = item.to_text() if item.rdtype == dns.rdatatype.A: if answer_name == domain: final_answer = answer instant_ips[answer_name].append(item_text) elif item.rdtype == dns.rdatatype.NS: if domain not in instant_ips: logger('got NS: %s -> %s', answer_name, item_text) if item_text in instant_ips: authority_ips = instant_ips[item_text] NS_IPS[item_text].update(authority_ips) else: authority_ips = [item_text] KNOWN_AUTHORITIES[answer_name].extend(authority_ips) elif item.rdtype == dns.rdatatype.CNAME: logger('CNAME: %s -> %s', domain, item_text) domain, dns_name = item_text, dns.name.from_text(item_text) if domain in instant_ips: return domain, instant_ips[domain], final_answer.ttl if __name__ == '__main__': log = _logger log = logger_full print resolve_A('www.baidu.com', client_ip='8.8.8.8', logger=log) print resolve_A('www.taobao.com', client_ip='8.8.8.8', logger=log) print resolve_A('www.google.com', client_ip='8.8.8.8', logger=log) print resolve_A('dl.tiku.zhan.com', client_ip='8.8.8.8', logger=log) print resolve_A('ns3.dnsv4.com', client_ip='8.8.8.8', logger=log) print resolve_A('www.facebook.com', client_ip='8.8.8.8', logger=log) print resolve_A('api.taoqian123.com', client_ip='8.8.8.8', logger=log) print resolve_A('hotsoon.snssdk.com', client_ip='66.103.188.117', logger=log)
Thursday, April 6, 2017
kubernetes 1.6环境下使用storageClass与ceph rbd整合
在每个k8s node上要安装ceph-common,确保有rbd命令
安装secret(集群级别)
apiVersion: v1 kind: Secret metadata: name: ceph-secret type: kubernetes.io/rbd #非常重要,如果想让storageclass识别必须加这个,文档示例上没写,但是example里写了 data: key: QVFDZ2ZOOVkza3VyR3hBQXNYYmx6Mi9xVlBZNzN0VWZvMUlFRlE9PQ== # ceph auth get-key client.admin | base64
安装storageClass(集群级别)
apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: name: fast provisioner: kubernetes.io/rbd parameters: monitors: 10.1.4.208:6789 adminId: admin adminSecretName: ceph-secret pool: rbd userId: admin userSecretName: ceph-secret
安装PVC(namespace级别)
- 向配置的stroageClass申请PV
- 绑定后PV相当于卷,pod挂载时是挂载对PVC的引用
- 目前只支持指定大小,实际上代码修改后可以支持对fsType等的指定(目前不支持)
- 挂载方式是使用docker HostConfig.Binds来挂载,所以
- 在node上看mount会看到:/dev/rbd0 on /var/lib/kubelet/plugins/kubernetes.io/rbd/rbd/rbd-image-kubernetes-dynamic-pvc-59474c44-1a77-11e7-8b1a-fa163e0dfa6d type ext4 (rw,relatime,stripe=4096,data=ordered)
- HostConfig.Binds:"/var/lib/kubelet/pods/b84a50db-1aa7-11e7-a0c8-fa163e0dfa6d/volumes/kubernetes.io~rbd/pvc-3c34a5f2-1a95-11e7-a0c8-fa163e0dfa6d:/mnt/rbd-rox"
kind: PersistentVolumeClaim apiVersion: v1 metadata: name: pvc-rox spec: accessModes: - ReadOnlyMany resources: requests: storage: 100Mi storageClassName: fast
安装pod:
apiVersion: v1 kind: Pod metadata: name: rbd-test5 spec: containers: - name: rbd-rw image: docker.cloudin.com/cloudin/alpine:latest command: ["/bin/sleep", "10000"] volumeMounts: - mountPath: "/mnt/rbd" name: rbd-rox # 也可以配置subPath,使用该pvc中的一个子目录 volumes: - name: rbd-rox persistentVolumeClaim: claimName: pvc-rox
疑难问题:
- 检查锁定的image: rbd lock list aaa.img --pool kube
- 检查当前节点的rbd map信息rbd showmapped
- osd节点支持使用目录(系统原生fs)或块设备来创建,如果原生fs为ext4时,会有可能出现:File name too long
- 配置调整ceph.conf
- osd max object name len = 256
- osd max object namespace len = 64
- 配置调整ceph.conf
- rbd
- 稳定性依赖于ceph
- 仅支持rox / rwo
- 需要rwx就上ceph fs (文件系统级互斥锁)
- ceph:rbd image有4个 features,layering, exclusive-lock, object-map, fast-diff, deep-flatten 因为目前内核仅支持layering
- 修改默认配置 每个ceph node的/etc/ceph/ceph.conf 添加一行 rbd_default_features = 1 这样之后创建的image 只有这一个feature
- 验证方式:ceph --show-config|grep rbd|grep features
- rbd_default_features = 1
- ceph osd crush show-tunables -f json-pretty
- ceph osd crush tunables legacy
- 修改默认配置 每个ceph node的/etc/ceph/ceph.conf 添加一行 rbd_default_features = 1 这样之后创建的image 只有这一个feature
- k8s < 1.6是不支持在pvc中使用storageClassName的
- 对于k8s 1.6,docker版本最好是1.12,在centos7是存在于默认yum源中的
- 如果使用kubeadm部署的k8s,controll-manager会在容器内运行
- failed to create rbd image: executable file not found in $PATH
- 需要在容器内安装rbd或改为系统环境运行
- 跟踪日志:journalctl -fu kubelet / journalctl -f
- dowload_image.sh
- docker pull zeewell/$1
docker tag zeewell/$1 gcr.io/google_containers/$1
- docker pull zeewell/$1
- 需要下载的镜像
- bash download.sh etcd-amd64:3.0.17
bash download.sh kube-controller-manager-amd64:v1.6.0
bash download.sh kube-proxy-amd64:v1.6.0
bash download.sh k8s-dns-sidecar-amd64:1.14.1
bash download.sh k8s-dns-kube-dns-amd64:1.14.1
bash download.sh k8s-dns-dnsmasq-nanny-amd64:1.14.1
- bash download.sh etcd-amd64:3.0.17
- vim /etc/systemd/system/kubelet.service.d/10-kubeadm.conf
- 注释掉 #Environment="KUBELET_NETWORK_ARGS=--network-plugin=cni --cni-conf-dir=/etc/cni/net.d --cni-bin-dir=/opt/cni/bin"
- KUBELET_KUBECONFIG_ARGS需要添加 --cgroup-driver=systemd
refs:
- http://foxhound.blog.51cto.com/1167932/1899545
- http://docs.ceph.com/docs/master/start/quick-ceph-deploy/
- http://docs.ceph.com/docs/master/install/manual-deployment/
- https://github.com/kubernetes/kubernetes/tree/master/examples/persistent-volume-provisioning/rbd
- http://www.tuicool.com/articles/vQr6zaV
- http://www.tuicool.com/articles/feyiMr6
- http://tonybai.com/2016/11/07/integrate-kubernetes-with-ceph-rbd/
- http://webcache.googleusercontent.com/search?q=cache:XESMwMuMZTEJ:lists.ceph.com/pipermail/ceph-users-ceph.com/2015-July/002815.html+&cd=1&hl=en&ct=clnk
- https://kubernetes.io/docs/concepts/storage/persistent-volumes/
Thursday, March 30, 2017
chrome 阅读模式(distill mode) (desktop version)
在iOS上用习惯了的阅读模式,在chrome桌面版也可以用了~
add -enable-dom-distiller to chrome shortcut (desktop version)
tested on 56.0.2924.87 (64-bit) for win7 x64 (official release)
ref:
https://github.com/chromium/dom-distiller
add -enable-dom-distiller to chrome shortcut (desktop version)
tested on 56.0.2924.87 (64-bit) for win7 x64 (official release)
ref:
https://github.com/chromium/dom-distiller
Tuesday, March 28, 2017
kubernetes log collection with custom ELK stack
logstash / k8s.conf:
input {
file {
path => "/var/log/containers/*.log"
start_position => "beginning"
}
}
filter {
kubernetes {}
mutate { remove_field => "path" }
json { source => "message" }
}
output {
if "_jsonparsefailure" not in [tags] {
mutate { remove_field => "message" }
}
stdout { # for test
codec => json
}
elasticsearch {
hosts => ["elk-e0:9200", "elk-e1:9200", "elk-e2:9200"]
index => "kube-%{+YYYY.MM.dd}"
}
}
安装:
- gpg --keyserver hkp://keys.gnupg.net --recv-keys 409B6B1796C275462A1703113804BB82D39DC0E3
- \curl -sSL https://get.rvm.io | bash -s stable
- rvm install jruby 1.7
- dpkg -i logstash-5.2.2.deb
- /usr/share/logstash/bin/logstash-plugin install logstash-filter-kubernetes-0.3.1.gem
- /usr/share/logstash/bin/logstash -f k8s.conf
测试:
- mkdir -p /var/log/containers/
- echo '{"log":"\n","stream":"stdout","time":"2017-03-13T09:28:04.20730347Z"}' >> /var/log/containers/redis-master_default_master-ce24440abd65b3702d7dc0588a2a1e099bc41e6b7833456774bb4845d7958429.log
依据:
- kubelet maintain the symlinks on /var/logs/containers/
- output.elasticsearch.hosts: If given an array it will load balance requests across the hosts specified in the
hostsparameter. - k8s默认是支持fluentd,所以推断必然有办法收集日志和分析tag
- 不支持自定义es服务,只能用自带的,不符合需求
- es服务以container方式运行于k8s的cluster-service上
- 自动安装了fluentd到各个node上,并且自动配置和发现(k8s/service: elasticsearch-logging)
- 相关代码位于kubernetes/cluster/addons/fluentd-elasticsearch/
参考文档:
- https://kubernetes.io/docs/tasks/debug-application-cluster/logging-elasticsearch-kibana/
- https://github.com/fabric8io/fluent-plugin-kubernetes_metadata_filter/blob/master/lib/fluent/plugin/filter_kubernetes_metadata.rb
- https://github.com/vaijab/logstash-filter-kubernetes
- https://www.elastic.co/guide/en/logstash/current/plugins-outputs-elasticsearch.html#plugins-outputs-elasticsearch-hosts
- https://github.com/kubernetes/kubernetes/blob/master/cluster/addons/fluentd-elasticsearch/fluentd-es-image/td-agent.conf
- http://www.tuicool.com/articles/jEBBZbb
schema:
PUT _template/template_kube
{
"template": "kube-*",
"settings": {
"number_of_shards": 5,
"number_of_replicas": 1
},
"mappings": {
"_default_": {
"properties": {
"host": {
"type": "keyword"
},
"kubernetes.container_id": {
"type": "keyword"
},
"kubernetes.container_name": {
"type": "keyword"
},
"kubernetes.namespace": {
"type": "keyword"
},
"kubernetes.pod": {
"type": "keyword"
},
"kubernetes.replication_controller": {
"type": "keyword"
},
"log": {
"type": "text",
"analyzer": "english"
},
"message": {
"type": "text"
},
"stream": {
"type": "keyword"
},
"time": {
"type": "date"
},
"tags": {
"type": "nested"
}
}
}
}
}
依据:
- Indexes imported from 2.x only support string and not text or keyword.
- For the legacy mapping type string the index option only accepts legacy values analyzed (default, treat as full-text field), not_analyzed (treat as keyword field) and no.
refs:
- https://www.elastic.co/guide/en/elasticsearch/reference/current/string.html
- https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-templates.html
未来展望
- 可能的日志深度分析:[ https://logz.io/learn/docker-monitoring-elk-stack/ ]
- filter的逻辑还是较fluentd-k8s的插件为简单,如果需要更多元数据,可以做代码迁移
- k8s集群本身的日志收集还没有做
- 对外提供Elasticsearch的schema自定义会有很重要的意义,但计费和资源控制就比较麻烦
日志空间回收
- 如果根据日志事件删除,会造成服务器负载高的问题
- 以index名称区分日期,"kube-%{+YYYY.MM.dd}"
- crontab
- 25 2 * * * curl -XDELETE "localhost:9200/kube-`date --date='30 day ago' +%F`"
refs
Elasticsearch learning log
Elasticsearch
- Usages:
- autocomplete suggestions
- collect log or transaction data and you want to analyze and mine this data to look for trends, statistics, summarizations, or anomalies
- price alerting platform which allows price-savvy customers to specify a rule
- analytics/business-intelligence needs and want to quickly investigate, analyze, visualize, and ask ad-hoc questions
- Make sure that you don’t reuse the same Cluster names in different environments
- a Node is identified by a name which by default is a random Universally Unique IDentifier (UUID) that is assigned to the node at startup. You can define any node name you want if you do not want the default.
- An IndDex is a collection of documents that have somewhat similar characteristics.
- Within an index, you can define one or more Types.
- A Document is a basic unit of information that can be indexed.
- Shards & Replicas you may change the number of replicas dynamically anytime but you cannot change the number of shards
- Default 5 shards & 1 replica (means 2)
- Pass
- curl -XGET 'localhost:9200/_cat/indices?v&pretty'
- curl -XGET 'localhost:9200/_cat/nodes?v&pretty'
- curl -XGET 'localhost:9200/_cat/health?v&pretty'
- curl -XDELETE 'localhost:9200/customer?pretty&pretty'
- <REST Verb> /<Index>/<Type>/<ID>
- curl -XPUT 'localhost:9200/customer/external/1?pretty&pretty' -H 'Content-Type: application/json' -d' { "name": "John Doe" } '
- Update / insert
- using the POST verb instead of PUT since we didn’t specify an ID.
- curl -XPOST 'localhost:9200/customer/external/1/_update?pretty&pretty' -H 'Content-Type: application/json' -d'{"script" : "ctx._source.age += 5"}'
- DELETE /customer/external/2?pretty
- Batch Processing
- POST /customer/external/_bulk?pretty
- {"index":{"_id":"1"}}
- {"name": "John Doe" }
- {"index":{"_id":"2"}}
- {"name": "Jane Doe" }
- {"update":{"_id":"1"}}
- {"doc": { "name": "John Doe becomes Jane Doe" } }
- {"delete":{"_id":"2”}}
- The Bulk API does not fail due to failures in one of the actions. you can check if a specific action failed or not.
- The request body method allows you to be more expressive and also to define your searches in a more readable JSON format.
- GET /bank/_search
{
"query": { "match_all": {} },
"sort": { "balance": { "order": "desc" } },
"_source": ["account_number", "balance"] # fields
"from": 10,
"size": 10
} - Query:
- "bool": {
"should": [
{ "match": { "address": "mill" } },
{ "match": { "address": "lane" } }
] - should -> or
- must -> and
- must_not -> all false
- "filter": {
"range": {
"balance": {
"gte": 20000,
"lte": 30000
}
} - 没时间看
- Full query DSL supporting
- Cross cluster
- Scripting
- Fetch the status of all running reindex requests
- curl -XGET 'localhost:9200/_tasks?detailed=true&actions=*reindex&pretty'
- pre-process documents before indexing, you define a pipeline that specifies a series of processors.
- Append Processor
- Convert Processor
- Date Processor
- Date Index Name Processor
- Fail Processor
- Foreach Processor
- Grok Processor
- Gsub Processor
- Join Processor
- JSON Processor
- KV Processor
- Lowercase Processor
- Remove Processor
- Rename Processor
- Script Processor
- Set Processor
- Split Processor
- Sort Processor
- Trim Processor
- Uppercase Processor
- Dot Expander Processor
- Keeping the search context alive
- POST /_search/scroll
{
"scroll" : "1m",
"scroll_id" : "DXF1ZXJ5QW5kRmV0Y2gBAAAAAAAA..."
} - POST /twitter/tweet/_search?scroll=1m
- Filtered / routing / mapping
- Don’t return large result sets, use scroll APIs
- Avoid large documents, http.max_context_length is set to 100MB, Lucene still has a limit of about 2GB.
- you want to make books searchable doesn’t necessarily mean that a document should consist of a whole book
- Avoid sparsity
- Avoid putting unrelated data in the same index
- Even if you really need to put different kinds of documents in the same index, Normalize document structures
- Avoid types… having multiple types that have different fields in a single index will also cause problems
- norms can be disabled if producing scores is not necessary on a field
- document is stored in an index and has a type and an id. A document is a JSON object, The original JSON document that is indexed will be stored in the _source field
- A mapping is like a schema definition in a relational database. The mapping also allows you to define (amongst other things) how the value for a field should be analyzed. Has a number of index-wide settings. Fields with the same name in different types in the same index must have the same mapping
- An index is like a table in a relational database. It has a mapping which defines the fields in the index, which are grouped by multiple type.
- Each primary shard can have zero or more replicas. A replica is a copy of the primary shard, and has two purposes: increase failover,increase performance. never be started on the same node as its primary shard.
- A shard is a single Lucene instance. you never need to refer to shards directly.
- A term is an exact value that is indexed in elasticsearch. The terms foo, Foo, FOO are NOT equivalent. can be searched for using term queries
- Analysis is the process of converting full text to terms. These terms are what is actually stored in the index. A full text query (not a term query) for FoO:bAR will also be analyzed to the terms foo,bar and will thus match the terms stored in the index.
- Text (or full text) is ordinary unstructured text, such as this paragraph.
- Each document is stored in a single primary shard. primary shard is chosen by hashing the routing value. derived from document id or, parent document id(to ensure stored on the same shard). This value can be overridden by specifying a routing value at index time, or a routing field in the mapping.
- Slow Log
- Fields with the same name in different mapping types in the same index must have the same mapping.
- dynamic mapping rules can guess fields types. Or you can define by Explicit mappings
- existing type and field mappings cannot be updated
- token_count is really an integer field, to count the number of tokens in a string
- Array support does not require a dedicated type
- object for single JSON objects
- nested for arrays of JSON objects
- geo_point for lat/lon points
- geo_shape for complex shapes like polygons
- ip for IPv4 and IPv6 addresses
- completion to provide auto-complete suggestions
- murmur3 to compute hashes of values at index-time and store them in the index
- the mapper-attachments plugin which supports indexing attachments like Microsoft Office formats, Open Document formats, ePub, HTML, etc. into an attachment datatype.
- Percolator type: Accepts queries from the query-dsl
- The _all field concatenates the values of all of the other fields into one big string, then analyzed and indexed, but not stored. ("store": true)
- All values treated as strings
- The _all field takes fields’ boosts into account
- copy_to parameter allows the creation of multiple custom _all fields
- "first_name": {
"type": "text",
"copy_to": "full_name"
},
"last_name": {
"type": "text",
"copy_to": "full_name"
}, - Stores query instead of document
- "properties": {
"age": { "type": "integer" },
"name": {
"properties": {
"first": { "type": "text" },
"last": { "type": "text" }
}
} - all values in the array must be of the same datatype.
- null values are either replaced by the configured null_value or skipped entirely. An empty array [] is treated as a missing field — a field with no values.
- Treated as a set of data, without order
- use nested query to query them
- Indexing a document with 100 nested fields actually indexes 101 documents
- Meta fields reference, wrong url
- If you are sure that you don’t need to sort or aggregate on a field, or access the field value from a script, you can disable doc values in order to save disk space:
- _default_ mapping, Configure the base mapping to be used for new mapping types.
- PUT index-name/_settings "index.mapper.dynamic":false
- numeric detection (which is disabled by default)
- How to dynamic mapping fields to type
- The mapper-size plugin provides the _size meta field which, when enabled, indexes the size in bytes of the original _source field.
- Keyword fields are only searchable by their exact value.
- Coercion attempts to clean up dirty values to fit the datatype of a field by default
- "properties": {
"city": {
"type": "text",
"fields": {
"raw": {
"type": "keyword", "analyzer": "english"
}
}
} - "sort": {
"city.raw": "asc"
}, - Suggest search on different fields
- converting text, like the body of any email, into tokens or terms which are added to the inverted index for searching
- This same analysis process is applied to the query string at search time
- Outputs the statistic information of terms in a document
Subscribe to:
Comments (Atom)
