Tuesday, March 28, 2017

Elasticsearch learning log



Elasticsearch
    1. Usages:
      1. autocomplete suggestions
      2. collect log or transaction data and you want to analyze and mine this data to look for trends, statistics, summarizations, or anomalies
      3. price alerting platform which allows price-savvy customers to specify a rule
      4. analytics/business-intelligence needs and want to quickly investigate, analyze, visualize, and ask ad-hoc questions
    1. Make sure that you don’t reuse the same Cluster names in different environments
    2. a Node is identified by a name which by default is a random Universally Unique IDentifier (UUID) that is assigned to the node at startup. You can define any node name you want if you do not want the default.
    3. An IndDex is a collection of documents that have somewhat similar characteristics.
    4. Within an index, you can define one or more Types.
    5. A Document is a basic unit of information that can be indexed.
    6. Shards & Replicas you may change the number of replicas dynamically anytime but you cannot change the number of shards
      1. Default 5 shards & 1 replica (means 2)
    1. Pass
    1. curl -XGET 'localhost:9200/_cat/indices?v&pretty'
    2. curl -XGET 'localhost:9200/_cat/nodes?v&pretty'
    3. curl -XGET 'localhost:9200/_cat/health?v&pretty'
    4. curl -XDELETE 'localhost:9200/customer?pretty&pretty'
    5. <REST Verb> /<Index>/<Type>/<ID>
    6. curl -XPUT 'localhost:9200/customer/external/1?pretty&pretty' -H 'Content-Type: application/json' -d' { "name": "John Doe" } '
      1. Update / insert
      2. using the POST verb instead of PUT since we didn’t specify an ID.
    7. curl -XPOST 'localhost:9200/customer/external/1/_update?pretty&pretty' -H 'Content-Type: application/json' -d'{"script" : "ctx._source.age += 5"}'
    8. DELETE /customer/external/2?pretty
    9. Batch Processing
      1. POST /customer/external/_bulk?pretty
      2. {"index":{"_id":"1"}}
      3. {"name": "John Doe" }
      4. {"index":{"_id":"2"}}
      5. {"name": "Jane Doe" }
      6. {"update":{"_id":"1"}}
      7. {"doc": { "name": "John Doe becomes Jane Doe" } }
      8. {"delete":{"_id":"2”}}
      9. The Bulk API does not fail due to failures in one of the actions. you can check if a specific action failed or not.
    1. The request body method allows you to be more expressive and also to define your searches in a more readable JSON format.
    2. GET /bank/_search
      {
        "query": { "match_all": {} },
        "sort": { "balance": { "order": "desc" } },
      "_source": ["account_number", "balance"]  # fields
      "from": 10,
      "size": 10
      }
    3. Query:
    4. "bool": {
           "should": [
             { "match": { "address": "mill" } },
             { "match": { "address": "lane" } }
           ]
          1. should -> or
          2. must -> and
          3. must_not -> all false
    5.     "filter": {
             "range": {
               "balance": {
                 "gte": 20000,
                 "lte": 30000
               }
             }
      1. 没时间看
    1. Full query DSL supporting
    2. Cross cluster
    3. Scripting
    4. Fetch the status of all running reindex requests
      1. curl -XGET 'localhost:9200/_tasks?detailed=true&actions=*reindex&pretty'
    1. pre-process documents before indexing, you define a pipeline that specifies a series of processors.
    2. Append Processor
    3. Convert Processor
    4. Date Processor
    5. Date Index Name Processor
    6. Fail Processor
    7. Foreach Processor
    8. Grok Processor
    9. Gsub Processor
    10. Join Processor
    11. JSON Processor
    12. KV Processor
    13. Lowercase Processor
    14. Remove Processor
    15. Rename Processor
    16. Script Processor
    17. Set Processor
    18. Split Processor
    19. Sort Processor
    20. Trim Processor
    21. Uppercase Processor
    22. Dot Expander Processor
    1. Keeping the search context alive
    2. POST  /_search/scroll
      {
         "scroll" : "1m",
         "scroll_id" : "DXF1ZXJ5QW5kRmV0Y2gBAAAAAAAA..."
      }
    3. POST /twitter/tweet/_search?scroll=1m
      1. Filtered / routing / mapping
    1. Don’t return large result sets, use scroll APIs
    2. Avoid large documents, http.max_context_length is set to 100MB, Lucene still has a limit of about 2GB.
      1. you want to make books searchable doesn’t necessarily mean that a document should consist of a whole book
    3. Avoid sparsity
      1. Avoid putting unrelated data in the same index
      2. Even if you really need to put different kinds of documents in the same index, Normalize document structures
      3. Avoid types… having multiple types that have different fields in a single index will also cause problems
      4. norms can be disabled if producing scores is not necessary on a field
    1. document is stored in an index and has a type and an id. A document is a JSON object, The original JSON document that is indexed will be stored in the _source field
    2. A mapping is like a schema definition in a relational database. The mapping also allows you to define (amongst other things) how the value for a field should be analyzed. Has a number of index-wide settings. Fields with the same name in different types in the same index must have the same mapping
    3. An index is like a table in a relational database. It has a mapping which defines the fields in the index, which are grouped by multiple type.
    4. Each primary shard can have zero or more replicas. A replica is a copy of the primary shard, and has two purposes: increase failover,increase performance. never be started on the same node as its primary shard.
    5. A shard is a single Lucene instance. you never need to refer to shards directly.
    6. A term is an exact value that is indexed in elasticsearch. The terms foo, Foo, FOO are NOT equivalent. can be searched for using term queries
    7. Analysis is the process of converting full text to terms. These terms are what is actually stored in the index. A full text query (not a term query) for FoO:bAR will also be analyzed to the terms foo,bar and will thus match the terms stored in the index.
    8. Text (or full text) is ordinary unstructured text, such as this paragraph.
    9. Each document is stored in a single primary shard. primary shard is chosen by hashing the routing value. derived from document id or, parent document id(to ensure stored on the same shard). This value can be overridden by specifying a routing value at index time, or a routing field in the mapping.
    1. Slow Log
    1. Fields with the same name in different mapping types in the same index must have the same mapping.
    2. dynamic mapping rules can guess fields types. Or you can define by Explicit mappings
    3. existing type and field mappings cannot be updated
    1. token_count is really an integer field, to count the number of tokens in a string
    2. Array support does not require a dedicated type
    3. object for single JSON objects
    4. nested for arrays of JSON objects
    5. geo_point for lat/lon points
    6. geo_shape for complex shapes like polygons
    7. ip for IPv4 and IPv6 addresses
    8. completion to provide auto-complete suggestions
    9. murmur3 to compute hashes of values at index-time and store them in the index
    10. the mapper-attachments plugin which supports indexing attachments like Microsoft Office formats, Open Document formats, ePub, HTML, etc. into an attachment datatype.
    11. Percolator type: Accepts queries from the query-dsl
    1. The _all field concatenates the values of all of the other fields into one big string,  then analyzed and indexed, but not stored. ("store": true)
    2. All values treated as strings
    3. The _all field takes fields’ boosts into account
    4. copy_to parameter allows the creation of multiple custom _all fields
    5.        "first_name": {
               "type":    "text",
               "copy_to": "full_name"
             },
             "last_name": {
               "type":    "text",
               "copy_to": "full_name"
             },
    1. Stores query instead of document
    1. "properties": {
                 "age":  { "type": "integer" },
                 "name": {
                   "properties": {
                     "first": { "type": "text" },
                     "last":  { "type": "text" }
                   }
                 }
    1. all values in the array must be of the same datatype.
    2. null values are either replaced by the configured null_value or skipped entirely. An empty array [] is treated as a missing field — a field with no values.
    3. Treated as a set of data, without order
    1. use nested query to query them
    2. Indexing a document with 100 nested fields actually indexes 101 documents
    1. Meta fields reference, wrong url
    1. If you are sure that you don’t need to sort or aggregate on a field, or access the field value from a script, you can disable doc values in order to save disk space:
    1. _default_ mapping, Configure the base mapping to be used for new mapping types.
    2. PUT index-name/_settings "index.mapper.dynamic":false
    3. numeric detection (which is disabled by default)
    1. How to dynamic mapping fields to type
    1. The mapper-size plugin provides the _size meta field which, when enabled, indexes the size in bytes of the original _source field.
    1. Keyword fields are only searchable by their exact value.
    1. Coercion attempts to clean up dirty values to fit the datatype of a field by default
    1. "properties": {
             "city": {
               "type": "text",
               "fields": {
                 "raw": {
                   "type":  "keyword", "analyzer": "english"
                 }
               }
             }
    2. "sort": {
         "city.raw": "asc"
       },
    1. Suggest search on different fields
    1. converting text, like the body of any email, into tokens or terms which are added to the inverted index for searching
    2. This same analysis process is applied to the query string at search time
    1. Outputs the statistic information of terms in a document

No comments:

Post a Comment