Detail Focused: ElasticSearch practice

Installation:

install ElasticSearch
    curl -XGET http://localhost:9200/
install Kibana
change config/kibana.yml, uncomment elasticsearch.url: "http://localhost:9200"
    http://localhost:5601
install sense plugin
    kibana plugin --install elastic/sense
run kibana
     http://localhost:5601/app/sense

ElasticSearch concepts

index - a collection of documents (think it as database)
type - represents a class/category of similar documents, e.g. "user" (think it as table)
mapping - similar to database schema for a table in RDBMS
includes the data type for each field, e.g. string, integer
Also includes information on how fields should be indexed and stored by Lucene
Document - a basic unit of information that can be indexed, consists of fields(think it as columns) which are key/value pairs. (think it as row in tables)
SHARDS
a index can be divided into multiple SHARDS, if one machine cannot store all data from the node
stored on any node in cluster
REPLICAS
a copy of a SHARDS, never resides on the same node of original SHARDS
Elastic search
To rank documents for a query, a score is calculated for each document that matches a query. The higher the score, the more relevant the document is to the search query.
Queries in query context affect the scores of matching documents.

The queries in query context answer the question: "How well does the document match?"

Queries in filter context do not affect the scores of matching documents.

The queries in filter context answer the question: "Does the document match"

Query String

Query DSL - for complex and advanced queries
https://www.youtube.com/watch?v=ybu8XwbwXCQ

Leaf queries
Look for particular values in particular fields.
Compound queries
wrap leaf clauses or even other compound query clauses
Full Text queries
running full text queries on full text fields.
Term level queries
Used for exact matching of values, usually for structured data like number or dates, e.g. finding person born between year 1980 and 2000
Joining queries
performing joins in distributed system is expensive
Elastic provides Nested Query
has_child query returns parent documents whose child documents match the query
has_parent query returns child documents whose parent document matches the query

Geo Queries
command type in sense
GET /ecommerce/product/_search?q=name:(pasta AND spaghetti)
index type api query string
GET /ecommerce/product/_search?q=(name:(pasta AND spaghetti) AND status:active)
GET /ecommerce/product/_search?q=name:+pasta -spaghetti //does not include spaghetti
GET /ecommerce/product/_search?q=name:pasta spaghetti //without "", it equals to pasta OR spaghetti
GET /ecommerce/product/_search?q=name:"pasta spaghetti" // equals to pasta AND spaghetti, but the order does matter. The found value is pasta - spaghetti, it is not 100% match. It is analysed.

Types of aggregations
Metric
Bucket
Pipeline

useful query
list all index
http://localhost:9200/_cat/indices?v

one example to query
http://localhost:9200/logstash-2016.11.21/_search?pretty&q=response=200

query on subfields
http://localhost:9200/mytest/_search?pretty&q=metadata._offset:3068837

_source_exclude/_source_include

http://localhost:9200/simpsons/episode/1/_source?_source_exclude=video_url

http://localhost:9200/simpsons/episode/1/_source?_source_include=title

reindex data

Reindex data copies documents from one index to another. I already have one index with documents. I found I wanted to change the analyzer on one field. I created the new index as the index can not be modified. Reindex from the old index to the new index, then the new index has all the documents in the old index.
https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-reindex.html

You can limit the documents by adding type or adding a query.

delete documents by query

https://www.elastic.co/guide/en/elasticsearch/reference/5.3/docs-delete-by-query.html

increase index queue size

When doing bulk index, I got the error
rejected execution (queue capacity 200)
To increase the index queue, one way is to put this setting in elasticsearch.yml and restart the server.

Or send PUT request to persist it

http://stackoverflow.com/questions/33110310/increasing-the-size-of-the-queue-in-elasticsearch
The best way to do this is not to increase queue_size, but to do it in bulk action.
https://www.elastic.co/guide/en/elasticsearch/client/javascript-api/current/api-reference.html#api-bulk

Increase query timeout

For the timeout error, if using elasticsearch.js, the requestTimeout can be configured when initialize the elasticsearch.Client

View tokens by analyzer

To view how the text is tokenized by the analyzer

view settings

view mapping

https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-get-mapping.html

Put mapping

If an index already exists, use this to add more mappings

https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-put-mapping.html
A mapping cannot only be deleted when the index is deleted.
https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-delete-mapping.html

Alias

alias could be subset of the index, like the view in relational database
POST /_aliases
{
    "actions" : [
        {
            "add" : {
                "index" : "simpsons",
                "alias" : "homer",
                "filter" : {
                    "term" : {"raw_character_text" : "homer"}
                }
            }
        }
    ]
}

close/open indices

the closed indices do not consume computer resources. For the history data, you can open it when you really need it.
POST http://localhost:9200/simpsons/_close
POST http://localhost:9200/simpsons/_open

_cat api

localhost:9200/_cat list all commands

update

curl -XPOST -d'{"doc":{"views":1001, "tags":["elasticsearch"]}}' localhost:9200/myIndex/myType/3/_update
the command above update the document directly
the command below run the script to update the document
curl -XPOST -d'{"script":"ctx._srouce.views +=1"}' localhost:9200/myIndex/myType/3/_update
Updating document directly has better performance than updating document by running script.
To avoid concurrent requests competitions, use retry_on_conflict. Elasticsearch get the document and merge the changes.When writing into index, if the _version is not same, it means another process update the document at the same moment. The parameter retry_on_conflict allows elasticsearch do the steps aboe again.
curl -XPOST -d'{"script":"ctx._srouce.views +=1"}' localhost:9200/myIndex/myType/3/_update?retry_on_conflict=5

op_type=create

curl -i -XPUT -d '{"title":"Error handling in elasticsearch"}' localhost:9200/myIndex/myType/3?op_type=create
if document with id 3 already exists, you will get the error "document already exists". Otherwise it will saved successfully.

ALternative
curl -i -XPUT -d '{"title":"Error handling in elasticsearch"}' localhost:9200/myIndex/myType/3/_create
if already exists, get http 409. If does not exists, get http 201.

Stop words for the same root

Here is a article about stopwords, the stopwords are filtered before stemmer filter applied.
https://www.peterbe.com/plog/elasticsearch-snowball-analyzer-and-stopwords
I need stop the words for the same root, for example, I want to stop expect, expected, expecting as they are all the same root.Use the stopwords of the stemmer analyzer does not work as the filter of the stopwords were applied before stemming. I actually need the stopwords filter applied after the stemming. To do that, I need fully customized the analyzer like this, the my_stop filter is after the "porter_stem"

stemmer_override

The stem can be overridden. The rules can be embedded

or from a file. The path is either relative to config location, or absolute.

A sample file is like this

https://simpsora.wordpress.com/2014/05/02/customizing-elasticsearch-english-analyzer/

Random documents

Here is the query to get random documents, the returned documents are based on the seed.

relationships and join query

http://detailfocused.blogspot.ca/2017/04/elasticsearch-join-query.html

_update_by_query

http://detailfocused.blogspot.ca/2017/05/elasticsearch-updatebyquery.html

elasticsearch query

http://detailfocused.blogspot.ca/2017/05/elasticsearch-query.html

elasticsearch distinct value and search

http://detailfocused.blogspot.ca/2017/05/elasticsearch-distinct-value-and-search.html

- to be continued -

Detail Focused

Wednesday, October 5, 2016

ElasticSearch practice