Wednesday, May 24, 2017

elasticsearch query

Full-text search queries

The most important queries in this category are the following:
- match_all
- match
- match_phrase
- multi_match
- query_string

-match_all return all documents, same result as {}

GET words_v1/userFamilarity/_search
{
  "query": {
    "match_all": {
    }
  }
}

- match

{
    "query" : {
        "match": {
            "title":"abc"
        }
    }
}
match in same order returns first
GET /my_test/words/_search
{
  "query": {
    "match": {
      "english" : "\"This is all of it\""
    }
  }
}
it will return "This is all of it" as first document.
then "This is all right"
then "it is all right"

- match_phrase exact match

{
    "query" : {
        "match_phrase" : {
            "spoken_words":"makes me laugh"
        }
    }
}

- multi_match

match in mutliple fields, the result shall have the query words in either fields "spoken_words" or "raw_character_text". If both fields are matched, the result get high score.
{
    "query" : {
        "mutli_match" : {
            "query":"homer simpson",
            "fields": ["spoken_words", "raw_character_text"]
        }
    }
}
boost the result. The "raw_character_text" was boost by factor "8".
{
    "query" : {
        "mutli_match" : {
            "query":"homer simpson",
            "fields": ["spoken_words", "raw_character_text^8"]
        }
    }
}
mutli_match
GET /my_test/words/_search
{
  "query": {
    "multi_match": {
      "query" : "eye sky",
      "fields" : ["english", "sourceTitle"],
      "operator" : "and"
    }
  }
}

- query_string


- AND - OR

operator, return the document contains both "all" and "special"
GET /my_test/words/_search
{
  "query": {
    "match": {
      "english": {
        "query" : "all special",
        "operator": "and"
      }
    }
  }
}

-wildcard

note: wildcard can consume a a lot of memory and time...
{
    "query" : {
        "fields":["spoken_words"],
        "query":"fri*"
    }
}

fuzzy match, even misspelling

{
    "query" : {
        "fields":["spoken_words"],
        "query":"dnout~"
    }
}
fuzzy match distance factor, to increase the performance. default distance is 2.
{
    "query" : {
        "fields":["spoken_words"],
        "query":"dnout~1"
    }
}

Term-based search queries

The most important queries in this category are the following:
- Term query
- Terms query
- Range query
- Exists query / Missing query

- Term query

The term query does an exact term matching in a given field. So, you need to provide the exact term to get the correct results. For example, if you have used a lowercase filter while
indexing, you need to pass the terms in lowercase while querying with the term query.
Another example: house after stemmer, it is "hous". The match query with parameter "hous" cannot return anything.
The term query with parameter "hous" can return the documents containing "house"
GET /words_v1/words/_search
{
  "query": {
    "term" : {
      "english" : "hous"
    }
  }
}
GET /words_v1/words/_search
{
  "query": {
    "match" : {
      "english" : "hous"
    }
  }
}

- Terms query


- Range query

exists query and missing query, get the documents which has or has not a value in one field
{
    "query": {
        "exists" : { "field" : "user" }
    }
}
The missing query can be done by combining must_not and exists query
{
    "query": {
        "bool": {
            "must_not": {
                "exists": {
                    "field": "user"
                }
            }
        }
    }
}

compound query

Compound queries are offered to connect multiple simple queries together to make your search better.\
- bool query
- not query
- Function score query

-bool query

{
    "query":{
        "bool":{
            "must":[{}],
            "should":[{}],
            "must_not":[{}]
            "filter":[{}]  //A query wrapped inside this clause must appear in the matching documents. However, this does not contribute to scoring.
        }
    }
}

{
    "query" : {
        "bool": {
            "must": {"match": {"title":"homer"}},
            "must_not": {"range": {"imdb_rating":{"gt": 8}}}
        }
    }
}

{
    "query" : {
        "bool": {
            "must": {"match": {"title":"homer"}},
            "must": {"range": {"imdb_rating":{"gt": 4, "lt":8}}}
        }
    }
}
Change "must" to "filter". Elastic search will do the filter first, then do the title match.
{
    "query" : {
        "bool": {
            "must": {"match": {"title":"homer"}},
            "filter": {"range": {"imdb_rating":{"gt": 4, "lt":8}}}
        }
    }
}

--------------------
the query json object is query => query type (such like match, term, multi_match, range...) => field name => more settings the query need

Queries were used to find out how relevant a document was to a particular query by calculating a score for each document, whereas filters were used to match certain criteria. In the query context, put the queries that ask the questions about document relevance and score calculations, while in the filter context, put the queries that need to match a simple yes/no question.


* query on date field, e.g find documents create after 2017-Feb-01
* constant_score: A query that wraps another query and simply returns a constant score equal to the query boost for every document in the filter.
* because of performance considerations; do not use sorting on analyzed fields.
---------------------

aggregation

4 types, pipline, matrics, bucket, matric aggregation
- Metrics are used to do statistics calculations, such as min, max, average, on a field of a document that falls into a certain criteria.
{
    "aggs": {  //declare doing aggregation
        "avg_word_count" : { //the field name in the result
            "avg" : {  // the function to do the aggregation, could be max, min...?
                "field" : "word_count"
            }
        }
}
The structure is like this
{
    "aggs": {
        "aggaregation_name": {
            "aggrigation_type": {
                "field": "name_of_the_field"
            }
        }
    }
}
size
{
    "size" : 0, //without this, the result displays the original documents first, then the aggregation result
    "aggs": {  //doing aggregation? 
        "avg_word_count" : { //the field name in the result
            "avg" : {  // the function to do the aggregation, could be max, min...?
                "field" : "word_count"
            }
        }
}

- extended_stats

GET /words_v1/userFamilarity/_search
{
  "size": 0,
  "aggs": {
    "result": {
      "extended_stats": {
        "field": "familarity"
      }
    }
  }
}
Here is the result
"aggregations": {
    "result": {
      "count": 47,
      "min": 0,
      "max": 3,
      "avg": 2.8085106382978724,
      "sum": 132,
      "sum_of_squares": 388,
      "variance": 0.3675871435038475,
      "std_deviation": 0.6062896531393617,
      "std_deviation_bounds": {
        "upper": 4.0210899445765955,
        "lower": 1.595931332019149
      }
    }
  }

- cardinality

The count of a distinct value of a field can be calculated using the cardinality aggregation.
{
    "size" : 0,
    "aggs": {  
        "speaking_line_count" : {
            "cardinality" : {
                "field" : "raw_character_text"
            }
        }

- percentile

{
    "size" : 0,
    "aggs": {  
        "word_count_percentiles" : {
            "percentile" : {
                "field" : "word_count"
            }
        }
}


for fielddata is disabled on text fields by default
PUT /myIndex/myType/_mapping/script    //script is type name
{
    "properties" : {
        "raw_character_text" : {
            "type" : "text",
            "fielddata" : true
        }
    }
}

bucket

document categorization based on some criteria, like group by in sql

- Terms aggregation, count group by term

GET /words_v1/userFamilarity/_search
{
  "size": 0,
  "aggs": {
    "result": {
      "terms": {
        "field": "familarity"
      }
    }
  }
}
Result
"aggregations": {
    "result": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": 3,
          "doc_count": 42
        },
        {
          "key": 1,
          "doc_count": 2
        },
        {
          "key": 2,
          "doc_count": 2
        },
        {
          "key": 0,
          "doc_count": 1
        }
      ]
    }
  }

- Range aggaregation

GET /words_v1/userFamilarity/_search
{
  "size": 0,
  "aggs": {
    "result": {
      "range": {
        "field": "familarity",
        "ranges": [
          {"to":3},  //3 is excluded
          {"from":3, "to":4}
        ]
      }
    }
  }
}
Result
"aggregations": {
    "result": {
      "buckets": [
        {
          "key": "*-3.0",
          "to": 3,
          "doc_count": 5
        },
        {
          "key": "3.0-4.0",
          "from": 3,
          "to": 4,
          "doc_count": 42
        }
      ]
    }
  }

- Date range aggregation

GET /words_v1/userFamilarity/_search
{
  "size": 0,
  "aggs": {
    "result": {
      "range": {
        "field": "date",
        "format": "yyyy",
        "ranges": [
          {"to":2017},
          {"from":2017, "to":2018}
        ]
      }
    }
  }
}
Result
"aggregations": {
    "result": {
      "buckets": [
        {
          "key": "*-1970",
          "to": 2017,
          "to_as_string": "1970",
          "doc_count": 0
        },
        {
          "key": "1970-1970",
          "from": 2017,
          "from_as_string": "1970",
          "to": 2018,
          "to_as_string": "1970",
          "doc_count": 0
        }
      ]
    }
  }

- Filter-based aggregation


- combine query and aggregation

GET /words_v1/userFamilarity/_search
{
  "size": 0,
  "query": {
    "match": {
      "familarity": 0
    }
  },
  "aggs": {
    "result": {
      "terms": {
        "field": "familarity"
      }
    }
  }
}
Result
"aggregations": {
    "result": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": 0,
          "doc_count": 1
        }
      ]
    }
  }

- combine filter and aggregation, the aggregation here is sub-aggregation

{
    "size" : 0,
    "aggs": {  
        "homer_word_count" : {
            "filter" : { "term" : {"raw_character_text":"homer"}}, //filter before aggregation
            "aggs": {
                "avg_word_count" : {"avg" : {"field", "word_count"} }
            }
        }
    }
}

{
    "size" : 0,
    "aggs": {  
        "simpsons" : {
            "filter" : {
                "other_bucket" : true,
                "other_bucket_key" : "Non-Simpsons Cast",
                "filters" : {
                    "Homer" : { "match" : {"row_character_text" : "homer"}},
                    "Lisa" : { "match" : {"row_character_text" : "lisa"}}
                }
            }
        }
    }
}

{
    "query" : {
        "terms" : {"raw_character_text" : ["homer"]}
    },
    "size" : 0,
    "aggregation" : {
        "SignificatnWords" : {
            "significant_terms" : {"field": "spoken_words"}
        }
    }
}

The bucket aggregations can be nested within each other. This means that a bucket can contain other buckets within it.
For example, a country-wise bucket can include a state-wise bucket, which can further include a city-wise bucket.

- sort

{
    "query":{
        "match":{"text":"data analytics"}
    },
    "sort":[
        {"created_at":{"order":"asc"},
        "followers_count":{"order":"asc"}}
    ]
}




No comments:

Post a Comment