Wednesday, November 8, 2017

Heap (data structure)

A heap is a specialized tree-based data structure that satisfies the heap property: if P is a parent node of C, then the key (the value) of P is either greater than or equal to (in a max heap) or less than or equal to (in a min heap) the key of C.

find median from unsorted array of intergers

For example [3,4,5,2,1], median is 3.
[3,4,5,2,1,6] median is (3+4)/2=3.5

The easiest way is to sort them and then calculate the median. This is easy to read but containing some unnecessary loop as you don't need sort all values.

By array length, we can know the index of the median.
Let's say the array is 3,9,8,1,0,4,5,2,7,6
Index starts from 0, leftM is index 4,  rightM is index 5. (They will be the same if the array length is odd.)
Using the quicksort algorithm, after partition, there are two partitions. One is less than the pivot and the other one is great than the pivot.
Use the first number as pivot. After partition, the array is
1,0,2 |3| 9,8,4,5,7,6
the pivot index is 3.
Check the partition length to determine which partition contains leftM and rightM.
The right partition 9,8,4,5,7,6, contains the leftM and rightM
Calculate the new leftM and rightM on the partition.
As the pivot index is 3 in original array, in the new partition, we are looking for leftM = 4 -3 -1 = 0, rightM = 5-1-3 =1
Now we need the frist and second number after sort the right partition.
Do the same steps on the new partition, pivot is 9
8,4,5,7,6 | 9
use left partition and left partition size > 2, we know the number we are looking for are still in the left partition.
pivot is 8
4, 5, 7, 6 | 8
Still in left partition
pivot is 4
4 | 5, 7, 6, 8
ok, the left partition is 1, that is one median number we are looking for.
the other number now is the first number after sorting the right partition
repeat the same way, we can find it is 5.

It is using the quicksort algorithm to find the partition where the median exists and sort on that partition. The worst time complexity is O(n(log(n))). If sorting the whole array, the best time complexity is O(n(log(n))) and the worst is O(n^2).

========================================

Another implementation here:
We separate the numbers into two partitions, left and right.
The left number is always smaller than right.
The size diff between left and right shall <=1.
If the right size - left size > 1, move the smallest right to the left, which will become the biggest of the left.
If the left size - right size > 1, move the biggest left to the right, which will become the smallest of the left.
Eventually, when all numbers are added to the partitions. If two partitions size are the same, median = (biggest left + smallest right) /2
If right size > left size, median = smallest right
if left size > right, median = biggest left
Using the example above. We push the number one by one into the two partitions.
3,9,8,1,0,4,5,2,7,6
left [    ], right [    ]
the first number is 3, as both are empty, let's put it into right.
[   ] [3  ]
then 9 compare with the biggest left (null) and smallest right (3), it shall be in the right,
[   ] [3 9]
now the size of right  - the size of left > 1. we need move one to the smallest one to the left
[   3] [9   ]
Then 8, because it is between smallest right and biggest right, we put it in right
[   3] [8, 9   ]
Then 1
[1, 3] [8, 9   ]
Then 0
[1,0,3] [8, 9   ]
Then 4, between smallest right and biggest left, put it in right
[1,0,3] [4,8,9]
Then 5
[1,0,3] [4,5,8,9]
Then 2
[1,0,2,3] [4,5,8,9]
Then 7
[1,0,2,3] [4,7,5,8,9]
Then 6
[1,0,2,3] [4,6,7,5,8,9]
now the right size - left size > 1, move the smallest right to the left
We also need to find the min number in the rest of the right
[1,0,2,3,4] [5,6,7,8,9]
Now all numbers are in two partitions. because the sizes are equal, median = (4+5)/2
If left size < right, the biggest left is median.
If right size > left, the smallest right is median.

For time complexity, the worst case is every time when you add the number to one partition, it needs to pop one item and add it to the other partition, and the current partition need find the min/max number.
Iteration the array: n
+
find min/max, to find the max/min in 2 items, iterate 2 times
2 + 3 + 4 + ... + n/2 (as the left right size are almost same, big diff is 1, the max size of left/right is close to n/2)
So the time complexity is O(n + 2 + 3 +...+n/2) = O(n)

Monday, November 6, 2017

HATEOAS

HATEOAS basically means when you trigger a request, the restful service not only returns the data but also the further actions (URLs) you can do.

For example, you trigger a request to get the account balance. The service returns you the account balance and a bunch of actions you can do, such like deposit, withdraw, transfer.

https://jozdoo.github.io/rest/2016/09/22/REST-HATEOAS.html
https://spring.io/understanding/HATEOAS

Saturday, November 4, 2017

RxJs do operator on Observable

https://angularfirebase.com/lessons/methods-for-debugging-an-angular-application/

https://www.youtube.com/watch?time_continue=9&v=gxixM90vo9Y, around 5:32


this.db.object('items/12345/title')
           .do(val => console.log('before map', val) )
           .map(title => title.$value.toUpperCase() )
           .do(val => console.log('after map', val) )


Spring boot supports Cross domain request (CORS)

https://spring.io/guides/gs/rest-service-cors/

@CrossOrigin(origins = "http://localhost:9000")

This @CrossOrigin annotation enables cross-origin requests only for this specific method. By default, its allows all origins, all headers, the HTTP methods specified in the @RequestMappingannotation and a maxAge of 30 minutes is used. You can customize this behavior by specifying the value of one of the annotation attributes: origins, methods, allowedHeaders, exposedHeaders, allowCredentials or maxAge. In this example, we only allow http://localhost:9000 to send cross-origin requests.

https://vimsky.com/article/2252.html

Friday, October 27, 2017

markdown cheat sheet

newline break
two space at the end of line, it is
two "Enter", it is


Wednesday, October 18, 2017

JPA bidirection mapping JSON infinite loop issue

Employee and EmployeeTitle are parent and child class.

I made an extreme example here. Both fetch is eager. Actually JPA can deal it, which does not create infinite loop. The trouble is when converting the JAVA object to JSON when you need pass the object through the restful service. It reports infinite loop.
This article describes different ways to deal with JSON infinite loop. Most of them is to ignore one on either parent or child object to break the infinite loop.
 http://www.baeldung.com/jackson-bidirectional-relationships-and-infinite-recursion

In my case I don't want to ignore any part.
The solution is to use @JsonIdentityInfo to resolve cyclic dependencies in an object graph by using an ID/reference mechanism so that an object instance is only completely serialized once and referenced by its ID elsewhere.

Employee is easy. Only need add @JsonIdentityInfo( generator = ObjectIdGenerators.PropertyGenerator.class, property = "empNo")
 This is a legacy database which is using composite primary key. But JsonIdentityInfo seems not support composite key. So I have to serialize the composite key to String. You can make it JSON string, that also works.
Now if we get Employee object, the JSON is like this

if we get employeeTitle object, the JSON is like this

Nothing was ignore on either side.

Friday, September 15, 2017

Proxy Api Calls in Vue2

The big advantage is we don't have to worry about CORS
https://github.com/prograhammer/vue-example-project#proxy-api-calls-in-webpack-dev-server

npm install --save vue-resource
npm install --save-dev http-proxy-middleware

in build/index.js add
'/service': {
 target: 'http://services.groupkt.com',
 changeOrigin: true, 
 ws: true,   // proxy websockets
 pathRewrite: {
   '^/service/state/': '/state/get/'   // http://localhost:8080/service/state/ => http://services.groupkt.com/state/get/
 }, 
 router: {
 }   
}

If it is configured properly, you will see this when the node server is started.
[HPM] Proxy created: /service  ->  http://services.groupkt.com
[HPM] Proxy rewrite rule created: "^/service/state/" ~> "/state/get/"

in .js file
// send request to http://localhost:8080/service/state/USA/all will be eventually sent to http://services.groupkt.com/state/get/USA/all
this.$http.get('/service/state/USA/all', {})  
        .then(response => {
            this.states = response.body.RestResponse.result
        }, response => {
            console.log("error");
            console.log(response)
      });

https://github.com/hairinwind/myVue/blob/master/src/components/HttpProxy.vue

webpack and Jquery-ui

Here is the way to integrate jquery-ui in webpack project

npm install jquery-ui-dist --save
webpack.config.js

resolve: {
    alias: {
        'jquery-ui': 'jquery-ui-dist/jquery-ui.js'
    }
},
...
plugins: [
    new webpack.ProvidePlugin({
        'jQuery': 'jquery',
        '$': 'jquery',
        'global.jQuery': 'jquery'
    })
],
In the .js file, load the module
require('jquery-ui');  //or import 'jquery-ui'
Reference
https://stackoverflow.com/questions/33998262/jquery-ui-and-webpack-how-to-manage-it-into-module
https://github.com/hairinwind/myVue/blob/master/build/webpack.base.conf.js
https://github.com/hairinwind/myVue/blob/master/src/components/JqueryUI.vue

Thursday, September 7, 2017

vue basic

The data are only reactive if they existed when the instance was created. That means if you add a new property, like: vm.b = 'hi' Then changes to b will not trigger any view updates.
https://vuejs.org/v2/guide/instance.html

computed:

var vm = new Vue({
  el: '#example',
  data: {
    message: 'Hello'
  },
  computed: {
    // a computed getter
    reversedMessage: function () {
      // `this` points to the vm instance
      return this.message.split('').reverse().join('')
    }
  }
})

Computed properties are cached based on their dependencies. A computed property will only re-evaluate when some of its dependencies have changed. In the example above, only when message is changed, the reversedMessage will be recalculated. The following computed property will never update, because Date.now() is not a reactive dependency
computed: {
  now: function () {
    return Date.now()
  }
}
https://vuejs.org/v2/guide/computed.html

List rendering

Due to limitations in JavaScript, Vue cannot detect the following changes to an array:
When you directly set an item with the index, e.g. vm.items[indexOfItem] = newValue
When you modify the length of the array, e.g. vm.items.length = newLength
To overcome caveat 1, both of the following will accomplish the same as vm.items[indexOfItem] = newValue, but will also trigger state updates in the reactivity system:
    // Vue.set
    Vue.set(example1.items, indexOfItem, newValue)
    // Array.prototype.splice
    example1.items.splice(indexOfItem, 1, newValue)
To deal with caveat 2, you can use splice:
    example1.items.splice(newLength)
https://vuejs.org/v2/guide/list.html

To register my component globally

in main.js
import Panel from './components/Panel'
Vue.component('panel', Panel)

Programmatic Navigation

this.$router.push({ name: 'Hello' })
https://router.vuejs.org/en/essentials/navigation.html

build and deploy PR env

打开chrome的审查元素,发现console出现了几个错误,原来打包的几个文件都是404 NOT FOUND.
#############解决办法###############
在目录build下面找到webpack.prod.conf.js文件,找到output节点,添加【publicPath:'./'】
npm run build
从 dist 目录下 copy index.html 和 static 去 web server 就可以
https://zmis.me/detail_1001

Reference

https://scotch.io/tutorials/getting-started-with-vue-router

Comparison with Other Frameworks
https://vuejs.org/v2/guide/comparison.html

example project
https://github.com/prograhammer/vue-example-project#configure-eslint

http://vuetips.com/

vuex
https://alligator.io/vuejs/intro-to-vuex/



Friday, August 25, 2017

python generator and yield

Generators are best for calculating large sets of results (particularly calculations involving loops themselves) where you don’t want to allocate the memory for all results at the same time.

To understand generator and yield, we can debug the code below.


put break point at line 5,6 and 10
  1. when the code is executed, line 9, it calls fibon(10) to get x
  2. the fibon function is executed to line 5, yield a, it suspends the process of fibon and return to the main process, x gets the value of a.
  3. Then line 10, print(x), we can have code here which consumes system resources here.
  4. Once it is done, it goes back to line 9, for x in fibon(10), ask for next value from fibon function
  5. It goes to line 6, the place where it is suspended, run the calculation, continue the iteration to line 4, 5 yield again, it will pop the calculated value to the main process line 10 again.
Send method
to send the value into the function



http://book.pythontips.com/en/latest/generators.html
http://kissg.me/2016/04/09/python-generator-yield/

Thursday, August 24, 2017

python list comprehensions - Single Line For Loops

List comprehensions provide a concise way to create lists.
[thing for thing in list_of_things]
[thing for thing in list_of_things if thing.xyz...]

for example

It can also be used in dict, inside {}

iterate dict
for k,v in d.items()
Now x is {1: 'a'}

set comprehensions
They are also similar to list comprehensions. The only difference is that they use braces {}. Here is an example:

http://blog.teamtreehouse.com/python-single-line-loops

https://docs.python.org/3.6/tutorial/datastructures.html search List Comprehensions


Wednesday, August 23, 2017

python unit test

I am using pytest
https://docs.pytest.org/en/latest/

python -m pytest
py.test -v
py.test -s # which allows print


Be careful of the arguments sequence, it is reverting the sequence of the @mock.

@mock is python out-of-box feature.
https://docs.python.org/3/library/unittest.mock.html

mock.return_value is to set the return value when the mock method is called.

assert_called
assert_called_once
assert_called_with
assert_called_with_once
assert_any_call

side_effect to raise exception

For unit test coverage, run this command
the coverage report is under the folder htmlcov.
http://www.robinandeer.com/blog/2016/06/22/how-i-test-my-code-part-3/
https://pypi.python.org/pypi/pytest-cov

To find the unused code, run "vulture targetFile or dir"

To run the pytest in eclipse





Wednesday, August 16, 2017

Common Fork Join Pool and Streams

https://dzone.com/articles/common-fork-join-pool-and-streams

how do you make the parallel streams use their own Fork Join Pools instead of sharing the common pool? Well, you need to create your own ForkJoinPool object and use this pool to contain the stream code.






Monday, July 31, 2017

python numpy

reshape

change the dimension of the array
rangeArray = np.arange(6,12) # [6,7,8,9,10,11]
rangeArray = rangeArray.reshape((2,3)) # [[6,7,8],[9,10,11]]

 

Monday, July 10, 2017

python function arguments

Python function parameters is like this

*non-keywords is a tuple
** keywords is a dict
You can call the method like newfoo(2, 4, *(6, 8), **{'foo': 10, 'bar': 12})
In this case, normal_arg1 = 2, normal_arg2 = 4, nonkeywords is (6,8), keywords is {'foo': 10, 'bar': 12}
newfoo(1)
    get error
newfoo(1,2 )
    normal_arg1 =1, normal_arg2 = 2
newfoo(*(1,2))
    normal_arg1 =1, normal_arg2 = 2
newfoo(1, *(2,))
    normal_arg1 =1, normal_arg2 = 2
newfoo(1, *(2,3))
    normal_arg1 =1, normal_arg2 = 2, nonkeywords: (3,)
newfoo(1, 2, x=3)
    normal_arg1 =1, normal_arg2 = 2, keywords: {'x':3}

For named arguments, the sequence can be changed.
test1(100,200) is equivalent to test1(x=100,y=200) and is equivalent to test1(y=200, x=100)
test1(100, y=200) is also allowed
but test1(100, x=200) is not allowed

For mixed arguments, named arguments have to follow default arguments, you cannot do this
Change it to

You can call test2(1,2,3) which is equivalent to test2(x=3, *(1,2))

To print all input arguments

Thursday, June 15, 2017

python pandas practice

iternate dataframe

for index, row in prediction_df.iterrows():
        print(row['columnName'])

rename dataframe column name

df=df.rename(columns = {'two':'new_name'})

convert a list of string to dataframe

the first item in the list is the column names
pd.DataFrame(data[1:], columns=data[0])

dataFrame get all columns

list(df) returns all columns except the 'index'
to include the index, one way is
columns = list(df)
columns.append(df.index.name)

get values from next row

df['nextDayClose'] = df['Close'].shift(-1)

rolling window

Here is the example to calculate the 7 days average.
df['7_mean'] = df['Close'].rolling(window=7).mean()

dataframe remove rows containing NAN

dat.dropna(how='any') #to drop if any value in the row has a nan
dat.dropna(how='all') #to drop if all values in the row are nan

get last value of one column

df['columnName'].values[-1]

dataframe change type to numeric

df = df.apply(pd.to_numeric, errors='ignore')

dataframe get one column

df['colName']
* putting text in the square brackets means manipulate columns
* putting number in the square brackets means manipulate rows

dataframe get multiple columns

df[['colName1', 'colName2', 'colName3']]

dataframe get first row

df[:1]
the api is like this df[startIndex:endIndex:sort], df[:1] equals df[0:1] which returns the first row of the dataframe

dataframe get last row

df[-1:]

dataframe get all rows except first/last

df[1:]
df[:-1]

dataframe reverse the order

df[::-1]
to get rows except last and reverse the order, this one does not work, df[:-1:-1]
it has to be df[:-1][::-1]

dataframe sort by columns

df.sort_values('col2')

dataframe keep rows have null/nan values

df = df[df.isnull().any(axis=1)]

readcsv and explicitly specify column type

quotes = pd.read_csv(csv, dtype={'Symbol': 'str'})
DataFrame.from_csv does not have this feature. It will convert "INF" to number infinite.

dataframe drop duplicates

quotes.drop_duplicates(subset=['Date'], keep='last', inplace=True)

two dataframes merge like inner join

merged = pd.merge(DataFrameA,DataFrameB, on=['Code','Date'])

dataframe only contain rows have null values

try to understand this
df[pd.isnull(df).any(axis=1)]
https://stackoverflow.com/questions/14247586/python-pandas-how-to-select-rows-with-one-or-more-nulls-from-a-dataframe-without

dataframe locate a cell

df.loc[0,'Close'] return the value of first row and column 'Close'
to set the value, df.loc[0,'Close']=123

dataframe locate by value

df.loc[df['column_name'] == some_value]

dataframe locate by lambda

quotes_df.loc[lambda df: pd.isnull(df.Open) | pd.isnull(df.Volume) | df.Open == 0,:]
https://pandas.pydata.org/pandas-docs/stable/indexing.html#selection-by-callable

dataframe locate with or condition

new_df = prediction_df.loc[(prediction_df['result'] != 0) | (prediction_df['prediction'] != 0)]
df.loc[(df['column_name'] == some_value) & df['other_column'].isin(some_values)]
isin returns a boolean Series, so to select rows whose value is not in some_values, negate the boolean Series using ~
df.loc[~df['column_name'].isin(some_values)]

replace null with Nan

df = df.replace('null', np.nan)
then we can do dropna to delete the rows
df.dropna(how='any', subset=['Open', 'High', 'Low', 'Close', 'Volume'], inplace=True)

dateframe group by

group by one column value and returns multiple dataframes
dateSymbol_df is
          Date Symbol
    0   2017-08-10   SFBC
    1   2017-08-10   SVBI
    2   2017-08-09  SENEB
    ...
for date, symbol_df in dateSymbol_df.groupby('Date'):
    dateSymbols[date] = list(symbol_df['Symbol'])
the dateSymbols is
{
    '2017-08-10': ['SFBC','SVBI'],
    '2017-08-09': ['SENEB'],
    ...
https://stackoverflow.com/questions/40498463/python-splitting-dataframe-into-multiple-dataframes-based-on-column-values-and

iterate between two dates

for i in pd.date_range(start, end):
    print(i)

print more columns

pd.set_option('display.expand_frame_repr', False)
https://stackoverflow.com/questions/11707586/python-pandas-how-to-widen-output-display-to-see-more-columns

https://www.datacamp.com/community/tutorials/pandas-tutorial-dataframe-python#gs.hEu=LKY

Friday, June 9, 2017

python practice

check if dict has key

if key in dict

python 3.6 install package

after 3.4, pip is not standalone, to install a package, run
python -m pip install SomePackage

functools.partial

returns a function with partially providing the parameters
https://docs.python.org/2/howto/functional.html 

python threads pool

csvFiles = list(filter(isFile, os.listdir("d:/quotes")))
with  multiprocessing.Pool(multiprocessing.cpu_count() - 1) as p:
    p.map(loadCsv, csvFiles)
Here loadCsv is the function, csvFiles is the list to iterate
https://docs.python.org/3.6/library/multiprocessing.html

pickle

with open('filename', 'a') as f:
        pickle.dump(data, f)

with open('filename', 'rb') as f:
    x = pickle.load(f)

python prirnt current exception

import traceback
traceback.print_exc()

zip two arrays into tuples

let's say you have two arrays [-1, 0, 1, 2] and [7,8,9,10]
you want to merge them into tuples like this [(-1, 7), (0, 8), (1, 9), (2, 10)]
list(zip(array1, array2))

array filter

result = [a for a in A if a not in subset_of_A]

python function argument

http://detailfocused.blogspot.ca/2017/07/python-function-parameters.html

define function inside function

def method_a(arg):

    def method_b(arg):
        return some_data

    some_data = method_b

python class and inheritance

http://www.cnblogs.com/feeland/p/4419121.html
https://docs.python.org/3/tutorial/classes.html

Generator and yield

http://detailfocused.blogspot.ca/2017/08/python-generator-and-yield.html

Set

Intersection - to find the same items
Difference - to find the different items
create sets using the new notation
a_set = {'red', 'blue', 'green'}
print(type(a_set))
# Output: type 'set'
>>> a = {1,1,2,3}
>>> a
{1, 2, 3}
http://book.pythontips.com/en/latest/set_-_data_structure.html

Decorators

Decorator is to add a wrap to the target function. It is like the interceptor in spring framework.
When use pymongo to query mongodb, it can return Cursor or list if it is sorted. This inconsistency is quite annoying to the client code .
I added the decorator "wrapReturnToList" to make all return to be List.
def wrapReturnToList(f):
    @wraps(f)
    def decorated(*args, **kwargs):
        return list(f(*args, **kwargs))
    return decorated

@wrapReturnToList
def getAllActiveSymbols():
    return stockDb.symbol.find({ "Status": { "$ne": SYMBOL_INACTIVE } })
Without the annotation, it returns Cursor. With the annotation, it returns List.
http://book.pythontips.com/en/latest/decorators.html

zipfile

    zf = zipfile.ZipFile(dir + zipFileName, mode='a')
    try:
        zf.write(toZipFileFullPat, os.path.basename(toZipFileFullPath))
    finally:
        zf.close()

os.path.basename(toZipFileFullPat) is to get the file name from the full path, so that it won't zip the dir into the zip file.

defaultdict

defaultdict does not need to check whether a key is present or not.
from collections import defaultdict

colours = (
    ('Yasoob', 'Yellow'),
    ('Ali', 'Blue'),
    ('Arham', 'Green'),
    ('Ali', 'Black'),
    ('Yasoob', 'Red'),
    ('Ahmed', 'Silver'),
)

favourite_colours = defaultdict(list)

for name, colour in colours:
    favourite_colours[name].append(colour) #favourite_colours[name] may not exist, dict throws error

print(favourite_colours)

# output
# defaultdict(,
#    {'Arham': ['Green'],
#     'Yasoob': ['Yellow', 'Red'],
#     'Ahmed': ['Silver'],
#     'Ali': ['Blue', 'Black']
# })
http://book.pythontips.com/en/latest/collections.html#defaultdict

deque

Double end queue, can pop or extend items from either end.
http://book.pythontips.com/en/latest/collections.html#deque




Wednesday, June 7, 2017

ember.js practice

-install ember-cli
npm install -g ember-cli

- create project
ember new hello-world

--start server
ember server

- create controller
ember g controller application

- create route
ember g route about
ember g route parent
ember g route parent/child

--create component
ember g component x-counter

-- script on page
{{#unless showName}}
 <h2 id="title">Hello {{name}}</h2>
{{else}}
 <h2 id="title">Hello Ember</h2>
{{/unless}}

<button {{action 'toggleName'}}>Toggle Name</button>

<br/>

{{numClicks}} Click

<button {{action 'incrementClicks'}}>Increment</button>

Monday, June 5, 2017

spring injects into static class

Usually, a static class shall not depend on some specific object. But in some cases, we want to inject some env specific instance into the static class by Spring.
MethodInvokingFactoryBean can help on this. Here is one example

If the method has multiple arguments, here is the example

This xml setting let spring call one specific set method on one object or class(static) to inject the object from the spring container. This piece of xml setting can be repeated if multiple objects need to be injected.

Friday, May 26, 2017

elasticsearch distinct value and search

In elasticsearch, to get the distinct value of one field, it is using term aggregation.

For example, here are the documents 
{
    "sourceTitle": "Arrival",
    "otherFields": ...
}
{
    "sourceTitle": "Arrival",
    "otherFields": ...
}
{
    "sourceTitle": "Eye in the Sky",
    "otherFields": ...
}
We want to get the result
    Arrival
    Eye in the sky
Here is the query to get the distinct values.
{
  "size": 0,
  "aggs": {
    "sourceTitle": {
      "terms": {
        "field": "sourceTitle",
        "size": 10
      }
    }
  }
}

But by default, elastic search will tokenize the field when indexing the data. To avoid that, we need make the field type "keyword". Then another problem comes up, when searching on "keyword" field, it has to be fully matched. Searching "Eye" on sourceTitle won't return anything. How to support getting distinct value and searching by partial text at the same moment.

We can set the field type to be text and give it a child field "keyword" which type is "keyword". 
"sourceTitle": {
  "type": "text",
  "fields": {
    "keyword": {
      "type": "keyword"
    }
  }
}

When getting distinct value, we shall run the aggs on the sourceTitle.keyword.
{
  "size": 0,
  "aggs": {
    "sourceTitle": {
      "terms": {
        "field": "sourceTitle.keyword",
        "size": 10
      }
    }
  }
}

https://discuss.elastic.co/t/distinct-value-and-search/86838/2

Wednesday, May 24, 2017

elasticsearch query

Full-text search queries

The most important queries in this category are the following:
- match_all
- match
- match_phrase
- multi_match
- query_string

-match_all return all documents, same result as {}

GET words_v1/userFamilarity/_search
{
  "query": {
    "match_all": {
    }
  }
}

- match

{
    "query" : {
        "match": {
            "title":"abc"
        }
    }
}
match in same order returns first
GET /my_test/words/_search
{
  "query": {
    "match": {
      "english" : "\"This is all of it\""
    }
  }
}
it will return "This is all of it" as first document.
then "This is all right"
then "it is all right"

- match_phrase exact match

{
    "query" : {
        "match_phrase" : {
            "spoken_words":"makes me laugh"
        }
    }
}

- multi_match

match in mutliple fields, the result shall have the query words in either fields "spoken_words" or "raw_character_text". If both fields are matched, the result get high score.
{
    "query" : {
        "mutli_match" : {
            "query":"homer simpson",
            "fields": ["spoken_words", "raw_character_text"]
        }
    }
}
boost the result. The "raw_character_text" was boost by factor "8".
{
    "query" : {
        "mutli_match" : {
            "query":"homer simpson",
            "fields": ["spoken_words", "raw_character_text^8"]
        }
    }
}
mutli_match
GET /my_test/words/_search
{
  "query": {
    "multi_match": {
      "query" : "eye sky",
      "fields" : ["english", "sourceTitle"],
      "operator" : "and"
    }
  }
}

- query_string


- AND - OR

operator, return the document contains both "all" and "special"
GET /my_test/words/_search
{
  "query": {
    "match": {
      "english": {
        "query" : "all special",
        "operator": "and"
      }
    }
  }
}

-wildcard

note: wildcard can consume a a lot of memory and time...
{
    "query" : {
        "fields":["spoken_words"],
        "query":"fri*"
    }
}

fuzzy match, even misspelling

{
    "query" : {
        "fields":["spoken_words"],
        "query":"dnout~"
    }
}
fuzzy match distance factor, to increase the performance. default distance is 2.
{
    "query" : {
        "fields":["spoken_words"],
        "query":"dnout~1"
    }
}

Term-based search queries

The most important queries in this category are the following:
- Term query
- Terms query
- Range query
- Exists query / Missing query

- Term query

The term query does an exact term matching in a given field. So, you need to provide the exact term to get the correct results. For example, if you have used a lowercase filter while
indexing, you need to pass the terms in lowercase while querying with the term query.
Another example: house after stemmer, it is "hous". The match query with parameter "hous" cannot return anything.
The term query with parameter "hous" can return the documents containing "house"
GET /words_v1/words/_search
{
  "query": {
    "term" : {
      "english" : "hous"
    }
  }
}
GET /words_v1/words/_search
{
  "query": {
    "match" : {
      "english" : "hous"
    }
  }
}

- Terms query


- Range query

exists query and missing query, get the documents which has or has not a value in one field
{
    "query": {
        "exists" : { "field" : "user" }
    }
}
The missing query can be done by combining must_not and exists query
{
    "query": {
        "bool": {
            "must_not": {
                "exists": {
                    "field": "user"
                }
            }
        }
    }
}

compound query

Compound queries are offered to connect multiple simple queries together to make your search better.\
- bool query
- not query
- Function score query

-bool query

{
    "query":{
        "bool":{
            "must":[{}],
            "should":[{}],
            "must_not":[{}]
            "filter":[{}]  //A query wrapped inside this clause must appear in the matching documents. However, this does not contribute to scoring.
        }
    }
}

{
    "query" : {
        "bool": {
            "must": {"match": {"title":"homer"}},
            "must_not": {"range": {"imdb_rating":{"gt": 8}}}
        }
    }
}

{
    "query" : {
        "bool": {
            "must": {"match": {"title":"homer"}},
            "must": {"range": {"imdb_rating":{"gt": 4, "lt":8}}}
        }
    }
}
Change "must" to "filter". Elastic search will do the filter first, then do the title match.
{
    "query" : {
        "bool": {
            "must": {"match": {"title":"homer"}},
            "filter": {"range": {"imdb_rating":{"gt": 4, "lt":8}}}
        }
    }
}

--------------------
the query json object is query => query type (such like match, term, multi_match, range...) => field name => more settings the query need

Queries were used to find out how relevant a document was to a particular query by calculating a score for each document, whereas filters were used to match certain criteria. In the query context, put the queries that ask the questions about document relevance and score calculations, while in the filter context, put the queries that need to match a simple yes/no question.


* query on date field, e.g find documents create after 2017-Feb-01
* constant_score: A query that wraps another query and simply returns a constant score equal to the query boost for every document in the filter.
* because of performance considerations; do not use sorting on analyzed fields.
---------------------

aggregation

4 types, pipline, matrics, bucket, matric aggregation
- Metrics are used to do statistics calculations, such as min, max, average, on a field of a document that falls into a certain criteria.
{
    "aggs": {  //declare doing aggregation
        "avg_word_count" : { //the field name in the result
            "avg" : {  // the function to do the aggregation, could be max, min...?
                "field" : "word_count"
            }
        }
}
The structure is like this
{
    "aggs": {
        "aggaregation_name": {
            "aggrigation_type": {
                "field": "name_of_the_field"
            }
        }
    }
}
size
{
    "size" : 0, //without this, the result displays the original documents first, then the aggregation result
    "aggs": {  //doing aggregation? 
        "avg_word_count" : { //the field name in the result
            "avg" : {  // the function to do the aggregation, could be max, min...?
                "field" : "word_count"
            }
        }
}

- extended_stats

GET /words_v1/userFamilarity/_search
{
  "size": 0,
  "aggs": {
    "result": {
      "extended_stats": {
        "field": "familarity"
      }
    }
  }
}
Here is the result
"aggregations": {
    "result": {
      "count": 47,
      "min": 0,
      "max": 3,
      "avg": 2.8085106382978724,
      "sum": 132,
      "sum_of_squares": 388,
      "variance": 0.3675871435038475,
      "std_deviation": 0.6062896531393617,
      "std_deviation_bounds": {
        "upper": 4.0210899445765955,
        "lower": 1.595931332019149
      }
    }
  }

- cardinality

The count of a distinct value of a field can be calculated using the cardinality aggregation.
{
    "size" : 0,
    "aggs": {  
        "speaking_line_count" : {
            "cardinality" : {
                "field" : "raw_character_text"
            }
        }

- percentile

{
    "size" : 0,
    "aggs": {  
        "word_count_percentiles" : {
            "percentile" : {
                "field" : "word_count"
            }
        }
}


for fielddata is disabled on text fields by default
PUT /myIndex/myType/_mapping/script    //script is type name
{
    "properties" : {
        "raw_character_text" : {
            "type" : "text",
            "fielddata" : true
        }
    }
}

bucket

document categorization based on some criteria, like group by in sql

- Terms aggregation, count group by term

GET /words_v1/userFamilarity/_search
{
  "size": 0,
  "aggs": {
    "result": {
      "terms": {
        "field": "familarity"
      }
    }
  }
}
Result
"aggregations": {
    "result": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": 3,
          "doc_count": 42
        },
        {
          "key": 1,
          "doc_count": 2
        },
        {
          "key": 2,
          "doc_count": 2
        },
        {
          "key": 0,
          "doc_count": 1
        }
      ]
    }
  }

- Range aggaregation

GET /words_v1/userFamilarity/_search
{
  "size": 0,
  "aggs": {
    "result": {
      "range": {
        "field": "familarity",
        "ranges": [
          {"to":3},  //3 is excluded
          {"from":3, "to":4}
        ]
      }
    }
  }
}
Result
"aggregations": {
    "result": {
      "buckets": [
        {
          "key": "*-3.0",
          "to": 3,
          "doc_count": 5
        },
        {
          "key": "3.0-4.0",
          "from": 3,
          "to": 4,
          "doc_count": 42
        }
      ]
    }
  }

- Date range aggregation

GET /words_v1/userFamilarity/_search
{
  "size": 0,
  "aggs": {
    "result": {
      "range": {
        "field": "date",
        "format": "yyyy",
        "ranges": [
          {"to":2017},
          {"from":2017, "to":2018}
        ]
      }
    }
  }
}
Result
"aggregations": {
    "result": {
      "buckets": [
        {
          "key": "*-1970",
          "to": 2017,
          "to_as_string": "1970",
          "doc_count": 0
        },
        {
          "key": "1970-1970",
          "from": 2017,
          "from_as_string": "1970",
          "to": 2018,
          "to_as_string": "1970",
          "doc_count": 0
        }
      ]
    }
  }

- Filter-based aggregation


- combine query and aggregation

GET /words_v1/userFamilarity/_search
{
  "size": 0,
  "query": {
    "match": {
      "familarity": 0
    }
  },
  "aggs": {
    "result": {
      "terms": {
        "field": "familarity"
      }
    }
  }
}
Result
"aggregations": {
    "result": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": 0,
          "doc_count": 1
        }
      ]
    }
  }

- combine filter and aggregation, the aggregation here is sub-aggregation

{
    "size" : 0,
    "aggs": {  
        "homer_word_count" : {
            "filter" : { "term" : {"raw_character_text":"homer"}}, //filter before aggregation
            "aggs": {
                "avg_word_count" : {"avg" : {"field", "word_count"} }
            }
        }
    }
}

{
    "size" : 0,
    "aggs": {  
        "simpsons" : {
            "filter" : {
                "other_bucket" : true,
                "other_bucket_key" : "Non-Simpsons Cast",
                "filters" : {
                    "Homer" : { "match" : {"row_character_text" : "homer"}},
                    "Lisa" : { "match" : {"row_character_text" : "lisa"}}
                }
            }
        }
    }
}

{
    "query" : {
        "terms" : {"raw_character_text" : ["homer"]}
    },
    "size" : 0,
    "aggregation" : {
        "SignificatnWords" : {
            "significant_terms" : {"field": "spoken_words"}
        }
    }
}

The bucket aggregations can be nested within each other. This means that a bucket can contain other buckets within it.
For example, a country-wise bucket can include a state-wise bucket, which can further include a city-wise bucket.

- sort

{
    "query":{
        "match":{"text":"data analytics"}
    },
    "sort":[
        {"created_at":{"order":"asc"},
        "followers_count":{"order":"asc"}}
    ]
}




Thursday, May 18, 2017

elasticsearch update_by_query

https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-update-by-query.html#picking-up-a-new-property

This is helpful when you want to update a bunch of documents.

This works for me

The inline script does not work for me.
Always got this error
{
  "error": {
    "root_cause": [
      {
        "type": "script_exception",
        "reason": "runtime error",
        "script_stack": [
          "ctx._soruce.date=\"2017-05-18T05:35:23.103Z\";",
          "           ^---- HERE"
        ],
        "script": "ctx._soruce.date=\"2017-05-18T05:35:23.103Z\";",
        "lang": "painless"
      }
    ],
    "type": "script_exception",
    "reason": "runtime error",
    "caused_by": {
      "type": "null_pointer_exception",
      "reason": null
    },
    "script_stack": [
      "ctx._soruce.date=\"2017-05-18T05:35:23.103Z\";",
      "           ^---- HERE"
    ],
    "script": "ctx._soruce.date=\"2017-05-18T05:35:23.103Z\";",
    "lang": "painless"
  },
  "status": 500
}
For single document update, the inline script works

Tuesday, May 16, 2017

JVM verbose log

For example, if you want to see the verbose jvm log for SSL handshake, add JVM parameters

-Djavax.net.debug=all -Djavax.net.debug=ssl:handshake:verbose


Friday, May 12, 2017

express: request entity too large

When post with large text, you may get the error "request entity too large", here is the code to fix that.


Monday, May 1, 2017

webpack practice

webpack enable sourcemaps




-- to be continued --

Thursday, March 30, 2017

passportjs, nodejs auth

The document of passportjs is not good. I followed the document and it did not make my code work.
After reading a couple of tutorial online, I eventually made it work.
I abstracted all passport related setting in one file, auth.js.

Here is the change in the server.js, basically it is adding authentication to the URL you want to protect.

Here are some helpful online tutorial
https://www.danielgynn.com/node-auth-part2/
https://blog.risingstack.com/node-hero-node-js-authentication-passport-js/
http://passportjs.org/docs/facebook
https://github.com/jaredhanson/passport-facebook#re-authentication


Monday, March 27, 2017

Javascript asynchronous and promise

The javascript is now using lots of asynchronous call.  In some scenarios, you need an asynchronous call chain, one after another. Promise can help on asynchronous chain.

Here is the requirement. 
The UI display a task list. Each task has a priority, which determines the display order on the UI.
There is up arrow and down arrow on the right of each task. Clicking the up arrow will exchange the priority with the task above.

The code here is nodejs and mongoose. 
Here is the pseudo code. 
//find current task by taskId
//find the previous task based on the current task priority
//exchange the priority
//save the tasks (update first task, once it is done, in the callback update the second task)
There are at least 4 callback functions embedded. The code is hard for reading.
With Promise, the code could be refactored like this. It is much neat and easy for reading.
For code "Task.findById(id)", Mongoose returns a promise.
Once the record is returned, the method then is executed. It calls the function "findThisAndPreviousTaskByPriority", which returns two records in an array.
the function then returns a promise as well. Once the inside function returns, it triggers the next "then", which is exchangePriority. It also returns an array contains two tasks, but their priorities have already been exchanged. "updateTwoTasksPriority" basically creates two mongoose update promise based on the two tasks. The last function call uses "Promise.all" to convert a promises Iterate into one single promise.

Here are some good articles about javascript promise
https://davidwalsh.name/promises
http://stackoverflow.com/questions/39028882/chaining-async-method-calls-javascript
https://developer.mozilla.org/en/docs/Web/JavaScript/Reference/Global_Objects/Promise
https://www.youtube.com/watch?v=s6SH72uAn3Q
https://60devs.com/best-practices-for-using-promises-in-js.html
http://mongoosejs.com/docs/promises.html
https://developer.mozilla.org/en/docs/Web/JavaScript/Reference/Global_Objects/Promise/all
http://stackoverflow.com/questions/22519784/how-do-i-convert-an-existing-callback-api-to-promises