Press [space] or [enter] to continue
It is a highly scalable search engine
It is a database companion (not a database)
It can be hosted with GCloud/AWS
It is a flexible solution for a difficult problem -
how do we quickly and efficiently search huge amounts of data?
It handles all of the indexing and searching, we just give it an index
We can protect it and scale it up using GCloud/AWS
All of the heavy-lifting work is done by someone else - no load on our servers
It supports autocomplete and term highlighting out of the box
Your index and queries can be as simple or as complicated as you want them to be
Elasticsearch uses the Lucene search language
It exposes HTTP endpoints for you to GET, POST, PUT and DELETE
As soon as you've created a new instance of Elasticsearch you can start publishing
Alternatively you can define your indexes at the start
Which means you can customise how it analyses and indexes your data
This analogy assumes “index” being similar to a “database” in an SQL database,
and a “type” being equivalent to a “table”.
MySQL/RDBMS | Elastic Search |
Database | Index |
Table | Type (deprecated) |
Row | Document |
This was a bad analogy that led to incorrect assumptions.
In an SQL database, tables are independent of each other.
The columns in one table have no bearing on columns with the same name in another table.
This is not the case for fields in a mapping type.
Each Elasticsearch shard is an Apache Lucene index,
with each individual Lucene index containing a subset of the documents in the Elasticsearch index
Having the right number of shards is important for performance.
It is thus wise to plan in advance.
When queries are run across different shards in parallel, they execute faster than an index composed of a single shard,
but only if each shard is located on a different node and there are sufficient nodes in the cluster.
When creating an index, you can set the number of shards and replicas as properties of the index
When an index is created, the number of shards is set,
and this cannot be changed later without reindexing the data.
PUT /some_index { 'settings' : { 'index' : { 'number_of_shards' : 6, 'number_of_replicas' : 2 } } }
Python
from datetime import datetime from elasticsearch import Elasticsearch es = Elasticsearch() myDoc = [{ 'Id': 1, 'FirstName': 'Kemaru', 'LastName': 'Young', 'timestamp': datetime.now() }, { 'Id': 2, 'FirstName': 'Jane', 'LastName': 'Doe', 'timestamp': datetime.now() }] options = { 'settings' : { 'index' : { 'number_of_shards':3, 'number_of_replicas':2 } }, 'data': myDoc } res = es.index(index='test-index', id=1, body=options) print(res['result']) res = es.get(index='test-index', id=1) print(res['_source']) es.indices.refresh(index='test-index') res = es.search(index='test-index', body={'query': {'match_all': {}}}) print('Got %d Hits:' % res['hits']['total']['value']) for hit in res['hits']['hits']: print('%(timestamp)s %(author)s: %(text)s' % hit['_source'])
Node.js
var myDocs = [ { Id: 1, FirstName: 'Kemaru', LastName: 'Young', 'timestamp': Date.now() }, { Id: 2, FirstName: 'Jane', LastName: 'Doe', 'timestamp': Date.now() } ] myDocs = formatForBulkUpdate(myDocs); var options = { url: 'https://test_host.com/users/_bulk', method: 'POST', proxy: 'https://test_host.com', data: myDocs } //if you are using GCloud, check this document: https://console.developers.google.com/project/_/apiui/credential //if you are using AWS, check this document: https://www.npmjs.com/package/aws4' aws4.sign(options, { accessKeyId: 'some_access_key_id', secretAccessKey: 'some_secret_access_key' }); request(options, function() { console.log('It works!'); });
Node.js (searching)
var mySearch = { 'query': { 'multi_match' : { 'fields' : ['FirstName', 'LastName'], 'query' : 'Ke', 'type' : 'phrase_prefix' } } } var options = { url: 'https://test_host.com/users/_search', method: 'POST', proxy: 'https://test_host.com', data: mySearch } aws4.sign(options, { accessKeyId: 'some_access_key_id', secretAccessKey: 'some_secret_access_key' }); request(options, function(results) { results.forEach(function(result) { console.log(result.FirstName + ' ' + result.LastName); }); });
C# (updating)
var requestBody = formatForBulkUpdate(myDocs); var request = (HttpWebRequest) WebRequest.Create('https://test_host.com/users/_bulk'); request.Method = 'POST'; request.ContentType = 'text/plain'; var requestBodyBytes = Encoding.ASCII.GetBytes(requestBody.ToString()); var creds = new Credentials() { AccessKey = 'some_access_key', SecretKey = 'some_secret_key' }; var signer = new SignV4Util(); signer.SignRequest(request, requestBodyBytes, creds, region, serviceName); using (var stream = request.GetRequestStream()) { stream.Write(requestBodyBytes, 0, requestBodyBytes.Length); }
Web application enriched with data
When the user types a query in the search input it uses AJAX to query the endpoint (after a short delay)
Services that update such data on a timely schedule
Cron & Logstash
Check your java version, need one of these:
Make sure JAVA_HOME environment variable is set
input { stdin { } } output { elasticsearch { hosts => ['localhost:9200'] } stdout { codec => rubydebug } }
bin/logstash -f logstash-simple.conf
Only allowing specific GCloud/AWS users to update the index, or
Go through an API gateway or Lambda functions hosted by GCloud/AWS
Run Elastic on the GCloud for free (trial) https://www.elastic.co/cloud/
(Nearly) everything about Elasticsearch has been documented here: https://www.elastic.co/guide/en/elasticsearch/reference/current/index.html
Go to elasticsearch Github: https://github.com/elastic/elasticsearch
If you have any questions please let me know
Thanks for watching :D
End of presentation