Elastic Search

What is Elasticsearch?

Press [space] or [enter] to continue

What is Elasticsearch?

It is a highly scalable search engine

It is a database companion (not a database)

It can be hosted with GCloud/AWS

It is a flexible solution for a difficult problem -
how do we quickly and efficiently search huge amounts of data?

Why Elasticsearch?

It handles all of the indexing and searching, we just give it an index

We can protect it and scale it up using GCloud/AWS

All of the heavy-lifting work is done by someone else - no load on our servers

It supports autocomplete and term highlighting out of the box

Your index and queries can be as simple or as complicated as you want them to be

How does it work?

Elasticsearch uses the Lucene search language

It exposes HTTP endpoints for you to GET, POST, PUT and DELETE

As soon as you've created a new instance of Elasticsearch you can start publishing

Alternatively you can define your indexes at the start

Which means you can customise how it analyses and indexes your data

How does it work?

An Outdate Analogy

Reference

This analogy assumes “index” being similar to a “database” in an SQL database,
and a “type” being equivalent to a “table”.

MySQL/RDBMS	Elastic Search
Database	Index
Table	Type (deprecated)
Row	Document

This was a bad analogy that led to incorrect assumptions.
In an SQL database, tables are independent of each other.
The columns in one table have no bearing on columns with the same name in another table.
This is not the case for fields in a mapping type.

Details

How does it work?

Terminologies

Elasticsearch Index
- Shard
  - Lucene Index
    - Segment
    - Segment
- Shard
  - Lucene Index
    - Segment
    - Segment
- Shard
  - Lucene Index
    - Segment
    - Segment

Terminologies

Shards and Replicas As the Foundation of Elasticsearch

Each Elasticsearch shard is an Apache Lucene index,
with each individual Lucene index containing a subset of the documents in the Elasticsearch index

Having the right number of shards is important for performance.
It is thus wise to plan in advance.
When queries are run across different shards in parallel, they execute faster than an index composed of a single shard,
but only if each shard is located on a different node and there are sufficient nodes in the cluster.

Show me some code

When creating an index, you can set the number of shards and replicas as properties of the index

When an index is created, the number of shards is set,
and this cannot be changed later without reindexing the data.

PUT /some_index
          {

              'settings' : {

                  'index' : {

                      'number_of_shards' : 6,

                      'number_of_replicas' : 2

                  }

              }

          }

Show me some code

Python

        from datetime import datetime
        from elasticsearch import Elasticsearch
        es = Elasticsearch()

        myDoc = [{
                'Id': 1,
                'FirstName': 'Kemaru',
                'LastName': 'Young',
                'timestamp': datetime.now()
            },

            {
                'Id': 2,
                'FirstName': 'Jane',
                'LastName': 'Doe',
                'timestamp': datetime.now()

            }]

        options = {
          'settings' : {
                 'index' : {
                      'number_of_shards':3,
                      'number_of_replicas':2
                   }
            },
          'data': myDoc
        }

        res = es.index(index='test-index', id=1, body=options)
        print(res['result'])

        res = es.get(index='test-index', id=1)
        print(res['_source'])

        es.indices.refresh(index='test-index')

        res = es.search(index='test-index', body={'query': {'match_all': {}}})
        print('Got %d Hits:' % res['hits']['total']['value'])
        for hit in res['hits']['hits']:
            print('%(timestamp)s %(author)s: %(text)s' % hit['_source'])

Show me some code

Node.js

        var myDocs = [
            {
                Id: 1,
                FirstName: 'Kemaru',
                LastName: 'Young',
                'timestamp': Date.now()
            },

            {
                Id: 2,
                FirstName: 'Jane',
                LastName: 'Doe',
                'timestamp': Date.now()
            }
        ]

        myDocs = formatForBulkUpdate(myDocs);

        var options = {
            url: 'https://test_host.com/users/_bulk',
            method: 'POST',
            proxy: 'https://test_host.com',
            data: myDocs
        }

        //if you are using GCloud, check this document: https://console.developers.google.com/project/_/apiui/credential


        //if you are using AWS, check this document: https://www.npmjs.com/package/aws4' 
        aws4.sign(options, {
            accessKeyId: 'some_access_key_id',
            secretAccessKey: 'some_secret_access_key'
        });

        request(options, function() {
            console.log('It works!');
        });

Show me some code

Node.js (searching)

        var mySearch = {
            'query': {
                'multi_match' : {
                    'fields' : ['FirstName', 'LastName'],
                    'query' : 'Ke',
                    'type' : 'phrase_prefix'
                }
            }
        }
            
        var options = {
            url: 'https://test_host.com/users/_search',
            method: 'POST',
            proxy: 'https://test_host.com',
            data: mySearch
        }

        aws4.sign(options, {
            accessKeyId: 'some_access_key_id',
            secretAccessKey: 'some_secret_access_key'
        });

        request(options, function(results) {
            results.forEach(function(result) {
                console.log(result.FirstName + ' ' + result.LastName);
            });
        });

Show me some code

C# (updating)

          var requestBody = formatForBulkUpdate(myDocs);
                
          var request = (HttpWebRequest) WebRequest.Create('https://test_host.com/users/_bulk');
          request.Method = 'POST';
          request.ContentType = 'text/plain';

          var requestBodyBytes = Encoding.ASCII.GetBytes(requestBody.ToString());

          var creds = new Credentials()
          {
              AccessKey = 'some_access_key',
              SecretKey = 'some_secret_key'
          };

          var signer = new SignV4Util();
          signer.SignRequest(request, requestBodyBytes, creds, region, serviceName);

          using (var stream = request.GetRequestStream())
          {
              stream.Write(requestBodyBytes, 0, requestBodyBytes.Length);
          }

Where are we using Elasticsearch?

Web application enriched with data

When the user types a query in the search input it uses AJAX to query the endpoint (after a short delay)

Services that update such data on a timely schedule

Cron & Logstash

User
- Kibana
  - Elastic Search
    - Logstash

Logstash

Installation

Check your java version, need one of these:

Java 8, or
Java 11, or
Java 14

Make sure JAVA_HOME environment variable is set

Installation

Logstash

Configuration

logstash-simple.conf

      input { stdin { } }
      output {
        elasticsearch { hosts => ['localhost:9200'] }
        stdout { codec => rubydebug }
      }

run logstash and specify the configuration file with the -f flag

bin/logstash -f logstash-simple.conf

Python logging with Logstash

python-logstash-async, or

python3-logstash

Security:

Only allowing specific GCloud/AWS users to update the index, or

Go through an API gateway or Lambda functions hosted by GCloud/AWS

Want to learn more?

Run Elastic on the GCloud for free (trial) https://www.elastic.co/cloud/

(Nearly) everything about Elasticsearch has been documented here: https://www.elastic.co/guide/en/elasticsearch/reference/current/index.html

Go to elasticsearch Github: https://github.com/elastic/elasticsearch

If you have any questions please let me know

Thanks for watching :D

End of presentation

days hours minutes seconds