notes on 📖 oreilly's book: elasticsearch: the definitive guide, unfortunately the book shows an old version of elastic, so I had to translate it to newer versions (7.* or 8.*).

~ elasticsearch

table of contents

elasticsearch?

Elasticsearch is an open source search engine. It is written in Java, it exposes a simple, coherent REST API that allow to do full-text search. It can also be described as:

A distributed real-time document store where every field is indexed and searchable
A distributed search engine with real-time analytics.
Super scalable.

Since it is an API you can interact with it with plain http requests, or with a client in your favorite programming language. If you want to do interact queries you can use kibana to interact with it. This is a separated software that you will have to install.

If you download it and run it on your local machine you can simply do:


% curl -k -u user:password https://localhost:9200/ | jq
% Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
Dload  Upload   Total   Spent    Left  Speed
100   536  100   536    0     0   1307      0 --:--:-- --:--:-- --:--:--  1307
{
    "name": "192.168.1.4",
    "cluster_name": "elasticsearch",
    "cluster_uuid": "some hash",
    "version": {
        "number": "8.17.0",
        "build_flavor": "default",
        "build_type": "tar",
        "build_hash": "2b6a7fed44faa321997703718f07ee0420804b41",
        "build_date": "2024-12-11T12:08:05.663969764Z",
        "build_snapshot": false,
        "lucene_version": "9.12.0",
        "minimum_wire_compatibility_version": "7.17.0",
        "minimum_index_compatibility_version": "7.0.0"
    },
    "tagline": "You Know, for Search"
}

You will see information for the elastic instance you have running.

Now, the best way of learning is by doing so lets follow the example of the book and create an employee directory.

employee directory

Requirements

We need to have an employee directory for Megacorp, as part of a new HR initiative.

the data can contain multi value tags, numbers and full text
retrieve the full details of any employee
allow structured search, meaning, find employees over the age of 30
return highlighted search snippets from text matching documents
build analytics dashboards over the data

Indexing employee documents

The act of storing data in elasticsearch is called indexing, but first we need to define where we are going to store it.

We can draw a parallel to relational databases, which I found useful in the book.


    Relational DB -> Databases -> Tables -> Rows -> Columns
    Elasticsearch -> Indices   -> Types  -> Documents -> Fields

    * Correction, apparently this change on elastic 7 and  8:

    Relational DB -> Databases -> Tables -> Rows    -> Columns
    Elasticsearch -> Indices   ->    Documents   -> Fields

Each elasticsearch cluster can contain multiple indices (databases)

Index have a lot of meaning here, so lets break them down.

index (noun): the parallel to database we just talked about
index (verb): store a document into an index (noun)
inverted index: its what elastic uses for full-text search, think of it like the index at the back of a book but more sophisticated. Each unique word (term) points to all the documents containing it

Okay so for our example, we will have an Index for employees, then each doc will be an employee with the employees info.

So how would it look like to do this? We can go to the Dev Tools in the Kibana UI and make the request there.


PUT /employees/_doc/1
{
    "first_name": "John",
    "last_name": "Smith",
    "age": 25,
    "about": "I love to go rock climbing",
    "interests": ["sports", "music"]
}

Do note that even though we specify the _doc endpoint it is the default one.

The response will look something like this:


{
  "_index": "employees",
  "_id": "1",
  "_version": 1,
  "result": "created",
  "_shards": {
    "total": 2,
    "successful": 1,
    "failed": 0
  },
  "_seq_no": 0,
  "_primary_term": 1
}

What is happening in between our request and the response?

Analyze and process the document (tokenization and normalization)
Create data structures (inverted index, bkd trees)
Store in the in-memory buffer
Persist to the durable filesystem
Send response to the client

I hope it is clear of course that you can store more documents following the same process.

Once we added some employees we can continue to do some stuff with the data itself.

Search Lite

You use GETs to do the queries. If you want to get all the documents on an index, you can do:


GET /employees/_search

And we would get something like:


{
  "took": 0,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 2,
      "relation": "eq"
    },
    "max_score": 1,
    "hits": [
      {
        "_index": "employees",
        "_id": "1",
        "_score": 1,
        "_source": {
          "first_name": "John",
          "last_name": "Smith",
          "age": 25,
          "about": "I love to go rock climbing",
          "interests": [
            "sports",
            "music"
          ]
        }
      },
      {
        "_index": "employees",
        "_id": "2",
        "_score": 1,
        "_source": {
          "first_name": "Jose",
          "last_name": "Perez",
          "age": 29,
          "about": "I love to go running",
          "interests": [
            "music",
            "chess"
          ]
        }
      }
    ]
  }
}

In this search lite you can do simple stuff, like getting all the documents that match a condition, or all the documents that have a word even. For example lets say I want to get the document for all the people that has the last name Smith


GET /employees/_search?q=last_name:Smith

Or I want to get all the documents where the word music appears.


GET /employees/_search?q=music

This approach is good and easy to understand but it has its limitations, some queries need special characters like +, this means that we would have to percent encode them. So +name:john would be %2Bname%3Ajohn.

Do not get me wrong, I like percent encoding like the next guy, but it can become a bit cumbersome to write queries like this. Therefore we have another way using a DSL for Elastic.

Search with Query DSL

You will have to leave your judgement at the door, and start sending a json payload in a GET request. I know, I know, it is hard, but this is the way they chose to do it.

We can recreate the last name Smith query we just listed above like this:


GET /employees/_search
{
    "query": {
        "match": {
          "last_name": "Smith"
        }
    }
}

You will see the exact same result. Of course this DSL allow us to make more complicated stuff. Say you want all the Smith's at the company, and want to see if they are into rock albums.


GET /employees/_search
{
    "query": {
        "bool": {
            "filter": {
                "term": {
                    "last_name": "smith"
                }
            },
            "must":{
                "match": {
                    "about": "rock albums"
                }
            }
        }
    }
}

So let us understand what is happening here.

Understanding Query DSL

There are two types of queries

Leaf query clauses: look for a particular value on a particular field
Compound query clauses: wrap other leaf or compound queries and combine them in logical fashion.

The above one is a compound query.

When you make a query on elasticsearch you will see elastic will sort the results using a relevance score, this measures how well the document matches the query. The higher the score, the more relevant the document.

Score calculation depends on several factors, one of them is if the query clause is run in a Query or a Filter context.

Query context: this context sees how well a document match the query clause, and will calculate a score for it.
Filter context: this context is binary. Do you match the query clause? yes or no?.

Filter context has its benefits, it uses simple binary logic, has better performance, resource efficiency among other things.

Now let us come back to our query.


GET /employees/_search
{
    "query": {
        "bool": {
            "filter": {
                "term": {
                    "last_name": "smith"
                }
            },
            "must":{
                "match": {
                    "about": "rock albums"
                }
            }
        }
    }
}

Here we are using both the filter context and the query context. For the filter context, we are telling elastic to return to us all the employees with Smith as the last name. Simple yes or no question.

For the query context, we are telling to give us the documents where the field about has something about rock albums. But even if it does not have exactly rock albums. We still are going to get some result with a lower score.


{
  "took": 3,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 2,
      "relation": "eq"
    },
    "max_score": 1.4167401,
    "hits": [
      {
        "_index": "employees",
        "_id": "3",
        "_score": 1.4167401,
        "_source": {
          "first_name": "Jane",
          "last_name": "Smith",
          "age": 32,
          "about": "I like to collect rock albums",
          "interests": [
            "sports",
            "music"
          ]
        }
      },
      {
        "_index": "employees",
        "_id": "1",
        "_score": 0.4589591,
        "_source": {
          "first_name": "John",
          "last_name": "Smith",
          "age": 25,
          "about": "I love to go rock climbing",
          "interests": [
            "sports",
            "music"
          ]
        }
      }
    ]
  }
}

We see that we are getting two results, both have the Smith last name, but one likes rock albums and the other rock climbing, the one that likes rock climbing has a lower score.

↑ go to the top