< back home
notes on 📖 oreilly's book: elasticsearch: the definitive guide, unfortunately the book shows an old version of elastic, so I had to translate it to newer versions (7.* or 8.*).

~ elasticsearch

table of contents

elasticsearch?

Elasticsearch is an open source search engine. It is written in Java, it exposes a simple, coherent REST API that allow to do full-text search. It can also be described as:

Since it is an API you can interact with it with plain http requests, or with a client in your favorite programming language. If you want to do interact queries you can use kibana to interact with it. This is a separated software that you will have to install.

If you download it and run it on your local machine you can simply do:


% curl -k -u user:password https://localhost:9200/ | jq
% Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
Dload  Upload   Total   Spent    Left  Speed
100   536  100   536    0     0   1307      0 --:--:-- --:--:-- --:--:--  1307
{
    "name": "192.168.1.4",
    "cluster_name": "elasticsearch",
    "cluster_uuid": "some hash",
    "version": {
        "number": "8.17.0",
        "build_flavor": "default",
        "build_type": "tar",
        "build_hash": "2b6a7fed44faa321997703718f07ee0420804b41",
        "build_date": "2024-12-11T12:08:05.663969764Z",
        "build_snapshot": false,
        "lucene_version": "9.12.0",
        "minimum_wire_compatibility_version": "7.17.0",
        "minimum_index_compatibility_version": "7.0.0"
    },
    "tagline": "You Know, for Search"
}

You will see information for the elastic instance you have running.

Now, the best way of learning is by doing so lets follow the example of the book and create an employee directory.

employee directory

Requirements

We need to have an employee directory for Megacorp, as part of a new HR initiative.

Indexing employee documents

The act of storing data in elasticsearch is called indexing, but first we need to define where we are going to store it.

We can draw a parallel to relational databases, which I found useful in the book.


    Relational DB -> Databases -> Tables -> Rows -> Columns
    Elasticsearch -> Indices   -> Types  -> Documents -> Fields

    * Correction, apparently this change on elastic 7 and  8:

    Relational DB -> Databases -> Tables -> Rows    -> Columns
    Elasticsearch -> Indices   ->    Documents   -> Fields

Each elasticsearch cluster can contain multiple indices (databases)

Index have a lot of meaning here, so lets break them down.

Okay so for our example, we will have an Index for employees, then each doc will be an employee with the employees info.

So how would it look like to do this? We can go to the Dev Tools in the Kibana UI and make the request there.


PUT /employees/_doc/1
{
    "first_name": "John",
    "last_name": "Smith",
    "age": 25,
    "about": "I love to go rock climbing",
    "interests": ["sports", "music"]
}

Do note that even though we specify the _doc endpoint it is the default one.

The response will look something like this:


{
  "_index": "employees",
  "_id": "1",
  "_version": 1,
  "result": "created",
  "_shards": {
    "total": 2,
    "successful": 1,
    "failed": 0
  },
  "_seq_no": 0,
  "_primary_term": 1
}

What is happening in between our request and the response?

  1. Analyze and process the document (tokenization and normalization)

  2. Create data structures (inverted index, bkd trees)

  3. Store in the in-memory buffer

  4. Persist to the durable filesystem

  5. Send response to the client

I hope it is clear of course that you can store more documents following the same process.

Once we added some employees we can continue to do some stuff with the data itself.

Search Lite

You use GETs to do the queries. If you want to get all the documents on an index, you can do:


GET /employees/_search

And we would get something like:


{
  "took": 0,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 2,
      "relation": "eq"
    },
    "max_score": 1,
    "hits": [
      {
        "_index": "employees",
        "_id": "1",
        "_score": 1,
        "_source": {
          "first_name": "John",
          "last_name": "Smith",
          "age": 25,
          "about": "I love to go rock climbing",
          "interests": [
            "sports",
            "music"
          ]
        }
      },
      {
        "_index": "employees",
        "_id": "2",
        "_score": 1,
        "_source": {
          "first_name": "Jose",
          "last_name": "Perez",
          "age": 29,
          "about": "I love to go running",
          "interests": [
            "music",
            "chess"
          ]
        }
      }
    ]
  }
}

In this search lite you can do simple stuff, like getting all the documents that match a condition, or all the documents that have a word even. For example lets say I want to get the document for all the people that has the last name Smith


GET /employees/_search?q=last_name:Smith

Or I want to get all the documents where the word music appears.


GET /employees/_search?q=music

This approach is good and easy to understand but it has its limitations, some queries need special characters like +, this means that we would have to percent encode them. So +name:john would be %2Bname%3Ajohn.

Do not get me wrong, I like percent encoding like the next guy, but it can become a bit cumbersome to write queries like this. Therefore we have another way using a DSL for Elastic.

Search with Query DSL

You will have to leave your judgement at the door, and start sending a json payload in a GET request. I know, I know, it is hard, but this is the way they chose to do it.

We can recreate the last name Smith query we just listed above like this:


GET /employees/_search
{
    "query": {
        "match": {
          "last_name": "Smith"
        }
    }
}

You will see the exact same result. Of course this DSL allow us to make more complicated stuff. Say you want all the Smith's at the company, and want to see if they are into rock albums.


GET /employees/_search
{
    "query": {
        "bool": {
            "filter": {
                "term": {
                    "last_name": "smith"
                }
            },
            "must":{
                "match": {
                    "about": "rock albums"
                }
            }
        }
    }
}

So let us understand what is happening here.

Understanding Query DSL

There are two types of queries

The above one is a compound query.

When you make a query on elasticsearch you will see elastic will sort the results using a relevance score, this measures how well the document matches the query. The higher the score, the more relevant the document.

Score calculation depends on several factors, one of them is if the query clause is run in a Query or a Filter context.

Filter context has its benefits, it uses simple binary logic, has better performance, resource efficiency among other things.

Now let us come back to our query.


GET /employees/_search
{
    "query": {
        "bool": {
            "filter": {
                "term": {
                    "last_name": "smith"
                }
            },
            "must":{
                "match": {
                    "about": "rock albums"
                }
            }
        }
    }
}

Here we are using both the filter context and the query context. For the filter context, we are telling elastic to return to us all the employees with Smith as the last name. Simple yes or no question.

For the query context, we are telling to give us the documents where the field about has something about rock albums. But even if it does not have exactly rock albums. We still are going to get some result with a lower score.


{
  "took": 3,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 2,
      "relation": "eq"
    },
    "max_score": 1.4167401,
    "hits": [
      {
        "_index": "employees",
        "_id": "3",
        "_score": 1.4167401,
        "_source": {
          "first_name": "Jane",
          "last_name": "Smith",
          "age": 32,
          "about": "I like to collect rock albums",
          "interests": [
            "sports",
            "music"
          ]
        }
      },
      {
        "_index": "employees",
        "_id": "1",
        "_score": 0.4589591,
        "_source": {
          "first_name": "John",
          "last_name": "Smith",
          "age": 25,
          "about": "I love to go rock climbing",
          "interests": [
            "sports",
            "music"
          ]
        }
      }
    ]
  }
}

We see that we are getting two results, both have the Smith last name, but one likes rock albums and the other rock climbing, the one that likes rock climbing has a lower score.

~ table of contents

↑ go to the top