Use the index, JSON! Aggregation performance (Elastic, MySQL, MongoDB)

2016/08/12 by admin

Use the index, Luke!? No: use the index, JSON! When hunting features the X DevAPI should consider I got shocked by some JSON query execution times. Sometimes the MySQL Document Store runs head-on-head with MongoDB, sometimes not. Sometimes the difference to Elasticsearch is huge, sometimes – just – acceptable. Everytime the index rules. Use the index, MySQL Document Store users!

Comparing Elasticsearch, MySQL and MongoDB comes with a touch. Elasticsearch is the most popular enterprise search engine closely followed by Solr. MongoDB is the #1 document store. And, MySQL tries to throw it’s owners flagship database from the RBDMS throne to gain the top position. The only thing they have in common is being able to store and query JSON documents to serve (orthogonal) web developer needs. Some believe the RBMS and Document model can be blended in one store whereas others advocate using specialized systems. Given how different the focus of the three systems is any performance comparison is questionable. However, as the quoted experts noted performance matters when choosing a system: you can often work around data model properties, you cannot overcome performance deficiencies. All those warnings given why not share some observations and basic tipps…

The setup: Elasticsearch 2.3.5, MongoDB 3.2, MySQL 8.0

My testbench consists of out-of-the-box installations of Elasticsearch, MongoDB and MySQL 8.0. MySQL 8.0 is an unreleased internal development version which shows no significant performance difference over 5.7 for this test. The stores run in a virtual machine on my notebook. All systems use their default settings. No parameters set but the data path for on-disk storage. The data set consists of 1.000.000 random data documents. All systems are loaded with the exact same data set.

The documents represent fictional “products” of an online fabrics and clothing retailer, some truly random stuff:


{
  "product_id": 123  <--- 1...1000000
  "size": "...",     <--- "S", "M", "L", "XL", "XXL", "114cm", "140cm"
  "price": 13.46,    <--- 1..100
  "title": "...",    <--- 2..9 "words" each word 3..20 characters
  "usage": "dress",  <--- "inside", "outdoor", "baby", "dress", optional/not given
  "gender": "female",<--- "male", "femal", optional/not given
  "material": "silk",<--- "silk", "wool", "cotton", "blends", "knits", "designer fabrics"
  "description": "", <--- 10..100 "words" each work 3..20 charachters
  "in_stock_since": "2016-08-03 14:57:32" <-- random dates within 7 days
}

I didn’t bother nesting data: it all started as a feature comparison. Eventually, the decision turned out as an advantage as it permits storing the data set in a SQL table.

Details: data size ~230MB, JSON type hints, no sharding

MongoDB and Elasticsearch have been hinted about the datetype of the “in_stock_since” field. The PHP script copying data from MySQL to MongoDB uses MongoDB\BSON\UTCDateTime for the date. Elasicsearch is a full-text (FT) search engine. By default it would add all text fields to its FT index. I disabled this using the below mapping. Neither was testing full-text search in my interest nor would MySQL or MongoDB build an FT index by default. Hinting or not hinting Elasticsearch does not matter much for this test, but it’s fair as we will set extra indicies for MySQL.


{
  "products": {
    "mappings": {
      "product": {
        "properties": {
          "description": {
            "type": "string",
            "index": "not_analyzed"
          },
          "gender": {
            "type": "string",
            "index": "not_analyzed"
          },
          "in_stock_since": {
            "type": "date",
            "format": "Y-M-d H:m:s"
          },
          "material": {
            "type": "string",
            "index": "not_analyzed"
          },
          "price": {
            "type": "double"
          },
          "size": {
            "type": "string",
            "index": "not_analyzed"
          },
          "title": {
            "type": "string"
          },
          "usage": {
            "type": "string",
            "index": "not_analyzed"
          }
        }
      }
    }
  }
}

MySQL reports a data set size of ~230MB for SELECT SUM(LENGTH(product)) FROM products (UTF8MB4). The three systems report slightly different sizes, likely due to different storage and charset approaches. The MySQL figure, as you will soon notice, does not count in the size of the document ID. It really does not matter. Point is: all systems have a fair chance to load the entire data set into the main memory. I’ve experimented with different random data sets from 10.000 to 1.000.000 documents – it really does not matter given how coarse the relative figures are that I show below.

Elasticsearch and MonoDB offer sharding out-of-the-box. MySQL does not: no news to report on MySQL Fabric which offers sharding for MySQL. Sharding or distributed query execution have not been evaluated.

Beware of naive use of MySQL JSON

No, there is no error in the below chart.

The chart shows the runtime of a query in seconds executed against MySQL and Elasticsearch. For this very query Elasticsearch blows MySQL out of the water. Query runtime on MySQL is 36.97 seconds compared to 0.11 seconds on Elasicsearch. Elasticsearch is 336x times faster than MySQL. REALLY? The query is a simple aggregation of the entire data set by material.
[wp_charts title=”GROUP BY material” type=”bar” align=”alignleft” margin=”5px 20px” width=”100%” datasets=”36.97,0.11″ animation=”false” scaleFontSize=”12″ labels=”MySQL,Elasticsearch”]
Whether the entire data set is aggregated by material (string, few distinct values) or price (number, ~9.000 distinct values) does not make a huge difference. The factor remains >>100x.

A first time MySQL JSON user is likely to store its JSON documents in the JSON column of a table. Maybe, the person is a little lazy, like me, and stores the document id in an extra integer column defined as AUTO_INCREMENT:


CREATE TABLE products (
  product_id INT(11) NOT NULL AUTO_INCREMENT,
  product JSON DEFAULT NULL,
  PRIMARY KEY (product_id)
)

The table definition is comparable to the MySQL Document Store table definition for collections. The actual indexing is a little different – the primary key is defined using a stored generated column – but that detail is of no relevance here. All observations and tipps apply to the MySQL Document Store. Using SQL or CRUD/X DevAPI makes no differences either. Internally, at the time of write, everything ends up as a SQL query.

Here’s the SQL statement and the Elasticsearch query in question side-by-side. MySQL follows the ANSI/ISO SQL and uses functions to work on JSON data. The first thing that becomes obvious is that MySQL will do 1.000.000 function calls to JSON_EXTRACT(). JSON_EXTRACT() will fetch the material column from the JSON document. The document is stored in binary form, there’s some sort of directory for fast field lookup and everything should be fine. But it’s not – for a simple reason.

MySQL	Elasticsearch
`SELECT COUNT(*), JSON_EXTRACT( product, '$.material' ) AS _facet_material FROM products GROUP BY _facet_material`	`{ "size": 0, "aggs": { "materials": { "terms": { "field": "material" } } } }`

Well, the EXPLAIN output is not fine either: full table scan, using temporary table for filesort. Using a temporary table on my hardware is no good idea. Not at all.


mysql> EXPLAIN SELECT COUNT(*), JSON_EXTRACT(product, "$.material") AS _material FROM products GROUP BY _material;
+----+-------------+----------+------------+------+---------------+------+---------+------+--------+----------+---------------------------------+
| id | select_type | table    | partitions | type | possible_keys | key  | key_len | ref  | rows   | filtered | Extra                           |
+----+-------------+----------+------------+------+---------------+------+---------+------+--------+----------+---------------------------------+
|  1 | SIMPLE      | products | NULL       | ALL  | NULL          | NULL | NULL    | NULL | 989973 |   100.00 | Using temporary; Using filesort |
+----+-------------+--------- +------------+------+---------------+------+---------+------+--------+----------+---------------------------------+
1 row in set, 1 warning (0,00 sec)

Double-check: no beginner mistake, ensure: no disk bottle-neck…

Four suggestions to atack the problem:

Tune MySQL and all the other stores
Run on different hardware
Reduce the data set size
Avoid temporary tables

Because this was never planned to be any serious benchmarking, taking a serious amount the first two options are out. Reducing the data set size is no option either unless I take the benchmarking task serious. With a data set size of 50.000 documennts execution times are 0.14 seconds for MySQL, 0.067s for Elasticsearch and 0.084 seconds for MongoDB but variation across runs on a VM is so huge that those figures can only be considered as trend figured.

A data set size of 1.000.000 documents gives me runtimes that don’t suffer too much from variation. The data easily fits into the InnoDB buffer pool. It must be possible to determine the base performance for accessing JSON document fields with a query entirely running in main memory.


SELECT SUM(data_size) FROM innodb_buffer_page WHERE table_name = "`facets`.`products`" GROUP BY table_name;
+----------------+
| sum(data_size) |
+----------------+
|       22656540 |
+----------------+

Working towards the baseline

Let’s run a slightly different query that does a plain table scan search and avoids the filesort. Instead of grouping products by material, we count the number of products which are made of cotton.


EXPLAIN SELECT COUNT(*) FROM products WHERE JSON_UNQUOTE(JSON_EXTRACT(product, "$.material")) = 'cotton';
+----+-------------+----------+------------+------+---------------+------+---------+------+--------+----------+-------------+
| id | select_type | table    | partitions | type | possible_keys | key  | key_len | ref  | rows   | filtered | Extra       |
+----+-------------+----------+------------+------+---------------+------+---------+------+--------+----------+-------------+
|  1 | SIMPLE      | products | NULL       | ALL  | NULL          | NULL | NULL    | NULL | 989973 |   100.00 | Using where |
+----+-------------+----------+------------+------+---------------+------+---------+------+--------+----------+-------------+

The same is done on MongoDB and Elasticsearch. MongoDB has a dedicated command for it, easy:


count{
  count: "products",
  query: {
    material: "cotton"
  },
  readConcern: {    
  }
}

I can think of two ways getting the result for Elasticsearch. The first is using the aggregation framework for the COUNT(*). The second, simpler solution is setting the result fetch size to zero and reading the count from the result set meta data. The latter is about twice as face and so I opted for it:


{
  "size": 0,
  "query": {
    "term": {
      "material": "cotton"
    }
  }
}

All these queries run in memory. No disk access. JSON document field access as simple as can be. JSON field type string, no nesting.

The results show MySQL coming in last. MySQLs best result out of approximately 10 runs is 2.43 seconds. MongoDB is second scoring with 1.17 seconds. Elasticsearch is by far the fastest. Beating MySQL by a factor of 142 and MongoDB by a factor of 68 it needs as little as 0.017 seconds.
[wp_charts title=”Table scan” type=”bar” align=”alignleft” margin=”5px 20px” width=”100%” datasets=”2.43,0.017,1.176″ animation=”false” scaleFontSize=”12″ labels=”MySQL doc,Elasticsearch,MongoDB”]

MySQL document vs. MySQL table vs. MongoDB

At this point I wondered if MySQL is really that slow, if there is something up with our table scan performance. Because I’m using flat documents, which, BTW, fail to demonstrate the great nesting feature of JSON (a plain lack of SQL:99+ in MySQL… grumble) I can easily flatten the data set to a table:


CREATE TABLE products_flat (
  product_id int(11) NOT NULL AUTO_INCREMENT,
  facet_title varchar(127) DEFAULT NULL,
  facet_size varchar(255) DEFAULT NULL,
  facet_material varchar(255) DEFAULT NULL,
  facet_usage varchar(255) DEFAULT NULL,
  facet_gender varchar(255) DEFAULT NULL,
  facet_price double DEFAULT NULL,
  in_stock_since datetime DEFAULT NULL,
  description varchar(1024) DEFAULT NULL,
  facet_price_range varchar(10) DEFAULT NULL,
  PRIMARY KEY (product_id)
)

Now we can compare the table access with using JSON in MySQL. How fast will SELECT COUNT(*) FROM products_flat WHERE facet_material = 'cotton' run? About as fast as MongoDB. It takes 1.21 seconds. The figures are marginally different but I cannot stop to stress out that I’m using a virtual machine, no proper benchmark script and the variation between runs gives me only trend figures.
[wp_charts title=”Table scan flattened” type=”bar” align=”alignleft” margin=”5px 20px” width=”100%” datasets=”2.43,1.21,0.017,1.176″ animation=”false” scaleFontSize=”12″ labels=”MySQL doc,MySQL table,Elasticsearch,MongoDB”]
The lesson learned seems to be that you should extract JSON fields you want to search on frequently.

Use the index, JSON!

You can extract JSON fields and store the value in a column in two ways: you do it, or the database does it for you. Trying to do it on the application side is error prone as you must ensure data consistency when duplicating data. Its better to have the database do it. Trigger are an option but there’s a better one in MySQ 5.7+: generated columns. Generated columns compute their value using an expression that is given with the table creation.


ALTER TABLE products 
ADD facet_material 
  VARCHAR(255) 
  GENERATED ALWAYS AS (
   JSON_UNQUOTE(
     JSON_EXTRACT(
       product, '$.material'
     )
   )
  )

Generated columns come in two flavours: virtual and stored. Virtual is the default. Virtual generated columns compute their values at runtime. With regards to table scan / document search performance this is no step forward. A stored generated column holding a JSON field gets you to the MySQL table performance at the cost of data duplication. The JSON field is now stored twice: in JSON document and in the column. Pretty much the worst of both worlds: disk read takes longer, less records fit into the buffers, CPU has more to do… If I was writing an application, I would try to avoid the data duplication, e.g. remove the field from the JSON document.

If you want the search operation to be fast, bite the bullet and follow a true classic: use the index, Luke! You can create an index on a virtual and a stored generated column:


ALTER TABLE products
 ADD INDEX 
   idx_facet_material(facet_material)

Needless to say an index needs disk space and will eat up space in the buffers as well:


SELECT 
  index_name, 
  SUM(data_size) / 1024 / 1024
FROM 
  INFORMATION_SCHEMA.innodb_buffer_page 
WHERE 
  table_name = "`facets`.`products_naive`"
GROUP BY
 index_name

+--------------------+------------------------------+
| index_name         | SUM(data_size) / 1024 / 1024 |
+--------------------+------------------------------+
| idx_facet_material |                  17.03267002 |
| PRIMARY            |                  76.14762783 |
+--------------------+------------------------------+

Using indicies MySQL and MongoDB run neck-on-neck at about 0.13s. Both are still 7x slower than Elasticsearch.
[wp_charts title=”Table scan index” type=”bar” align=”alignleft” margin=”5px 20px” width=”100%” datasets=”2.43,0.13,0.017,1.176,0.14″ animation=”false” scaleFontSize=”12″ labels=”MySQL doc,MySQL doc index,Elasticsearch,MongoDB,MongoDB index”]

GROUP BY material – MongoDB beats MySQL?

You may have wondered why I didn’t include the result for MongoDB in the very first chart. Best I got for MongoDB was 2.497s – 15x faster than MySQL? Well, that was due to filesort! Please don’t tap into the same pitfall, dear first time MySQL Document Store user. Here are all the figures I got when using 1.000.000 documents in one chart. MongoDB seems to fail to utilize its index on the material field when running the aggregation framework query. Ensuring all operations run in memory, given 1.000.000 random documents, using an index MySQL scores 0.72s versus 2.497s – MySQL is 3.5x faster. Hmm.

[wp_charts title=”GROUP BY material index” type=”bar” align=”alignleft” margin=”5px 20px” width=”100%” datasets=”36.97,0.72,1.81,0.79,0.11,2.497,2.497″ animation=”false” scaleFontSize=”12″ labels=”MySQL doc (disk access!),MySQL doc index (in memory), MySQL table (in memory), MySQL table index (in memory),Elasticsearch (in memory),MongoDB (in memory), MongoDB index (in memory)”]
MongoDB aggregation framework command used:


aggregate{
  aggregate: "products",
  pipeline: [
    {
      $group: {
        _id: "$material",
        val: {
          $sum: 1
        }
      }
    }
  ],
  allowDiskUse: true,
  readConcern: {    
  },
  cursor: {    
  }
}

The only take-away I dare to make is that MongoDB seems to be about twice as fast as MySQL extracting JSON field values, as the table scan figures hint. The following results for 50.000 documents seem to confirm the second observeration. To make the MySQL Document Store fast you must add indicies – and it will not do that for you automatically. For 50.000 documents I am getting: 0.03s MySQL using index, 0.157s MongoDB w/wo index, 0.28s MySQL wo index. Out-of-box experience for the MySQL Document Store user? Hmm.
[wp_charts title=”GROUP BY material 50000″ type=”bar” align=”alignleft” margin=”5px 20px” width=”100%” datasets=”0.28,0.03,0.157″ animation=”false” scaleFontSize=”12″ labels=”MySQL doc, MySQL doc index,MongoDB w/wo index”]

I am not that surprised. The MySQL Document Store is optimized for point-select using the document ID. And the document ID is indexed. Optimizing the other operations was no primary target for the teams. Everybody was/is busy providing the features as such.

Use the smart generated index…

The Elasticsearch aggregation framework (overview and comparison) has some wonderful “convenience” bucket aggregations such as range or histogram. My interest in Elasticsearch started because I wanted to know what faceted search is and who’s good at it. Easticsearch is. A faceted navigation for all products is likely to show a price range facet. Let it show the number of products in the price range from 0..10, 10..20, 20..50 and 50+. Knowing about the performance of MySQL JSON field access if no index is around, you wisely add a generated column and an index:


CREATE TABLE products (
  product_id INT(11) NOT NULL AUTO_INCREMENT,
  product JSON DEFAULT NULL,
  facet_price DOUBLE 
   GENERATED ALWAYS AS (
     JSON_UNQUOTE(
      JSON_EXTRACT(product, '$.price')
     )
   ),
  PRIMARY KEY (product_id),
  KEY idx_face_price(facet_price)
)

This is a poor index. It does not give you the range information and you have to compute it at runtime using a SQL expression:


SELECT COUNT(*), 
  (CASE
    WHEN facet_price BETWEEN 0 AND 10 THEN "0-10"
    WHEN facet_price BETWEEN 10.01 AND 20 THEN "10-20"
    WHEN facet_price BETWEEN 20.01 AND 50 THEN "20-50"
    ELSE "50+"
   END) AS _facet_price_range  
FROM 
  products 
GROUP BY 
  _facet_price_range;

No matter if I use the small 50.000 or the bigger 1.000.000 documents data set MySQL EXPLAIN barks: using index; using temporary; using filesort.

You are not using the features offered by an generated column – move the CASE into the column definition and add an index.


ALTER TABLE products ADD 
  facet_price_range VARCHAR(10)
  GENERATED ALWAYS AS (
   (CASE
     WHEN facet_price BETWEEN 0 AND 10 THEN "0-10"
     WHEN facet_price BETWEEN 10.01 AND 20 THEN "10-20"
     WHEN facet_price BETWEEN 20.01 AND 50 THEN "20-50"
     ELSE "50+"
   END)
  )

ALTER TABLE product ADD INDEX 
  idx_facet_price_range(facet_price_range)

Result on my machine: 2.85s vs. 0.73s. Nearly 4x faster.

And Elasticsearch?

I tried a lot to get MySQL close to the Elasticsearch performance when aggregating data. The bigger the data set, the less selective the query is, the more is Elasticsearch ahead. The fact that is can run “multiple queries” in one go plays for it. Elasticsearch returns matching documents together with an arbitrary amout of statistics on the matching documents, or even the entire data set. It requires only one query to do so, one scan versus many for MySQL (resp MongoDB). Most Elasticsearch aggregations turned out to be several times faster than MySQL, even when making massive use of indicies. Multiple this by the number of aggregations you compute in one go and…

MySQL 31ms vs. Elasticsearch 30ms!

Eventually, I found a query that exeuctes equally fast on MySQL and Elasticsearch. Why? Because the aggregation plays no role. The query could be executed when a customer enters a search term, has decided on a facet value (here: size=M) and the fulltext search returns few results only.


{
  "size": 0,
  "query": {
    "match": {
      "title": "Beach"
    }
  },
  "aggs": {
    "by_size_m": {
      "filter": {
        "term": {
          "size": "M"
        }
      },
      "aggs": {
        "by_day": {
          "date_histogram": {
            "field": "in_stock_since",
            "interval": "day",
            "format": "Y-MM-dd HH:mm:ss"
          }
        }
      }
    }
  }
}

  SELECT 
    COUNT(*) AS _hits, 
    NULL, 
    NULL 
  FROM 
    products 
  WHERE 
    MATCH(facet_title) AGAINST ("Beach") 
UNION ALL 
   SELECT 
     NULL, 
    COUNT(*) AS _facet_date_count,
    CAST(
     JSON_UNQUOTE(
        JSON_EXTRACT(
          product, 
          "$.in_stock_since"
        )
     ) AS date) AS _facet_date 
  FROM 
    products 
  WHERE 
    MATCH(facet_title) AGAINST ("Beach") 
    AND facet_size = "M" 
  GROUP BY _facet_date

MySQL equals Elasticsearch on Fulltext search performance! At least for this particular query. Amazingly, it’s a fulltext query. Please note, there seems to be a bug when accessing indexed virtual generated columns and fulltext columns in one query. I used stored generated columns for the above query to get the same results from both systems.

Happy hacking!

@Ulf_Wendel