To make the MySQL Document Store primary programming interface, the X DevAPI, a success we should provide building blocks to solve common web development tasks, for example, faceted search. But is there any chance a relational database bend towards a document store can compete around aggregation (grouping) features? Comparing Elasticsearch, MySQL and MongoDB gives mixed results. GROUP BY on JSON gets you pretty far but you need to watch out for pitfalls…
Scale-out, sharding, schema-less, big data all hiding a giant developer task: giving user access to semi-structured data in near real-time. This can mean providing user driven full text search, facet search or analytics to suggest topics to users, e.g. recommender systems. Elasticsearch is a popular choice here. As much as Elasticsearch differs from both MySQL and MongoDB and comparisons are not “fair” it is interesting to compare the Elasticsearch aggregation framework functionality with the other two. What’s there that MySQL could learn, which feature can be done using SQL, which comparable SQL is so tricky that it deserves to be wrapped in the DevAPI? An overview based on Elasticsearch 2.3.5, MySQL 5.7/8.0 and MongoDB 3.2.
Elasticsearch aggregations overview
The query languages of the three systems are quite different. The technology is different, the concepts differ and the terminology differs. The Elasticsearch aggregation framework speaks of three building blocks: metrics, buckets, pipelining. Because my main focus is on comparing the feature set, I’ll use SQL to describe the terms. The SQL examples help to stay focussed. Actual Elasticseach query language examples will be given after the introduction.
The Metric aggregations compute metrics over a set of documents ;-). Examples of metrics given as MYSQL SQL function names are: MIN(), MAX(), STDDEV(), SUM() and so forth.
single value metric
|
v
SELECT MIN(price) FROM products
Elasticsearch features metric aggregations that produce one metric/value or multiple metrics/values in one go.
multi-value metric
| |
v v
SELECT MIN(price), MAX(price) FROM products
Bucket aggregations partition the data set. They describe rules for building different sets of documents into which documents then “fall” into. In its simplest form, think of a SQL GROUP BY criteria in a SQL statement. A statement like SELECT * FROM products GROUP BY size partitions all products by their size and puts rows with different size column values into different “buckets”.
partition, bucket, ...
|
v
SELECT size COUNT(*) FROM products GROUP BY size
+----------------------+
| size | COUNT(*) |
+----------------------+
| S | 123 | <--- set of rows with size = S
| M | 456 |
| ... | ... |
Pipeline aggregations operate on the result of an aggregation instead of the original data set. Imagine you have a day-trade online shop and want to know the average price of all products depending on the day they have been stocked. In SQL you could write:
SELECT in_stock_since, AVG(price) FROM products GROUP BY in_stock_since
Additionally, you are interested in the average price across all the results the aggregation has produced. You could either issue a second query or “pipe” the result into a second aggregation. Here’s a SQL statement to illustrate the “pipe into”:
SELECT AVG(_avg_price) FROM (
SELECT AVG(price AS _avg_price) AS _avg_price
FROM products
GROUP BY in_stock_since
) AS _sub
Beats SQL: many aggregations in one go
A neat thing about Elasticsearch is that it lets you compute the result of both of the above two aggregations using one query. Elasticsearch can return the average price per day and the average over all them in one result set. I am unaware of a MySQL SQL statement that could do the same. Something similar can be achieved in SQL using a UNION ALL. This has two drawbacks. First, EXPLAIN hints that MySQL will scan the underlying data set twice and fail to optimize the query much. Second, the results from the two SELECT statements must be “sequeezed” into one row result set. Because the two SQL statements produce different columns additional columns have to be added to them to make storing their columns in one combined result set possible. On an aside: as long as MySQL lacks JSON aggregation functions similar to GROUP_CONCAT it is impossible to overcome the column issue by returning a nested JSON document (awareness of a problem is the first step towards a solution…).
SELECT
in_stock_since,
AVG(price AS _avg_price) AS _avg_price
NULL AS _avg_price_total
FROM products
GROUP BY in_stock_since
UNION ALL
SELECT
NULL AS in_stock_since,
NULL AS _avg_price,
AVG(_avg_price) AS _avg_price_total
FROM (SELECT
in_stock_since,
AVG(price AS _avg_price) AS _avg_price
NULL AS _avg_price_total
FROM products
GROUP BY in_stock_since) AS _sub
Executing multiple aggregations as one query is not limited to pipelined aggregations. Any kind of aggregation can be used in a query. And, any number of them may appear in a single query.
(My)SQL: multiple aggregations glued together
|
v
SELECT ... GROUP BY ... < --- bucket aggregation
UNION ALL
SELECT MAX(...) ... <--- metric aggregation
UNION ALL
SELECT ... FROM (SELECT...) <--- pipeline aggregation
UNION ALL
... <--- additional aggregations
DevAPI opportunity: go beyond SQL/CRUD
One of the things MySQL should learn from Elasticsearch is the ability to run an arbitrary number of independent aggregations on the same data set (ElasticSearch: index, MySQL: table/collection, MongoDB: collection) using only one query. Use case? Faceted search…
Possible approaches to solve the task include optimizing UNION ALL handling, giving users access to lower level aggregation features to write queries that break out of SQL limitations or exapnding commands to multiple SQL statements. All these things can easily be done “behind the scenes” and be hidden from the users. The X Plugin and the X DevAPI are perfect verhicles to shield users from the very details and the actual SQL. Technically, the song I’m singing is the same MySQL has been singing for years: MySQL server plugins make it possible to innovate on the client side independently of server development constraints.
What’s new in MySQL world is the availability of a full stack (AKA MySQL Document Store) to execute on the idea.
Component |
Options to innovate |
Application |
|
|
X DevAPI |
Higher level query API |
| |
|
MySQL drivers |
Pre-parse into command tree, translate command into series of SQL statement and aggregate results… |
| |
|
X Protocol |
Pipelining, send from server to client, … |
X Plugin |
Rewrite SQL, use lower-level server data access APIs over SQL, … |
MySQL Server |
|
What’s missing in MySQL world are ideas what to do with the new stack – time to be the bad boy talking about features missing in MySQL.
Elasticsearch aggregation query syntax
With some key Elasticsearch terms and concepts explained, first comparisions with SQL made, its time to show the basic syntax of an Elasticsearch aggregation query. Elasticsearch has an HTTP query interface. All queries can be written as JSON documents and be send to the search engine via HTTP. The basic structure of an aggregation query is as follows:
"aggregations" : {
"" : {
"" : {
}
[,"meta" : { [] } ]?
[,"aggregations" : { []+ } ]?
}
[,"" : { ... } ]*
}
The aggregation part of a search query is using the key “aggregations”, or short “aggs” in the query document:
{ <--- query document
"aggs": { <--- aggregation part
...
}
}
A single query can use one or more aggregations. Each aggregation has a user-defined name:
{
"aggs": {
"materials": { <-- aggregation using user-defined name
}
}
}
It follows the definition of the type of the aggregation and all relevant parameters for it:
{
"aggs": {
"materials": {
"terms": { <-- aggregation type
"field": "material" <-- parameters
}
}
}
}
The terms aggregation is a multiple bucket aggregation that matches GROUP BY field
. Results will be grouped by unique values found in the field. Queries are issued again an index. Think of an index as the counterpart of a table/collection in MySQL.
# curl -XGET '127.0.0.1:9200/products/_search?pretty' -d@aggs.json
Beats SQL again: aggregation and search results in one go
When the above query is executed, Elasticsearch beats SQL once more. Elasticsearch will return all documents matching the query (no filter set) plus the result of the aggregation. The SQL counterpart requires two queries. Ignoring some pagination details, the SQL equivalent of the above is:
SELECT * FROM products
SELECT material, COUNT(*) FROM products
The need to use two queries has the afore mentioned downsides: potentially two round trips from the client to the server, potentially scanning the same data stream twice. The issue of two round trips could easily be solved in the MySQL Document Store by having the X Plugin execute the two queries locally and combine the results before returning them to the client, or, maybe, using X protocol pipelining. But the issue of having to scan the data stream twice is a SQL limitation and potentially a server matter.
If one is not interested in the search results but only the aggregation results, then one can set the desired search result size to zero in the Elasticsearch JSON query document.
{
"size": 0, <--- skip search results
"aggs": {
"materials": {
"terms": {
"field": "material"
}
}
}
}
# curl -XGET '127.0.0.1:9200/products/_search?pretty' -d@aggs.json
{
"took": 89, <-- 89ms / 0.089s runtime
...
"hits": {
"total": 1000000, <--- 1e6 documents scanned
...
"hits": [ <--- do not return any search results
]
},
"aggregations": {
"materials": { <--- material aggregation
...
"buckets": [
{
"key": "wool",
"doc_count": 167074
},
{
"key": "cotton",
"doc_count": 166950
},
...
]
}
}
}
Metric aggregations compared
Most but not all Elasticsearch metric aggregations have direct counterparts in both MySQL and MongoDB. For example, all systems can compute the average value of an aggegrated numeric field. While the syntax differs there are no groundbreaking differences in the functionality.
Avg Aggregation |
Elasticsearch |
{
"aggs": {
"avg_price": {
"avg": {
"field": "price"
}
}
}
}
|
MySQL |
SELECT AVG(price) FROM products
|
MongoDB |
[
{
"$group": {
"_id": null,
"val": {
"$avg": "$price"
}
}
}
]
|
There are two metric aggregations that are hard to emulate in MySQL: percentiles and top hits per group. Proposals to calculate percentiles in MySQL using SQL include using GROUP_CONCAT or SQL session variables. Either approach results in SQL better not shown. Likely, the problem could be easily solved using an UDF respectively adding a grouping function to stock MySQL.
Beats MySQL: Top hits – SQL:99 LATERAL
Calculating the top hits per group is a functionality that should be highly valuable when building a facet navigation. SQL:99 contains a SQL feature called LATERAL which could be used for the task. Unfortunately MySQL does not support it. The LATERAL keyword gives a sub-select access to FROM clauses that appear before it. It is often described as SQL’s foreach loop. It best explained at the example. All examples so far assumed a collection/index of product documents respectively a products table like:
+-------------------+---------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-------------------+---------------+------+-----+---------+----------------+
| product_id | int(11) | NO | PRI | NULL | auto_increment |
| title | varchar(127) | YES | | NULL | |
| size | varchar(255) | YES | | NULL | |
| material | varchar(255) | YES | MUL | NULL | |
| usage | varchar(255) | YES | | NULL | |
| gender | varchar(255) | YES | | NULL | |
| price | double | YES | | NULL | |
| in_stock_since | datetime | YES | | NULL | |
| description | varchar(1024) | YES | | NULL | |
+-------------------+---------------+------+-----+---------+----------------+
Let’s try fetch the five most expensive products (fabrics) for each material (cotton, silk, ..) by help of LATERAL. We will build it up of two simple SQL queries. The grouping comes first:
SELECT material
FROM products
GROUP BY material
A query to fetch the five most expensive products made of cotton is straightforward as well:
SELECT product_id
FROM products
WHERE material = 'cotton'
ORDER BY price DESC
LIMIT 5
The two queries now need to be combined in such a way that for each facet _material found by the first query, the second query is invoked with the material found to compute the top five hits. The second query needs access values produced by the first query. And, this is what LATERAL is about. It gives a subquery access to the results found by a previous query. A subquery does not execute in isolation but can break out of its isolation. The general pattern goes:
SELECT <--- for each
inner.*
FROM outer <--- result produced (e.g. distinct material)
...
JOIN <--- add results
LATERAL <--- allow access to previous results/FROM
(
SELECT ... FROM inner <--- run a subquery to add results
WHERE
inner.field = outer.result <--- based on outer query results
) AS inner
Because I have neither Postgres, Oracle or any other database featuring SQL:99 LATERAL installed I haven’t check the following SQL but it should be along the lines of the actual SQL:99 to solve the task. It is a little more complex to satisfy GROUP BY constraints (see also this Postgres example):
SELECT
top_hits.*
FROM
(SELECT material FROM products
GROUP BY products) AS materials
JOIN LATERAL
(SELECT product_id, price, material
FROM products WHERE material = materials.material
ORDER BY price DESC
LIMIT 5) AS top_hits
If MySQL would feature LATERAL, then the DevAPI could make a top hits feature available using a syntax close to SQL, or a function based variant of the Elasticsearch query document, whatever turns out more powerful and convenient when writing application code. Here’s the Elastic search query to find the top five products grouped by material:
{
"aggs": {
"materials": {
"terms": {
"field": "material"
},
"aggs": {
"top-tag": {
"top_hits": {
"sort": {
"price": "desc"
},
"size": 5
}
}
}
}
}
}
On an aside, the Elasticsearch query is also an example of nesting aggregations. The data stream is partioned by material and then further processed using a nested aggregation for the top hits. Judging from the few Elasticsearch queries I examined, nesting can often be mapped to SQL either using derived tables (SELECT nested FROM (SELECT outer)
) or using a WHERE-clause.
Bucket Aggregations
Bucket aggregations partition the data stream. Most, if not all, Elasticsearch bucket aggregations do not offer features that cannot be achieved with either MySQL or MongoDB. The Filter, Filters and Global bucket aggreations are tricky to compare with the other systems. As explained above Elasticsearch differs from MySQL (and MongoDB) by returning search results and aggregation results together in one result set as a reply to a single query. For example, one can search all products that have the word Beach in their title line and gather distribution statistics on some product attributes like size and material. The Elasticsearch query is:
{
"query": {
"match": {
"title": "Beach"
}
},
"aggs": {
"by_size": {
"terms": {
"field": "size"
}
},
"by_material": {
"terms": {
"field": "material"
}
}
}
}
nixnutz@linux-ouxx:~/src/php-src> curl -XGET '127.0.0.1:9200/products/_search?pretty' -d@aggs.json
{
...
"hits": [ < --- all matching products
{
"_index": "products",
"_type": "product",
"_id": "86141",
"_score": 3.9986782,
"_source": {
"size": "L",
"price": 1.24,
"title": "L {ukqevpzmht jyorthop Beach",
"usage": "baby",
"gender": "female",
"material": "cotton",
"description": "",
"in_stock_since": "2016-08-01 20:08:40"
}
},
"aggregations": {
"by_size": { < -- size distribution statistics
...
{
"key": "L",
"doc_count": 194
},
...
"by_material": { <--- material distribution statistics
...
A SQL query that comes somewhat close to it is made of three individial queries glued together with UNION ALL (see above). The Filter bucket aggregation can be used to filter out values from one of the aggregations. Maybe, we only need to count how many of matching products are available in size “M”:
{
"query": {
"match": {
"title": "Beach"
}
},
"aggs": {
"by_size": {
"filter": { <-- filter bucket
"term": {
"size": "M"
}
}
},
...
In SQL the filter condition would be added to the WHERE clause of the corresponding query in the UNION ALL statement.
SELECT ... FROM products WHERE MATCH(title) AGAINST('Beach')
UNION ALL
SELECT ... FROM products WHERE size = 'M' AND ... < -- filter bucket condition
UNION ALL
SELECT ... FROM products GROUP BY material WHERE ...
n/a (yes) – apples and oranges…
Albeit the Filter, Filters and Global aggregations can be emulated in SQL, I marked them as “n/a (yes)” in the feature comparison table. In the end, its comparing apples and oranges. The the wonderful, unique Elasticsearch feature of returning aggregation values together with search results has no direct counterpart in the SQL world. The Elasticsearch global aggregation is specifically tailored to this different way of operation. It breaks and aggregation out of the current search context and allows it to perform an aggregation on the entire data set. The below query calculated the average price of those products matching “Beach” in their title and the average price of all products:
{
"query": {
"match": {
"title": "Beach" <-- search products matching "Beach"
}
},
"aggs": {
"avg_price_beach": {
"avg": {
"field": "price" <-- average price for "Beach" products
}
},
"avg_price_all": {
"global": {
},
"aggs": {
"avg_price": {
"avg": {
"field": "price" < average price for all products
}
}
}
}
}
}
The SQL equivalent is:
SELECT
COUNT(*) AS _hits, AVG(price) AS _avg_beach,
(SELECT AVG(price) FROM products) AS _avg_all <-- break out search context
FROM products
WHERE MATCH(title) AGAINST ("Beach");
Histogram and Range aggregations
The group of histogram and range aggregation in Elasticsearch could be described as convenience functionality from a SQL database vendor perspective. Take the histogram aggregation as an example:
{
"aggs": {
"prices": {
"histogram": {
"field": "price",
"interval": 10
}
}
}
}
The SQL equivalent is:
SELECT
COUNT(*) AS _num,
IF(
price % 10 < 0,
price - (10 + (price % 10)),
price - price % 10
) AS _bucket
FROM
products
GROUP BY _bucket
+----------+---------+
| _num | _bucket |
+----------+---------+
| 168697 | 0 |
| 243999 | 10 |
| 288138 | 20 |
| 116725 | 30 |
| 77041 | 40 |
| 50007 | 50 |
| 34483 | 60 |
| 16487 | 70 |
| 3866 | 80 |
| 557 | 90 |
+----------+---------+
Unquestionable SQL can deliver. Unquestionable using IF is no rocket science. And, nobody should pollute an API with convenience functions. But there is more
than convenience here. MySQL evaluates the IF for every row: it’s slow. How slow? That’s for the next blog posting.
Happy hacking!
@Ulf_Wendel