Faceted search or faceted navigation is a highly praised and widely use search pattern. And, it is a great reply to an off the records sales engineering question. MySQL finally has some document store features built-in. A bit of a yawn in 2016. There is a new X DevAPI available with some Connectors. A bit of a yawn technically. But it is a non-technical change of mind: developer centric counts! Sales, all, technical value could show at non-trivial developer tasks, like faceted search.
Todays X DevAPI does not get you very far
There are great stories to tell about the X DevAPI, see MySQL 5.7.12 – Part 3: More Than “Just” SQL?:
- Non-blocking API
- CRUD API for schemaless documents in collections and schemaful rows in tables
- SQL support
- Prepared for “we need new APIs” for distributed databases
Most importantly the “look and feel” is similar to comparable offerings from NoSQL vendors. Competitive offerings have been described as easy, natural, developer friendly. We tried to follow these maximes. Albeit a leap step forward the feature set of the first version of the X DevAPI is limited. It works for basic apps like the demo app. But instead of adding a bulk of additional features we broaden our view to achieve the goal of improving developer ease of use on a whole:
- Out-of-the box experience
- X Shell – new commandline shell which supports scripting
- User guides, tutorials, demo app
- …
Selling point readability
The off the records sales engineers question was: why would one use the X DevAPI, given that it is only a dump SQL wrapper? Let a developer write some code to fetch all products from a collection that have a price higher than 1.2 Euros:
products.find("price > 1.2").exeucte();
Upon execution, the Connectors (drivers) send the query to the X server plugin which translates it to some SQL and executes it. The SQL statement makes use of ANSI/ISO SQL standard JSON features which MySQL 5.7.12+ supports:
SELECT product FROM products WHERE JSON_UNQUOTE(JSON_EXTRACT(product, "$.price")) > 1.2;
There is no technical value in this. Schemaless? Works with SQL. JSON processing? Works with SQL, too. The X DevAPI selling point is readability. But none of the available features today and none of feature on the short term roadmap has the potential to add more value to the X DevAPI.
Faceted search: let daddy sew a coat
Faceted search supports exploring large amounts of data by displaying summaries about various partitions of the data and later allowing to narrow the navigation to a specific partition. Let there be a father that wants to sew a rain coat exactly as the one below for his one year old daughter.
Daddy opens his web browser in the middle of the night after having completed all baby care and feeding duties and starts to search for fabric in an online retailer. But what to search for, how to quicky find the best fabric out of 100.000 offered by the online store? Many stores will display a search box and a faceted navigation at the left side. Possible facets (partitions, dimensions) are:
- Material
- Blends (16647)
- Cotton (16762)
- …
- Usage
- Baby (19913)
- Dress (20005)
- …
- Price
- Less than 10 Euros (13815)
- 10-20 Euros (16207)
- …
A faceted document search using SQL
Assume the products are stored in the database using (JSON) documents. All documents are in JSON column called “product” of a table called “products”. How to get the totals for all the facets using SQL?
{ "title": "Beach motive M182-16" "description": ... "weight": ... "material": "cotton" ... }
The SQL for one facet is pretty much straight-forward:
SELECT JSON_UNQUOTE(JSON_EXTRACT(product, "$.material")) AS facet_value, COUNT(*) AS facet_count FROM products WHERE JSON_EXTRACT(product, "$.material") IS NOT NULL GROUP BY JSON_EXTRACT(product, "$.material")
To get the values for more than one facet using no more than one query call in your application, combine the results for each facet using UNION ALL. Add a column “facet” to mark the orgininating facet in the combined row result.
SELECT "material" AS facet, JSON_UNQUOTE(JSON_EXTRACT(product, "$.material")) AS facet_value, COUNT(*) AS facet_count FROM products WHERE JSON_EXTRACT(product, "$.material") IS NOT NULL GROUP BY JSON_EXTRACT(product, "$.material") UNION ALL SELECT "usage" AS facet, JSON_UNQUOTE(JSON_EXTRACT(product, "$.usage")) AS facet_value, COUNT(*) AS facet_count FROM products WHERE JSON_EXTRACT(product, "$.usage") IS NOT NULL GROUP BY JSON_EXTRACT(product, "$.usage")
Speeding it up and shortening the SQL
There’s quite a bit to come and the length of the SQL statement will grow. Let’s shorten it a bit by adding generated columns (5.7.6) and speed things up using indicies. Long story short, here’s the CREATE TABLE statement I’m using for this blog post:
CREATE TABLE `products` ( `product_id` int(11) NOT NULL AUTO_INCREMENT, `product` json DEFAULT NULL, `facet_size` varchar(255) GENERATED ALWAYS AS (json_unquote(json_extract(`product`,'$.size'))) VIRTUAL, `facet_material` varchar(255) GENERATED ALWAYS AS (json_unquote(json_extract(`product`,'$.material'))) VIRTUAL, `facet_usage` varchar(255) GENERATED ALWAYS AS (json_unquote(json_extract(`product`,'$.usage'))) VIRTUAL, `facet_gender` varchar(255) GENERATED ALWAYS AS (json_unquote(json_extract(`product`,'$.gender'))) VIRTUAL, `facet_price` double GENERATED ALWAYS AS (json_unquote(json_extract(`product`,'$.price'))) VIRTUAL, PRIMARY KEY (`product_id`), KEY `idx_facet_size` (`facet_size`), KEY `idx_facet_material` (`facet_material`), KEY `idx_facet_usage` (`facet_usage`), KEY `idx_facet_gender` (`facet_gender`), KEY `idx_face_price` (`facet_price`) )
Using the above the query to gather one facets totals is shortened to:
SELECT facet_material AS facet_value, count(*) AS facet_count FROM products WHERE facet_material IS NOT NULL GROUP BY facet_material
A range based face: price
The price facet in the fabric online shop is range based. In Germany, the home of the example daddy, fabrics are sold per meter. The typical price of a fabric ranges from a few Euros up to 100 Euros with the majority topping around 40 Euros. The show owner wants to display totals for prices from 0..10, 10..20, 20..50, 50+ Euros. Note that the ranges are of different size.
Of course, SQL can deliver! For example, use a subquery and CASE expressions. Stackoverflow will guide a developer within seconds.
SELECT "price" AS facet, fpsub.facet_price_range AS facet_value, COUNT(*) AS facet_count FROM (SELECT (CASE WHEN facet_price BETWEEN 0 AND 10 THEN "0-10" WHEN facet_price BETWEEN 10.01 AND 20 THEN "10-20" WHEN facet_price BETWEEN 20.01 AND 50 THEN "20-50" ELSE "50+" END) AS facet_price_range FROM products WHERE facet_price IS NOT NULL) AS fpsub GROUP BY fpsub.facet_price_range
Combine all the individual facet queries using UNION ALL and you have the basic search.
Refinement: user selects a facet value
Example Daddy made up his mind on the material for the inner jacket, selects “cotton” and enters “Beach” into the search box. Likely, the users expectation is that the selection will have no impact on the totals shown for any other material but cotton. In other words the material face query needs to be split in two: one to count the total for “cotton” and “Beach” and another one for the totals of all the other facet values. Note that I ignore the problem of full text search and use LIKE exactly how one should not use it.
SELECT facet_material, count(*) FROM products WHERE facet_material = "cotton" AND JSON_UNQUOTE(JSON_EXTRACT(product, '$.title')) LIKE "%Beach%" GROUP BY facet_material UNION ALL SELECT facet_material, COUNT(*) FROM products WHERE facet_material != "cotton" AND facet_material IS NOT NULL GROUP BY facet_material
The grand total
Proof: you can use SQL for the job.
SELECT "material" AS facet, facet_material AS facet_value, COUNT(*) AS facet_count FROM products WHERE facet_material = "cotton" AND JSON_UNQUOTE(JSON_EXTRACT(product, '$.title')) LIKE "%Beach%" GROUP BY facet_material UNION ALL SELECT "material" AS facet, facet_material AS facet_value, COUNT(*) AS facet_count FROM products WHERE facet_material IS NOT NULL AND facet_material != "cotton" GROUP BY facet_material UNION ALL SELECT "price" AS facet, fpsub.facet_price_range AS facet_value, COUNT(*) AS facet_count FROM (SELECT (CASE WHEN facet_price BETWEEN 0 AND 10 THEN "0-10" WHEN facet_price BETWEEN 10.01 AND 20 THEN "10-20" WHEN facet_price BETWEEN 20.01 AND 50 THEN "20-50" ELSE "50+" END) AS facet_price_range FROM products WHERE facet_price IS NOT NULL) AS fpsub GROUP BY fpsub.facet_price_range UNION ALL SELECT "size" AS facet, facet_size AS facet_value, COUNT(*) AS facet_count FROM products WHERE facet_size IS NOT NULL GROUP BY facet_size UNION ALL SELECT "usage" AS facet, facet_usage AS facet_value, COUNT(*) AS facet_count FROM products WHERE facet_usage IS NOT NULL GROUP BY facet_usage UNION ALL SELECT "gender" AS facet, facet_gender AS facet_value, COUNT(*) AS facet_count FROM products WHERE facet_gender IS NOT NULL GROUP BY facet_gender ORDER BY facet, facet_value +----------+------------------+-------------+ | facet | facet_value | facet_count | +----------+------------------+-------------+ | gender | female | 33387 | | gender | male | 33327 | | material | blends | 16647 | | material | designer fabrics | 16703 | | material | knits | 16739 | | material | silk | 16594 | | material | wool | 16555 | | price | 0-10 | 13815 | | price | 10-20 | 16207 | | price | 20-50 | 55668 | | price | 50+ | 14310 | | size | 114cm | 14464 | | size | 140cm | 14366 | | size | L | 14009 | | size | M | 14303 | | size | S | 14327 | | size | XL | 14211 | | size | XXL | 14320 | | usage | baby | 19913 | | usage | dress | 20005 | | usage | inside | 19929 | | usage | outdoor | 20166 | +----------+------------------+-------------+ 22 rows in set (0,34 sec)
The execution time of 0.34s? Well, this is a source build of MySQL running in a VM on a notebook. No server parameters set. Inacceptable result for interactive search experience.
How the DevAPI could make a difference
A DevAPI counterpart of the above could read very different.
products .find("title LIKE :search") .facets({ "usage", "material", "gender", "size", "price" { "0-10": {"min": 0, "max": 10"} ... } .bind("search", "%Beach%")
As a developer, I could immediately grasp what the higher level DevAPI does. I would have no clue what the SQL is about if the column aliases would not hint me. Note also that the DevAPI and the X plugin are not limited to SQL. They could, if they want, use the lower level MySQL internal data access APIs. The internal APIs can be faster than SQL, way faster. SQL runs on top of them.
Dear Sales, All, the DevAPI is only a tiny bit of the MySQL document store story. You may see it as a dump SQL wrapper of little value. Or, as a chance to offer an alternative query API that solves web developer problems much easier than raw SQL could do.
Happy hacking!