Ulf Wendel

Faceted search, why the DevAPI could matter one day

| 0 comments

Faceted search or faceted navigation is a highly praised and widely use search pattern. And, it is a great reply to an off the records sales engineering question. MySQL finally has some document store features built-in. A bit of a yawn in 2016. There is a new X DevAPI available with some Connectors. A bit of a yawn technically. But it is a non-technical change of mind: developer centric counts! Sales, all, technical value could show at non-trivial developer tasks, like faceted search.

Todays X DevAPI does not get you very far

There are great stories to tell about the X DevAPI, see MySQL 5.7.12 – Part 3: More Than “Just” SQL?:

  • Non-blocking API
  • CRUD API for schemaless documents in collections and schemaful rows in tables
  • SQL support
  • Prepared for “we need new APIs” for distributed databases

Most importantly the “look and feel” is similar to comparable offerings from NoSQL vendors. Competitive offerings have been described as easy, natural, developer friendly. We tried to follow these maximes. Albeit a leap step forward the feature set of the first version of the X DevAPI is limited. It works for basic apps like the demo app. But instead of adding a bulk of additional features we broaden our view to achieve the goal of improving developer ease of use on a whole:

Selling point readability

The off the records sales engineers question was: why would one use the X DevAPI, given that it is only a dump SQL wrapper? Let a developer write some code to fetch all products from a collection that have a price higher than 1.2 Euros:

products.find("price > 1.2").exeucte();


Upon execution, the Connectors (drivers) send the query to the X server plugin which translates it to some SQL and executes it. The SQL statement makes use of ANSI/ISO SQL standard JSON features which MySQL 5.7.12+ supports:

SELECT product FROM products WHERE
   JSON_UNQUOTE(JSON_EXTRACT(product, "$.price")) > 1.2;


There is no technical value in this. Schemaless? Works with SQL. JSON processing? Works with SQL, too. The X DevAPI selling point is readability. But none of the available features today and none of feature on the short term roadmap has the potential to add more value to the X DevAPI.

Faceted search: let daddy sew a coat

Faceted search supports exploring large amounts of data by displaying summaries about various partitions of the data and later allowing to narrow the navigation to a specific partition. Let there be a father that wants to sew a rain coat exactly as the one below for his one year old daughter.

Self-made baby rain coat

Daddy opens his web browser in the middle of the night after having completed all baby care and feeding duties and starts to search for fabric in an online retailer. But what to search for, how to quicky find the best fabric out of 100.000 offered by the online store? Many stores will display a search box and a faceted navigation at the left side. Possible facets (partitions, dimensions) are:

  • Material
    • Blends (16647)
    • Cotton (16762)
  • Usage
    • Baby (19913)
    • Dress (20005)
  • Price
    • Less than 10 Euros (13815)
    • 10-20 Euros (16207)

A faceted document search using SQL

Assume the products are stored in the database using (JSON) documents. All documents are in JSON column called “product” of a table called “products”. How to get the totals for all the facets using SQL?

{
  "title": "Beach motive M182-16"
  "description": ...  
  "weight": ...
  "material": "cotton"
  ...
}


The SQL for one facet is pretty much straight-forward:

SELECT 
  JSON_UNQUOTE(JSON_EXTRACT(product, "$.material")) AS facet_value, 
  COUNT(*) AS facet_count 
FROM products 
WHERE JSON_EXTRACT(product, "$.material") IS NOT NULL 
GROUP BY JSON_EXTRACT(product, "$.material")


To get the values for more than one facet using no more than one query call in your application, combine the results for each facet using UNION ALL. Add a column “facet” to mark the orgininating facet in the combined row result.

  SELECT 
    "material" AS facet, 
    JSON_UNQUOTE(JSON_EXTRACT(product, "$.material")) AS facet_value, 
    COUNT(*) AS facet_count 
  FROM products 
  WHERE JSON_EXTRACT(product, "$.material") IS NOT NULL 
  GROUP BY JSON_EXTRACT(product, "$.material") 
UNION ALL 
  SELECT 
   "usage" AS facet, 
   JSON_UNQUOTE(JSON_EXTRACT(product, "$.usage")) AS facet_value, 
   COUNT(*) AS facet_count 
  FROM products 
  WHERE JSON_EXTRACT(product, "$.usage") IS NOT NULL 
  GROUP BY JSON_EXTRACT(product, "$.usage")

Speeding it up and shortening the SQL

There’s quite a bit to come and the length of the SQL statement will grow. Let’s shorten it a bit by adding generated columns (5.7.6) and speed things up using indicies. Long story short, here’s the CREATE TABLE statement I’m using for this blog post:

CREATE TABLE `products` (
  `product_id` int(11) NOT NULL AUTO_INCREMENT,
  `product` json DEFAULT NULL,
  `facet_size` varchar(255) GENERATED ALWAYS AS (json_unquote(json_extract(`product`,'$.size'))) VIRTUAL,
  `facet_material` varchar(255) GENERATED ALWAYS AS (json_unquote(json_extract(`product`,'$.material'))) VIRTUAL,
  `facet_usage` varchar(255) GENERATED ALWAYS AS (json_unquote(json_extract(`product`,'$.usage'))) VIRTUAL,
  `facet_gender` varchar(255) GENERATED ALWAYS AS (json_unquote(json_extract(`product`,'$.gender'))) VIRTUAL,
  `facet_price` double GENERATED ALWAYS AS (json_unquote(json_extract(`product`,'$.price'))) VIRTUAL,
  PRIMARY KEY (`product_id`),
  KEY `idx_facet_size` (`facet_size`),
  KEY `idx_facet_material` (`facet_material`),
  KEY `idx_facet_usage` (`facet_usage`),
  KEY `idx_facet_gender` (`facet_gender`),
  KEY `idx_face_price` (`facet_price`)
)

Using the above the query to gather one facets totals is shortened to:

SELECT 
  facet_material AS facet_value, 
  count(*) AS facet_count 
FROM products 
WHERE facet_material IS NOT NULL 
GROUP BY facet_material

A range based face: price

The price facet in the fabric online shop is range based. In Germany, the home of the example daddy, fabrics are sold per meter. The typical price of a fabric ranges from a few Euros up to 100 Euros with the majority topping around 40 Euros. The show owner wants to display totals for prices from 0..10, 10..20, 20..50, 50+ Euros. Note that the ranges are of different size.

Of course, SQL can deliver! For example, use a subquery and CASE expressions. Stackoverflow will guide a developer within seconds.

SELECT 
  "price" AS facet, 
  fpsub.facet_price_range AS facet_value, 
  COUNT(*) AS facet_count 
FROM 
  (SELECT 
    (CASE 
      WHEN facet_price BETWEEN 0 AND 10 THEN "0-10"
      WHEN facet_price BETWEEN 10.01 AND 20 THEN "10-20" 
      WHEN facet_price BETWEEN 20.01 AND 50 THEN "20-50" 
      ELSE "50+" 
    END) AS facet_price_range 
  FROM products 
  WHERE facet_price IS NOT NULL) AS fpsub
GROUP BY fpsub.facet_price_range


Combine all the individual facet queries using UNION ALL and you have the basic search.

Refinement: user selects a facet value

Example Daddy made up his mind on the material for the inner jacket, selects “cotton” and enters “Beach” into the search box. Likely, the users expectation is that the selection will have no impact on the totals shown for any other material but cotton. In other words the material face query needs to be split in two: one to count the total for “cotton” and “Beach” and another one for the totals of all the other facet values. Note that I ignore the problem of full text search and use LIKE exactly how one should not use it.

  SELECT 
    facet_material, 
    count(*) 
  FROM products 
  WHERE 
    facet_material = "cotton" AND 
    JSON_UNQUOTE(JSON_EXTRACT(product, '$.title')) LIKE "%Beach%"  
  GROUP BY facet_material 
UNION ALL 
  SELECT 
    facet_material, 
    COUNT(*) 
  FROM products 
  WHERE 
   facet_material != "cotton" AND
   facet_material IS NOT NULL   
  GROUP BY facet_material

The grand total

Proof: you can use SQL for the job.

SELECT 
    "material" AS facet,
    facet_material AS facet_value, 
    COUNT(*) AS facet_count
  FROM products 
  WHERE 
    facet_material = "cotton" AND 
    JSON_UNQUOTE(JSON_EXTRACT(product, '$.title')) LIKE "%Beach%"  
  GROUP BY facet_material 
UNION ALL 
  SELECT
    "material" AS facet,
    facet_material AS facet_value, 
    COUNT(*) AS facet_count
  FROM products 
  WHERE 
   facet_material IS NOT NULL AND
   facet_material != "cotton" 
  GROUP BY facet_material
UNION ALL
  SELECT 
    "price" AS facet, 
    fpsub.facet_price_range AS facet_value, 
    COUNT(*) AS facet_count 
  FROM 
    (SELECT 
      (CASE 
        WHEN facet_price BETWEEN 0 AND 10 THEN "0-10"
        WHEN facet_price BETWEEN 10.01 AND 20 THEN "10-20" 
        WHEN facet_price BETWEEN 20.01 AND 50 THEN "20-50" 
        ELSE "50+" 
      END) AS facet_price_range 
    FROM products
    WHERE facet_price IS NOT NULL) AS fpsub
  GROUP BY fpsub.facet_price_range
UNION ALL
  SELECT
   "size" AS facet,
   facet_size AS facet_value,
   COUNT(*) AS facet_count
  FROM products
  WHERE
    facet_size IS NOT NULL
  GROUP BY
    facet_size
UNION ALL
  SELECT 
    "usage" AS facet,
    facet_usage AS facet_value,
    COUNT(*) AS facet_count
  FROM
    products
  WHERE
    facet_usage IS NOT NULL
  GROUP BY 
    facet_usage
UNION ALL
  SELECT    
    "gender" AS facet,
    facet_gender AS facet_value,
    COUNT(*) AS facet_count
  FROM 
    products
  WHERE
    facet_gender IS NOT NULL
  GROUP BY
    facet_gender
ORDER BY 
 facet, 
 facet_value

+----------+------------------+-------------+
| facet    | facet_value      | facet_count |
+----------+------------------+-------------+
| gender   | female           |       33387 |
| gender   | male             |       33327 |
| material | blends           |       16647 |
| material | designer fabrics |       16703 |
| material | knits            |       16739 |
| material | silk             |       16594 |
| material | wool             |       16555 |
| price    | 0-10             |       13815 |
| price    | 10-20            |       16207 |
| price    | 20-50            |       55668 |
| price    | 50+              |       14310 |
| size     | 114cm            |       14464 |
| size     | 140cm            |       14366 |
| size     | L                |       14009 |
| size     | M                |       14303 |
| size     | S                |       14327 |
| size     | XL               |       14211 |
| size     | XXL              |       14320 |
| usage    | baby             |       19913 |
| usage    | dress            |       20005 |
| usage    | inside           |       19929 |
| usage    | outdoor          |       20166 |
+----------+------------------+-------------+
22 rows in set (0,34 sec)


The execution time of 0.34s? Well, this is a source build of MySQL running in a VM on a notebook. No server parameters set. Inacceptable result for interactive search experience.

How the DevAPI could make a difference

A DevAPI counterpart of the above could read very different.

products
  .find("title LIKE :search")
  .facets({
   "usage", 
   "material", 
   "gender", 
   "size",
   "price" {
     "0-10": {"min": 0, "max": 10"}
     ...
   }
  .bind("search", "%Beach%")


As a developer, I could immediately grasp what the higher level DevAPI does. I would have no clue what the SQL is about if the column aliases would not hint me. Note also that the DevAPI and the X plugin are not limited to SQL. They could, if they want, use the lower level MySQL internal data access APIs. The internal APIs can be faster than SQL, way faster. SQL runs on top of them.

Dear Sales, All, the DevAPI is only a tiny bit of the MySQL document store story. You may see it as a dump SQL wrapper of little value. Or, as a chance to offer an alternative query API that solves web developer problems much easier than raw SQL could do.

Happy hacking!

@Ulf_Wendel Follow me on Twitter

Leave a Reply

Required fields are marked *.