MySQL Document Store: unstructured data, unstructured search

2016/08/17 by admin

Storing documents is convenient: no need to define a schema up-front, no downtime for schema changes, no normalization, no slow joins – true or not, you name it. But what about search if you do not know how your data is structured? For example, the famous SFW-construct requires listing the columns to search: SELECT … FROM … WHERE some_column = ‘Jippie’ . Given that JSON data can have no schema how to write a query without knowing any field names, where is SELECT … FROM … WHERE * = ‘Jippie’? JSON_SEARCH() gets you started but there are gaps if you think about it for a minute.

There are many reasons why you may not know the structure of the data you operate on. Maybe, you have gathered documents using different “schema versions” over time, maybe, there is simply no common structure because the data comes from a variety of different sources. This bares two challenges:

how to search not knowing:
- field names
- table/collection/index names
how to optimize search and storage not knowing:
- schema/structure of documents
- cardinalities and space requirements of fields

As we will see, the MySQL Document Store does not have answers to all questions.

Search not knowing field names: JSON_SEARCH()

The SQL language is made for typed and well structured data. The ANSI/ISO take on JSON is not trying to integrate it deeply into the type system of SQL but to use functions and a path language to work on JSON. A function based approach is a low risk approach for vendors, and it works for this particular task too. JSON_SEARCH() scans a document for matching strings.

All of the following examples use two identical tables holding 50.000 respectively 1.000.000 randomly generated JSON product descriptions. The table definition is very basic:


CREATE TABLE products (
  product_id INT NOT NULL AUTO_INCREMENT
  product JSON DEFAULT NULL
  PRIMARY KEY(product_id)
)

JSON_SEARCH takes a JSON document, a search term and returns the path to the value. The search term supports the wildcards also supported by LIKE. You can use “%” to match any number of characters and “_” to match one arbitrary character:


SELECT 
  JSON_SEARCH(product, "all", "Beach") <-- search "document"
FROM 
  products <-- table/collection/index is known
WHERE 
  JSON_SEARCH(product, "all", "Beach") IS NOT NULL

+--------------------------------------+
| JSON_SEARCH(product, "all", "Beach") |
+--------------------------------------+
| "$.title"                            |
+--------------------------------------+


SELECT  
  JSON_SEARCH(
    product, 
    "all", <-- return patch to all matches
    "%b%"
  ) 
FROM 
  products 
WHERE 
  JSON_SEARCH(product, "one", "%b%") IS NOT NULL
LIMIT 2
+------------------------------------+
| JSON_SEARCH(product, "all", "%b%") |
+------------------------------------+
| ["$.title", "$.description"]       |
| ["$.title", "$.description"]       |
+------------------------------------+

The function can either return the path to the first match in a document or a list of path expressions to all matches. The return value is of type JSON. This means, there’s a bit of type juggling required if you want to use the path to actually fetch the matching field value:


SELECT
  JSON_EXTRACT(
    product, 
    JSON_UNQUOTE(
      JSON_SEARCH(
        product, 
        "one", 
        "Beach"
      )
    )
  ) AS _match
FROM 
 products
WHERE 
 JSON_SEARCH(product, "all", "Beach") IS NOT NULL
 
+---------+
| _match  |
+---------+
| "Beach" |
+---------+

I am unaware of any SQL statement which gives you the value for all matching path expressions returned by JSON_SEARCH(…, “all”, …). At the time of writing MySQL lacks a group of nest/unnest operators what can extract rows from JSON arrays/objects. Once such functionality is available it should become possible to extract all matching values.

There are two more issues with JSON_SEARCH: it is limited to string matching and it is slow. If, for example, you know that somewhere in your JSON documents there is a field that holds a numeric value representing a year but you don’t know the field/path, then it becomes tricky and complex to search for it.

JSON string search using a fulltext index

Given that JSON_SEARCH is limited to string matching one may be tempted to replace it with a fulltext index search. A fulltext index based search is much faster than a full table scan performed by JSON_SEARCH. Problem is, MySQL does not support creating a fulltext index for an entire JSON document:


ALTER TABLE products ADD FULLTEXT idx_ft_product(product);
ERROR 3152 (42000): JSON column 'product' cannot be used in key specification.

A possible workaround is to extract all string columns from a JSON column and concatenate their values. Ideally, this would be possible using a virtual generated column to safe disk space. Problem is, MySQL will not create a fulltext index on a virtual generated columns:


ALTER TABLE 
 products  
ADD
 facet_text TEXT GENERATED ALWAYS AS(
   CONCAT_WS(
     " ",
     JSON_UNQUOTE(
       JSON_EXTRACT(
         product, "$.title"
       )
     )
     JSON_UNQUOTE(
       JSON_EXTRACT(
         product, "$.description"
       )
     )
   )
 )
 
ALTER TABLE 
 products
ADD FULLTEXT idx_ft_facet_text(facet_text);
ERROR 3106 (HY000): 'Fulltext index on virtual generated column' is not supported for generated columns.

This problem can be solved using a stored generated column but we are running in circles. The task is to find information in unstructured data. How do you know which values to extract for indexing if you do not know about the structure…. Elasticsearch solves it by flipping the logic upside down. Elasticsearch does index all strings from a JSON document unless you tell it to ignore some of them.

I wouldn’t be surprised to learn the MySQL JSON fulltext indexing issue can be solved with something as simple as a custom fulltext parser.

Figuring out the string fields in JSON

There is no solution but at least you can figure out which fields to extract and index by anaylsing a given data set. Looking at existing data does not help you with future unstructured data but it is better than nothing. Recall that JSON_SEARCH accepts wildcards and returns path expressions. Use “%” to produce a list of all matching path expressions. By help of the patch expressions it becomes possible to create a stored generated column and add a fulltext index on it.


SELECT 
  COUNT(*), 
  JSON_SEARCH(product, "all", "%") AS _string_paths 
FROM 
  products 
GROUP BY 
  _string_paths
  
+----------+-------------------------------------------------------------------------------------------------+
| COUNT(*) | _string_paths                                                                                   |
+----------+-------------------------------------------------------------------------------------------------+
|        1 | ["$.title", "$.description"]                                                                    |
|     3256 | ["$.size", "$.title", "$.material", "$.description", "$.in_stock_since"]                        |
|    13296 | ["$.size", "$.title", "$.usage", "$.material", "$.description", "$.in_stock_since"]             |
|     6840 | ["$.size", "$.title", "$.gender", "$.material", "$.description", "$.in_stock_since"]            |
|    26607 | ["$.size", "$.title", "$.usage", "$.gender", "$.material", "$.description", "$.in_stock_since"] |
+----------+-------------------------------------------------------------------------------------------------+

Be warned that this results in a full table scan and the aggregation may cause the use of a temporary table. If your data set is large and your disks are slow, like on my VM, mind your queries.

How to manage what you don’t know?

From this little example it becomes clear how valuable information on the structure or schema of existing JSON data is. Query optimizers like to use statistics to decide on the optimal query, and I can imagine that a DBA would be interested in such information as well. The little JSON_SEARCH trick tells me wich string fields there are in my JSON data. But what about searching all documents that have a number holding a value of 1974? Or, how to notice that an array inside a document has grown to a size that refactoring the storage is benefitial.

A SQL user has the INFORMATION_SCHEMA at and to learn about tables, columns, storage space requirements but JSON is a black box in that world. JSON columns report the total or average size of the entire document they hold. But there is no way too look inside them other than fetching all the documents and building up some statistics on the application side.

Lastly, what if you do not even know the table to search?

If you take it to the extreme, why can I not do: SELECT … FROM * WHERE * = ‘Oh, no’ ? Why do I have to name the table? Imagine we would support joins between collections. You could choose whether you want to nest data and stuff everything in one document or normalize your data to avoid data duplication and update anomalies. The very moment you normalize your data, you loose the functionality offered by JSON_SEARCH(). JSON_SEARCH() can search all the strings in your nested data. The very moment you spread the nested data over two collections, you loose that feature. You now have to run two queries using JSON_SEARCH against the two collections holding the data and merge the results. How inconvenient… why not allow searching tables “linked” by foreign keys using one query?

That use case a little aside the original question as it assumes some structure in your data, once again. However, the question remains valid: do we need unstructured search given our unstrucuted data?

Happy hacking!

@Ulf_Wendel