Ulf Wendel

Searching document stores in 2013: from 1983 to SQL:2003 in a blink?

I love the new NoSQL systems: more choices! After years of RDBMS dominance there are hundrets of NoSQL systems offering a wide range of data models, data distribution strategies and interfaces. Polyglot persistence describes the market change. I am most fascinated by document stores: nested data and data distribution go hand-in-hand. Nested data, finally. And, for those who like it: schemaless or even schemafree. Maybe something to learn for MySQL? But their search capabilities… A word or two on SQL (SELECT … FROM … WHERE – SFW) and nested data.

Learn from NoSQL document stores

The classical relational data model requests all data to be in first normal form (1NF):

A relation is in first normal form if the domain of each attribute contains only atomic values, and the value of each attribute contains only a single value from that domain.

Simply put: only scalar values are in the columns of table allowed. Take the popular example of a web application implementing a blog system. Let a blog article consist of an id, a title, the blog content and a list of comments with each of it consisting of an id, an author and a message. Using the classic models two tables are needed: blogposts(id, title, content) and comments(id, author, message, fk_blogpost_id). Because only scalar values can be stored in a table, the list of comments has to go in a second table. A foreign key links the two tables.

blogposts
id title content
1982 Searching document stores… I love the new NoSQL systems: more choices! …
comments
id author message fk_blogpost_id
1 Andrey Hi Ulf, … 1982
2 Johannes Ahoi Ulf, … 1982

A NoSQL document store allows using non-scalar values. Thus, comments belonging to a blogpost can be saved in a nested fashion: blogpost(id, title, content, comments(id, author, message)). Here’s a citation from 1986 transferring it into the world of relational databases:

…, an NF2 relation is a set of (equally structured) tuples (or records), the fields (attributes) of which contain either atomic values (e.g. numbers, strings) or relations. The latter may be flat or NF2 relations in turn.

As will be shown, the non-first normal form (NF2 or N1NF) data model is not as academic as it may seem on the first look.

blogposts
id title content comments
id author message
1982 Searching document stores… I love the new NoSQL systems: more choices! … 1 Andrey Hi Ulf, …
2 Johannes Ahoi Ulf, …

Naturally, the 1980s usage examples are not about blog software. Back then, the examples have been around CAD/CAM, forms, scientific and engineering data. The NF2 data model and a modern document store share some advantages:

  • Semantically related data is not split into different tables.
    • On disk: potentially less joins and disk seeks, thus quicker search. Related: Akiban.
    • Distribution (e.g. sharding): potentially less cross-shard queries as basic queries may not spawn multiple nodes. Related: MySQL Cluster condition push-down, …
  • Hierarchical structures often found in reports (master-detail view) are matched.
  • 1:n relationships are easily modeled

[e]NF2 and JSON stores similarities

Todays NoSQL document stores go beyond yesterdays NF2 data model. Often, JSON respectively BSON is used as a data serialization format. This is a natural choice for a web database. JSON and JavaScript go perfectly together. JavaScript gets more and more popular, for example, to create portable applications for mobile devices. But also other languages support the lightweight data encoding format.

JSON supports scalar and non-scalar data types. The scalar types are: string, number, bool and null. The non-scalar types are array and object. With the exception of an array, counterparts can be found in classic NF2 models, as sketched above. Extended NF2 proposals exist that cover all JSON types. SQL:1999 introduces ARRAY (as of SQL:2003 multi-dimensional) as well as ROW. SQL:2003 adds MULTISET. The similarities between a JSON based document store and SQL based [e]NF2 remain obvious, which is important when getting deeper into algebra (not for me 🙂 !), query languages and such.

Data types
JSON SQL
Scalar / atomic
string SQL:92 – CHARACTER, CHARACTER VARYING, SQL:1999 – CLOB
number SQL:92 – NUMERIC, DECIMAL, INTEGER, SMALLINT, FLOAT, REAL, DOUBLE PRECISION
true, false SQL:1999 – BOOLEAN (TRUE, FALSE, UNKNOWN)
null SQL:86(?) – NULL
Non-scalar / composite / collection
object SQL:1999 – ROW
array SQL:1999 – ARRAY (ordered), SQL:2003 – MULTISET (unordered), see notes below!

Below is a concrete example for the blog post application data modelling using a JSON document and a SQL:2003 style table. Once you look beyond MySQL, you can find relational databases implementing these or similar features. For example, Oracle has a variable-length VARRAY type, Postgres offers variable-length multidimensional ARRAY at least since 7.1, DB/2 uses…, Informix has…

JSON

{
    "id": 1982,
    "title": "Searching document stores...",
    "content": "I love the new NoSQL systems: more choices! ...",
    "show_comments": true,
    "comments": [
        {
            "author": "Andrey",
            "message": "Hi Ulf, ..."
        },
        {
            "author": "Johannes",
            "message": "Ahoi Ulf, ..."
        }
    ]
}

SQL:>=2003

CREATE TABLE blogposts (
  id      INTEGER,
  title   VARCHAR(100),
  content CHARACTER LARGE OBJECT,
  show_comments BOOLEAN,
  comments ROW(
    author VARCHAR(100),
    message CHARACTER LARGE OBJECT 
    ) ARRAY[1000]
)     
   

Please note, I have choosen an example which gives the impression that there is a direct mapping between a SQL:2003 table and arbitrary JSON documents. This is not the case. The SQL:2003 collection data types ARRAY and MULTISET are typed. Imagine, some comments had a created TIME field, others had not. Then, two distinct ROW declarations would be required. For the rest of the post, I’ll ignore the glory schema question.

SQL:1999 "the type monster" versus schemaless

The major difference between a JSON store and a SQL:1999+ relational database is on the role and importance of a schema. Be it a table or a composite type everything in a relational database follows a previously defined schema. The more you walk along the SQL:1999, SQL:2003, SQL:2008 road, the complexer and stricter types (and schemata) get: object orientation, XML, … Not so in many NoSQL document stores: each and every document can have a different schema, schemas are not declared upfront. Simple thing: simple to learn, simple to use yet powerful.

Martin Fowler has published a great presentation discussing schemaless data structures last week. It’s as good as his book NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence (strong buy for beginners).

Like the schemaless nature of a JSON document store or not. Fact is, JSON based document stores have gained significant popularity and market share. Fact is, if your queries match the sweet spot of the data model (queries do not spawn multiple documents), you need sharding, and the system does not mess to badly with schemaless disadvantages (e.g. storage requirements) you get a winning combination.

It is so appealing! BUT the query language…

Speaking of JSON, documents and sharding what could be more logical than doing key lookups (shard key, distributed systems, ….) or using JavaScript expressions to search the data store. As a web developer you do know JavaScript inside-out anyway. Possibly, the search is enriched with MapReduce (sharding, distributed systems… ).

Key lookups are unquestionable no rich query language. Whether the JavaScript expressions available to are rich and powerful, has to be checked on a case by case basis. For example, I have my doubts (english version, german version) about the MongoDB query language. MapReduce is certainly great for distributed systems but only for those. Olaf Bachman, who is managing Google’s European ads abuse engineering efforts, describes in his talk NoNoSQL@Google how his team ended up building SQL-like query capabilities for the various NoSQL solutions used.

One does not have to be a computer specialist to understand SFW basics- Read: sales and marketing can do basic analysis themselves without bothering you, the developer. Virtually ever developers has learned SQL at some point, thus low training costs for you, the manager.

SQL is a declarative language. You say what data you want and not how to access it. The data store will find an access path, an optimizer component can speed it up. Whereas with MapReduce or, for example, a JavaScript based query language, the developer has to implement an access path. Listen to NoNoSQL@Google: are you an expert in developing distributed [MapReduce] programs?

The roots of SQL for [e]NF2

Given the link between a simple JSON centric data model – [e]NF and the unquestioned advantages of SQL, it is time to do a time ride back to the 1980s but skip 1977. We are on the hunt for SQL to query [e]NF2. Let the joy ride begin with the paper Die NF2-Relationenalgebra zur einheitlichen Manipulation externer, konzeptueller und interne Datenstrukturen by Hans-Jörg Schek and Marc H. Scholl published in 1983. The discussion of a relational algebra was quickly followed by prototype systems and query language proposals, for example:

After being a hot topic in the mid 1980s, research focus seems to have shifted towards object oriented data bases.

Most NF2 algebras propose NEST and UNNEST operations. Most SQL-style language proposals seek to improve database language orthogonality, see also Some principles of good language design: with especial reference to the design of database languages by C.J. Date (1984). Traces of both can be found in SQL:2003.

Nested relations, nested expressions…

The SQL/NF paper notes that a NF2 data structure allows a relation wherever a scalar value can occur in a 1NF data structure (orthogonality) [1]. They continue that SQL has a closure property where the result of any query on one or more relations is itself a relation [2].

  SELECT attribute-list
  FROM relation-list
  WHERE predicate

Thus, the language shall allow using a SFW-expression in the attribute-list (because of [1]) and in the relation-list (because of [2]) of a SFW-expression.

blogposts
id title content comments_allowed comments
author message
1982 Searching document stores… I love the new NoSQL systems: more choices! … true Andrey Hi Ulf, …
Johannes Ahoi Ulf, …
2013 Solution for… MySQL 5.6 introduces… true    
2012 … and a Happy New Year! Picture taken… false    

Given the above nested blogpost table, the proposal is that we can filter the nested comments using nested SFW:

  SELECT 
    title, 
    (SELECT * FROM comments WHERE author = 'Johannes') 
  FROM 
    blogposts


The above query would return all blogposts including those without any comments but show only comments from Johannes, if any. Andreys’ comment on "Searching document stores…" would be excluded. To show only Johannes comments’ and retrieve only those blogposts commented by Johannes one adds a predicate:

  SELECT 
    title, 
    (SELECT * FROM comments WHERE author = 'Johannes') 
  FROM 
    blogposts
  WHERE 
    (SELECT * FROM comments WHERE author = 'Johannes')


The result of this query would be:

title comments
author message
Searching document stores… Johannes Ahoi Ulf, …

The userfriendliness of the query can be improved removing the need to repeat the nested SFW-expression. The various papers from back then use different shortcuts. Let’s do a modern mash-up using SQL:1999 WITH style:

 WITH
   author_search 
 AS
   (SELECT * FROM comments WHERE author = 'Johannes')
 SELECT 
   title, 
   author_search 
  FROM 
    blogposts
  WHERE 
    author_search


BTW, an Oracle colleguage (not from MySQL) noted in the current issue of the german computer magazine iX that MySQL lacks WITH

I am aware of jumping around between old and new syntax elements. However, I am not after proposing a concrete language – I better leave that to the experts. I try to take a users view on what could possibly be done. The early proposals appeal to me… One more query to conclude the paragraph with SQL:2003 style syntax elements: show all blog posts with together with the first comment, if any.

  SELECT
    title,
    comments[0]
  FROM 
    blogposts


SQL:2003 allows accessing ARRAY elements using an offset, like in many other programming languages.

Nesting and unnesting

To transform nested into flat relations and vice versa NEST and UNNEST operations are required. Formulating them explicitly and introducing new operators might contribute to language clearity. (Please note, that below example uses PNF.)

blogposts
id title content
1982 Searching document stores… I love the new NoSQL systems: more choices! …
comments
id author message fk_blogpost_id
1 Andrey Hi Ulf, … 1982
2 Johannes Ahoi Ulf, … 1982
v NEST / UNNEST ^
blogposts
id title content comments
id author message fk_blogpost_id
1982 Searching document stores… I love the new NoSQL systems: more choices! … 1 Andrey Hi Ulf, … 1982
2 Johannes Ahoi Ulf, … 1982

Unnesting is also useful for use of nested relations with function that expect scalar values. What should SUM((price, description), (price, description)) return? Recall that we can have non-scalar wherever scalar values used to be.

SQL:2003 ?

To be honest, I have not written a single SQL query of higher complexity than SELECT 1 FROM DUAL for years. SELECT 'Hello world' was I needed for testing PHP stuff. Be gentle with my lame examples inspired by what I found in my bookshelf. My bookshelf has a copy of SQL 1999 und SQL 2003: Objektrelationales SQL, SQLJ und SQL/XML by Can Türker. (Aside and BTW: Can Türker is an advisor of ETZ Zurich Database Research Group together with H.J. Schek.)

For all SQL:2003 examples, I’ll use a slightly modified table definition than shown above. I do not use ARRAY to match JSON syntax but MULTISET.

CREATE TABLE blogposts (
  id      INTEGER,
  title   VARCHAR(100),
  content CHARACTER LARGE OBJECT,
  show_comments BOOLEAN,
  comments ROW(
    author VARCHAR(100),
    message CHARACTER LARGE OBJECT 
    ) MULTISET
);


To get started we are looking for all blogposts commented by Johannes. For each one found, we want all the comments, including Andreys’ comments.

  SELECT * 
  FROM blogposts
  WHERE EXISTS(
    SELECT * 
    FROM UNNEST(comments) c 
    WHERE c.author = 'Johannes');    


Find all blog posts titles together with all comments but show only comments from Johannes, if any:

  SELECT 
    title,
    MULTISET(
      SELECT c 
      FROM UNNEST(comments) c
      WHERE c.author = 'Johannes'
    )
  FROM blogposts

Compared to the early SQL/NF style examples the syntax is quite verbose because a type constructor (MULTISET) us used with the inner query. If I’m not mistaken, such explicit type constructors can also be found in HDBL from 1985/1986. But instead of having to type the eight characters of MULTISET the constructor names are shorter like <>, {}

SQL:2003 TABLE constructor

… speaking of constructors. A new TABLE constructor has been added to SQL:2003 which will become handy in a second. The constructor makes it possible to use a functions return value like a table. For example, there could be a function random_greetings returning (rnd, greeting) tuples. By help of TABLE it can appear in the relation-list of an SFW-expression:

  SELECT
    rnd
  FROM TABLE(random_greetings(100))

Back to JSON

Time to recap, time to attempt to crack the nut.

More than 30 years ago, the SQL community has looked into breaking out of the limitations of "flat" tables (first normal form). Algebras for "nested" tables (non-first normal form) have been developed, followed by SQL language extension proposals. Some 10 years ago, SQL:2003 has incorporated all or some of the ideas – please, find an expert to elaborate on this.

SQL tables can hold nested data, JSON documents can hold nested. SQL types almost match JSON types. The troublemaker here is a JSON array. A JSON array is a sorted list of values of arbitrary types (types can vary within one array). A SQL array is a sorted list of values of one type.

JSON SQL

 [
  1,
  "schemaless",
  "no",
  "yes",
  3
 ]

n/a

 [
   1,
   2,
   3
 ]

INT ARRAY[3]

Likely, one would have to introduce a new sorted collection type to overcome this gap. Or, would we rather need a loosely typed scalar? Whatever, it should be possible to use (almost) standard SQL:2003 tables to store JSON documents in a "native" way. JSON would be nothing but a fancy data serialization format. Such a nested table could equally well be used to store and output data as XML.

Result: MySQL as a schemafull document store – with a proper query language.

Finally, the schemaless issue

The remaining nut to crack is schemaless. Like it or not, people use it. Tell them it has drawbacks, they still use it and want it. A question to the experts: is it a non-issue for an SFW-expression?

SFW can deal any nested table. Any collection of (schemaless) documents can be mapped into a list of (schemafull) tables. In the worst case the collection is mapped into as many different tables as there are documents. However, SFW can deal with any table.

It is not possible to create a table that can all possible documents of a collection. But, we can create a table and dump all JSON documents in a BLOB. The SQL:2003 TABLE constructor can be used to create tables, on which we can perform queries, based on a functions return value. Is this possible:

SELECT
  id,
  title
FROM TABLE(json(<blob_column>))

Result, if possible: MySQL as a schemaless document store – with a proper query language. I would definetly prefer that over any UDF based approach such as:

  SELECT
    CAST(JSON_DECODE("id", <blob_column>), INT) AS id,
    JSON_DECODE("title", <blob_column>) AS title
  FROM 
   <blob_column>

The sketched NF2/SQL:2003 road would allow storing nested data in a schemafull and possibly also schemaless way. Query capabilities would be great. Likely, the BLOB hack is somewhat hidden from the user.

Happy hacking!

@Ulf_Wendel Follow me on Twitter