Valio.fi deep dive #7: The resource data storage model

After the previous post on how our CMS works on a concept level, it’s time to explain the technical details. Note that remembering the previous post’s concepts is pretty much mandatory for understanding this discussion. Oh, and this gets quite detailed; unless you’re designing a content management platform, you probably don’t care enough to read it all. That’s ok. :-)

The resource table

image

So, this is our Resource table in its entirety (at one stage of the development, but recent enough for this discussion). Many of the fields are not really interesting. In particular, there are a bunch of fields concerning ratings, view counts, likes and comments, which are cached (technically trigger-updated) summaries of other tables storing the actual user-level interaction. Others are simple mechanics like creation/modification data and publication control.

The interesting bits are:

  • Url is what maps a given incoming request to a resource. A typical example is “/reseptit/appelsiinipasha”, a reference to a particular recipe.
  • Type is a string identifying the resource type, e.g. “Recipe” or “Article”.
  • Layout is a string that describes a specific layout version to use. Typically null, but for campaign articles and other special content, we could switch to an alternate set of MVC views with this.
  • Data is an XML field that contains quite a lot of information. We’ll get back to this later.

imageThis is the key structure for managing resources. There’s very little else. One more table deserves to be mentioned at this stage: ResourceLink. A resource link specifies (unsurprisingly) a link between two resources. The exact semantics of that link vary, as represented by the html-ish rel column.

Typical rels are “draft” (points from a published resource to its draft), “tag” (points from a resource to its tags – yes, tags are resources too) and “seeAlso” (a manually created link between two related resources).

Modeling the resource types

The resource types – Recipe, Article, Product, Home, … – are a key part of the equation, because the resource type specifies the controller used for page execution. It also specifies the view, although we can make exceptions using the Layout property discussed above.

Storage-wise, how do these resource types differ? Actually, the differences are rather small. They all have very different controllers and views, but the actual data structures are remarkably similar. For example, a product resource has a reference to an actual Product business object (stored within a Product table and a few dozen subtables), but that’s only one guid. Some resource types have a handful of custom fields, but most only have one or two.

We originally considered a table per subclass strategy, but that would have yielded ridiculously many tables for a very simple purpose. Thus we decided to go for a single table for all resources, with the Type column as the discriminator. However, the option of dumping all the subclass-specific columns as nullables on the Resource would have yielded a very ugly table (consider what we have now + 50-something fields more).

XML to the rescue

Enter the Data xml column. “Ugh”, you say, “you guys actually discarded referential integrity and hid your foreign keys into an XML blob?”. Yeah! Mind you, we actually even considered a NoSQL implementation, so a relational database with only partial referential integrity was by no means an extreme option.

Let’s compare the arguments for and against the XML column.

Pro-XML Con-XML (more like pro-tables)
  • The relational representation would have involved really many columns, most of them null on any given row. We didn’t want that mess.
  • We would have been changing the table’s schema all the time, as finding the final set of variables for all types took quite a while.
  • XML enables resource type templates to be stored (mostly) as a single XML document; contrast with the verbiage needed to declaratively describe and instantiate a database row with close to 100 fields.
  • Since some of the resource types required multi-value properties, we couldn’t have gone cleanly with one table anyway, or alternatively we would have ended up with non-relational encodings (e.g. int arrays in nvarchars like “1,2,3”)
  • We still needed a clean way to store the widgets (see below).
  • Lack of referential integrity. What to do when a user hits a recipe page which refers to a recipe object that was deleted? True, we need to deal with that, but such scenario is still the result of a bug; we do have code to clean up refs.Also, realize that cascades aren’t a perfect solution either. If product deletion just set the reference to null, we’d have the same problem as above. If it removed the resource altogether, we’d potentially wipe lots of valuable information and content.
  • Performance. Granted, while MSSQL provides decent XQuery tools, XML column performance is suboptimal. However, it’s not as bad as some think, and it can certainly be mitigated by caching.
  • OR mapper support. Yeah, that’s a problem, but writing it ourselves was less work than what one might imagine.

 

Widgets, the last straw

Ultimately, much of the XML decision finally hinged on the question of where to store widgets. A short recap: A typical page has perhaps a dozen widgets organized in a few zones. The widgets are selected from something like 30 widget types, each of which has a specific set of parameters which must be stored per widget instance.

Now, the widget question is just like the resource one, but on a smaller scale. In order to construct a relational model for the widgets, we would be creating a dazzling amount of tables. For example, if a Product Highlight widget had a specified fixed product to highlight, a pure relational implementation would have at least the following tables:

  • Widget (with at least resource id, zone name and a position index, plus a type reference)
  • ProductHighlightWidget (with the same identity as the Widget, plus columns for single-value configuration properties)
  • ProductHighlightWidget_Product (with a ProductHighlightWidget reference, a Product reference and an index position)

Granted, a deleting cascade would work great here except for collapsing the index positions, but even that we could easily handle with a trigger.

But I’m accepting some compromises already: I don’t have a WidgetType table, and my zone name is a string. Relationally speaking, a Resource should actually have a reference to a ResourceLayout (which technically defines available zones), which should then be linked onward to ResourceType (which defines the layout set available). Oh, and we’d still need a ResourceLayout_Zone to which Widget would link, but since the zone set is actually defined in cshtml files, who would be responsible for updating the RL_Z table?

The previous mental experiment reveals some ways in which many applications could benefit from the use of NoSQL solutions. We only touched on one property, and those widget types contain quite a few of them.

As it was obvious that we needed XML to store the widget setup, it became quite lucrative to use the same XML blob for resource subclass data as well.

After the widgets discussion, there is one more thing I want to highlight as a benefit of XML. Since most of the page’s content is defined in that one blob, certain scenarios such as version control become trivial. For example, publishing and reverting drafts mostly involves throwing XML blobs around. Compare this to the effort it would take to update all the references properly.

Finally, you may want to ask “Why XML instead of, say, JSON?”. XML admittedly produces relatively bulky documents. However, it has one excellent characteristic: we can easily query it with SQL Server, and that makes a huge difference in many scenarios. Implementing the equivalent performance with JSON would have required cache fields, reliable updating of which would in turn require triggers, but since parsing JSON with T-SQL is a pain (to say the least), it would have drawn SQLCLR in as well. Thus, XML was actually simple this time.

Now show me the XML!

The actual size of the XML varies heavily by resource. On one of my dev databases, the shortest data document is 7 bytes; the longest is 28 kilobytes. But here’s a fairly short one:

image

This is for a recipe page, where the only real property is the recipe reference. The page template also specifies a zone called “Additional”, but it has no widgets specified – thus, a very short XML document.

Here’s a snippet of a significantly longer one.

image

This is from an article page. As you can see, the Article resource type has a subtype field declaring this instance as a “product article”, i.e. one that describes a single Valio product or a family thereof. Since an article does not derive its content from a linked business entity, its content is mostly located in widgets. Therefore, the <zones> element tends to be fairly long with a hefty stack of various widgets. In this example, you can see an image carousel, a text element and a brand banner (with a specific layout set – yeah, we have those for widgets too).

After the data comes the code

When I initially set out to write the description of our CMS features, I was thinking of a single long post. I have now rambled on for three medium-sized ones, and I still haven’t shown you a line of code. But, I promise that will change in the next episode: I will wrap up the CMS story by discussing our NHibernate implementation of the above data structures. And I’ll go through the controller/widget rendering framework as well.

Meanwhile, if there are some questions you’d like to have answered, please feel free to post comments.

November 19, 2011 · Jouni Heikniemi · No Comments
Tags: , ,  Â· Posted in: Web

Leave a Reply