Using AzureGraphStore to store triples in Azure Table Storage

Introduction

As part of a series of blog posts, I’ll explore a number of extensions I have written for the Windows Azure Storage SDK for .NET. These extensions build on top of the existing storage options (tables, queues and blobs) to provide an API more tailored for particular storage needs.

What is a Graph Store?

No, it’s not somewhere for keeping pie charts, it’s a triple store. It stores three properties that represent a relationship or a fact about something.

These properties are usually ‘subject’, ‘predicate’, ‘object’. As this is .NET, I’ve called them ‘subject’, ‘property’ and ‘value’.

As an example, you could use triples to store these facts:

Richard, is, Male
Richard, likes, Cheese
Dave, likes, Cheese

If you then queried the store for all triples with a value of ‘Cheese’, you would find that Richard and Dave ‘like’ it.

This kind of data store lends itself very well to storing relationships, and ad-hoc facts that don’t fit particularly well into a relational or document database.

Installing and source code

The source code is in GitHub: https://github.com/richorama/AzureStorageExtensions

NuGet Package: http://nuget.org/packages/AzureGraphStore/

To install AzureGraphStore, run the following command in the Package Manager Console.

PM> Install-Package AzureGraphStore

How to use

The graph store is designed to feel just like the Table, Blob and Queue API. To create a graph, start with a CloudStorageAccount:

var account = CloudStorageAccount.DevelopmentStorageAccount;
var graphClient = account.CreateCloudGraphClient();
var graph = graphClient.GetGraphReference("example");
graph.CreateIfNotExists();

You can then start writing triples to the store:

graph.Put("Richard", "Loves", "Cheese");

To query the graph, pass any number of arguments into the ‘Get’ function:

// query a single triple
var triple = graph.Get("Richard", "Loves", "Cheese").First();

// query using any combination of subject, property and value, i.e.
var triples = graph.Get(subject: "Richard");
var triples = graph.Get(property: "Loves");
var triples = graph.Get(values: "Cheese");
var triples = graph.Get(subject: "Richard", property: "Hates");
var triples = graph.Get(property: "Hates", value: "Marmite");
var triples = graph.Get(); // retrieving the entire graph is not recommended!

It’s recommended to enable key hashing, which allows you to use longer key names and escape any invalid characters:

graph.KeyEncoder = Graph.MD5Hash;

How it works

Writing Triples

Under the covers, the graph is just a table, with a name prefixed with ‘wazgraph’.

Azure Table Storage uses two keys to store the records; the PartitionKey and RowKey. These are the only indexed columns in the table, so the use of these keys should be carefully considered when designing a storage system.

Because the triple store can be queried in several ways, using any combination of one, two or all three dimensions, entities are stored three times to improve retrieval times, and avoid full-table scans.

To record the triple ‘richard’, ‘loves’, ‘cheese’, the following entities are written to the table:

Untitled

The records are stored with a PartitionKey combinations of

  • Property~Value
  • Subject~Value
  • Value~Property

The RowKey is always the other dimension, which isn’t included as part of the PartitionKey.

The PartitionKeys are prefixed with ps/sv/vp to stop key collisions.

The three entities are written in parallel.

Querying Triples

When you want to query records with a subject of ‘richard’, you compose an OData query like this:

  PartitionKey gt 'sv~richard~' and PartitionKey lt 'sv~richard~~'

…which then returns all records where the PartitionKey starts with ‘sv~richard~’.

The tilde character is the last unicode character on the keyboard, so querying between ‘prefix~’ and ‘prefix~~’ should pick up all keys in starting with ‘prefix~’.

If you want to search on two dimension, for example ‘loves’, ‘cheese’, the OData query looks like this:

  PartitionKey eq 'vp~cheese~loves'

The appropriate PartitionKey is used, which holds the two dimensions, and avoid the full table scan.

When querying using all three dimensions, the RowKey is included in the query.

What Next?

I intend to test this with a large quantity of data, and publishing some performance metrics.

I have also put some thought into writing a SPARQL parser on top of this, but life is too short. Does anybody want to take that on?

About these ads