Using AzureGraphStore to store triples in Azure Table Storage
Introduction
As part of a series of blog posts, I’ll explore a number of extensions I have written for the Windows Azure Storage SDK for .NET. These extensions build on top of the existing storage options (tables, queues and blobs) to provide an API more tailored for particular storage needs.
What is a Graph Store?
No, it’s not somewhere for keeping pie charts, it’s a triple store. It stores three properties that represent a relationship or a fact about something.
These properties are usually ‘subject’, ‘predicate’, ‘object’. As this is .NET, I’ve called them ‘subject’, ‘property’ and ‘value’.
As an example, you could use triples to store these facts:
Richard, is, Male Richard, likes, Cheese Dave, likes, Cheese
If you then queried the store for all triples with a value of ‘Cheese’, you would find that Richard and Dave ‘like’ it.
This kind of data store lends itself very well to storing relationships, and ad-hoc facts that don’t fit particularly well into a relational or document database.
Installing and source code
The source code is in GitHub: https://github.com/richorama/AzureStorageExtensions
NuGet Package: http://nuget.org/packages/AzureGraphStore/
To install AzureGraphStore, run the following command in the Package Manager Console.
PM> Install-Package AzureGraphStore
How to use
The graph store is designed to feel just like the Table, Blob and Queue API. To create a graph, start with a CloudStorageAccount:
var account = CloudStorageAccount.DevelopmentStorageAccount; var graphClient = account.CreateCloudGraphClient(); var graph = graphClient.GetGraphReference("example"); graph.CreateIfNotExists();
You can then start writing triples to the store:
graph.Put("Richard", "Loves", "Cheese");
To query the graph, pass any number of arguments into the ‘Get’ function:
// query a single triple var triple = graph.Get("Richard", "Loves", "Cheese").First(); // query using any combination of subject, property and value, i.e. var triples = graph.Get(subject: "Richard"); var triples = graph.Get(property: "Loves"); var triples = graph.Get(values: "Cheese"); var triples = graph.Get(subject: "Richard", property: "Hates"); var triples = graph.Get(property: "Hates", value: "Marmite"); var triples = graph.Get(); // retrieving the entire graph is not recommended!
It’s recommended to enable key hashing, which allows you to use longer key names and escape any invalid characters:
graph.KeyEncoder = Graph.MD5Hash;
How it works
Writing Triples
Under the covers, the graph is just a table, with a name prefixed with ‘wazgraph’.
Azure Table Storage uses two keys to store the records; the PartitionKey and RowKey. These are the only indexed columns in the table, so the use of these keys should be carefully considered when designing a storage system.
Because the triple store can be queried in several ways, using any combination of one, two or all three dimensions, entities are stored three times to improve retrieval times, and avoid full-table scans.
To record the triple ‘richard’, ‘loves’, ‘cheese’, the following entities are written to the table:
The records are stored with a PartitionKey combinations of
- Property~Value
- Subject~Value
- Value~Property
The RowKey is always the other dimension, which isn’t included as part of the PartitionKey.
The PartitionKeys are prefixed with ps/sv/vp to stop key collisions.
The three entities are written in parallel.
Querying Triples
When you want to query records with a subject of ‘richard’, you compose an OData query like this:
PartitionKey gt 'sv~richard~' and PartitionKey lt 'sv~richard~~'
…which then returns all records where the PartitionKey starts with ‘sv~richard~’.
The tilde character is the last unicode character on the keyboard, so querying between ‘prefix~’ and ‘prefix~~’ should pick up all keys in starting with ‘prefix~’.
If you want to search on two dimension, for example ‘loves’, ‘cheese’, the OData query looks like this:
PartitionKey eq 'vp~cheese~loves'
The appropriate PartitionKey is used, which holds the two dimensions, and avoid the full table scan.
When querying using all three dimensions, the RowKey is included in the query.
What Next?
I intend to test this with a large quantity of data, and publishing some performance metrics.
I have also put some thought into writing a SPARQL parser on top of this, but life is too short. Does anybody want to take that on?
Reply
You must be logged in to post a comment.