Caching time series

This is a problem I’ve been trying to deal with for a while now. In my case I’m talking about financial data. Most of it is – one way or another – a time series.
Take for example transactions. A position management system needs to be able to store millions of a transactions for a number of portfolios. Storing the data is not a problem for any relational database – The problem is querying it.
Financial data is, by its own nature, relational. This means, as far as I’m concerned, that you’d always want to store your raw data in a relational model without de-normalizing your data.
Now this is all good and well but when we need to access the transactions for a specific portfolio between the first of January 1999 and the 31st of December 2008 with all the related security information and prices and perform some analysis relational databases start to struggle.
Of course it is doable, but my argument against it is that CPU cycles used by your software to do the computation are less expensive than you database’s.
What I’m trying to achieve here is cache a time-series of transactions in a memory structure thus giving n clients access to it without having to constantly refer back to the database.
Client tells me what is the last transaction it knows about and all newer data is returned from the structure in memory. Thing is, I want to do it without having to loop over my Array/Map of transaction and avoid wasting memory by duplicating data.
Here’s a possible solution I’ve been toying with today.
The idea comes from a rendering technique used in videogames called voxel-based rendering, specifically an old game called Outcast.
I’ll try and briefly explain what that is (at least as far as my idea is concerned):
Say you are trying to render a landscape. Voxel rendering makes this fairly simple and is achieved column by column. For each column, a maximum Y value was stored (in a Y-Buffer). The pixels of the column in question are filled in starting at the bottom of the screen, only if their Y value was greater than the one stored in the Y-Buffer. This rendering method made the process of removing non-visible parts of the scene very efficient and very easy.
Instead of caching array of transactions based on a time value of variable granularity we could just create a massive ArrayList of all the transactions for any period of time. While creating the array we should also create a HashMap (our Y-buffer). The hasmap will simply contain a transaction unique id as key and the Position in the array as value.
This way when the client calls the APIs with the latest unique identifier it has received we’ll just have to make a very quick lookup in our Y-Buffer for the array position and then use the subList method of the List interface to retrieve everything in the transaction array from the position to the end (size()))This is a problem I’ve been trying to deal with for a while now. In my case I’m talking about financial data. Most of it is – one way or another – a time series.

This is something I’ve been thinking about for a while. How can I cache and have nearly-instant access to any data point within the time series without having to splash out on expensive time series databases. I have an half cooked solution and decided to write this post to see if anybody out there has worked on a similar problem and has a better solution to share.

Specifically I’m trying to work with financial data. Take for example transactions. A position management system needs to be able to store millions of a transactions for a number of portfolios. Storing the data is not a problem for any relational database – The problem is querying it.

Financial data is, by its own nature, relational. This means, as far as I’m concerned, that you’d always want to store your raw data in a relational model without de-normalizing it.

Now this is all good and well but when we need to access the transactions for a specific portfolio between the first of January 1999 and the 31st of December 2008 with all the related security information and prices relational databases start to struggle. Of course it is doable, but if you can get the data out first CPU cycles used by your software to do the computation are less expensive than you database’s.

What I want to be able to do is build a cache of the flattened (de-normalized) data and make it available to n clients at the same time.

The obvious solution is to build a massive time-ordered array of all the transactions and store it in memory. Considering the potential size of the array something like memcached would be more appropriate.
Here comes the second problem. Assume a client calls your software with a couple of dates, you will be forced to loop over the whole array to find the chunk you need. Again, doable, but inefficient.

Array is definitely the way to go to store the data. What I need is an “index” which will tell me exactly where a specific transaction is stored in the array.

Here’s a possible solution I’ve been toying with today.

The idea comes from a technique used in videogames called voxel-based-rending and ray-casting. I’ll try and briefly explain what that is (at least as far as my idea is concerned):

Say you are trying to render a landscape. Voxel rendering makes this fairly simple and is achieved column by column. For each column, a maximum Y value was stored (in a Y-Buffer). The pixels of the column in question are filled in starting at the bottom of the screen, only if their Y value was greater than the one stored in the Y-Buffer. This rendering method made the process of removing non-visible parts of the scene very efficient and very easy.

rendering ray casting voxels

While creating the array we should also create a map (our Y-buffer). The map will simply contain a transaction unique id as key and the position in the array as value.

This way when the client calls our APIs with the last unique identifier it has received we’ll just have to make a very quick lookup in our Y-Buffer for the array position and this will point us exactly to what we need to return.

This could work, but it’s only a partial solution, or rather a solution for a very specific problem. What we really want is for the client to be able to hand in 2 date-time values (from and to).
We could create our index using time information. The problem is that transactions are not at constant time-intervals. Which means that if we are called with a time value that is not in our index we’d still end up looping over the values around the closest point to find the position of the first element in the series.

An alternative would be to build our index using the maximum precision allowed. say an index of milliseconds where the last element position in the array is repeated in the index until a new one appears. Doable but very memory consuming.

This is as far as I’ve gone. Any other ideas out there?

UPDATE: follow-up article

Tagged , , , , , ,

13 thoughts on “Caching time series

  1. Uzair says:

    Hmm. A fairly standard solution:

    1) Sort the data – the one-time cost is absolutely worth it. Say this gives you a big array in memory of *n* items.

    2) Choose some appropriate level of granularity (maybe 100,000 entries) and build a coarse index (a simple std:map will do fine). Say this gives you an index of *m* items.

    3) When querying for a time *t*, you would call *lower_bound(t)* on your map. If *t* is found in the map, you can look it up in your in-memory array and you’re done. If not, increment the iterator lower_bound() returned and do a binary search on the items between the two indices (ie, between the one pointed to by the iterator lower_bound() returned and the iterator after incrementing). The complexity is therefore O(log(m) + log(n)). If you could regularise the time indices (eg, uniform sampling) you could improve this considerably.

    For a 10 million point dataset with a 100,000 point granularity, you would get a cost of something like 24 accesses into your index and 17 accesses into your data array, which isn’t bad considering we haven’t posed any stricter assumptions.

    I just read your post again and it sounds sort of similar…

    • Stefano Buliani says:

      This is something I considered. Just need to work out which one is less expensive. A l pathfinding-type lookup on my index (as you suggest) or keeping a full index at the maximum possible precision.

      I’ll still spend some more time on this and post some more thoughts.

      • Uzair says:

        BTW, what language are you thinking of writing this in?

      • Stefano Buliani says:

        Uzair. Just wrote a new post detailing my solution.

        I’m not sure yet what language I’m going to write it in. This is a solution to a problem I’m just hoping to have and haven’t actually faced yet. Just decided to work on it for a bit to get some perverse geeky fun. Nothing else.

        The only constraints I’d have in the project is that it has to be a web-server, has to use memcached (no control over that choice) and had to integrate with other software around it all written in java.

  2. Jamie Olson says:

    What’s so bad about making one or two extra clustered indices? Sure you duplicate your data, but disk space is cheap and the extra overhead for maintaining the indices is something log(n).

    • Stefano Buliani says:

      Ideally I would like to store everything in memory, hence memcached. On a very large data set this could become a bit too big. See my reply to Uzair.

  3. anon says:

    why can’t you just do a binary search of the time column (assuming that it is sorted by time) to find the nearest (or exact point), shouldn’t take very long. log2n

    this is exactly what a database would be doing at the lowest level, for an index based on time.

    • Stefano Buliani says:

      I want to avoid searches as much as possible. I want my API to be able to handle as many simultaneous requests as possible and a lookup on a map in memory is definitely less expensive than a binary search.

      • anon says:

        I still don’t understand.

        if your ordered index is in-memory.

        even if you had 1TB of memory (quite a lot), and each memory access took you 1 microsecond (slow),

        each search would only take worse case of 40 accesses (2^40 = 1 trillion)
        which would be 40 microseconds. this is not very long at all.

        assuming some more reasonable ball-park figures of 4GB of memory, and
        200 nanosecond access times, gives you

        32 accesses * 200 nanos = 6 microseconds

        this is over 150,000 searches per second.

        exactly how demanding is your user?

      • Stefano Buliani says:

        Anon, You are right. As I said in my reply to Uzair I have decided to spend some time thinking about this to get some geeky fun out of it.

  4. [...] Caching time series Oct 27 [...]

  5. Polprav says:

    Hello from Russia!
    Can I quote a post “No teme” in your blog with the link to you?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: