This is something I’ve been thinking about for a while. How can I cache and have nearly-instant access to any data point within the time series without having to splash out on expensive time series databases. I have an half cooked solution and decided to write this post to see if anybody out there has worked on a similar problem and has a better solution to share.
Specifically I’m trying to work with financial data. Take for example transactions. A position management system needs to be able to store millions of a transactions for a number of portfolios. Storing the data is not a problem for any relational database – The problem is querying it.
Financial data is, by its own nature, relational. This means, as far as I’m concerned, that you’d always want to store your raw data in a relational model without de-normalizing it.
Now this is all good and well but when we need to access the transactions for a specific portfolio between the first of January 1999 and the 31st of December 2008 with all the related security information and prices relational databases start to struggle. Of course it is doable, but if you can get the data out first CPU cycles used by your software to do the computation are less expensive than you database’s.
What I want to be able to do is build a cache of the flattened (de-normalized) data and make it available to n clients at the same time.
The obvious solution is to build a massive time-ordered array of all the transactions and store it in memory. Considering the potential size of the array something like memcached would be more appropriate.
Here comes the second problem. Assume a client calls your software with a couple of dates, you will be forced to loop over the whole array to find the chunk you need. Again, doable, but inefficient.
Array is definitely the way to go to store the data. What I need is an “index” which will tell me exactly where a specific transaction is stored in the array.
Here’s a possible solution I’ve been toying with today.
The idea comes from a technique used in videogames called voxel-based-rending and ray-casting. I’ll try and briefly explain what that is (at least as far as my idea is concerned):
Say you are trying to render a landscape. Voxel rendering makes this fairly simple and is achieved column by column. For each column, a maximum Y value was stored (in a Y-Buffer). The pixels of the column in question are filled in starting at the bottom of the screen, only if their Y value was greater than the one stored in the Y-Buffer. This rendering method made the process of removing non-visible parts of the scene very efficient and very easy.

While creating the array we should also create a map (our Y-buffer). The map will simply contain a transaction unique id as key and the position in the array as value.
This way when the client calls our APIs with the last unique identifier it has received we’ll just have to make a very quick lookup in our Y-Buffer for the array position and this will point us exactly to what we need to return.
This could work, but it’s only a partial solution, or rather a solution for a very specific problem. What we really want is for the client to be able to hand in 2 date-time values (from and to).
We could create our index using time information. The problem is that transactions are not at constant time-intervals. Which means that if we are called with a time value that is not in our index we’d still end up looping over the values around the closest point to find the position of the first element in the series.
An alternative would be to build our index using the maximum precision allowed. say an index of milliseconds where the last element position in the array is repeated in the index until a new one appears. Doable but very memory consuming.
This is as far as I’ve gone. Any other ideas out there?
UPDATE: follow-up article













Hmm. A fairly standard solution:
1) Sort the data – the one-time cost is absolutely worth it. Say this gives you a big array in memory of *n* items.
2) Choose some appropriate level of granularity (maybe 100,000 entries) and build a coarse index (a simple std:map will do fine). Say this gives you an index of *m* items.
3) When querying for a time *t*, you would call *lower_bound(t)* on your map. If *t* is found in the map, you can look it up in your in-memory array and you’re done. If not, increment the iterator lower_bound() returned and do a binary search on the items between the two indices (ie, between the one pointed to by the iterator lower_bound() returned and the iterator after incrementing). The complexity is therefore O(log(m) + log(n)). If you could regularise the time indices (eg, uniform sampling) you could improve this considerably.
For a 10 million point dataset with a 100,000 point granularity, you would get a cost of something like 24 accesses into your index and 17 accesses into your data array, which isn’t bad considering we haven’t posed any stricter assumptions.
I just read your post again and it sounds sort of similar…
This is something I considered. Just need to work out which one is less expensive. A l pathfinding-type lookup on my index (as you suggest) or keeping a full index at the maximum possible precision.
I’ll still spend some more time on this and post some more thoughts.
BTW, what language are you thinking of writing this in?
Uzair. Just wrote a new post detailing my solution.
I’m not sure yet what language I’m going to write it in. This is a solution to a problem I’m just hoping to have and haven’t actually faced yet. Just decided to work on it for a bit to get some perverse geeky fun. Nothing else.
The only constraints I’d have in the project is that it has to be a web-server, has to use memcached (no control over that choice) and had to integrate with other software around it all written in java.
What’s so bad about making one or two extra clustered indices? Sure you duplicate your data, but disk space is cheap and the extra overhead for maintaining the indices is something log(n).
Ideally I would like to store everything in memory, hence memcached. On a very large data set this could become a bit too big. See my reply to Uzair.
why can’t you just do a binary search of the time column (assuming that it is sorted by time) to find the nearest (or exact point), shouldn’t take very long. log2n
this is exactly what a database would be doing at the lowest level, for an index based on time.
I want to avoid searches as much as possible. I want my API to be able to handle as many simultaneous requests as possible and a lookup on a map in memory is definitely less expensive than a binary search.
I still don’t understand.
if your ordered index is in-memory.
even if you had 1TB of memory (quite a lot), and each memory access took you 1 microsecond (slow),
each search would only take worse case of 40 accesses (2^40 = 1 trillion)
which would be 40 microseconds. this is not very long at all.
assuming some more reasonable ball-park figures of 4GB of memory, and
200 nanosecond access times, gives you
32 accesses * 200 nanos = 6 microseconds
this is over 150,000 searches per second.
exactly how demanding is your user?
Anon, You are right. As I said in my reply to Uzair I have decided to spend some time thinking about this to get some geeky fun out of it.
[...] Caching time series Oct 27 [...]
Hello from Russia!
Can I quote a post “No teme” in your blog with the link to you?
Absolutely. Feel free to use this post wherever you want.