This is something I’ve been thinking about for a while. How can I cache and have nearly-instant access to any data point within the time series without having to splash out on expensive time series databases. I have an half cooked solution and decided to write this post to see if anybody out there has worked on a similar problem and has a better solution to share.
Specifically I’m trying to work with financial data. Take for example transactions. A position management system needs to be able to store millions of a transactions for a number of portfolios. Storing the data is not a problem for any relational database – The problem is querying it.
Financial data is, by its own nature, relational. This means, as far as I’m concerned, that you’d always want to store your raw data in a relational model without de-normalizing it.
Now this is all good and well but when we need to access the transactions for a specific portfolio between the first of January 1999 and the 31st of December 2008 with all the related security information and prices relational databases start to struggle. Of course it is doable, but if you can get the data out first CPU cycles used by your software to do the computation are less expensive than you database’s.
What I want to be able to do is build a cache of the flattened (de-normalized) data and make it available to n clients at the same time.
The obvious solution is to build a massive time-ordered array of all the transactions and store it in memory. Considering the potential size of the array something like memcached would be more appropriate.
Here comes the second problem. Assume a client calls your software with a couple of dates, you will be forced to loop over the whole array to find the chunk you need. Again, doable, but inefficient.
Array is definitely the way to go to store the data. What I need is an “index” which will tell me exactly where a specific transaction is stored in the array.
Here’s a possible solution I’ve been toying with today.
The idea comes from a technique used in videogames called voxel-based-rending and ray-casting. I’ll try and briefly explain what that is (at least as far as my idea is concerned):
Say you are trying to render a landscape. Voxel rendering makes this fairly simple and is achieved column by column. For each column, a maximum Y value was stored (in a Y-Buffer). The pixels of the column in question are filled in starting at the bottom of the screen, only if their Y value was greater than the one stored in the Y-Buffer. This rendering method made the process of removing non-visible parts of the scene very efficient and very easy.
While creating the array we should also create a map (our Y-buffer). The map will simply contain a transaction unique id as key and the position in the array as value.
This way when the client calls our APIs with the last unique identifier it has received we’ll just have to make a very quick lookup in our Y-Buffer for the array position and this will point us exactly to what we need to return.
This could work, but it’s only a partial solution, or rather a solution for a very specific problem. What we really want is for the client to be able to hand in 2 date-time values (from and to).
We could create our index using time information. The problem is that transactions are not at constant time-intervals. Which means that if we are called with a time value that is not in our index we’d still end up looping over the values around the closest point to find the position of the first element in the series.
An alternative would be to build our index using the maximum precision allowed. say an index of milliseconds where the last element position in the array is repeated in the index until a new one appears. Doable but very memory consuming.
This is as far as I’ve gone. Any other ideas out there?
UPDATE: follow-up article