Last week I wrote a rather long post asking for suggestions on how to cache and access a time series stored in an array in memory.
After some more research I have decided that the solution proposed by Uzair in his comment to the original post with some minor tweaks.
My originial idea involved creating a map in memory with one entry for each point in my series and then lookup the item requested in there.
This is very efficient when requests come in for a time which is in my index, if the point is not in there I’d have to run a binary search on the whole index. Which as Anon pointed out in his comment is still very fast.
if your ordered index is in-memory.
even if you had 1TB of memory (quite a lot), and each memory access took you 1 microsecond (slow),
each search would only take worse case of 40 accesses (2^40 = 1 trillion)
which would be 40 microseconds. this is not very long at all.
assuming some more reasonable ball-park figures of 4GB of memory, and
200 nanosecond access times, gives you
32 accesses * 200 nanos = 6 microseconds
this is over 150,000 searches per second.
The index will be a map of maps. Basically I’ll distribute a number of entries in my index at regular intervals. Each entry will contain a map of entries that preceded it in my time series.
Since I will know the first and last point in the series and the distribution of my regular entries *z* when a request *x* comes in I will be able to work out where is my next regular index entry *y*)
y = x – mod(x/z) + z
Once I have retrieved the map contained in my regular entry in the index I can proceed and lookup the value requested in there. If it’s not a point in my map I will still have to do a binary search but on a much smaller dataset – which I should be able to work on locally without having to rely on memcached over the network. The only “downside” of this solution is that I’ll have to make at least 2 lookups in my index.