Oct 27

Last week I wrote a rather long post asking for suggestions on how to cache and access a time series stored in an array in memory.

After some more research I have decided that the solution proposed by Uzair in his comment to the original post with some minor tweaks.

My originial idea involved creating a map in memory with one entry for each point in my series and then lookup the item requested in there.
This is very efficient when requests come in for a time which is in my index, if the point is not in there I’d have to run a binary search on the whole index. Which as Anon pointed out in his comment is still very fast.

if your ordered index is in-memory.

even if you had 1TB of memory (quite a lot), and each memory access took you 1 microsecond (slow),

each search would only take worse case of 40 accesses (2^40 = 1 trillion)
which would be 40 microseconds. this is not very long at all.

assuming some more reasonable ball-park figures of 4GB of memory, and
200 nanosecond access times, gives you

32 accesses * 200 nanos = 6 microseconds

this is over 150,000 searches per second.

The index will be a map of maps. Basically I’ll distribute a number of entries in my index at regular intervals. Each entry will contain a map of entries that preceded it in my time series.
Since I will know the first and last point in the series and the distribution of my regular entries *z* when a request *x* comes in I will be able to work out where is my next regular index entry *y*)

y = x – mod(x/z) + z

Once I have retrieved the map contained in my regular entry in the index I can proceed and lookup the value requested in there. If it’s not a point in my map I will still have to do a binary search but on a much smaller dataset – which I should be able to work on locally without having to rely on memcached over the network. The only “downside” of this solution is that I’ll have to make at least 2 lookups in my index.

Tagged with:
Oct 24
This is a problem I’ve been trying to deal with for a while now. In my case I’m talking about financial data. Most of it is – one way or another – a time series.
Take for example transactions. A position management system needs to be able to store millions of a transactions for a number of portfolios. Storing the data is not a problem for any relational database – The problem is querying it.
Financial data is, by its own nature, relational. This means, as far as I’m concerned, that you’d always want to store your raw data in a relational model without de-normalizing your data.
Now this is all good and well but when we need to access the transactions for a specific portfolio between the first of January 1999 and the 31st of December 2008 with all the related security information and prices and perform some analysis relational databases start to struggle.
Of course it is doable, but my argument against it is that CPU cycles used by your software to do the computation are less expensive than you database’s.
What I’m trying to achieve here is cache a time-series of transactions in a memory structure thus giving n clients access to it without having to constantly refer back to the database.
Client tells me what is the last transaction it knows about and all newer data is returned from the structure in memory. Thing is, I want to do it without having to loop over my Array/Map of transaction and avoid wasting memory by duplicating data.
Here’s a possible solution I’ve been toying with today.
The idea comes from a rendering technique used in videogames called voxel-based rendering, specifically an old game called Outcast.
I’ll try and briefly explain what that is (at least as far as my idea is concerned):
Say you are trying to render a landscape. Voxel rendering makes this fairly simple and is achieved column by column. For each column, a maximum Y value was stored (in a Y-Buffer). The pixels of the column in question are filled in starting at the bottom of the screen, only if their Y value was greater than the one stored in the Y-Buffer. This rendering method made the process of removing non-visible parts of the scene very efficient and very easy.
Instead of caching array of transactions based on a time value of variable granularity we could just create a massive ArrayList of all the transactions for any period of time. While creating the array we should also create a HashMap (our Y-buffer). The hasmap will simply contain a transaction unique id as key and the Position in the array as value.
This way when the client calls the APIs with the latest unique identifier it has received we’ll just have to make a very quick lookup in our Y-Buffer for the array position and then use the subList method of the List interface to retrieve everything in the transaction array from the position to the end (size()))This is a problem I’ve been trying to deal with for a while now. In my case I’m talking about financial data. Most of it is – one way or another – a time series.

This is something I’ve been thinking about for a while. How can I cache and have nearly-instant access to any data point within the time series without having to splash out on expensive time series databases. I have an half cooked solution and decided to write this post to see if anybody out there has worked on a similar problem and has a better solution to share.

Specifically I’m trying to work with financial data. Take for example transactions. A position management system needs to be able to store millions of a transactions for a number of portfolios. Storing the data is not a problem for any relational database – The problem is querying it.

Financial data is, by its own nature, relational. This means, as far as I’m concerned, that you’d always want to store your raw data in a relational model without de-normalizing it.

Now this is all good and well but when we need to access the transactions for a specific portfolio between the first of January 1999 and the 31st of December 2008 with all the related security information and prices relational databases start to struggle. Of course it is doable, but if you can get the data out first CPU cycles used by your software to do the computation are less expensive than you database’s.

What I want to be able to do is build a cache of the flattened (de-normalized) data and make it available to n clients at the same time.

The obvious solution is to build a massive time-ordered array of all the transactions and store it in memory. Considering the potential size of the array something like memcached would be more appropriate.
Here comes the second problem. Assume a client calls your software with a couple of dates, you will be forced to loop over the whole array to find the chunk you need. Again, doable, but inefficient.

Array is definitely the way to go to store the data. What I need is an “index” which will tell me exactly where a specific transaction is stored in the array.

Here’s a possible solution I’ve been toying with today.

The idea comes from a technique used in videogames called voxel-based-rending and ray-casting. I’ll try and briefly explain what that is (at least as far as my idea is concerned):

Say you are trying to render a landscape. Voxel rendering makes this fairly simple and is achieved column by column. For each column, a maximum Y value was stored (in a Y-Buffer). The pixels of the column in question are filled in starting at the bottom of the screen, only if their Y value was greater than the one stored in the Y-Buffer. This rendering method made the process of removing non-visible parts of the scene very efficient and very easy.

rendering ray casting voxels

While creating the array we should also create a map (our Y-buffer). The map will simply contain a transaction unique id as key and the position in the array as value.

This way when the client calls our APIs with the last unique identifier it has received we’ll just have to make a very quick lookup in our Y-Buffer for the array position and this will point us exactly to what we need to return.

This could work, but it’s only a partial solution, or rather a solution for a very specific problem. What we really want is for the client to be able to hand in 2 date-time values (from and to).
We could create our index using time information. The problem is that transactions are not at constant time-intervals. Which means that if we are called with a time value that is not in our index we’d still end up looping over the values around the closest point to find the position of the first element in the series.

An alternative would be to build our index using the maximum precision allowed. say an index of milliseconds where the last element position in the array is repeated in the index until a new one appears. Doable but very memory consuming.

This is as far as I’ve gone. Any other ideas out there?

UPDATE: follow-up article

Tagged with:
Nov 26

Another post about my experiences with startups.

The one important thing I have ascertained about programmers during my career is that good ones are hard to find. Furthermore skillful developers tend to be ego-maniacs (I am no exception). The people generally portrayed in films as shy “nerds” are really expansive, loud and boastful persons. After all why shouldn’t they be; a startup’s most important assets are its people, and developers are a vital part of the organism. On the other hand it is also true that everybody is important but nobody is necessary.

In all the startups I’ve worked for, particular emphasis was put in the recruitment process, especially whilst hiring members of the initial core team. Which doesn’t mean only finding smart competent people, but also making sure that they’ll fit in with the rest of the team. Therefore it is very important to have every member of the initial team meet the candidate and make sure that they will be comfortable working with them; even people from different departments, marketing directors should meet programmers before an offer is made.

The other thing you have to make absolutely sure of is whether your newly-hired programmer can accept the technical leader’s style. The manager/CTO (call it whatever you want) has to be flexible, but at the end of the day he/she is still running the show and you can’t really afford to have a small team destabilized, especially when the disturbance undermines the leader’s authority.

I have had to interact with quite a few managers, each one with a different approach to the development process. The two most common and distinguishable modus operandi are certainly the autarchic and the unconstrationist (new word I just came up with).
The autarchic wants to be in full control every step of the way and put a two cents in every decision. The “libertine”, on the other hand, will take care of the big picture and leave the individual programmers to make decisions regarding the piece of functionality they have been assigned.

Neither attitude is wrong. However, I generally adhere to the more liberal approach, hence the title of the post. Being in charge of something every step of the way certainly helps prevent bugs from being introduced, if not because of a CTO’s superior experience because two brains are generally better than one. Anyhow if you try to be involved at every level of the development, from the requirements gathering and analysis to the practical development you’ll quickly end up being overwhelmed with work and with a crew perhaps not prepared to scale and be in charge of new hires themselves.

Contrarily leaving developers some independence in their restricted realm helps boost you team’s morale and prepares a normal programmers to be in change of somebody in the future as the company grows. Admittedly this approach will leave the whole system more exposed to potential bugs, introduced either for a technical mistake or lack of wide-angle-view-to-future-developments of the programmer in charge. The second drawback of this method is that you’ll most likely end up with a code library written with very different styles, hence harder to maintain and get used to for new people. Be that as it may, I still stick with “Celebrate Diversity”.
There’s always going to be a better tool or programming style than the one used when developing something. However, asking a proud programmer to change his code and do whatever you asked for exactly your way is worse than a punch in the guts.

As I said before neither approach is wrong and it’s purely a personal decision of the person in charge. It is a very difficult balance to strike. Ego-feeding is good, keeps the spirits up and developers are more likely to work harder and better. Unfortunately, programmers are human and as you leave somebody in charge of a single piece of functionality there might be nobody to find that stupid bug which could have been caught in a second by an additional brain.

Tagged with:
Nov 22

During my short yet intense career I have worked for quite a few startups and small businesses. I want to talk about a phase most startups go through – Our application is crap, lets trash it and rebuild it from scratch.

Hindsight is something few people have and wisdom is just a myth, just a fancy name we have come up with to call our mistakes.
Nevertheless when a company outgrows the chrysalis stage and starts producing revenue and growing accordingly hindsight is one of management’s favorite words. It’s widely abused in every meeting. Who hasn’t heard at least once “With the benefit of hindsight we could have…”.
This special word makes the eager-developers brain click. There’s our chance to build the perfect application. We know what to avoid, we have wasted countless hours solving all the problems we found in our path before and we’re now ready to whip out the ultimate system.

Perhaps with a bit of hindsight, wisdom, or a wee little bit of both, our developers would realize that every time they start developing an application from scratch a whole new bunch of complications will crop up and will have to be solved. While this happens the management team who has promised to deliver in time to the big bosses is obviously breathing heavily down the programmers’ skinny necks.

This happened to me once, although we didn’t go as far as actually begin the development a countless amount of hours was wasted in meetings to trying and figure out what we were going to do and how.
The issue for many startups is that very often the software is developed hastily for lack of funding or time constraints. This generally leads to a structureless applications patched together at the last minute. That’s how it is and always going to be – and in my personal experience it’s definitely better this way than running out of funding because the entire process is managed by a megalomaniac-developer with serious ego-related issues.
Once the application is ready and the product is launched there’s never going to be time to stop. Look back. Amend emissions and non-blocking bugs until it’s too late. That’s just the natural process.

After 1 year of life the initial product will most likely be completely unrecognizable and changed/badly-patched in many of its core components. This will make keeping track of what’s exactly going on inside the software harder.
So now what.
The approach I’ve found most feasible and realistic is to gradually fix and produce documentation as the development progresses. A software, especially a website, will never stop evolving and the amount of feature-requests and bugs to fix will only increase with time, stopping the entire machinery to go back and fix is just not an option. However, whilst new functionalities are being added, they will probably have to be integrated with old ones. This is where wisdom can assist you. It is definitely worth spending a few hours more on each new piece of code to patch up the old part it has to touch. Gradually your development process will fall back on track and there won’t be any more wild panic because of obsolete or faulty routines and functions.

If, on the other hand, your software is rotten to the core and cannot be rescued, well, tough sob, you should probably have thought it through more carefully before putting your hands on the keyboard.

Having said that I’d like to add that rewriting an application is not entirely impossible, but it’s most likely going to be a colossal process which will require large sums of money.

Tagged with:
preload preload preload