Nov 03

If you Google for this you’ll find that the easiest way to provide a full audit of everything that happened in your database is to create a duplicate of each table you need history and insert a copy of the record you are changing in there for every update.

For example if you have a table called client

CREATE TABLE client (
  id INTEGER,
  firstname VARCHAR(255),
  lastname VARCHAR(255),
  datecreated TIMESTAMP
);

You will also create a table called client_history

CREATE TABLE client_history (
  id INTEGER,
  firstname VARCHAR(255),
  lastname VARCHAR(255),
  datecreated TIMESTAMP,
  history_creation TIMESTAMP,
  history_user INTEGER – A foreign key to the user who updated this record if required
);

You’ll then attach an onUpdate trigger to the client table inserting all the data in your history table. This solution gives you the ability to keep track of everything that happens in your database, when it happened and who-done-it.
What this solution doesn’t deal with is schema changes. Also querying the state of a record at a point in time to join it with other tables might be a bit tricky.

Let’s start from the first point. Schema changes.
The most obvious and simple solution is to put in place a proper process for all database schema changes. i.e. whoever touches the schema is also in charge of updating the history table structure and the trigger on the main table. Doable but a bit too prone to human error if you ask me.

What I generally do is use a small function/stored procedure to create the history on a table. This procedure is in charge of both creating the history table the first time it’s called and updating the structure if the main table has changed.
In this example I’m going to use PostgreSQL. First because it’s a database I’m familiar with and more importantly being open-srouce you can just download it and try this yourself.

My example here is a bit verbose but it serves its purpose. There are many database-specific instructions you could use to extract a table structure or perform some of the simple tasks of this function. However, what I’m trying to demonstrate is the concept and not PostgreSQL trickery.
Most relational databases store the schema information in internal tables (pg_attribute, pg_class and pg_namespace in this case) so that’s what we are going to use to read the structure of our original table.

– This function expects as a parameter the name of the table you want to
– create history for and a unique identifier for the record.
– in my case all tables have an ID column. Alternatively you could hand
– as a parameter also the name of the unique id column
CREATE OR REPLACE FUNCTION create_history(tablename VARCHAR(255), recordid INTEGER) RETURNS BOOLEAN AS $$
DECLARE
  vhisttablename NAME; – history table name
  vhistfieldcount INTEGER;  – number of fields in history table
  vtablefieldcount INTEGER; – number of fields in main table
  vtmprowcount INTEGER; – temporary variable to store query results
  vcurfield RECORD; – variable to loop over fields of main table to create history
  vhisttablesql TEXT; – sql to create history table
  vfieldlist TEXT; – list fo fields in main table
  vhisttablefields TEXT; – list of fields in history table
  vhistinsertsql TEXT; – SQL for insert statement
  vtmptablename VARCHAR(255); – temp variable to check if history exists
BEGIN
  vhisttablename := tablename||‘_hist’;

  – Check if the history table exists, if not we need to create it
  SELECT INTO vtmptablename relname
  FROM pg_class
  WHERE relname = vhisttablename;

  vfieldlist := ;
  vhisttablefields := ;

  – count the fields in the history/current table if history exists
  IF vtmptablename IS NOT NULL
  THEN
    SELECT INTO vhistfieldcount count(*)
    FROM pg_attribute AS a
    INNER JOIN pg_class AS c ON (c.oid = a.attrelid)
    INNER JOIN pg_namespace AS n ON (n.oid = c.relnamespace)
    WHERE a.attnum > 0
    AND c.relname = vhisttablename – history table name
    AND NOT a.attisdropped
    AND pg_table_is_visible(c.oid);

    SELECT INTO vtablefieldcount count(*)
    FROM pg_attribute AS a
    INNER JOIN pg_class AS c ON (c.oid = a.attrelid)
    INNER JOIN pg_namespace AS n ON (n.oid = c.relnamespace)
    WHERE a.attnum > 0
    AND c.relname = tablename – main table
    AND NOT a.attisdropped
    AND pg_table_is_visible(c.oid);
  END IF;

  – Get all the attributes and their type from the original table
  FOR vcurfield IN
  SELECT a.attname AS COLUMN, format_type(a.atttypid, a.atttypmod) AS datatype
  FROM pg_attribute AS a
  INNER JOIN pg_class AS c ON (c.oid = a.attrelid)
  INNER JOIN pg_namespace AS n ON (n.oid = c.relnamespace)
  WHERE a.attnum > 0
  AND c.relname = tablename
  AND NOT a.attisdropped
  AND pg_table_is_visible(c.oid)
  LOOP
    – populate lists of fields both for history creation and select from main table
    vhisttablefields := vhisttablefields||vcurfield.COLUMN||‘ ‘||vcurfield.datatype||‘ NULL, ‘;
    vfieldlist := vfieldlist||vcurfield.COLUMN||‘, ‘;

    – If the history table exists and the number of fields is different
    – from the main table (+3 as we add a timestamp, a user field and an history unique id)
    IF vtmptablename IS NOT NULL AND vtablefieldcount+3 <> vhistfieldcount
    THEN
      – make sure that this is the missing field in the history table
       PERFORM a.attname
       FROM pg_attribute AS a
       INNER JOIN pg_class AS c ON (c.oid = a.attrelid)
       INNER JOIN pg_namespace AS n ON (n.oid = c.relnamespace)
       WHERE a.attnum > 0
       AND c.relname = vhisttablename
       AND NOT a.attisdropped
       AND a.attname = vcurfield.COLUMN
       AND pg_table_is_visible(c.oid);

       GET DIAGNOSTICS vtmprowcount = ROW_COUNT;

      – If it is then generate the SQL and execute the alter table
       IF vtmprowcount = 0
       THEN
         vhisttablesql := ‘ALTER TABLE ‘||vhisttablename||‘ ADD COLUMN ‘||vcurfield.COLUMN||‘ ‘||vcurfield.datatype||‘ NULL;’;
         EXECUTE vhisttablesql;
       END IF;
     END IF;
  END LOOP;

  – The history table doesn’t exist. Create it
  IF vtmptablename IS NULL OR vtmptablename =
  THEN
    vhisttablesql := ‘CREATE TABLE ‘||vhisttablename||‘ ( histid SERIAL PRIMARY KEY, ‘;

    vhisttablesql := vhisttablesql||vhisttablefields||‘ history_creation TIMESTAMP NOT NULL DEFAULT now() );’;

    EXECUTE vhisttablesql;

    – create indexes
    vhisttablesql := ‘ CREATE INDEX idx_’||vhisttablename||‘_1 ON ‘||vhisttablename||‘ (id); ‘;
    EXECUTE vhisttablesql;

    vhisttablesql := ‘ CREATE INDEX idx_’||vhisttablename||‘_2 ON ‘||vhisttablename||‘ (history_creation); ‘;
    EXECUTE vhisttablesql;
  END IF;

  – Proceed with the history creation
  vhistinsertsql := ‘INSERT INTO ‘||vhisttablename||‘ (‘||vfieldlist||‘ history_creation) SELECT ‘||vfieldlist||‘ now() FROM ‘||tablename||‘ WHERE id=’||recordid||‘;’;

  RAISE NOTICE ‘Executing %’, vhistinsertsql;
  EXECUTE vhistinsertsql;

  GET DIAGNOSTICS vtmprowcount = ROW_COUNT;
  IF vtmprowcount > 0
  THEN
    RETURN TRUE;
  ELSE
    RETURN FALSE;
  END IF;
END;
$$
LANGUAGE plpgsql;

You’ll notice this function only adds fields to the history table. That’s because even if we remove a field from the original table we still want to keep track of it in our history.

Now on to querying the history to retrieve the state of a record at a specific date.

This is a bit tricky. At the moment as far as I know only Oracle has the ability return a “virtual table” from a function. PostgreSQL can return a RECORD variable – which is great to use inside functions but once outside it loses its structure and turns into a comma separated list of values – or a %ROWTYPE you can define.
This unfortunately means that you’d have to create a custom function for each history table to returns a SET OF client – in our case.

What we are trying to achieve is something like what this query does.

SELECT h.id, firstname, lastname, datecreated, hist_creation
FROM client_hist AS h
INNER JOIN (
  SELECT min(history_creation) AS mintstamp, id AS id
  FROM client_hist
  WHERE history_creation > THE-TIME-YOU-NEED
  AND id= YOUR-RECORD-ID
  GROUP BY id
) AS m ON (m.id = h.id AND m.mintstamp = h.history_creation);

Basically the next record in the history after the time specified.

At this point so far as I can see we have two options. Either create a function that returns a specific type of ROW, or write a more generic history function to return only the id of the history record we need and not the whole row (If you look at the history creation function we are putting a histid column in there).

You could use this function in your queries to get the history record id out like this.

– This function returns the histid you need to look at in your history table
CREATE OR REPLACE FUNCTION select_hist_id(tablename VARCHAR(255), recordid INTEGER, tstamp TIMESTAMP) RETURNS INTEGER AS $$
DECLARE
  curs REFCURSOR;
  vid INTEGER;
BEGIN
  OPEN curs FOR EXECUTE ‘SELECT h.histid
  FROM ‘
||tablename||‘_hist AS h
  INNER JOIN (
    SELECT min(history_creation) AS mintstamp, id
    FROM ‘
||tablename||‘_hist
    WHERE history_creation > ‘
||tstamp||
    AND id=’
||recid||
    GROUP BY id
  ) AS m ON (m.id = h.id AND m.mintstamp = h.history_creation);’
;

  FETCH curs INTO vid;

  RETURN vid;
END;
$$
LANGUAGE plpgsql;

– At this point you can just get ids like this
– records in client as of this time 2009-11-02 21:13:41.552601
SELECT select_hist_id(‘client’, id, ’2009-11-02 21:13:41.552601′::TIMESTAMP) FROM client;

I’m sure you can figure out the rest.

Be careful this is very slow on large data sets. If you are planning to work on millions of records then you should consider building a history lookup function for each table either defining data types for your RECORD or using OUT variables.

CREATE FUNCTION foo(recordid int, firstname OUT VARCHAR(10))
Tagged with:
Nov 01

Not just because it’s vital for your business.
For most production systems the speed-bottleneck lies with accessing your data. Database are excellent for storing all your data and keeping it organised. However, when it comes to getting it out quickly, especially if you have lots of it, they are not the sharpest of tools.

That’s why I always try to press home the importance of keeping on top of your data.

Even though you think all your data is neatly organised in your perfectly structured database it keeps changing shape, or rather the understanding your database has of your data keeps changing.
For example in PostgreSQL the “shape” of your data is stored in a table called pg_statistics. Data is collected and stored there by analyze.

The query planner uses the data collected in pg_statistics to pick the most efficient way to run your queries. Unfortunately no system is perfect, even the best planner makes mistakes. You have to strike a balance between letting analyze collect as much data as possible to give your database a better understanding of your data and keeping it slim enough for it to be quick.

So as much as you can trust machines I suggest you try to keep on top of your data yourself.

There’s a couple of very simple ways to do that.

First. Keep an eye on your database logs, exactly like you do with your webserver logs. There’s a few open-source applications that can help you do that like Enterprise Postgres Query Analyser.
This will give you a basic understanding of which query/ies you will have to focus on.

Once you have an idea of which ones are the slowest queries you should keep an eye on the explain plan at regular intervals. In PostgreSQL I have a scheduled job that every day runs “EXPLAIN ANALYZE” on the heaviest queries in my system and compares the output with the previous day’s.

It’s a lot of work and you are probably better off confronting these problems as they come up and not waste valuable development time creating the most optimised database ever.
Setting up auto-vacuuming properly will keep you safe for a long while.

Tagged with:
Oct 30

Maybe that’s a bit too harsh, maybe recursive query are not evil, it’s just the people who use them.

PostgreSQL LogoI spend quite a lot of time working with PostgreSQL users helping them optimise their queries. When I read that PostgreSQL 8.4 added support for recursive queries I knew that a whole new hellish chapter in my life would begin.

First off. What are recursive queries: (from PostgreSQL manual)

Recursive queries are typically used to deal with hierarchical or tree-structured data. A useful example is this query to find all the direct and indirect sub-parts of a product, given only a table that shows immediate inclusions:

WITH RECURSIVE included_parts(sub_part, part, quantity) AS (
SELECT sub_part, part, quantity FROM parts WHERE part = ‘our_product’
UNION ALL
SELECT p.sub_part, p.part, p.quantity
FROM included_parts pr, parts p
WHERE p.part = pr.sub_part
)
SELECT sub_part, SUM(quantity) as total_quantity
FROM included_parts
GROUP BY sub_part

These structures are commonly used in relational database. Just think about a threaded comment system for a blog or an industry classification for securities on multiple levels (financial data is what I’m most familiar with).

In this latter case you can image that you’ll hardly ever extract industry classification information by itself. It’s generally used as a sub-query to provide additional information about a security, a trade or what have you.

As I said earlier I have nothing against recursive queries per se. However, I can already see people out there creating monster-queries in production systems. The sort of monster query that needs to be executed 50 times a second, the one that just doesn’t work.

Storing and retrieving tree-structured data in SQL is one of my favourite questions in interviews. I always make a point of asking it. Not because it’s particularly challenging technically but because it will tell me a lot about the way the person I’m interviewing thinks about data.

The first part of the question is obviously do design a structure to hold threaded blog comments.

Whether you use a separate table to hold the relationship between nodes or a self-referencing parent id column in the same table I don’t really care. So long as you come up with an answer we can move on with the interview, because the answer to the next part of the question is what interests me.

I will now call your blog page with the ID from an element in your structure, any element. I want you to return instantly the ID of the root element for that branch of the tree.

- Root comment 1
   - Child 1.1
   - Child 1.2
      - Child 1.2.1
   - Child 1.3
- Root comment 2
   - Child 2.1
      - Child 2.1.1
         - Child 2.1.1.1
   - Child 2.2

I will call you with 2.1.1.1 and I want you to tell me 2, instantly. Feel free to change your database structure.

Their answer to this will tell me how they feel about de-normalisation and if they can think in those terms. We are talking about the daft requirements written by a product person who’s clearly gone quite mad. All he cares about is getting the data out quickly, nothing else.

Easiest de-normalised way out is to add a root id column in each comment row. It will make inserting new comments slower but it won’t require any recursion to go back to the top when selecting data.

If all you can come up with is recursive query I’ll be sorely disappointed. It’s cool and elegant but not nearly efficient enough for a high-availability production system.

Feel free to talk about recursive queries when I ask you this question, just remember to put the magic words “materialized view” in front of it. then we can talk.

Let this bet a warning to you. If I find a non-materialized/cached recursive query in your production code I will recursively kick you in the head.

Tagged with:
Oct 27

Last week I wrote a rather long post asking for suggestions on how to cache and access a time series stored in an array in memory.

After some more research I have decided that the solution proposed by Uzair in his comment to the original post with some minor tweaks.

My originial idea involved creating a map in memory with one entry for each point in my series and then lookup the item requested in there.
This is very efficient when requests come in for a time which is in my index, if the point is not in there I’d have to run a binary search on the whole index. Which as Anon pointed out in his comment is still very fast.

if your ordered index is in-memory.

even if you had 1TB of memory (quite a lot), and each memory access took you 1 microsecond (slow),

each search would only take worse case of 40 accesses (2^40 = 1 trillion)
which would be 40 microseconds. this is not very long at all.

assuming some more reasonable ball-park figures of 4GB of memory, and
200 nanosecond access times, gives you

32 accesses * 200 nanos = 6 microseconds

this is over 150,000 searches per second.

The index will be a map of maps. Basically I’ll distribute a number of entries in my index at regular intervals. Each entry will contain a map of entries that preceded it in my time series.
Since I will know the first and last point in the series and the distribution of my regular entries *z* when a request *x* comes in I will be able to work out where is my next regular index entry *y*)

y = x – mod(x/z) + z

Once I have retrieved the map contained in my regular entry in the index I can proceed and lookup the value requested in there. If it’s not a point in my map I will still have to do a binary search but on a much smaller dataset – which I should be able to work on locally without having to rely on memcached over the network. The only “downside” of this solution is that I’ll have to make at least 2 lookups in my index.

Tagged with:
Oct 24
This is a problem I’ve been trying to deal with for a while now. In my case I’m talking about financial data. Most of it is – one way or another – a time series.
Take for example transactions. A position management system needs to be able to store millions of a transactions for a number of portfolios. Storing the data is not a problem for any relational database – The problem is querying it.
Financial data is, by its own nature, relational. This means, as far as I’m concerned, that you’d always want to store your raw data in a relational model without de-normalizing your data.
Now this is all good and well but when we need to access the transactions for a specific portfolio between the first of January 1999 and the 31st of December 2008 with all the related security information and prices and perform some analysis relational databases start to struggle.
Of course it is doable, but my argument against it is that CPU cycles used by your software to do the computation are less expensive than you database’s.
What I’m trying to achieve here is cache a time-series of transactions in a memory structure thus giving n clients access to it without having to constantly refer back to the database.
Client tells me what is the last transaction it knows about and all newer data is returned from the structure in memory. Thing is, I want to do it without having to loop over my Array/Map of transaction and avoid wasting memory by duplicating data.
Here’s a possible solution I’ve been toying with today.
The idea comes from a rendering technique used in videogames called voxel-based rendering, specifically an old game called Outcast.
I’ll try and briefly explain what that is (at least as far as my idea is concerned):
Say you are trying to render a landscape. Voxel rendering makes this fairly simple and is achieved column by column. For each column, a maximum Y value was stored (in a Y-Buffer). The pixels of the column in question are filled in starting at the bottom of the screen, only if their Y value was greater than the one stored in the Y-Buffer. This rendering method made the process of removing non-visible parts of the scene very efficient and very easy.
Instead of caching array of transactions based on a time value of variable granularity we could just create a massive ArrayList of all the transactions for any period of time. While creating the array we should also create a HashMap (our Y-buffer). The hasmap will simply contain a transaction unique id as key and the Position in the array as value.
This way when the client calls the APIs with the latest unique identifier it has received we’ll just have to make a very quick lookup in our Y-Buffer for the array position and then use the subList method of the List interface to retrieve everything in the transaction array from the position to the end (size()))This is a problem I’ve been trying to deal with for a while now. In my case I’m talking about financial data. Most of it is – one way or another – a time series.

This is something I’ve been thinking about for a while. How can I cache and have nearly-instant access to any data point within the time series without having to splash out on expensive time series databases. I have an half cooked solution and decided to write this post to see if anybody out there has worked on a similar problem and has a better solution to share.

Specifically I’m trying to work with financial data. Take for example transactions. A position management system needs to be able to store millions of a transactions for a number of portfolios. Storing the data is not a problem for any relational database – The problem is querying it.

Financial data is, by its own nature, relational. This means, as far as I’m concerned, that you’d always want to store your raw data in a relational model without de-normalizing it.

Now this is all good and well but when we need to access the transactions for a specific portfolio between the first of January 1999 and the 31st of December 2008 with all the related security information and prices relational databases start to struggle. Of course it is doable, but if you can get the data out first CPU cycles used by your software to do the computation are less expensive than you database’s.

What I want to be able to do is build a cache of the flattened (de-normalized) data and make it available to n clients at the same time.

The obvious solution is to build a massive time-ordered array of all the transactions and store it in memory. Considering the potential size of the array something like memcached would be more appropriate.
Here comes the second problem. Assume a client calls your software with a couple of dates, you will be forced to loop over the whole array to find the chunk you need. Again, doable, but inefficient.

Array is definitely the way to go to store the data. What I need is an “index” which will tell me exactly where a specific transaction is stored in the array.

Here’s a possible solution I’ve been toying with today.

The idea comes from a technique used in videogames called voxel-based-rending and ray-casting. I’ll try and briefly explain what that is (at least as far as my idea is concerned):

Say you are trying to render a landscape. Voxel rendering makes this fairly simple and is achieved column by column. For each column, a maximum Y value was stored (in a Y-Buffer). The pixels of the column in question are filled in starting at the bottom of the screen, only if their Y value was greater than the one stored in the Y-Buffer. This rendering method made the process of removing non-visible parts of the scene very efficient and very easy.

rendering ray casting voxels

While creating the array we should also create a map (our Y-buffer). The map will simply contain a transaction unique id as key and the position in the array as value.

This way when the client calls our APIs with the last unique identifier it has received we’ll just have to make a very quick lookup in our Y-Buffer for the array position and this will point us exactly to what we need to return.

This could work, but it’s only a partial solution, or rather a solution for a very specific problem. What we really want is for the client to be able to hand in 2 date-time values (from and to).
We could create our index using time information. The problem is that transactions are not at constant time-intervals. Which means that if we are called with a time value that is not in our index we’d still end up looping over the values around the closest point to find the position of the first element in the series.

An alternative would be to build our index using the maximum precision allowed. say an index of milliseconds where the last element position in the array is repeated in the index until a new one appears. Doable but very memory consuming.

This is as far as I’ve gone. Any other ideas out there?

UPDATE: follow-up article

Tagged with:
preload preload preload