Nov 01

Not just because it’s vital for your business.
For most production systems the speed-bottleneck lies with accessing your data. Database are excellent for storing all your data and keeping it organised. However, when it comes to getting it out quickly, especially if you have lots of it, they are not the sharpest of tools.

That’s why I always try to press home the importance of keeping on top of your data.

Even though you think all your data is neatly organised in your perfectly structured database it keeps changing shape, or rather the understanding your database has of your data keeps changing.
For example in PostgreSQL the “shape” of your data is stored in a table called pg_statistics. Data is collected and stored there by analyze.

The query planner uses the data collected in pg_statistics to pick the most efficient way to run your queries. Unfortunately no system is perfect, even the best planner makes mistakes. You have to strike a balance between letting analyze collect as much data as possible to give your database a better understanding of your data and keeping it slim enough for it to be quick.

So as much as you can trust machines I suggest you try to keep on top of your data yourself.

There’s a couple of very simple ways to do that.

First. Keep an eye on your database logs, exactly like you do with your webserver logs. There’s a few open-source applications that can help you do that like Enterprise Postgres Query Analyser.
This will give you a basic understanding of which query/ies you will have to focus on.

Once you have an idea of which ones are the slowest queries you should keep an eye on the explain plan at regular intervals. In PostgreSQL I have a scheduled job that every day runs “EXPLAIN ANALYZE” on the heaviest queries in my system and compares the output with the previous day’s.

It’s a lot of work and you are probably better off confronting these problems as they come up and not waste valuable development time creating the most optimised database ever.
Setting up auto-vacuuming properly will keep you safe for a long while.

Tagged with:
Oct 30

Maybe that’s a bit too harsh, maybe recursive query are not evil, it’s just the people who use them.

PostgreSQL LogoI spend quite a lot of time working with PostgreSQL users helping them optimise their queries. When I read that PostgreSQL 8.4 added support for recursive queries I knew that a whole new hellish chapter in my life would begin.

First off. What are recursive queries: (from PostgreSQL manual)

Recursive queries are typically used to deal with hierarchical or tree-structured data. A useful example is this query to find all the direct and indirect sub-parts of a product, given only a table that shows immediate inclusions:

WITH RECURSIVE included_parts(sub_part, part, quantity) AS (
SELECT sub_part, part, quantity FROM parts WHERE part = ‘our_product’
UNION ALL
SELECT p.sub_part, p.part, p.quantity
FROM included_parts pr, parts p
WHERE p.part = pr.sub_part
)
SELECT sub_part, SUM(quantity) as total_quantity
FROM included_parts
GROUP BY sub_part

These structures are commonly used in relational database. Just think about a threaded comment system for a blog or an industry classification for securities on multiple levels (financial data is what I’m most familiar with).

In this latter case you can image that you’ll hardly ever extract industry classification information by itself. It’s generally used as a sub-query to provide additional information about a security, a trade or what have you.

As I said earlier I have nothing against recursive queries per se. However, I can already see people out there creating monster-queries in production systems. The sort of monster query that needs to be executed 50 times a second, the one that just doesn’t work.

Storing and retrieving tree-structured data in SQL is one of my favourite questions in interviews. I always make a point of asking it. Not because it’s particularly challenging technically but because it will tell me a lot about the way the person I’m interviewing thinks about data.

The first part of the question is obviously do design a structure to hold threaded blog comments.

Whether you use a separate table to hold the relationship between nodes or a self-referencing parent id column in the same table I don’t really care. So long as you come up with an answer we can move on with the interview, because the answer to the next part of the question is what interests me.

I will now call your blog page with the ID from an element in your structure, any element. I want you to return instantly the ID of the root element for that branch of the tree.

- Root comment 1
   - Child 1.1
   - Child 1.2
      - Child 1.2.1
   - Child 1.3
- Root comment 2
   - Child 2.1
      - Child 2.1.1
         - Child 2.1.1.1
   - Child 2.2

I will call you with 2.1.1.1 and I want you to tell me 2, instantly. Feel free to change your database structure.

Their answer to this will tell me how they feel about de-normalisation and if they can think in those terms. We are talking about the daft requirements written by a product person who’s clearly gone quite mad. All he cares about is getting the data out quickly, nothing else.

Easiest de-normalised way out is to add a root id column in each comment row. It will make inserting new comments slower but it won’t require any recursion to go back to the top when selecting data.

If all you can come up with is recursive query I’ll be sorely disappointed. It’s cool and elegant but not nearly efficient enough for a high-availability production system.

Feel free to talk about recursive queries when I ask you this question, just remember to put the magic words “materialized view” in front of it. then we can talk.

Let this bet a warning to you. If I find a non-materialized/cached recursive query in your production code I will recursively kick you in the head.

Tagged with:
preload preload preload