Maybe that’s a bit too harsh, maybe recursive query are not evil, it’s just the people who use them.
I spend quite a lot of time working with PostgreSQL users helping them optimise their queries. When I read that PostgreSQL 8.4 added support for recursive queries I knew that a whole new hellish chapter in my life would begin.
First off. What are recursive queries: (from PostgreSQL manual)
Recursive queries are typically used to deal with hierarchical or tree-structured data. A useful example is this query to find all the direct and indirect sub-parts of a product, given only a table that shows immediate inclusions:
WITH RECURSIVE included_parts(sub_part, part, quantity) AS (
SELECT sub_part, part, quantity FROM parts WHERE part = ‘our_product’
SELECT p.sub_part, p.part, p.quantity
FROM included_parts pr, parts p
WHERE p.part = pr.sub_part
SELECT sub_part, SUM(quantity) as total_quantity
GROUP BY sub_part
These structures are commonly used in relational database. Just think about a threaded comment system for a blog or an industry classification for securities on multiple levels (financial data is what I’m most familiar with).
In this latter case you can image that you’ll hardly ever extract industry classification information by itself. It’s generally used as a sub-query to provide additional information about a security, a trade or what have you.
As I said earlier I have nothing against recursive queries per se. However, I can already see people out there creating monster-queries in production systems. The sort of monster query that needs to be executed 50 times a second, the one that just doesn’t work.
Storing and retrieving tree-structured data in SQL is one of my favourite questions in interviews. I always make a point of asking it. Not because it’s particularly challenging technically but because it will tell me a lot about the way the person I’m interviewing thinks about data.
The first part of the question is obviously do design a structure to hold threaded blog comments.
Whether you use a separate table to hold the relationship between nodes or a self-referencing parent id column in the same table I don’t really care. So long as you come up with an answer we can move on with the interview, because the answer to the next part of the question is what interests me.
I will now call your blog page with the ID from an element in your structure, any element. I want you to return instantly the ID of the root element for that branch of the tree.
- Root comment 1 - Child 1.1 - Child 1.2 - Child 1.2.1 - Child 1.3 - Root comment 2 - Child 2.1 - Child 2.1.1 - Child 18.104.22.168 - Child 2.2
I will call you with 22.214.171.124 and I want you to tell me 2, instantly. Feel free to change your database structure.
Their answer to this will tell me how they feel about de-normalisation and if they can think in those terms. We are talking about the daft requirements written by a product person who’s clearly gone quite mad. All he cares about is getting the data out quickly, nothing else.
Easiest de-normalised way out is to add a root id column in each comment row. It will make inserting new comments slower but it won’t require any recursion to go back to the top when selecting data.
If all you can come up with is recursive query I’ll be sorely disappointed. It’s cool and elegant but not nearly efficient enough for a high-availability production system.
Feel free to talk about recursive queries when I ask you this question, just remember to put the magic words “materialized view” in front of it. then we can talk.
Let this bet a warning to you. If I find a non-materialized/cached recursive query in your production code I will recursively kick you in the head.