Hierarchical Data Modeling


Choosing the Adjacency List pattern to solve hierarchical data modeling is one of many options to solve this kind of problem, some of them are Nested Set, Closure Table, Flat Table and Path Enumeration. In this post I will investigate the benefits of using this approach and also some problems.

How it works?

We simply create a table with a field that references a row in this same table. An example of a table like this would be:

CREATE TABLE categories (
  id serial PRIMARY KEY,
  parent_id integer,
  name varchar,
  FOREIGN KEY (parent_id) REFERENCES categories(id)
);

We can see that parent_id references an id in this same table. The rest in straighfoward, to create a category that does not have a parent we leave the field parent_id blank. Child categories can reference its parent by id.

Let’s see some examples:

INSERT into categories (name)
  VALUES ('sports');
INSERT INTO categories (parent_id, name)
  VALUES (1, 'football');
INSERT INTO categories (parent_id, name)
  VALUES (2, 'U-14');
INSERT INTO categories (parent_id, name)
  VALUES (2, 'U-17');
INSERT INTO categories (parent_id, name)
  VALUES (1, 'volleyball');
INSERT INTO categories (parent_id, name)
  VALUES (5, 'U-21');

SELECT * FROM categories;

id | parent_id |    name
----+-----------+------------
  1 |    (null) | sports
  2 |         1 | football
  3 |         2 | U-14
  4 |         2 | U-17
  5 |         1 | volleyball
  6 |         5 | U-21
(6 rows)

So, as we can see we have something like:

* sports
  * football
    * U-17
    * U-14
  * volleyball
    * U-21

Let’s see the advantages of this approach.

Advantages

It’s pretty easy to insert a new row in the hierarchy:

INSERT INTO categories (parent_id, name)
  VALUES (5, 'U-18');

It’s also easy to change the parent of a row (changing its subtree):

UPDATE categories
SET parent_id = 1
WHERE name = 'U-18';

Querying the direct parent or descendant of a category is straightforward:

SELECT c1.*
FROM categories c1 INNER JOIN categories c2 ON c2.parent_id = c1.id
WHERE c2.name = 'football';

 id | parent_id |  name
----+-----------+--------
  1 |    (null) | sports
(1 row)

SELECT c1.*
FROM categories c1 INNER JOIN categories c2 ON c1.parent_id = c2.id
WHERE c2.name = 'football';

 id | parent_id | name
----+-----------+------
  3 |         2 | U-14
  4 |         2 | U-17
(2 rows)

But as nothing is perfect, we also have some issues.

Issues

We can’t delete a row without updating first its descendants parent. For example, suppose we want do delete football. To do this and keep the tree consistent, we would have to update U-14 and U-17 parent_id to sports, using our queries to get the parent and direct descendants of a row.

UPDATE categories SET parent_id = (
  SELECT c1.id
  FROM categories c1
  INNER JOIN categories c2 ON c2.parent_id = c1.id
  WHERE c2.name = 'football'
)
WHERE id IN(
  SELECT c1.id
  FROM categories c1
  INNER JOIN categories c2 ON c1.parent_id = c2.id
  WHERE c2.name = 'football'
);

And after that we delete our row:

DELETE from categories where name = 'football';

It’s also not easy to delete a subtree. Suppose we want to delete the row that contains football and all its descendants. We would need to find all descendants of football and delete them in reverse order to avoid constraint errors. So suppose we have the following:

* sports
  * football
    * U-17
      * male
      * female
    * U-14
      * male
      * female
  * volleyball
    * U-21

To delete football subtree we would have to delete first the male and female from both U-14 and U-17. After that we would have to delete U-14 and U-17 and finally we would delete football. But what if we have 5 or 10 levels of nesting?

To solve this problem we would need to use recursive queries in a database like PostgreSQL to achieve something like:

WITH RECURSIVE CategoriesTree (id, parent_id, name) AS (
  SELECT *
  FROM categories
  WHERE name = 'football'
UNION ALL
  SELECT c.*
  FROM CategoriesTree ct INNER JOIN categories c
  ON (ct.id = c.parent_id)
)
DELETE FROM Categories WHERE id IN(
  SELECT id FROM CategoriesTree
);

The query above gets all descendants ids including the row we want to delete and pass its ids to DELETE, the database takes care of deleting the rows in the right order.

To query all the descendants we would also need to use recursive queries.

WITH RECURSIVE CategoriesTree (id, parent_id, name) AS (
  SELECT *, 0 as depth
  FROM categories
  WHERE name = 'sports'
UNION ALL
  SELECT c.*, ct.depth + 1 as depth
  FROM CategoriesTree ct INNER JOIN categories c
  ON (ct.id = c.parent_id)
)
SELECT * FROM CategoriesTree;

 id | parent_id |   name   | depth
----+-----------+----------+-------
  1 |    (null) | sports   |     0
  2 |         1 | football |     1
  3 |         2 | U-14     |     2
  4 |         2 | U-17     |     2
  7 |         3 | male     |     3
  8 |         3 | female   |     3
  9 |         4 | male     |     3
 10 |         4 | female   |     3

Wrap Up

As we saw, the Adjacency List pattern has good and not so good use cases and it’s important to consider what is your use case before adopting this solution. Keep in mind that there is no such thing as a perfect solution, just one that fits your problem really well.

TL;DR:

Use the Adjacency List if your use case uses mostly:

  • Inserts
  • Updates
  • Queries to get just the parent and/or direct descendants

Keep an eye on it if it uses mostly:

  • Deletes (row or subtree)
  • Queries to get all descendants

See you in the next post!