Virtual columns in Postgres


It’s probably a good idea to start saying that Postgres doesn’t have the concept of virtual columns (or generated columns) as other databases does, for example, MySQL. With this in mind, let’s see what we can do to at least simulate this functionality.

To start the discussion, let’s first setup our example and explain the situation:

CREATE TABLE users (
  user_id serial PRIMARY KEY,
  name varchar,
  registration varchar
);

INSERT INTO users(name, registration) VALUES
  ('john', '1234-5'),
  ('alice', '12-345'),
  ('bob', '36872'),
  ('jean', '551-34');

SELECT * FROM users;

 user_id | name  | registration
---------+-------+--------------
       1 | john  | 1234-5
       2 | alice | 12-345
       3 | bob   | 36872
       4 | jean  | 551-34
(4 rows)

As we can see we have users in our database and each user have a registration number. Suppose this is a legacy system and some users were created with duplicated registration numbers, the only difference is the dash between numbers, that in this case shouldn’t make a difference.

So, john and alice have the same registration number 12345 and we should inform this in our query. We can start with a JOIN to solve our problem:

SELECT u1.name, u1.registration, u2.name, u2.registration
FROM users AS u1 INNER JOIN users AS u2
ON replace(u1.registration, '-', '') = replace(u2.registration,'-', '')
WHERE u1.user_id <> u2.user_id;

name  | registration | name  | registration
-------+--------------+-------+--------------
 john  | 1234-5       | alice | 12-345
 alice | 12-345       | john  | 1234-5
(2 rows)

Or we could use a GROUP BY and the string_agg function:

SELECT
  replace(registration, '-', '') as registration,
  string_agg(name, ',') as names
FROM users
GROUP BY replace(registration, '-', '')
HAVING count(*) > 1;

registration |   names
--------------+------------
 12345        | john,alice
(1 row)

Nothing wrong with these queries, but the need to call replace repeatedly is someting that bothers me, can we avoid this? The first thing that I can think of is to create a view with our replace function as a virtual column:

CREATE OR REPLACE VIEW normalized_users AS
SELECT user_id, name, replace(registration, '-', '') as registration
FROM users;

SELECT
  registration,
  string_agg(name, ',') as names
FROM normalized_users
GROUP BY registration
HAVING count(*) > 1;

 registration |   names
--------------+------------
 12345        | john,alice
(1 row)

Great, no more functions to remember! This is a great way to simplify our data model, but, do we have alternatives to solve this?

Function as a column

This alternative require us to create a function that will receive our row as a paramater and normalize its registration number, after that we can call this function like it was a column.

It’s easier to explain this by showing the result:

CREATE OR REPLACE FUNCTION normalized_registration(users)
RETURNS text AS $$
  SELECT replace($1.registration, '-', '')
$$ STABLE LANGUAGE SQL;

SELECT
  users.normalized_registration,
  string_agg(name, ',') as names
FROM users
GROUP BY users.normalized_registration
HAVING count(*) > 1;

 registration |   names
--------------+------------
 12345        | john,alice
(1 row)

We created the normalized_registration function and called it using the syntax users.normalized_registration, how is this possible?

The trick here is that attribute notation (users.name) and function notation (name(users)) are equivalent in Postgres! Let’s see this in use:

SELECT name(users), users.name
FROM users;

 name  | name
-------+-------
 john  | john
 alice | alice
 bob   | bob
 jean  | jean
(4 rows)

With this we can call our functions like virtual columns! Another cool thing is that since our function is STABLE we can create an index using it!

CREATE INDEX normalized_registration_idx
ON users (normalized_registration(users));

And if we analyze our query we can see that it’s using the index:

EXPLAIN SELECT
  users.normalized_registration,
  string_agg(name, ',') as names
FROM users
GROUP BY users.normalized_registration
HAVING count(*) > 1;

                          QUERY PLAN
------------------------------------------------------------------
 GroupAggregate  (cost=0.13..12.30 rows=4 width=64)
   Group Key: replace((registration)::text, '-'::text, ''::text)
   Filter: (count(*) > 1)
   ->  Index Scan using normalized_registration_idx on users
       (cost=0.13..12.20 rows=4 width=64)
(4 rows)

If the index is not being used in your example it’s probably because the sample is small and the database performs better with a sequential scan. In this case if we want to be sure that the index is going to be used we can disable the sequencial scan and run our query again:

SET enable_seqscan = OFF;
-- Run our query with EXPLAIN
SET enable_seqscan = ON;

The last thing that is useful to mention is that if the virtual column requires some heavy computation it is probably better to create a real column and populate it using a trigger since it’s something that’s only going to be invoked during INSERT and UPDATE.

TL;DR:

We can achieve something similar as a virtual column using:

  • Views
  • Functions

But its better to use triggers with a real column in case of heavy computations.

If there is another way of solving this or something bothers you in this post please leave me a tweet or email, I will appreciate it!

See you in the next post!