It’s probably a good idea to start saying that Postgres doesn’t have the concept of virtual columns (or generated columns) as other databases does, for example, MySQL. With this in mind, let’s see what we can do to at least simulate this functionality.
To start the discussion, let’s first setup our example and explain the situation:
CREATE TABLE users ( user_id serial PRIMARY KEY, name varchar, registration varchar ); INSERT INTO users(name, registration) VALUES ('john', '1234-5'), ('alice', '12-345'), ('bob', '36872'), ('jean', '551-34'); SELECT * FROM users; user_id | name | registration ---------+-------+-------------- 1 | john | 1234-5 2 | alice | 12-345 3 | bob | 36872 4 | jean | 551-34 (4 rows)
As we can see we have users in our database and each user have a registration number. Suppose this is a legacy system and some users were created with duplicated registration numbers, the only difference is the dash between numbers, that in this case shouldn’t make a difference.
So, john and alice have the same registration number
12345 and we should
inform this in our query. We can start with a JOIN to solve our problem:
SELECT u1.name, u1.registration, u2.name, u2.registration FROM users AS u1 INNER JOIN users AS u2 ON replace(u1.registration, '-', '') = replace(u2.registration,'-', '') WHERE u1.user_id <> u2.user_id; name | registration | name | registration -------+--------------+-------+-------------- john | 1234-5 | alice | 12-345 alice | 12-345 | john | 1234-5 (2 rows)
Or we could use a
GROUP BY and the
SELECT replace(registration, '-', '') as registration, string_agg(name, ',') as names FROM users GROUP BY replace(registration, '-', '') HAVING count(*) > 1; registration | names --------------+------------ 12345 | john,alice (1 row)
Nothing wrong with these queries, but the need to call
replace repeatedly is
someting that bothers me, can we avoid this? The first thing that I can think of
is to create a view
replace function as a virtual column:
CREATE OR REPLACE VIEW normalized_users AS SELECT user_id, name, replace(registration, '-', '') as registration FROM users; SELECT registration, string_agg(name, ',') as names FROM normalized_users GROUP BY registration HAVING count(*) > 1; registration | names --------------+------------ 12345 | john,alice (1 row)
Great, no more functions to remember! This is a great way to simplify our data model, but, do we have alternatives to solve this?
Function as a column
This alternative require us to create a function that will receive our row as a paramater and normalize its registration number, after that we can call this function like it was a column.
It’s easier to explain this by showing the result:
CREATE OR REPLACE FUNCTION normalized_registration(users) RETURNS text AS $$ SELECT replace($1.registration, '-', '') $$ STABLE LANGUAGE SQL; SELECT users.normalized_registration, string_agg(name, ',') as names FROM users GROUP BY users.normalized_registration HAVING count(*) > 1; registration | names --------------+------------ 12345 | john,alice (1 row)
We created the
normalized_registration function and called it using the syntax
users.normalized_registration, how is this possible?
The trick here is that attribute notation (users.name) and function notation (name(users)) are equivalent in Postgres! Let’s see this in use:
SELECT name(users), users.name FROM users; name | name -------+------- john | john alice | alice bob | bob jean | jean (4 rows)
With this we can call our functions like virtual columns! Another cool thing is
that since our function is
STABLE we can create an index using it!
CREATE INDEX normalized_registration_idx ON users (normalized_registration(users));
And if we analyze our query we can see that it’s using the index:
EXPLAIN SELECT users.normalized_registration, string_agg(name, ',') as names FROM users GROUP BY users.normalized_registration HAVING count(*) > 1; QUERY PLAN ------------------------------------------------------------------ GroupAggregate (cost=0.13..12.30 rows=4 width=64) Group Key: replace((registration)::text, '-'::text, ''::text) Filter: (count(*) > 1) -> Index Scan using normalized_registration_idx on users (cost=0.13..12.20 rows=4 width=64) (4 rows)
If the index is not being used in your example it’s probably because the sample is small and the database performs better with a sequential scan. In this case if we want to be sure that the index is going to be used we can disable the sequencial scan and run our query again:
SET enable_seqscan = OFF; -- Run our query with EXPLAIN SET enable_seqscan = ON;
The last thing that is useful to mention is that if the virtual column requires
some heavy computation it is probably better to create a real column and
populate it using a
it’s something that’s only going to be invoked during
We can achieve something similar as a virtual column using:
But its better to use triggers with a real column in case of heavy computations.
If there is another way of solving this or something bothers you in this post please leave me a tweet or email, I will appreciate it!
See you in the next post!