Removing duplicates in Postgres


The best way to avoid duplicates in our database is to use a UNIQUE constraint or index in the first place, but as we know sometimes things get out of control and some housekeeping is needed. Let’s see some ways to do this:

Suppose we have a table called orders like the following:

CREATE TABLE orders(
  id serial PRIMARY KEY,
  item_id integer,
  code varchar
);

INSERT INTO orders(item_id, code)
VALUES (1, 'abc'),
       (1, 'abc'),
       (2, 'dba'),
       (2, 'hjk'),
       (2, 'dba'),
       (3, 'fkl');

SELECT * FROM orders;

 id | item_id | code
----+---------+------
  1 |       1 | abc
  2 |       1 | abc
  3 |       2 | dba
  4 |       2 | hjk
  5 |       2 | dba
  6 |       3 | fkl
(6 rows)

We want to remove rows that have the same item_id and code, what are our options? Actually, a lot!

If your table have an unique identifier like an id you can skip the next section.

Tables without an unique identifier

If your table don’t have an unique identifier you can replace the id column by the ctid column in the examples below.

ctid is a system column representing the physical location of the row in our table. This identifier is guaranteed to be unique, but changes over time if our rows are updated, so it is only useful for these kind of operations where you won’t need to use its values later.

With this covered, let’s start our study:

GROUP BY

Let’s start simple and avoid “fancy” things like joins and window functions.

DELETE
FROM orders
WHERE id NOT in (
  SELECT MIN(id)
  FROM orders
  GROUP BY item_id, code
);

Here we group our rows using the same criteria we use to find duplicates and fetch the lowest id in each group. After this we delete every row that is not in this set.

We could achieve the same result with DISTINCT ON, let’s see an example.

DISTINCT ON

DELETE
FROM orders
WHERE id NOT IN (
  SELECT DISTINCT ON (item_id, code) id
  FROM orders
);

Here we fetch the first row from each group with the same item_id and name and delete the rest. It’s important to mention that we are relying in the order by id. If this is not the case we should explicitly order our result as mentioned in the documentation.

EXISTS

DELETE
FROM orders o1
WHERE EXISTS (
  SELECT 1
  FROM orders o2
  WHERE o1.item_id = o2.item_id AND
        o1.code = o2.code AND
        o1.id > o2.id
);

This reads as: “Delete every row from o1 that has the same item_id and code and an id bigger than at least one id from o2”. This will ensure that only the duplicates with the lowest id remains.

USING

DELETE
FROM orders o1 USING orders o2
WHERE o1.item_id = o2.item_id AND
      o1.code = o2.code AND
      o1.id > o2.id;

Which executes the same code in our GROUP BY example. We also use the same technique explained above to keep the lowest id from our duplicated records. We can see another USING example here.

PARTITION (WINDOW FUNCTION)

DELETE
FROM orders
WHERE id IN (
  SELECT id
  FROM (
   SELECT row_number() OVER (PARTITION BY item_id, code ORDER BY id),
          id
   FROM orders
  ) AS sorted
  WHERE sorted.row_number > 1
);

Here the rows were partitioned by the specified criteria to find duplicates and sorted by id, after this we create a column with the row number for each row in a partition. The last step is to select only the rows with row numbers greater than 1 to delete.

Temporary tables

CREATE TABLE temporary_orders AS
  SELECT DISTINCT ON (item_id, code) * FROM orders;

DROP TABLE orders;
ALTER TABLE temporary_orders RENAME TO orders;

This solution is fast, but you have to disable indexes and foreign key constraints in your table to achieve this, which is not very practical. Still, it’s a possibility that we should take into account.

We can also use a CREATE TEMPORARY TABLE if our table should be dropped at the end of the session.

That’s it for now, if you know more solutions or have further questions feel free to leave me a tweet or email so we can discuss things further!

See you in the next post!