Removing duplicates from a PostgreSQL database – Siafoo.

The basic assumptions are that you have a table, named foo, that contains a text field, called contents and that the contents field may contain duplicate rows.

For my purposes I am using a table with 30,000 rows of which 8,000 rows must be removed because they duplicate content one or more times. The contents field contains an average of only 180 characters making the hash function very fast but also probably unnecessary.

  1. First, as always, backup your database.

  2. Add an md5 hash key and populate it with the content hashes (~2 seconds for my data).

    # ’s

    ALTER TABLE foo ADD COLUMN hash text;
    UPDATE foo SET hash = md5(content);
  3. Create a unique id (i.e. like a primary key) for each row, if one does not exist.

    # ’s

    CREATE SEQUENCE foo_sequence
           INCREMENT 1 MINVALUE 1 MAXVALUE 9223372036854775807 START 1 CACHE 1;
    -- Set the correct owner for the sequence if needed
    -- ALTER TABLE foo_seq OWNER TO postgres;
    -- Populate the field
    ALTER TABLE foo ADD COLUMN id integer;
    UPDATE foo SET id = nextval('foo_seq');
  4. (optional) Estimate the number of duplicates

    You can use the query below to estimate the number of duplicates, so that once the deletion takes place you can figure out if something has gone horribly wrong™.

    # ’s

    SELECT sum(dupes.c) - count(dupes.c)
    FROM (SELECT count(*) AS c
          FROM foo
          GROUP BY hash
          HAVING count(*) > 1
         ) AS dupes
  5. Delete the duplicates (~3 minutes on my local machine, a dual core AMD)

    # ’s

           WHERE id NOT IN (SELECT min(id)
                            FROM foo
                            GROUP BY hash HAVING count(*) >= 1)
  6. Drop the hash column

    # ’s


Naturally you can modify the above procedure to better fit your data definition and uniqueness requirements. It is also probable that you can achieve some speed ups by using indexes… somewhere.