UPSERT Operations in PostgreSQL: From Traditional Methods to ON CONFLICT

Keywords: PostgreSQL | UPSERT | Concurrency Safety

Abstract: This article provides an in-depth exploration of UPSERT operations in PostgreSQL, focusing on the INSERT...ON CONFLICT syntax introduced in version 9.5 and its advantages. It compares traditional approaches, including retry loops and bulk locking updates, with modern methods, explaining race condition issues and solutions in concurrent environments. Practical code examples illustrate various implementations, offering technical guidance for PostgreSQL users across different versions.

Introduction

In database operations, UPSERT (update or insert) is a common requirement that combines the functionalities of INSERT and UPDATE. It updates a record if it exists or inserts a new one if it does not. MySQL supports this via INSERT ... ON DUPLICATE KEY UPDATE, while the SQL standard uses the MERGE statement. Before PostgreSQL 9.5, native support was lacking, requiring developers to adopt alternative methods. This article systematically covers UPSERT implementations in PostgreSQL, with a focus on concurrency safety.

PostgreSQL 9.5 and Newer: ON CONFLICT Syntax

PostgreSQL 9.5 introduced the INSERT ... ON CONFLICT syntax, directly supporting UPSERT operations. This syntax handles unique constraint violations through conflict detection, offering DO UPDATE and DO NOTHING behaviors. For example, for a table testtable with a primary key id and a field somedata, the UPSERT code is as follows:

INSERT INTO testtable (id, somedata) VALUES 
(2, 'Joe'), 
(3, 'Alan')
ON CONFLICT (id) 
DO UPDATE SET somedata = EXCLUDED.somedata;

This statement attempts to insert records with id 2 and 3. If id=2 already exists, it updates its somedata to 'Joe'; if id=3 does not exist, it inserts a new record. The keyword EXCLUDED references the conflicting insert values. This method supports multiple rows, requires no explicit locking or retry loops, and significantly improves efficiency. Compared to MySQL's ON DUPLICATE KEY UPDATE, PostgreSQL's syntax is more flexible, allowing specification of any unique constraint.

Traditional Methods for PostgreSQL 9.4 and Older

Before version 9.5, PostgreSQL had no built-in UPSERT support, necessitating custom implementations. The core challenge is ensuring concurrency safety and avoiding race conditions. Common approaches include individual row retry loops and bulk locking updates.

Individual Row Retry Loop

This method is suitable for high-concurrency scenarios, where each UPSERT operation retries in a loop until success. The PostgreSQL documentation provides an example function using a BEGIN...EXCEPTION block to handle duplicate key errors:

CREATE OR REPLACE FUNCTION upsert_testtable(p_id integer, p_somedata text) 
RETURNS void AS $$
BEGIN
    LOOP
        BEGIN
            UPDATE testtable SET somedata = p_somedata WHERE id = p_id;
            IF found THEN
                RETURN;
            END IF;
            INSERT INTO testtable (id, somedata) VALUES (p_id, p_somedata);
            RETURN;
        EXCEPTION WHEN unique_violation THEN
            -- Retry
        END;
    END LOOP;
END;
$$ LANGUAGE plpgsql;

This function first attempts an update; if no row is found, it inserts. If a unique violation occurs during insertion, it catches the exception and retries. This approach is safe under the READ COMMITTED isolation level but inefficient, as the loop may execute multiple times. Note that if triggers or other unique constraints exist, it may not handle all conflicts correctly.

Bulk Locking Update

For bulk data processing, the locking method is recommended. It involves storing new data in a temporary table, locking the target table, and then performing updates and inserts. For example:

BEGIN;
CREATE TEMPORARY TABLE newvals(id integer, somedata text);
INSERT INTO newvals(id, somedata) VALUES (2, 'Joe'), (3, 'Alan');
LOCK TABLE testtable IN EXCLUSIVE MODE;
UPDATE testtable 
SET somedata = newvals.somedata 
FROM newvals 
WHERE newvals.id = testtable.id;
INSERT INTO testtable 
SELECT newvals.id, newvals.somedata 
FROM newvals 
LEFT OUTER JOIN testtable ON (testtable.id = newvals.id) 
WHERE testtable.id IS NULL;
COMMIT;

This process starts by creating a temporary table newvals and inserting data. It then locks testtable in EXCLUSIVE mode to prevent modifications by other transactions. Next, it uses UPDATE ... FROM to update existing records and inserts new records via a left join. This method is efficient and concurrency-safe but may impact other operations during the lock.

Concurrency Safety and Race Condition Analysis

UPSERT operations are prone to race conditions in concurrent environments. For instance, a common erroneous approach is:

-- Incorrect example: prone to race conditions
BEGIN;
UPDATE testtable SET somedata = 'blah' WHERE id = 2;
INSERT INTO testtable (id, somedata) 
SELECT 2, 'blah' 
WHERE NOT EXISTS (SELECT 1 FROM testtable WHERE testtable.id = 2);
COMMIT;

If two transactions execute this simultaneously, both might update zero rows (due to no initial matches), then both pass the EXISTS check and attempt inserts, leading to a duplicate key error in one transaction. Retry loops or locking methods avoid this issue. The ON CONFLICT syntax handles conflicts at the database kernel level, requiring no application-layer intervention.

Comparison with Other Databases

MySQL's INSERT ... ON DUPLICATE KEY UPDATE is similar to PostgreSQL's ON CONFLICT, but the latter is more versatile, supporting any unique constraint. The SQL standard's MERGE statement is not widely used for UPSERT in PostgreSQL due to ambiguous concurrency semantics, often requiring additional locking. In other databases like SQL Server and Oracle, MERGE is common but also faces concurrency challenges.

Conclusion

PostgreSQL's UPSERT implementation has evolved from traditional methods to native ON CONFLICT support. For versions 9.5 and above, INSERT ... ON CONFLICT is recommended for its simplicity, efficiency, and concurrency safety. Users of older versions can opt for retry loops or bulk locking methods but should consider performance and complexity. In practice, the choice of method should depend on data volume, concurrency needs, and the PostgreSQL version.

Copyright Notice: All rights in this article are reserved by the operators of DevGex. Reasonable sharing and citation are welcome; any reproduction, excerpting, or re-publication without prior permission is prohibited.