dup-select¶
The dup-select operation implements the SQL pseudo-statement INSERT/ON DUPLICATE KEY SELECT
. This is a cross between MySQL's INSERT/ON DUPLICATE KEY UPDATE
(which does not provide SELECT
functionality) and PostgreSQL's INSERT RETURNING
(which throws an error on duplicate instead of returning the existing row).
This is needed if you use serial pkeys (which must be fetched for the existing row) or multiple unique constraints (which may produce different plain-text pkeys). If you only use plain-text pkeys and single unique constraints, you can use the faster and simpler dup-return instead.
implementation:¶
each of our three import algorithms has a different implementation of this core operation:
row-based import: put()
¶
column-based import: put_table()
¶
trigger-based import: trigger dup-select¶
note that data should never be sent between the client and server if it can at all be avoided, as this creates a huge bottleneck7. column-based import fixed this problem, but in a way which created a huge amount of complexity in the codebase. a much better solution would be to move the row-based import code to the server itself, which fixes the client/server bottleneck without requiring complicated and buggy parallelization techniques.
column-based import could be simplified significantly by moving both the dup-select code and the mappings into the schema itself as rules or triggers (trigger normalization). INSTEAD OF
rules and triggers can also be used to implement SQL functions without requiring special support by put_table()
.
runtime¶
The runtime3 of trigger dup-select is much better than the current implementation of column-based import, and still better than column-based import if it were completely optimized (due to cache locality). This is in spite of us switching back to a row-based way of doing things, because column-based import's speed improvement over row-based import was due to moving the data operations to the server side7, rather than any advantage of doing things "by column" instead of "by row".
note that adding a hash index is probably not necessary, because Postgres's implementation of these is slow enough that they don't add an appreciable speed improvement over btree indexes. in fact, we possibly should not even be pre-removing duplicates using hash-based DISTINCT ON
(in current column-based import), because this may in fact make column-based import slower.
.
method: | current column-based import4 | + hash indexes1 | + trigger dup-select | + trigger normalization | + dup-return |
improvement: | - | fast index scans (if needed) | immediate refetch | input row read only once | no refetch scan |
runtime6: | worse: ~13n_8 + m*log(m) | better: 2n + m*log(m) | even better: 2n + m*log(m) - m | best: n + m*log(m) | |
complexity6: | worse: O(n*log(m)), m<n | better: avg: O(n + m*log(m)), m<n • best-case: O(n) (m<<n) • worst-case: O(n*log(m)) (m~=n) | |||
intra-row index cache locality: |
worse: the refetch index scan happens separately at the end, when the index pages are no longer cached |
better: the refetch index scan occurs right after the constraint pre-check scan and can use the same cached index pages |
best: no refetch needed | ||
inter-row index cache locality: |
sometimes better: when rows are in close to sorted order, consecutive rows can use the same cached index pages within each major step |
possibly worse: when inserting into many tables, some index pages might no longer be cached when the next row is processed |
|||
input cache locality: |
worse: each input row needs to be re-read for each column group and for each of the 3 steps | better: each input row is read only once | |||
steps: | pre-remove duplicates using hash-based DISTINCT ON _1 |
pre-check existence using hash index1_2 | |||
O(n): n rows * O(1) hash aggregate | O(n): n rows * O(1) hash index | ||||
for unique rows, constraint-check index scan on insert | for new rows, constraint-check index scan on insert | ||||
O(m*log(m)): m unique rows * O(log(m)) btree5 index | |||||
for all rows, index scan to fetch output pkey | for only duplicate rows, index scan to fetch output pkey | - | |||
O(n*log(m)): n rows * O(log(m)) btree index | O(n): n rows * O(1) hash index | O(n - m): (n - m) rows * O(1) hash index |
1 this step is probably not necessary, because even for our 80-million row analytical_stem table, the query planner still prefers a btree index over a hash index when both are provided (lookups for them take about the same amount of time). thus, the additional hash check just adds an extra operation, rather than speeding up the lookup.
2 this step can also check if the pkey exists in superclasses (e.g. to detect inter-datasource collisions). however, pkeys inserted into a superclass by a concurrent transaction will not be visible to the SELECT
, so this does not completely prevent inter-class duplicates.
3 times are for inserting into an empty table, which simplifies the calculation
4 column-based import only works if all rows violate the same unique constraint (each datasource that violates a different constraint has different generated code).
also, column-based import does not support autopopulated natural pkeys, because the autopopulation triggers don't run until after the column-based import operation, so there is not yet a key value to DISTINCT
against.
5 you cannot avoid btree indexes completely, because every unique index must be a btree: ERROR: access method "hash" does not support unique indexes
6 m = # unique rows, e.g. 4,079 GBIF herbaria with plants
7 doing all the operations in the database avoids the need to:
- send data over a Unix socket connection (including wrapping data in packets)
- translate between database-native, Python, and plain text representations of values
- parse SQL commands (the query plan for each PostgreSQL function is cached)
8 n*log(m) + n = n*log2(e.g. 4,079 GBIF herbaria with plants) + n = 12.0n + n = 13.0n, although this doesn't take into account the relative page costs of btree and hash index scans. even if hash index scans are up to 6 times slower per step than btree scans, current column-based import will still be worse in all cases.
.
implementation¶
The core operation happens in an insert_on_duplicate_key_select()
function, which would try the insert on the underlying table within an exception block, and on a duplicate key error would instead select the existing row out of the table for use by INSERT RETURNING
.
CREATE FUNCTION insert_on_duplicate_key_select(new table)
RETURNS table
LANGUAGE plpgsql
AS $$
BEGIN
BEGIN
INSERT INTO table VALUES (new.*) RETURNING * INTO STRICT new;
EXCEPTION
WHEN unique_violation THEN /*select existing row*/;
END;
RETURN new;
END;
$$;
(As an optimization, the main table (not the inserter view) can also have a BEFORE trigger that pre-checks existence of the row by doing a select for all constraints that have a hash index, and throwing a unique_violation
if any of them return a row.)
The error message provides identifying information for the existing row:
to prep:
CREATE TEMP TABLE t (c1 text, c2 text);
CREATE UNIQUE INDEX i ON t ((c1 || 'x'), (c2 || 'y'));
INSERT INTO t VALUES ('v1', 'v2');
then, running
INSERT INTO t VALUES ('v1', 'v2');
again produces the parseable error:
ERROR: duplicate key value violates unique constraint "i"
DETAIL: Key ((c1 || 'x'::text), (c2 || 'y'::text))=(v1x, v2y) already exists.
SQL state: 23505
the pkey of the existing row can then be fetched with:
SELECT * FROM t WHERE ((c1 || 'x'::text), (c2 || 'y'::text)) = ('v1x', 'v2y');
note that the ()
tuple syntax does use an index scan, even for non-text types:
CREATE TEMP TABLE t (c1 integer, c2 integer);
CREATE UNIQUE INDEX i ON t ((c1+10), (c2+20));
INSERT INTO t VALUES (1, 2);
INSERT INTO t VALUES (1, 2);
ERROR: duplicate key value violates unique constraint "i"
DETAIL: Key ((c1 + 10), (c2 + 20))=(11, 22) already exists.
SQL state: 23505
EXPLAIN SELECT * FROM t WHERE ((c1 + 10), (c2 + 20)) = ('11', '22');
Index Scan using i on t (cost=0.01..8.28 rows=1 width=8)
Index Cond: (((c1 + 10) = 11) AND ((c2 + 20) = 22))
with multi-column unique constraints and values that contain ", "
, you have to be more careful:
CREATE TEMP TABLE t (c1 text, c2 text);
CREATE UNIQUE INDEX i ON t ((c1 || 'x'), (c2 || 'y'));
INSERT INTO t VALUES ('v1, a', 'v2, b');
INSERT INTO t VALUES ('v1, a', 'v2, b');
ERROR: duplicate key value violates unique constraint "i"
DETAIL: Key ((c1 || 'x'::text), (c2 || 'y'::text))=(v1, ax, v2, by) already exists.
SQL state: 23505
because it is not clear which ", " separates the multiple fields, you instead have to reevaluate the constraint expression on the tuple you are trying to insert (so that the values will be bound to the field names), and use the resulting value on the right side of the =. note that this workaround only works if the unique constraint fields are not autopopulated by a trigger. otherwise, you would need to test each combination of ''
around the fields, or switch to a single-column unique index on a tuple combining the columns (with (((..., ...)))
instead of (..., ...)
).
in <9.1¶
You may be able to call the insert_on_duplicate_key_select()
function from an INSTEAD
rule, which rewrites INSERT RETURNING
to a SELECT
on the insert_on_duplicate_key_select()
function (which takes new
as an argument and returns the results of the dup-select operation, similar to how a trigger modifies and returns new
)
in 9.1+¶
This can now also be done using *INSTEAD OF
triggers*, which support INSERT RETURNING
Note that the inserter view doing dup-select cannot be the same as the underlying table, because the presence of an INSTEAD OF
trigger on the table would prevent doing an underlying insert on it ("INSTEAD OF triggers do not support WHEN conditions" which could be used to switch the trigger off for the nested insert)