Project

General

Profile

String IDs

When a string is used as a ID instead of an integer, string comparisons generally require a slower lexicographical comparison of each character instead of a single integer equality test. However, this limitation can be overcome using a string interning or hashing technique, below.

string interning

To efficiently use strings as IDs, the strings should be _interned_. Interning assigns an integer ID to each string so strings can be quickly compared by ID.

In programming languages with interning support (such as Java), the ID is usually the memory address of the first instance of the string. In SQL, on the other hand, the ID is an autogenerated numeric ID. PostgreSQL further provides an automatic interning mechanism via the OID handle types (regclass, regproc, etc.), which translate names automatically to their numeric OIDs when a text string is cast to the handle type. Essentially, the handle types are integers which display as strings.

Postgres's datatype interning mechanism can easily be expanded to support interning any text string, simply by creating a datatype with the same name as the string. The creation of this datatype can be automated in the constructor for a special-purpose wrapper type, as is demonstrated by SQLkit's fast text IDs implementation. Unfortunately, an important drawback of using Postgres-provided handle types is that they only support strings up to 63 characters long.

To avoid the Postgres handle type's character-length limitation, a custom handle type can instead be created, which performs an insert into a string->autogenerated integer2 lookup table (or the pkey's table itself). In this approach, the handle type should have a custom output function that displays itself as just the string instead of the string+autogenerated integer, to avoid leaking unstable autogenerated IDs to the user. The handle type's <> operators should compare on the string so that the pkey sorts by text string, not autogenerated integer. The handle type's = operator, however, will use just the integer, to apply the interning optimization. Users should add a hash index3 on every pkey so that the fast = operator, rather than the slower <> operators, will be used for exact pkey lookups4. note that this applies only to =, as Postgres's implementation of IS NOT DISTINCT FROM does not use =.

In inheritance hierarchies, it is often necessary for the pkey to include the table name (or a unique abbreviation of it) so that different subclasses' pkeys don't conflict with each other when used in the root table of the hierarchy.

2 it is generally not a good idea to use the ctid for this, as it differs between tables in the same inheritance hierarchy. nevertheless, when not using inheritance, this may be a viable option. in this case, it is not necessary to store the string pkey in the fkey handle type, because the string pkey can be looked up instantaneously (ie. without an index scan) using the ctid. this also means that when the string pkey changes, it does not need to be updated in all the fkey handles referencing it (however, this does not avoid the need to update the ctid itself in the fkey handles). in cases where the ctid won't work, it is instead possible to use the autogenerated oid, which has the same property of being hidden by default in SELECT * (helpful when the pkey splitting strategy4 is used). in any case, regardless of which kind of autogenerated integer you use (ctids, oids, or sequences), pg_dump will include only the string portion of the fkey in the export, requiring an index lookup at import time no matter what.

3 the function costs of the slower <> operators need to be set so that Postgres will always prefer the hash index to the btree index when doing lookups (eg. for joins). also, note that hash indexes require the handle type to have a hash operator class with a hashing function that returns the autogenerated integer. alternatively, a btree index could be created that uses a faster operator class with ~<, etc. instead of <>. ~<, etc. would use the autogenerated integer instead of the string. (for this reason, they cannot be named <> because they are not true sorting comparisons.)

4 if the query planner cannot be set to prefer the faster index for lookups, an alternative strategy can be used to make the fast index the only available index. this requires splitting the pkey into two columns, one which stores just the string and has the primary key constraint on it (for the table's default sort order), and the other which stores just the autogenerated integer. this second column receives a btree index (and optionally a hash index), and is referenced by all fkeys to the table. note that in this case, the fkeys do not reference the primary key, but instead reference an alternative, faster version of it. also, in this case, the handle type is used only for the fkey, not the pkey. (it is still needed for the fkey to store the autogenerated integer while displaying only the string.) in inheritance hierarchies (where a subclass's pkey is an fkey to another pkey), this requires the subclass pkey to be split into two columns (string and int), each with an fkey to the corresponding base table column.
.

cryptographic hashing

this approach is the same as string interning above, except that a cryptographic hash of the string pkey is used in place of the autogenerated integer. this speeds up fkey table restores, by avoiding the overhead of doing an index scan for the string->autogenerated integer lookup (and instead doing a cryptographic hash calculation), at the expense of slower joins (due to greater time in the = operator, which now compares hashes instead of integers). note that the hash must be at least an 8-byte bigint so that collisions are unlikely1.

1 0.1% probability of collision with 200 million rows according to Wikipedia. this probability does not need to be zero because a unique constraint on the string pkey will distinguish between a row that already exists and a pkey hash collision. (the pkey hash needs to be checked after the string pkey for this to work, which will happen automatically if the primary key is on the handle type and the hash value by itself gets an ordinary unique constraint.)
.

lossy hashing

the lossy hash function hashtext() can be used to provide a fast inequality determination, which is the most common type of operation performed in an index scan. the computation of the hash itself is faster than for a cryptographic hash, but at the cost of a slower full comparison of the strings for equality determinations. this method provides the fastest inserts and table restores, but the slowest joins.