/ - Diff - BIEN 3 - NCEAS Projects

« Previous | Next »

Revision 5903

Added by Aaron Marcuse-Kubitza about 12 years ago

sql.py: distinct_table(): Use DISTINCT ON instead of a unique index and insert_select()'s ignore mode to remove duplicate rows. This uses whichever sorting method PostgreSQL deems to be fastest instead of requiring the use of a B-tree index. Since most of the slower operations in TNRS's import are distinct_table() calls, this should speed up the TNRS import, which is a bottleneck for the DB import as a whole because the TNRS import must complete before other datasources can be imported.

         limit = None
         if distinct_on == []: limit = 1 # one sample row
         else:
             add_index(db, distinct_on, new_table, unique=True)
             add_index(db, distinct_on, table) # for join optimization
         else: add_index(db, distinct_on, table) # for join optimization
         insert_select(db, new_table, None, mk_select(db, table, order_by=None,
             limit=limit), ignore=True)
         insert_select(db, new_table, None, mk_select(db, table,
             distinct_on=distinct_on, order_by=None, limit=limit))
         analyze(db, new_table)
         return new_table

Also available in: Unified diff

Project

General

Profile

Revision 5903

Added by Aaron Marcuse-Kubitza about 12 years ago