


  • svn:executable: *

# Date Author Comment
11970 01/20/2014 11:33 AM Aaron Marcuse-Kubitza

moved everything into /trunk/ to create the standard svn layout, for use with tools that require this (eg. git-svn). IMPORTANT: do NOT do an `svn up`. instead, re-use your working copy's existing files with `svn switch` (

11918 12/17/2013 05:47 AM Aaron Marcuse-Kubitza

bugfix: bin/map: in_is_db: don't ignore errors when the table does not exist, because these prevent an errexit and allow an import to continue when a staging table is missing. suppressing this error had previously been necessary because metadata-only tables (Source/) used to not have installed staging tables, and the program had to react accordingly.

11806 12/03/2013 08:58 AM Aaron Marcuse-Kubitza

bin/map: support param start="", which indicates the default value. this fixes a bug in inputs/input.Makefile $(restart_row), which outputs "" if an explicit starting row is not found.

11396 10/21/2013 07:14 PM Aaron Marcuse-Kubitza

fix: bin/map: put template: comment out the "Put template:" label so that the output is valid XML, and displays properly in a browser rather than showing a syntax error

11227 10/09/2013 10:12 PM Aaron Marcuse-Kubitza

bin/map: usage: documented that verbosity > 3 in commit mode turns on debug_temp mode, which creates real tables instead of temp tables

10854 09/04/2013 01:28 PM Aaron Marcuse-Kubitza

bin/map: allow user to override the source env var, which is used as the source.shortname value in the DB

10191 07/09/2013 12:56 AM Aaron Marcuse-Kubitza

bin/map: removed no longer used support for map.csv input column prefixes (expand out the prefixes instead). this used to be used by SpeciesLink to use just one mapping for a single term with multiple DwC namespaces, but was replaced with an explicit, ordered rather than implicit, unordered /_alt-ing together of the terms.

10190 07/08/2013 11:47 PM Aaron Marcuse-Kubitza

bin/map: removed no longer accurate comment that this is case- and punctuation-insensitive, since the case- and punctuation-insensitivity is now instead handled by map.csv preprocessing scripts before the mappings are even provided to bin/map

10140 07/02/2013 02:31 PM Aaron Marcuse-Kubitza

bugfix: bin/map: in_is_db: inline metadata value columns (used by new-style import) so that they can be compared by value in XML simplifying functions (lib/

10115 07/02/2013 03:50 AM Aaron Marcuse-Kubitza

bin/map: map_table(): Resolve prefixes: combined db_xml.ColRef() constructor call with creation of args (as tuple) for clarity

10114 07/02/2013 03:35 AM Aaron Marcuse-Kubitza

bin/map: update_in_label(): use in_schema instead of the map spreadsheet column name when available, to allow using one spreadsheet for all datasources (which would not have a datasource-specific spreadsheet column name)

9455 05/17/2013 01:13 PM Aaron Marcuse-Kubitza

bin/map: by_col: ensure verbosity is at least 2 in live mode (using new ints.set_min() instead of max() for clarity). documented that live column-based import MUST be run with verbosity 2+ (3 preferred) to provide debugging information for often-complex errors. without this, debugging is effectively impossible.

9453 05/17/2013 12:57 PM Aaron Marcuse-Kubitza

bin/map: Set default verbosity: by_col: documented that showing all queries is primarily to assist debugging, not profiling

8075 03/16/2013 02:16 PM Aaron Marcuse-Kubitza

bin/map: No mappings warning: Added explanation that this could also be due to no column name matches, and hint to check if you are importing the correct input table

7148 01/10/2013 09:03 PM Aaron Marcuse-Kubitza

bin/map: map_table(): Resolving prefixes: Fixed bug where need to use list instead of tuple for metadata value mappings

7122 01/08/2013 10:55 PM Aaron Marcuse-Kubitza

bin/map: Made $redo flag default to off, because redo mode is slow (all tables have to be truncated) and is only needed when running tests on a public schema with data in it, which would not be the case on a development machine where tests are usually run

6740 12/11/2012 01:48 AM Aaron Marcuse-Kubitza

bin/map: Removed column names simplification, which was causing columns with the same alphanumeric characters but different punctuation to be simplified to the same name. Name simplification is now performed by the mapping mechanism itself, and can be overridden in the mappings.

6454 11/25/2012 07:20 PM Aaron Marcuse-Kubitza

bin/map: in_is_db: by_col: Clearing errors table: Skip this if the table has been set to None because it didn't exist (and thus was a metadata-only map spreadsheet)

6445 11/24/2012 02:41 PM Aaron Marcuse-Kubitza

bin/map: in_is_db: If table does not exist, set table to None so that db_xml.put_table() doesn't try to access it. This fixes a bug in metadata-only map spreadsheets under column-based import.

6404 11/24/2012 07:32 AM Aaron Marcuse-Kubitza

bin/map: update_in_label(): Removed hardcoded source_id col_default, which is now set in mappings/VegCore-VegBIEN.csv's output root

6399 11/24/2012 06:44 AM Aaron Marcuse-Kubitza

bin/map: update_in_label(): Set $source env var to the in_label (datasource name), to make it available to _env()

6385 11/24/2012 05:05 AM Aaron Marcuse-Kubitza

bin/map: Support map spreadsheets containing only metadata mappings (with no corresponding staging table), by falling back to an empty table when the named table does not exist

6179 11/14/2012 06:30 PM Aaron Marcuse-Kubitza

schemas/vegbien.sql: Renamed reference -> source to make this table more broadly applicable, and because this now stores the datasource metadata

5953 11/01/2012 10:09 AM Aaron Marcuse-Kubitza

mappings/VegCore-VegBIEN.csv: Renamed creator_ids to reference_id since they are now fkeys to reference

5952 11/01/2012 10:04 AM Aaron Marcuse-Kubitza

schemas/vegbien.sql: Made creator_ids an fkey to reference instead of party, so that datasources are stored separately from people and to allow adding reference-type metadata (URL, copyright, etc.) for each datasource

5928 11/01/2012 06:41 AM Aaron Marcuse-Kubitza

bin/map: map_rows(): map_table(): Fixed bug where metadata values were being removed prematurely, by passing them through

5927 11/01/2012 06:40 AM Aaron Marcuse-Kubitza

bin/map: map_rows(): Fixed bug where metadata values were being passed to functions that expected columns, by placing them directly in the XML import tree and then removing them from the mappings

5911 11/01/2012 04:16 AM Aaron Marcuse-Kubitza

bin/map: Added support for including literal metadata values in the map spreadsheet, by prefixing them with ':'

5523 10/15/2012 02:36 PM Aaron Marcuse-Kubitza calls: Removed order_by=None everywhere that a stable row order is required (i.e. consistent between selects, or consistent between table transformations). This causes several tests to return different inserted row counts, because the input table is now being accessed in pkey order instead of in table order. This fixes a bug where tables with more rows than ~100 would return different results for repeated calls of the same non-ordered select.

5242 10/04/2012 08:26 PM Aaron Marcuse-Kubitza

schemas/vegbien.sql: Renamed datasource_id to creator_id so it can apply generally to any entity (such as a person), not just an aggregated datasource. This also enables taxonconcept.datasource_id to merge with creator_id, which now serves the same purpose.

5234 10/04/2012 06:15 PM Aaron Marcuse-Kubitza

schemas/vegbien.sql: party: Made it datasource-scoped. Since this creates a recursive fkey, a datasource (a root party) should point to itself in this field, which will happen automatically by setting it to the special value 0.

5026 09/26/2012 11:49 PM Aaron Marcuse-Kubitza

bin/map, db_xml.put_table() (row-based and column-based import): Don't sort the input table by its pkey, in order to support input tables with no pkey. Note that reading the input table in table order and having this match the input flat file's order is only possible with sql_io.import_csv()'s truncation of the table on a failed import, which ensures that the rows will be stored in inserted order.

4652 09/12/2012 02:28 PM Aaron Marcuse-Kubitza

Removed no longer used intersect

4505 09/07/2012 09:16 AM Aaron Marcuse-Kubitza

bin/map: map_table(): Refactored to map simplified to original column names first and then determine column index for each original name, in order to avoid trying to recover the original name from a simplified name where multiple original names might collide onto the same simplified name. Documented that it's case- and punctuation-insensitive.

4503 09/07/2012 08:42 AM Aaron Marcuse-Kubitza

bin/map: map_table(): Resolve all mappings and prefixes after applying maps.simplify()

4492 09/06/2012 08:42 PM Aaron Marcuse-Kubitza

Replaced repr() with strings.urepr() (or equivalent) everywhere needed, to avoid future UnicodeEncodeErrors

4491 09/06/2012 08:30 PM Aaron Marcuse-Kubitza

Replaced str() with strings.ustr() (or equivalent) everywhere needed, to avoid future UnicodeEncodeErrors

4474 09/05/2012 09:09 AM Aaron Marcuse-Kubitza

bin/map: Clearing errors table: Fixed bug where needed to check if sql_io.errors_table() returned None (indicating that the errors table didn't exist) before calling sql.drop_table()

4473 09/05/2012 09:04 AM Aaron Marcuse-Kubitza

bin/map: Clearing errors table: Fixed bug where needed to use sql.drop_table() instead of sql.truncate() now that errors tables are not created until column-based import runs

4213 08/24/2012 07:24 PM Aaron Marcuse-Kubitza

bin/map: Documented that it is duplicate-column safe (supports multiple columns of the same name)

4068 08/16/2012 12:39 PM Aaron Marcuse-Kubitza

bin/map: collision_suffix: Setting back to _alt to test if _merge caused the SpeciesLink slowdown. SpeciesLink contains a huge number of equivalent columns due to each DwC term being present with namespaces for all versions of the DwC schema, and these columns can be combined either using _alt or _merge. _merge is only useful if the values in different versions of the same DwC field are different, which is not likely the case.

4049 08/15/2012 07:02 AM Aaron Marcuse-Kubitza

bin/map: collision_suffix: Changed to use _merge instead of _alt to avoid losing source data on import when multiple fields collide

4048 08/15/2012 06:58 AM Aaron Marcuse-Kubitza

bin/map: Preventing collisions if multiple inputs mapping to same output: Made collision suffix configurable so it can easily be changed

4047 08/15/2012 06:56 AM Aaron Marcuse-Kubitza

bin/map: Preventing collisions if multiple inputs mapping to same output: Made collision suffix configurable so it can easily be changed

4042 08/15/2012 05:55 AM Aaron Marcuse-Kubitza

bin/map: Run new xml_func.simplify() on the root before printing the put template, so that _alts and _merges with only one element for the current datasource will be printed in their simplified form (with the _alt/_merge removed). This faciliates automated testing after an _alt/_merge suffix has been added, because the put template provided as part of the automated test will only change for those datasources that actually have an entry for both mappings, which greatly reduces the number of tests that need to be accepted.

4026 08/15/2012 03:44 AM Aaron Marcuse-Kubitza

Removed trailing whitespace on non-empty lines

3769 08/02/2012 08:54 PM Aaron Marcuse-Kubitza

bin/map: input is CSV: Removed unused map_ var

3768 08/02/2012 08:50 PM Aaron Marcuse-Kubitza

bin/map: Documented that it's multi-safe (supports an input appearing multiple times)

3715 08/01/2012 05:48 AM Aaron Marcuse-Kubitza

bin/map: out_is_db: row-based mode: Debug-log the processed XML tree produced by xml_func.process()

3696 07/31/2012 08:04 PM Aaron Marcuse-Kubitza

bin/map: Don't create unneeded /_ignore/inLabel element containing the datasource name because sql_io.put_table() now autopopulates the datasource_id

3689 07/30/2012 06:09 PM Aaron Marcuse-Kubitza

bin/map: Fixed bug where needed to use sql.function_exists() to determine if something is a relational (now SQL) function, including in row-based mode, since that now uses sql_io.put_table(), which requires this. The bug fix relies on the new xml_func.process() feature that preserves unknown relational functions in case they are built-in functions rather than SQL functions.

3669 07/27/2012 10:48 PM Aaron Marcuse-Kubitza

bin/map: Call sys.stdout.flush() after every call to sys.stdout.write() to avoid interleaved stdout/stderr output due to stdout buffering

3661 07/27/2012 09:38 PM Aaron Marcuse-Kubitza

Moved importing of col_defaults from db_xml.put_table() to bin/map, so that it also happens in row-based mode. Note that this causes a DB entry for the datasource to always be created, even if the datasource has no mappings or no rows.

3653 07/27/2012 08:21 PM Aaron Marcuse-Kubitza

bin/map: out_is_db: Use col_defaults in row-based mode as well

3652 07/27/2012 08:02 PM Aaron Marcuse-Kubitza Renamed put_table_special_funcs to put_special_funcs because it is now used by put() as well

3641 07/27/2012 06:03 PM Aaron Marcuse-Kubitza

bin/map: out_is_db: Output the put template to stdout so it will be validated in the automated testing

3627 07/26/2012 07:56 PM Aaron Marcuse-Kubitza

bin/map: by_col: db_xml.put_table() call: Use new col_defaults param to automatically set datasource_id to the in_label (datasource name)

3617 07/26/2012 04:48 PM Aaron Marcuse-Kubitza

bin/map: by_col: Only clear errors table if doing full re-import starting from row 0, not if restarting import at a later row

3581 07/24/2012 07:18 AM Aaron Marcuse-Kubitza

bin/map: in_is_xml: doc2rows(): "Root not found in input" warning: Changed "error" to "warning" to match the type of error condition signaled

3580 07/24/2012 07:15 AM Aaron Marcuse-Kubitza

bin/map: map_rows(): out_is_db: Changed `id_node != None` assertion to a warning because this is a normal circumstance in the base case where there are no mappings

3577 07/24/2012 06:37 AM Aaron Marcuse-Kubitza

bin/map: in_is_xml: doc2rows(): "Root not found in input" error: Changed SystemExit to a warning because this is a normal circumstance in the base case where the input XML file contains no rows

3427 07/17/2012 08:28 PM Aaron Marcuse-Kubitza

bin/map: by_col: Stripping XML functions not in the DB: Remove DB functions based on whether a plain SQL function of that name exists, rather than whether a relational function (i.e. a table) of that name exists. This will allow column-based import to use plain SQL functions that don't have a corresponding relational function.

3424 07/17/2012 08:08 PM Aaron Marcuse-Kubitza process(): Changed rel_funcs param to a callback is_rel_func, so that caller can specify any dynamic function to determine if a name is a relational function rather than having to list out all known relational functions

3339 07/11/2012 10:34 PM Aaron Marcuse-Kubitza

bin/map: Fixed bug where errors table indexes could not be looked up using index_cols() because their schema was not in the search_path, by explicitly adding the in_schema at the end of the search_path. This is the only reason the in_schema needs to be in the search_path, but it's unavoidable because the "duplicate key value violates unique constraint" error does not included the constraint's schema.

3324 07/11/2012 05:15 PM Aaron Marcuse-Kubitza

bin/map: ex_tracker: Don't add row_ct to iters count in column-based import (by_col) because errors are not done by row, so a % of rows affected is not meaningful

3243 07/06/2012 10:38 AM Aaron Marcuse-Kubitza

bin/map: Logging: Raised debug-mode verbosity threshold to 1.5 so that in row-based imports, which have a default verbosity of 1.1, sql.DbConn.run_query() will not profile the query, to avoid unnecessary overhead

3186 07/02/2012 10:03 AM Aaron Marcuse-Kubitza

bin/map: by_col: Reuse existing out_db connection for in_db instead of opening separate connection

3132 06/27/2012 08:56 PM Aaron Marcuse-Kubitza

bin/map: Optimized default verbosities for the mode: automated tests should not be verbose, column-based import should show all queries to assist profiling, and row-based import should just show row progress

3117 06/27/2012 06:31 PM Aaron Marcuse-Kubitza

bin/map: Use new DbConn.close()

3103 06/26/2012 11:06 PM Aaron Marcuse-Kubitza

Moved error tracking from to

2977 06/20/2012 07:46 PM Aaron Marcuse-Kubitza

main Makefile: Removed empty_db, because `make schemas/reinstall` has the same effect and is simpler

2928 06/18/2012 05:38 PM Aaron Marcuse-Kubitza put_table(): Removed no longer needed commit param

2927 06/18/2012 05:16 PM Aaron Marcuse-Kubitza

bin/map: Removed rollback() call before closing the connection because PostgreSQL does this automatically

2925 06/18/2012 05:13 PM Aaron Marcuse-Kubitza

Removed unnecessary db.db.commit() calls because commits are now done automatically by DbConn's autocommit mode

2920 06/18/2012 04:36 PM Aaron Marcuse-Kubitza

bin/map: connect_db(): Autocommit in commit mode to avoid the need for manual commits. This should also reduce the time that table locks are held, to avoid unnecessary contention when multiple processes are trying to insert into the same output table. (The program always uses nested transactions to support rollbacks, so there is no problem autocommitting whenever a top-level nested transaction or top-level query completes.)

2916 06/18/2012 04:25 PM Aaron Marcuse-Kubitza Use new DbConn.debug_temp config option to control whether temporary objects should instead be permanent

2898 06/15/2012 03:53 AM Aaron Marcuse-Kubitza

bin/map: connect_db(): Only use autocommit mode if verbosity > 3, to avoid accidentally activating it if you want debug output in normal import mode

2897 06/15/2012 03:45 AM Aaron Marcuse-Kubitza

bin/map: connect_db(): Only use autocommit mode if verbosity > 2, because it causes the intermediate tables to be created as permanent tables, which you don't want unless you're actually debugging (verbosity = 2 is normal for column-based import)

2883 06/15/2012 12:38 AM Aaron Marcuse-Kubitza

Wrap sys.stderr.write() calls in strings.to_raw_str() to avoid UnicodeEncodeErrors when stderr is to a file and the default encoding is ASCII

2881 06/15/2012 12:12 AM Aaron Marcuse-Kubitza

bin/map: When logging the row # being processed, add 1 because row # is interally 0-based, but 1-based to the user

2880 06/15/2012 12:05 AM Aaron Marcuse-Kubitza

bin/map: Log the row # being processed with level=1.1 so that the user can see a status report if desired

2809 06/12/2012 10:26 PM Aaron Marcuse-Kubitza

bin/map: by_col: Pass on_error to db_xml.put_table() that calls ex_tracker.track()

2734 06/11/2012 04:38 PM Aaron Marcuse-Kubitza

bin/map: If doing full import, clear errors table

2639 06/06/2012 01:38 PM Aaron Marcuse-Kubitza

schemas/functions.sql: _nullIf: Fixed bug where NOT NULL parameters were not supported, because an input NULL value would not match an existing DEFAULT value in a select query, by temporarily disabling _nullIf until this can be supported. Removed previous workarounds.

2638 06/05/2012 03:21 PM Aaron Marcuse-Kubitza

bin/map: out_is_db, row-based mode: Disabled using DB relational functions instead of XML functions because they were causing problems

2602 06/04/2012 02:17 PM Aaron Marcuse-Kubitza process(): Refactored to emphasize special handling for row-based and column-based modes. In row-based mode, always use a DB relational function over a local XML function when possible, to faciliate testing of DB relational functions in row-based mode. (The shadowed local XML version will still be tested in non-DB modes, such as outputting to intermediate XML files.)

2601 06/04/2012 01:01 PM Aaron Marcuse-Kubitza

bin/map: Move retrieval of out_db's relational functions outside of process_input() so they can also be used by the non-by_col case

2600 06/04/2012 12:52 PM Aaron Marcuse-Kubitza

bin/map: out_is_db: Don't evaluate relational functions in xml_func.process() because these will be evaluated by db_xml.put()

2598 06/04/2012 12:40 PM Aaron Marcuse-Kubitza

bin/map: Use xml_func.process(..., strip=True) instead of xml_func.strip()

2560 06/01/2012 06:47 PM Aaron Marcuse-Kubitza

bin/map: by_col: Stripping XML functions not in the DB: Fixed bug where preserve_funcs.add() was used when `preserve_funcs |=` should have been used to add the entire iterable that sql.tables() returns

2550 06/01/2012 03:40 PM Aaron Marcuse-Kubitza

bin/map: by_col: Strip only XML functions that are not in the DB

2490 05/30/2012 07:08 PM Aaron Marcuse-Kubitza

bin/map: Logging: log(): Remove extra debug info from DB query messages and format level 1.5 (summary) messages as Redmine list items

2480 05/30/2012 03:58 PM Aaron Marcuse-Kubitza as_table(): Fixed bug where table was not ended properly, by adding a space after the last \n and having rstrip() string only newlines

2476 05/29/2012 09:09 PM Aaron Marcuse-Kubitza

bin/map: Logging: log(): Strip trailing newlines from msg

2458 05/25/2012 07:01 PM Aaron Marcuse-Kubitza

bin/map: Fixed bug where verbosity needed to be 1 outside of test mode so that profiling and errors stats would be printed at end of import. Verbosity defaults to 0.5 rather than 1 in test mode so profiling and errors stats do not clutter up the test output when running automated tests.

2457 05/25/2012 06:55 PM Aaron Marcuse-Kubitza

bin/map: Only display verbose_errors in test mode, but with any nonzero verbosity. They should not be displayed outside of test mode because verbose errors make the log files huge.

2456 05/25/2012 06:52 PM Aaron Marcuse-Kubitza

bin/map: Renamed verbose param to verbosity because it's now a number, not a boolean

2455 05/25/2012 06:51 PM Aaron Marcuse-Kubitza

bin/map: Removed no longer used debug param (verbose=2 is used instead)

2454 05/25/2012 06:48 PM Aaron Marcuse-Kubitza

bin/map: Fixed bug where verbose_errors' default value depended on debug var, which was not yet set. Removed verbose_errors param and instead turn verbose_errors on whenever verbosity >= 1. Verbosity defaults to 1 in test mode.

2453 05/25/2012 06:33 PM Aaron Marcuse-Kubitza

bin/map: Logging: Don't set sql.run_raw_query.debug, because it is not used anymore (sql.connect(log_debug=...) is used instead)