moved everything into /trunk/ to create the standard svn layout, for use with tools that require this (eg. git-svn). IMPORTANT: do NOT do an `svn up`. instead, re-use your working copy's existing files with `svn switch` (http://svnbook.red-bean.com/en/1.6/svn.ref.svn.c.switch.html).
bugfix: bin/map: in_is_db: don't ignore errors when the table does not exist, because these prevent an errexit and allow an import to continue when a staging table is missing. suppressing this error had previously been necessary because metadata-only tables (Source/) used to not have installed staging tables, and the program had to react accordingly.
bin/map: support param start="", which indicates the default value. this fixes a bug in inputs/input.Makefile $(restart_row), which outputs "" if an explicit starting row is not found.
fix: bin/map: put template: comment out the "Put template:" label so that the output is valid XML, and displays properly in a browser rather than showing a syntax error
bin/map: usage: documented that verbosity > 3 in commit mode turns on debug_temp mode, which creates real tables instead of temp tables
bin/map: allow user to override the source env var, which is used as the source.shortname value in the DB
bin/map: removed no longer used support for map.csv input column prefixes (expand out the prefixes instead). this used to be used by SpeciesLink to use just one mapping for a single term with multiple DwC namespaces, but was replaced with an explicit, ordered rather than implicit, unordered /_alt-ing together of the terms.
bin/map: removed no longer accurate comment that this is case- and punctuation-insensitive, since the case- and punctuation-insensitivity is now instead handled by map.csv preprocessing scripts before the mappings are even provided to bin/map
bugfix: bin/map: in_is_db: inline metadata value columns (used by new-style import) so that they can be compared by value in XML simplifying functions (lib/xml_func.py)
bin/map: map_table(): Resolve prefixes: combined db_xml.ColRef() constructor call with creation of args (as tuple) for clarity
bin/map: update_in_label(): use in_schema instead of the map spreadsheet column name when available, to allow using one spreadsheet for all datasources (which would not have a datasource-specific spreadsheet column name)
bin/map: by_col: ensure verbosity is at least 2 in live mode (using new ints.set_min() instead of max() for clarity). documented that live column-based import MUST be run with verbosity 2+ (3 preferred) to provide debugging information for often-complex errors. without this, debugging is effectively impossible.
bin/map: Set default verbosity: by_col: documented that showing all queries is primarily to assist debugging, not profiling
bin/map: No mappings warning: Added explanation that this could also be due to no column name matches, and hint to check if you are importing the correct input table
bin/map: map_table(): Resolving prefixes: Fixed bug where need to use list instead of tuple for metadata value mappings
bin/map: Made $redo flag default to off, because redo mode is slow (all tables have to be truncated) and is only needed when running tests on a public schema with data in it, which would not be the case on a development machine where tests are usually run
bin/map: Removed column names simplification, which was causing columns with the same alphanumeric characters but different punctuation to be simplified to the same name. Name simplification is now performed by the mapping mechanism itself, and can be overridden in the mappings.
bin/map: in_is_db: by_col: Clearing errors table: Skip this if the table has been set to None because it didn't exist (and thus was a metadata-only map spreadsheet)
bin/map: in_is_db: If table does not exist, set table to None so that db_xml.put_table() doesn't try to access it. This fixes a bug in metadata-only map spreadsheets under column-based import.
bin/map: update_in_label(): Removed hardcoded source_id col_default, which is now set in mappings/VegCore-VegBIEN.csv's output root
bin/map: update_in_label(): Set $source env var to the in_label (datasource name), to make it available to _env()
bin/map: Support map spreadsheets containing only metadata mappings (with no corresponding staging table), by falling back to an empty table when the named table does not exist
schemas/vegbien.sql: Renamed reference -> source to make this table more broadly applicable, and because this now stores the datasource metadata
mappings/VegCore-VegBIEN.csv: Renamed creator_ids to reference_id since they are now fkeys to reference
schemas/vegbien.sql: Made creator_ids an fkey to reference instead of party, so that datasources are stored separately from people and to allow adding reference-type metadata (URL, copyright, etc.) for each datasource
bin/map: map_rows(): map_table(): Fixed bug where metadata values were being removed prematurely, by passing them through
bin/map: map_rows(): Fixed bug where metadata values were being passed to functions that expected columns, by placing them directly in the XML import tree and then removing them from the mappings
bin/map: Added support for including literal metadata values in the map spreadsheet, by prefixing them with ':'
sql.select() calls: Removed order_by=None everywhere that a stable row order is required (i.e. consistent between selects, or consistent between table transformations). This causes several tests to return different inserted row counts, because the input table is now being accessed in pkey order instead of in table order. This fixes a bug where tables with more rows than ~100 would return different results for repeated calls of the same non-ordered select.
schemas/vegbien.sql: Renamed datasource_id to creator_id so it can apply generally to any entity (such as a person), not just an aggregated datasource. This also enables taxonconcept.datasource_id to merge with creator_id, which now serves the same purpose.
schemas/vegbien.sql: party: Made it datasource-scoped. Since this creates a recursive fkey, a datasource (a root party) should point to itself in this field, which will happen automatically by setting it to the special value 0.
bin/map, db_xml.put_table() (row-based and column-based import): Don't sort the input table by its pkey, in order to support input tables with no pkey. Note that reading the input table in table order and having this match the input flat file's order is only possible with sql_io.import_csv()'s truncation of the table on a failed import, which ensures that the rows will be stored in inserted order.
Removed no longer used intersect
bin/map: map_table(): Refactored to map simplified to original column names first and then determine column index for each original name, in order to avoid trying to recover the original name from a simplified name where multiple original names might collide onto the same simplified name. Documented that it's case- and punctuation-insensitive.
bin/map: map_table(): Resolve all mappings and prefixes after applying maps.simplify()
Replaced repr() with strings.urepr() (or equivalent) everywhere needed, to avoid future UnicodeEncodeErrors
Replaced str() with strings.ustr() (or equivalent) everywhere needed, to avoid future UnicodeEncodeErrors
bin/map: Clearing errors table: Fixed bug where needed to check if sql_io.errors_table() returned None (indicating that the errors table didn't exist) before calling sql.drop_table()
bin/map: Clearing errors table: Fixed bug where needed to use sql.drop_table() instead of sql.truncate() now that errors tables are not created until column-based import runs
bin/map: Documented that it is duplicate-column safe (supports multiple columns of the same name)
bin/map: collision_suffix: Setting back to _alt to test if _merge caused the SpeciesLink slowdown. SpeciesLink contains a huge number of equivalent columns due to each DwC term being present with namespaces for all versions of the DwC schema, and these columns can be combined either using _alt or _merge. _merge is only useful if the values in different versions of the same DwC field are different, which is not likely the case.
bin/map: collision_suffix: Changed to use _merge instead of _alt to avoid losing source data on import when multiple fields collide
bin/map: Preventing collisions if multiple inputs mapping to same output: Made collision suffix configurable so it can easily be changed
bin/map: Run new xml_func.simplify() on the root before printing the put template, so that _alts and _merges with only one element for the current datasource will be printed in their simplified form (with the _alt/_merge removed). This faciliates automated testing after an _alt/_merge suffix has been added, because the put template provided as part of the automated test will only change for those datasources that actually have an entry for both mappings, which greatly reduces the number of tests that need to be accepted.
Removed trailing whitespace on non-empty lines
bin/map: input is CSV: Removed unused map_ var
bin/map: Documented that it's multi-safe (supports an input appearing multiple times)
bin/map: out_is_db: row-based mode: Debug-log the processed XML tree produced by xml_func.process()
bin/map: Don't create unneeded /_ignore/inLabel element containing the datasource name because sql_io.put_table() now autopopulates the datasource_id
bin/map: Fixed bug where needed to use sql.function_exists() to determine if something is a relational (now SQL) function, including in row-based mode, since that now uses sql_io.put_table(), which requires this. The bug fix relies on the new xml_func.process() feature that preserves unknown relational functions in case they are built-in functions rather than SQL functions.
bin/map: Call sys.stdout.flush() after every call to sys.stdout.write() to avoid interleaved stdout/stderr output due to stdout buffering
Moved importing of col_defaults from db_xml.put_table() to bin/map, so that it also happens in row-based mode. Note that this causes a DB entry for the datasource to always be created, even if the datasource has no mappings or no rows.
bin/map: out_is_db: Use col_defaults in row-based mode as well
db_xml.py: Renamed put_table_special_funcs to put_special_funcs because it is now used by put() as well
bin/map: out_is_db: Output the put template to stdout so it will be validated in the automated testing
bin/map: by_col: db_xml.put_table() call: Use new col_defaults param to automatically set datasource_id to the in_label (datasource name)
bin/map: by_col: Only clear errors table if doing full re-import starting from row 0, not if restarting import at a later row
bin/map: in_is_xml: doc2rows(): "Root not found in input" warning: Changed "error" to "warning" to match the type of error condition signaled
bin/map: map_rows(): out_is_db: Changed `id_node != None` assertion to a warning because this is a normal circumstance in the base case where there are no mappings
bin/map: in_is_xml: doc2rows(): "Root not found in input" error: Changed SystemExit to a warning because this is a normal circumstance in the base case where the input XML file contains no rows
bin/map: by_col: Stripping XML functions not in the DB: Remove DB functions based on whether a plain SQL function of that name exists, rather than whether a relational function (i.e. a table) of that name exists. This will allow column-based import to use plain SQL functions that don't have a corresponding relational function.
xml_func.py: process(): Changed rel_funcs param to a callback is_rel_func, so that caller can specify any dynamic function to determine if a name is a relational function rather than having to list out all known relational functions
bin/map: Fixed bug where errors table indexes could not be looked up using index_cols() because their schema was not in the search_path, by explicitly adding the in_schema at the end of the search_path. This is the only reason the in_schema needs to be in the search_path, but it's unavoidable because the "duplicate key value violates unique constraint" error does not included the constraint's schema.
bin/map: ex_tracker: Don't add row_ct to iters count in column-based import (by_col) because errors are not done by row, so a % of rows affected is not meaningful
bin/map: Logging: Raised debug-mode verbosity threshold to 1.5 so that in row-based imports, which have a default verbosity of 1.1, sql.DbConn.run_query() will not profile the query, to avoid unnecessary overhead
bin/map: by_col: Reuse existing out_db connection for in_db instead of opening separate connection
bin/map: Optimized default verbosities for the mode: automated tests should not be verbose, column-based import should show all queries to assist profiling, and row-based import should just show row progress
bin/map: Use new DbConn.close()
Moved error tracking from sql.py to sql_io.py
main Makefile: Removed empty_db, because `make schemas/reinstall` has the same effect and is simpler
db_xml.py: put_table(): Removed no longer needed commit param
bin/map: Removed rollback() call before closing the connection because PostgreSQL does this automatically
Removed unnecessary db.db.commit() calls because commits are now done automatically by DbConn's autocommit mode
bin/map: connect_db(): Autocommit in commit mode to avoid the need for manual commits. This should also reduce the time that table locks are held, to avoid unnecessary contention when multiple processes are trying to insert into the same output table. (The program always uses nested transactions to support rollbacks, so there is no problem autocommitting whenever a top-level nested transaction or top-level query completes.)
sql.py: Use new DbConn.debug_temp config option to control whether temporary objects should instead be permanent
bin/map: connect_db(): Only use autocommit mode if verbosity > 3, to avoid accidentally activating it if you want debug output in normal import mode
bin/map: connect_db(): Only use autocommit mode if verbosity > 2, because it causes the intermediate tables to be created as permanent tables, which you don't want unless you're actually debugging (verbosity = 2 is normal for column-based import)
Wrap sys.stderr.write() calls in strings.to_raw_str() to avoid UnicodeEncodeErrors when stderr is to a file and the default encoding is ASCII
bin/map: When logging the row # being processed, add 1 because row # is interally 0-based, but 1-based to the user
bin/map: Log the row # being processed with level=1.1 so that the user can see a status report if desired
bin/map: by_col: Pass on_error to db_xml.put_table() that calls ex_tracker.track()
bin/map: If doing full import, clear errors table
schemas/functions.sql: _nullIf: Fixed bug where NOT NULL parameters were not supported, because an input NULL value would not match an existing DEFAULT value in a select query, by temporarily disabling _nullIf until this can be supported. Removed previous workarounds.
bin/map: out_is_db, row-based mode: Disabled using DB relational functions instead of XML functions because they were causing problems
xml_func.py: process(): Refactored to emphasize special handling for row-based and column-based modes. In row-based mode, always use a DB relational function over a local XML function when possible, to faciliate testing of DB relational functions in row-based mode. (The shadowed local XML version will still be tested in non-DB modes, such as outputting to intermediate XML files.)
bin/map: Move retrieval of out_db's relational functions outside of process_input() so they can also be used by the non-by_col case
bin/map: out_is_db: Don't evaluate relational functions in xml_func.process() because these will be evaluated by db_xml.put()
bin/map: Use xml_func.process(..., strip=True) instead of xml_func.strip()
bin/map: by_col: Stripping XML functions not in the DB: Fixed bug where preserve_funcs.add() was used when `preserve_funcs |=` should have been used to add the entire iterable that sql.tables() returns
bin/map: by_col: Strip only XML functions that are not in the DB
bin/map: Logging: log(): Remove extra debug info from DB query messages and format level 1.5 (summary) messages as Redmine list items
strings.py: as_table(): Fixed bug where table was not ended properly, by adding a space after the last \n and having rstrip() string only newlines
bin/map: Logging: log(): Strip trailing newlines from msg
bin/map: Fixed bug where verbosity needed to be 1 outside of test mode so that profiling and errors stats would be printed at end of import. Verbosity defaults to 0.5 rather than 1 in test mode so profiling and errors stats do not clutter up the test output when running automated tests.
bin/map: Only display verbose_errors in test mode, but with any nonzero verbosity. They should not be displayed outside of test mode because verbose errors make the log files huge.
bin/map: Renamed verbose param to verbosity because it's now a number, not a boolean
bin/map: Removed no longer used debug param (verbose=2 is used instead)
bin/map: Fixed bug where verbose_errors' default value depended on debug var, which was not yet set. Removed verbose_errors param and instead turn verbose_errors on whenever verbosity >= 1. Verbosity defaults to 1 in test mode.
bin/map: Logging: Don't set sql.run_raw_query.debug, because it is not used anymore (sql.connect(log_debug=...) is used instead)