/trunk/bin/map - Changes - BIEN 3 - NCEAS Projects

root/trunk/bin/map @ 14348

svn:executable: *

#	Date	Author	Comment
11970	01/20/2014 11:33 AM	Aaron Marcuse-Kubitza	moved everything into /trunk/ to create the standard svn layout, for use with tools that require this (eg. git-svn). IMPORTANT: do NOT do an `svn up`. instead, re-use your working copy's existing files with `svn switch` (http://svnbook.red-bean.com/en/1.6/svn.ref.svn.c.switch.html).
11918	12/17/2013 05:47 AM	Aaron Marcuse-Kubitza	bugfix: bin/map: in_is_db: don't ignore errors when the table does not exist, because these prevent an errexit and allow an import to continue when a staging table is missing. suppressing this error had previously been necessary because metadata-only tables (Source/) used to not have installed staging tables, and the program had to react accordingly.
11806	12/03/2013 08:58 AM	Aaron Marcuse-Kubitza	bin/map: support param start="", which indicates the default value. this fixes a bug in inputs/input.Makefile $(restart_row), which outputs "" if an explicit starting row is not found.
11396	10/21/2013 07:14 PM	Aaron Marcuse-Kubitza	fix: bin/map: put template: comment out the "Put template:" label so that the output is valid XML, and displays properly in a browser rather than showing a syntax error
11227	10/09/2013 10:12 PM	Aaron Marcuse-Kubitza	bin/map: usage: documented that verbosity > 3 in commit mode turns on debug_temp mode, which creates real tables instead of temp tables
10854	09/04/2013 01:28 PM	Aaron Marcuse-Kubitza	bin/map: allow user to override the source env var, which is used as the source.shortname value in the DB
10191	07/09/2013 12:56 AM	Aaron Marcuse-Kubitza	bin/map: removed no longer used support for map.csv input column prefixes (expand out the prefixes instead). this used to be used by SpeciesLink to use just one mapping for a single term with multiple DwC namespaces, but was replaced with an explicit, ordered rather than implicit, unordered /_alt-ing together of the terms.
10190	07/08/2013 11:47 PM	Aaron Marcuse-Kubitza	bin/map: removed no longer accurate comment that this is case- and punctuation-insensitive, since the case- and punctuation-insensitivity is now instead handled by map.csv preprocessing scripts before the mappings are even provided to bin/map
10140	07/02/2013 02:31 PM	Aaron Marcuse-Kubitza	bugfix: bin/map: in_is_db: inline metadata value columns (used by new-style import) so that they can be compared by value in XML simplifying functions (lib/xml_func.py)
10115	07/02/2013 03:50 AM	Aaron Marcuse-Kubitza	bin/map: map_table(): Resolve prefixes: combined db_xml.ColRef() constructor call with creation of args (as tuple) for clarity
10114	07/02/2013 03:35 AM	Aaron Marcuse-Kubitza	bin/map: update_in_label(): use in_schema instead of the map spreadsheet column name when available, to allow using one spreadsheet for all datasources (which would not have a datasource-specific spreadsheet column name)
9455	05/17/2013 01:13 PM	Aaron Marcuse-Kubitza	bin/map: by_col: ensure verbosity is at least 2 in live mode (using new ints.set_min() instead of max() for clarity). documented that live column-based import MUST be run with verbosity 2+ (3 preferred) to provide debugging information for often-complex errors. without this, debugging is effectively impossible.
9453	05/17/2013 12:57 PM	Aaron Marcuse-Kubitza	bin/map: Set default verbosity: by_col: documented that showing all queries is primarily to assist debugging, not profiling
8075	03/16/2013 02:16 PM	Aaron Marcuse-Kubitza	bin/map: No mappings warning: Added explanation that this could also be due to no column name matches, and hint to check if you are importing the correct input table
7148	01/10/2013 09:03 PM	Aaron Marcuse-Kubitza	bin/map: map_table(): Resolving prefixes: Fixed bug where need to use list instead of tuple for metadata value mappings
7122	01/08/2013 10:55 PM	Aaron Marcuse-Kubitza	bin/map: Made $redo flag default to off, because redo mode is slow (all tables have to be truncated) and is only needed when running tests on a public schema with data in it, which would not be the case on a development machine where tests are usually run
6740	12/11/2012 01:48 AM	Aaron Marcuse-Kubitza	bin/map: Removed column names simplification, which was causing columns with the same alphanumeric characters but different punctuation to be simplified to the same name. Name simplification is now performed by the mapping mechanism itself, and can be overridden in the mappings.
6454	11/25/2012 07:20 PM	Aaron Marcuse-Kubitza	bin/map: in_is_db: by_col: Clearing errors table: Skip this if the table has been set to None because it didn't exist (and thus was a metadata-only map spreadsheet)
6445	11/24/2012 02:41 PM	Aaron Marcuse-Kubitza	bin/map: in_is_db: If table does not exist, set table to None so that db_xml.put_table() doesn't try to access it. This fixes a bug in metadata-only map spreadsheets under column-based import.
6404	11/24/2012 07:32 AM	Aaron Marcuse-Kubitza	bin/map: update_in_label(): Removed hardcoded source_id col_default, which is now set in mappings/VegCore-VegBIEN.csv's output root
6399	11/24/2012 06:44 AM	Aaron Marcuse-Kubitza	bin/map: update_in_label(): Set $source env var to the in_label (datasource name), to make it available to _env()
6385	11/24/2012 05:05 AM	Aaron Marcuse-Kubitza	bin/map: Support map spreadsheets containing only metadata mappings (with no corresponding staging table), by falling back to an empty table when the named table does not exist
6179	11/14/2012 06:30 PM	Aaron Marcuse-Kubitza	schemas/vegbien.sql: Renamed reference -> source to make this table more broadly applicable, and because this now stores the datasource metadata
5953	11/01/2012 10:09 AM	Aaron Marcuse-Kubitza	mappings/VegCore-VegBIEN.csv: Renamed creator_ids to reference_id since they are now fkeys to reference
5952	11/01/2012 10:04 AM	Aaron Marcuse-Kubitza	schemas/vegbien.sql: Made creator_ids an fkey to reference instead of party, so that datasources are stored separately from people and to allow adding reference-type metadata (URL, copyright, etc.) for each datasource
5928	11/01/2012 06:41 AM	Aaron Marcuse-Kubitza	bin/map: map_rows(): map_table(): Fixed bug where metadata values were being removed prematurely, by passing them through
5927	11/01/2012 06:40 AM	Aaron Marcuse-Kubitza	bin/map: map_rows(): Fixed bug where metadata values were being passed to functions that expected columns, by placing them directly in the XML import tree and then removing them from the mappings
5911	11/01/2012 04:16 AM	Aaron Marcuse-Kubitza	bin/map: Added support for including literal metadata values in the map spreadsheet, by prefixing them with ':'
5523	10/15/2012 02:36 PM	Aaron Marcuse-Kubitza	sql.select() calls: Removed order_by=None everywhere that a stable row order is required (i.e. consistent between selects, or consistent between table transformations). This causes several tests to return different inserted row counts, because the input table is now being accessed in pkey order instead of in table order. This fixes a bug where tables with more rows than ~100 would return different results for repeated calls of the same non-ordered select.
5242	10/04/2012 08:26 PM	Aaron Marcuse-Kubitza	schemas/vegbien.sql: Renamed datasource_id to creator_id so it can apply generally to any entity (such as a person), not just an aggregated datasource. This also enables taxonconcept.datasource_id to merge with creator_id, which now serves the same purpose.
5234	10/04/2012 06:15 PM	Aaron Marcuse-Kubitza	schemas/vegbien.sql: party: Made it datasource-scoped. Since this creates a recursive fkey, a datasource (a root party) should point to itself in this field, which will happen automatically by setting it to the special value 0.
5026	09/26/2012 11:49 PM	Aaron Marcuse-Kubitza	bin/map, db_xml.put_table() (row-based and column-based import): Don't sort the input table by its pkey, in order to support input tables with no pkey. Note that reading the input table in table order and having this match the input flat file's order is only possible with sql_io.import_csv()'s truncation of the table on a failed import, which ensures that the rows will be stored in inserted order.
4652	09/12/2012 02:28 PM	Aaron Marcuse-Kubitza	Removed no longer used intersect
4505	09/07/2012 09:16 AM	Aaron Marcuse-Kubitza	bin/map: map_table(): Refactored to map simplified to original column names first and then determine column index for each original name, in order to avoid trying to recover the original name from a simplified name where multiple original names might collide onto the same simplified name. Documented that it's case- and punctuation-insensitive.
4503	09/07/2012 08:42 AM	Aaron Marcuse-Kubitza	bin/map: map_table(): Resolve all mappings and prefixes after applying maps.simplify()
4492	09/06/2012 08:42 PM	Aaron Marcuse-Kubitza	Replaced repr() with strings.urepr() (or equivalent) everywhere needed, to avoid future UnicodeEncodeErrors
4491	09/06/2012 08:30 PM	Aaron Marcuse-Kubitza	Replaced str() with strings.ustr() (or equivalent) everywhere needed, to avoid future UnicodeEncodeErrors
4474	09/05/2012 09:09 AM	Aaron Marcuse-Kubitza	bin/map: Clearing errors table: Fixed bug where needed to check if sql_io.errors_table() returned None (indicating that the errors table didn't exist) before calling sql.drop_table()
4473	09/05/2012 09:04 AM	Aaron Marcuse-Kubitza	bin/map: Clearing errors table: Fixed bug where needed to use sql.drop_table() instead of sql.truncate() now that errors tables are not created until column-based import runs
4213	08/24/2012 07:24 PM	Aaron Marcuse-Kubitza	bin/map: Documented that it is duplicate-column safe (supports multiple columns of the same name)
4068	08/16/2012 12:39 PM	Aaron Marcuse-Kubitza	bin/map: collision_suffix: Setting back to _alt to test if _merge caused the SpeciesLink slowdown. SpeciesLink contains a huge number of equivalent columns due to each DwC term being present with namespaces for all versions of the DwC schema, and these columns can be combined either using _alt or _merge. _merge is only useful if the values in different versions of the same DwC field are different, which is not likely the case.
4049	08/15/2012 07:02 AM	Aaron Marcuse-Kubitza	bin/map: collision_suffix: Changed to use _merge instead of _alt to avoid losing source data on import when multiple fields collide
4048	08/15/2012 06:58 AM	Aaron Marcuse-Kubitza	bin/map: Preventing collisions if multiple inputs mapping to same output: Made collision suffix configurable so it can easily be changed
4047	08/15/2012 06:56 AM	Aaron Marcuse-Kubitza	bin/map: Preventing collisions if multiple inputs mapping to same output: Made collision suffix configurable so it can easily be changed
4042	08/15/2012 05:55 AM	Aaron Marcuse-Kubitza	bin/map: Run new xml_func.simplify() on the root before printing the put template, so that _alts and _merges with only one element for the current datasource will be printed in their simplified form (with the _alt/_merge removed). This faciliates automated testing after an _alt/_merge suffix has been added, because the put template provided as part of the automated test will only change for those datasources that actually have an entry for both mappings, which greatly reduces the number of tests that need to be accepted.
4026	08/15/2012 03:44 AM	Aaron Marcuse-Kubitza	Removed trailing whitespace on non-empty lines
3769	08/02/2012 08:54 PM	Aaron Marcuse-Kubitza	bin/map: input is CSV: Removed unused map_ var
3768	08/02/2012 08:50 PM	Aaron Marcuse-Kubitza	bin/map: Documented that it's multi-safe (supports an input appearing multiple times)
3715	08/01/2012 05:48 AM	Aaron Marcuse-Kubitza	bin/map: out_is_db: row-based mode: Debug-log the processed XML tree produced by xml_func.process()
3696	07/31/2012 08:04 PM	Aaron Marcuse-Kubitza	bin/map: Don't create unneeded /_ignore/inLabel element containing the datasource name because sql_io.put_table() now autopopulates the datasource_id
3689	07/30/2012 06:09 PM	Aaron Marcuse-Kubitza	bin/map: Fixed bug where needed to use sql.function_exists() to determine if something is a relational (now SQL) function, including in row-based mode, since that now uses sql_io.put_table(), which requires this. The bug fix relies on the new xml_func.process() feature that preserves unknown relational functions in case they are built-in functions rather than SQL functions.
3669	07/27/2012 10:48 PM	Aaron Marcuse-Kubitza	bin/map: Call sys.stdout.flush() after every call to sys.stdout.write() to avoid interleaved stdout/stderr output due to stdout buffering
3661	07/27/2012 09:38 PM	Aaron Marcuse-Kubitza	Moved importing of col_defaults from db_xml.put_table() to bin/map, so that it also happens in row-based mode. Note that this causes a DB entry for the datasource to always be created, even if the datasource has no mappings or no rows.
3653	07/27/2012 08:21 PM	Aaron Marcuse-Kubitza	bin/map: out_is_db: Use col_defaults in row-based mode as well
3652	07/27/2012 08:02 PM	Aaron Marcuse-Kubitza	db_xml.py: Renamed put_table_special_funcs to put_special_funcs because it is now used by put() as well
3641	07/27/2012 06:03 PM	Aaron Marcuse-Kubitza	bin/map: out_is_db: Output the put template to stdout so it will be validated in the automated testing
3627	07/26/2012 07:56 PM	Aaron Marcuse-Kubitza	bin/map: by_col: db_xml.put_table() call: Use new col_defaults param to automatically set datasource_id to the in_label (datasource name)
3617	07/26/2012 04:48 PM	Aaron Marcuse-Kubitza	bin/map: by_col: Only clear errors table if doing full re-import starting from row 0, not if restarting import at a later row
3581	07/24/2012 07:18 AM	Aaron Marcuse-Kubitza	bin/map: in_is_xml: doc2rows(): "Root not found in input" warning: Changed "error" to "warning" to match the type of error condition signaled
3580	07/24/2012 07:15 AM	Aaron Marcuse-Kubitza	bin/map: map_rows(): out_is_db: Changed `id_node != None` assertion to a warning because this is a normal circumstance in the base case where there are no mappings
3577	07/24/2012 06:37 AM	Aaron Marcuse-Kubitza	bin/map: in_is_xml: doc2rows(): "Root not found in input" error: Changed SystemExit to a warning because this is a normal circumstance in the base case where the input XML file contains no rows
3427	07/17/2012 08:28 PM	Aaron Marcuse-Kubitza	bin/map: by_col: Stripping XML functions not in the DB: Remove DB functions based on whether a plain SQL function of that name exists, rather than whether a relational function (i.e. a table) of that name exists. This will allow column-based import to use plain SQL functions that don't have a corresponding relational function.
3424	07/17/2012 08:08 PM	Aaron Marcuse-Kubitza	xml_func.py: process(): Changed rel_funcs param to a callback is_rel_func, so that caller can specify any dynamic function to determine if a name is a relational function rather than having to list out all known relational functions
3339	07/11/2012 10:34 PM	Aaron Marcuse-Kubitza	bin/map: Fixed bug where errors table indexes could not be looked up using index_cols() because their schema was not in the search_path, by explicitly adding the in_schema at the end of the search_path. This is the only reason the in_schema needs to be in the search_path, but it's unavoidable because the "duplicate key value violates unique constraint" error does not included the constraint's schema.
3324	07/11/2012 05:15 PM	Aaron Marcuse-Kubitza	bin/map: ex_tracker: Don't add row_ct to iters count in column-based import (by_col) because errors are not done by row, so a % of rows affected is not meaningful
3243	07/06/2012 10:38 AM	Aaron Marcuse-Kubitza	bin/map: Logging: Raised debug-mode verbosity threshold to 1.5 so that in row-based imports, which have a default verbosity of 1.1, sql.DbConn.run_query() will not profile the query, to avoid unnecessary overhead
3186	07/02/2012 10:03 AM	Aaron Marcuse-Kubitza	bin/map: by_col: Reuse existing out_db connection for in_db instead of opening separate connection
3132	06/27/2012 08:56 PM	Aaron Marcuse-Kubitza	bin/map: Optimized default verbosities for the mode: automated tests should not be verbose, column-based import should show all queries to assist profiling, and row-based import should just show row progress
3117	06/27/2012 06:31 PM	Aaron Marcuse-Kubitza	bin/map: Use new DbConn.close()
3103	06/26/2012 11:06 PM	Aaron Marcuse-Kubitza	Moved error tracking from sql.py to sql_io.py
2977	06/20/2012 07:46 PM	Aaron Marcuse-Kubitza	main Makefile: Removed empty_db, because `make schemas/reinstall` has the same effect and is simpler
2928	06/18/2012 05:38 PM	Aaron Marcuse-Kubitza	db_xml.py: put_table(): Removed no longer needed commit param
2927	06/18/2012 05:16 PM	Aaron Marcuse-Kubitza	bin/map: Removed rollback() call before closing the connection because PostgreSQL does this automatically
2925	06/18/2012 05:13 PM	Aaron Marcuse-Kubitza	Removed unnecessary db.db.commit() calls because commits are now done automatically by DbConn's autocommit mode
2920	06/18/2012 04:36 PM	Aaron Marcuse-Kubitza	bin/map: connect_db(): Autocommit in commit mode to avoid the need for manual commits. This should also reduce the time that table locks are held, to avoid unnecessary contention when multiple processes are trying to insert into the same output table. (The program always uses nested transactions to support rollbacks, so there is no problem autocommitting whenever a top-level nested transaction or top-level query completes.)
2916	06/18/2012 04:25 PM	Aaron Marcuse-Kubitza	sql.py: Use new DbConn.debug_temp config option to control whether temporary objects should instead be permanent
2898	06/15/2012 03:53 AM	Aaron Marcuse-Kubitza	bin/map: connect_db(): Only use autocommit mode if verbosity > 3, to avoid accidentally activating it if you want debug output in normal import mode
2897	06/15/2012 03:45 AM	Aaron Marcuse-Kubitza	bin/map: connect_db(): Only use autocommit mode if verbosity > 2, because it causes the intermediate tables to be created as permanent tables, which you don't want unless you're actually debugging (verbosity = 2 is normal for column-based import)
2883	06/15/2012 12:38 AM	Aaron Marcuse-Kubitza	Wrap sys.stderr.write() calls in strings.to_raw_str() to avoid UnicodeEncodeErrors when stderr is to a file and the default encoding is ASCII
2881	06/15/2012 12:12 AM	Aaron Marcuse-Kubitza	bin/map: When logging the row # being processed, add 1 because row # is interally 0-based, but 1-based to the user
2880	06/15/2012 12:05 AM	Aaron Marcuse-Kubitza	bin/map: Log the row # being processed with level=1.1 so that the user can see a status report if desired
2809	06/12/2012 10:26 PM	Aaron Marcuse-Kubitza	bin/map: by_col: Pass on_error to db_xml.put_table() that calls ex_tracker.track()
2734	06/11/2012 04:38 PM	Aaron Marcuse-Kubitza	bin/map: If doing full import, clear errors table
2639	06/06/2012 01:38 PM	Aaron Marcuse-Kubitza	schemas/functions.sql: _nullIf: Fixed bug where NOT NULL parameters were not supported, because an input NULL value would not match an existing DEFAULT value in a select query, by temporarily disabling _nullIf until this can be supported. Removed previous workarounds.
2638	06/05/2012 03:21 PM	Aaron Marcuse-Kubitza	bin/map: out_is_db, row-based mode: Disabled using DB relational functions instead of XML functions because they were causing problems
2602	06/04/2012 02:17 PM	Aaron Marcuse-Kubitza	xml_func.py: process(): Refactored to emphasize special handling for row-based and column-based modes. In row-based mode, always use a DB relational function over a local XML function when possible, to faciliate testing of DB relational functions in row-based mode. (The shadowed local XML version will still be tested in non-DB modes, such as outputting to intermediate XML files.)
2601	06/04/2012 01:01 PM	Aaron Marcuse-Kubitza	bin/map: Move retrieval of out_db's relational functions outside of process_input() so they can also be used by the non-by_col case
2600	06/04/2012 12:52 PM	Aaron Marcuse-Kubitza	bin/map: out_is_db: Don't evaluate relational functions in xml_func.process() because these will be evaluated by db_xml.put()
2598	06/04/2012 12:40 PM	Aaron Marcuse-Kubitza	bin/map: Use xml_func.process(..., strip=True) instead of xml_func.strip()
2560	06/01/2012 06:47 PM	Aaron Marcuse-Kubitza	bin/map: by_col: Stripping XML functions not in the DB: Fixed bug where preserve_funcs.add() was used when `preserve_funcs \|=` should have been used to add the entire iterable that sql.tables() returns
2550	06/01/2012 03:40 PM	Aaron Marcuse-Kubitza	bin/map: by_col: Strip only XML functions that are not in the DB
2490	05/30/2012 07:08 PM	Aaron Marcuse-Kubitza	bin/map: Logging: log(): Remove extra debug info from DB query messages and format level 1.5 (summary) messages as Redmine list items
2480	05/30/2012 03:58 PM	Aaron Marcuse-Kubitza	strings.py: as_table(): Fixed bug where table was not ended properly, by adding a space after the last \n and having rstrip() string only newlines
2476	05/29/2012 09:09 PM	Aaron Marcuse-Kubitza	bin/map: Logging: log(): Strip trailing newlines from msg
2458	05/25/2012 07:01 PM	Aaron Marcuse-Kubitza	bin/map: Fixed bug where verbosity needed to be 1 outside of test mode so that profiling and errors stats would be printed at end of import. Verbosity defaults to 0.5 rather than 1 in test mode so profiling and errors stats do not clutter up the test output when running automated tests.
2457	05/25/2012 06:55 PM	Aaron Marcuse-Kubitza	bin/map: Only display verbose_errors in test mode, but with any nonzero verbosity. They should not be displayed outside of test mode because verbose errors make the log files huge.
2456	05/25/2012 06:52 PM	Aaron Marcuse-Kubitza	bin/map: Renamed verbose param to verbosity because it's now a number, not a boolean
2455	05/25/2012 06:51 PM	Aaron Marcuse-Kubitza	bin/map: Removed no longer used debug param (verbose=2 is used instead)
2454	05/25/2012 06:48 PM	Aaron Marcuse-Kubitza	bin/map: Fixed bug where verbose_errors' default value depended on debug var, which was not yet set. Removed verbose_errors param and instead turn verbose_errors on whenever verbosity >= 1. Verbosity defaults to 1 in test mode.
2453	05/25/2012 06:33 PM	Aaron Marcuse-Kubitza	bin/map: Logging: Don't set sql.run_raw_query.debug, because it is not used anymore (sql.connect(log_debug=...) is used instead)

Project

General

Profile