Activity - BIEN 3 - NCEAS Projects

Activity

From 04/05/2012 to 05/04/2012

05/04/2012

07:15 PM Revision 2069: sql.py: mk_insert_select(): Removed unused params recover and cacheable: Aaron Marcuse-Kubitza
07:10 PM Revision 2068: sql.py: Added mogrify(): Aaron Marcuse-Kubitza
07:00 PM Revision 2067: db_xml.py: put_table(): Corrected @return doc: Aaron Marcuse-Kubitza
06:32 PM Revision 2066: sql.py: Added mk_insert_select() and use it in insert_select(): Aaron Marcuse-Kubitza
06:21 PM Revision 2065: db_xml.py: put_table(): Use new insert_select(): Aaron Marcuse-Kubitza
06:15 PM Revision 2064: sql.py: insert_select(): Changed order of cols and params arguments so select_query and params would be together: Aaron Marcuse-Kubitza
06:12 PM Revision 2063: sql.py: Added insert_select() and use it in insert(): Aaron Marcuse-Kubitza
04:55 PM Revision 2062: Calls to sql.esc_name*(): Removed preserve_case=True because it is now the default: Aaron Marcuse-Kubitza
04:51 PM Revision 2061: sql.py: esc_name_by_module(): Changed preserve_case to ignore_case, which defaults to False: Aaron Marcuse-Kubitza
04:49 PM Revision 2060: Calls to sql.esc_name*(): Removed preserve_case=True because it is now the default: Aaron Marcuse-Kubitza
04:47 PM Revision 2059: sql.py: esc_name_by_module(): preserve_case defaults to True: Aaron Marcuse-Kubitza
04:44 PM Revision 2058: sql.py: mk_select(): Escape all names used (table, column, cond, etc.): Aaron Marcuse-Kubitza
04:33 PM Revision 2057: sql.py: esc_name_by_module(): If not enclosing name in quotes, call check_name() on it: Aaron Marcuse-Kubitza
04:30 PM Revision 2056: sql.py: mk_select(): Support literal values in the list of cols to select: Aaron Marcuse-Kubitza
03:22 PM Revision 2055: sql.py: mk_select(): Don't escape the table name, because it will either be check_name()d or it's already been escaped: Aaron Marcuse-Kubitza
03:11 PM Revision 2054: sql.py: Added mk_select(), and use it in select(): Aaron Marcuse-Kubitza
02:14 PM Revision 2053: bin/map: Always pass qual_name(table) to sql.select(). This is possible now that qual_name() can handle None schemas.: Aaron Marcuse-Kubitza
02:08 PM Revision 2052: db_xml.py: put_table(): Take separate in_table and in_schema names, instead of in_table and table_is_esc, because the in_schema is needed to scope the temp tables appropriately: Aaron Marcuse-Kubitza
02:04 PM Revision 2051: sql.py: qual_name(): If schema is None, don't prepend schema: Aaron Marcuse-Kubitza

05/03/2012

06:59 PM Revision 2050: bin/map, sql.py: Turned SQL query caching back on because benchmarks of just the caching on vs. off reveal that it does reduce processing time significantly. However, there is a slowdown that was introduced between the time caching was added and the time the same XML tree was used for each node, which was giving the false indication that the slowdown was due to the caching.: Aaron Marcuse-Kubitza
06:44 PM Revision 2049: bin/map: Turn SQL query caching off by default: Aaron Marcuse-Kubitza
06:39 PM Revision 2048: bin/map: Added cache_sql env var to enable SQL query caching: Aaron Marcuse-Kubitza
06:39 PM Revision 2047: sql.py: Make caching DbConn enablable. Turn caching off by default because recent benchmarks (n=1000) were showing that it slows things down.: Aaron Marcuse-Kubitza
04:53 PM Revision 2046: bin/map: Added new verbose_errors mode, enabled in test mode and off otherwise, which controls whether the output row and tracebacks are included in error messages. Having this off in import mode will reduce the size of error logs so they don't fill up the vegbiendev hard disk as quickly.: Aaron Marcuse-Kubitza
04:51 PM Revision 2045: exc.py: print_ex(): Added detail option to turn off traceback: Aaron Marcuse-Kubitza
04:10 PM Revision 2044: bin/map: Turn parallel processing off by default. This should fix "Cannot allocate memory" errors in large imports.: Aaron Marcuse-Kubitza

05/01/2012

07:58 AM Revision 2043: bin/map: in_is_db: Don't cache the main SELECT query: Aaron Marcuse-Kubitza
07:56 AM Revision 2042: bin/map: by_col: Use the created template, which already has the column names in it, instead of mapping a sample row: Aaron Marcuse-Kubitza
07:50 AM Revision 2041: bin/map: Fixed bug where db_xml could not be imported twice, or it was treated as an undefined variable for some reason: Aaron Marcuse-Kubitza
07:45 AM Revision 2040: bin/map: map_table(): Make each column a db_xml.ColRef instead of a bare index, so that it will appear as the column name when converted to a string. This will provide better debugging info in the template tree and also avoid needing to create a separate sample row in by_col.: Aaron Marcuse-Kubitza
07:33 AM Revision 2039: db_xml.py: Added ColRef: Aaron Marcuse-Kubitza
06:33 AM Revision 2038: bin/map: Fixed bug where row count was off by one if all rows in the input were exhausted, because the row that raises StopIteration was counting as a row: Aaron Marcuse-Kubitza
06:13 AM Revision 2037: main Makefile: VegBIEN DB: mk_db: Use template1 because it has PROCEDURAL LANGUAGE plpgsql already installed and we aren't using an encoding other than UTF8: Aaron Marcuse-Kubitza
06:11 AM Revision 2036: Moved "CREATE PROCEDURAL LANGUAGE plpgsql" to main Makefile so that it would only run when the DB is created, not when the public schema is reinstalled. This is only relevant on PostgreSQL < 9.x, where the plpgsql language is not part of template0.: Aaron Marcuse-Kubitza
05:56 AM Revision 2035: Renamed parallel.py to parallelproc.py to avoid conflict with new system parallel module on vegbiendev: Aaron Marcuse-Kubitza
05:43 AM Revision 2034: Makefile: VegBIEN DB: public schema: Added schemas/rotate: Aaron Marcuse-Kubitza
05:34 AM Revision 2033: bin/map: Fixed bug in input rows processed count where the count would be off by 1, because the for loop would leave i at the index of the last row instead of one-past-the-last: Aaron Marcuse-Kubitza
04:44 AM Revision 2032: bin/map: Use the same XML tree for each row in DB outputs, to eliminate time spent creating the tree from the XPaths for each row: Aaron Marcuse-Kubitza
04:08 AM Revision 2031: bin/map: map_table(): Resolve each prefix into a separate mapping, which is collision-eliminated, instead of resolving values from multiple prefixes when each individual row is mapped: Aaron Marcuse-Kubitza
03:50 AM Revision 2030: bin/map: Moved collision-prevention code to map_rows() so it would only run if there were mappings, and so that it would run after any mappings preprocessing by map_table() that creates more collisions: Aaron Marcuse-Kubitza
03:45 AM Revision 2029: bin/map: Prevent collisions if multiple inputs mapping to same output: Aaron Marcuse-Kubitza
02:02 AM Revision 2028: mappings/DwC1-DwC2.specimens.csv: Mapped collectorNumber and recordNumber to recordNumber with _alt so they wouldn't collide when every input column, even empty ones, are created in the XML tree: Aaron Marcuse-Kubitza
12:42 AM Revision 2027: bin/map: If out_is_db, in debug mode, print each row's XML tree and each value that it's putting: Aaron Marcuse-Kubitza
12:36 AM Revision 2026: bin/map: If out_is_db, in debug mode, print the template XML tree used to insert a sample row into the DB: Aaron Marcuse-Kubitza

04/30/2012

11:57 PM Revision 2025: bin/map: map_table(): When translating mappings to column indexes, use appends to a new list instead of deletions from an existing list to simplify the algorithm: Aaron Marcuse-Kubitza
11:20 PM Revision 2024: union: Omit mappings that are mapped *to* in the input map, in addition to mappings that were overridden. This prevents multiple outputs being created for both the renamed and original mappings, causing duplicate output nodes when one XML tree is used for all rows.: Aaron Marcuse-Kubitza
11:18 PM Revision 2023: union: Omit mappings that are mapped *to* in the input map, in addition to mappings that were overridden. This prevents multiple outputs being created for both the renamed and original mappings, causing duplicate output nodes when one XML tree is used for all rows.: Aaron Marcuse-Kubitza
11:17 PM Revision 2022: input.Makefile: Maps building: Via maps cleanup: subtract: Include comment column so commented mappings are never removed: Aaron Marcuse-Kubitza
11:07 PM Revision 2021: subtract: Support "ragged rows" that have fewer columns than the specified column numbers: Aaron Marcuse-Kubitza
11:06 PM Revision 2020: util.py: list_subset(): Added default param to specify the value to use for invalid indexes (if any): Aaron Marcuse-Kubitza
09:44 AM Revision 2019: mappings/VegX-VegBIEN.stems.csv: Mappings with multiple inputs for the same output: Use _alt, etc. to map the multiple inputs to different places in the XML tree, so that when using a pregenerated tree, the empty leaves for each input will not collide with each other: Aaron Marcuse-Kubitza
09:20 AM Revision 2018: mappings/VegX-VegBIEN.stems.csv: Changed XPath references (using "$") to XML function references using _ref where needed to make them work even on a pre-made XML tree used by all rows: Aaron Marcuse-Kubitza
09:13 AM Revision 2017: xml_func.py: Added _ref to retrieve a value from another XML node: Aaron Marcuse-Kubitza
06:12 AM Revision 2016: xml_func.py: Made all functions take a 2nd node param, which contains the func node itself: Aaron Marcuse-Kubitza
04:15 AM Revision 2015: bin/map: If outputting to a DB, also create output XML elements for NULL input values. This will help with the transition to using the same XML tree for all rows.: Aaron Marcuse-Kubitza
04:09 AM Revision 2014: xml_func.py: _label: return None on empty input: Aaron Marcuse-Kubitza
03:46 AM Revision 2013: mappings/VegX-VegBIEN.stems.csv: Added _collapse around subtrees that need to be removed if they are created around a NULL value: Aaron Marcuse-Kubitza
03:40 AM Revision 2012: xml_func.py: Added _collapse to collapse a subtree if the "value" element in it is NULL: Aaron Marcuse-Kubitza
01:44 AM Revision 2011: schemas/vegbien.sql: definedvalue: Made definedvalue nullable so that each row of a datasource can have a uniform structure in VegBIEN, and to support reusing the same XML DOM tree for each row: Aaron Marcuse-Kubitza
01:11 AM Revision 2010: xpath.py: Added is_xpath(): Aaron Marcuse-Kubitza
01:10 AM Revision 2009: xml_dom.py: set_value(): If value is None and node is Element, remove value node entirely instead of setting node's value to None: Aaron Marcuse-Kubitza
01:02 AM Revision 2008: xml_dom.py: Added value_node(). Use new value_node() in value() and set_value(). set_value(): If the node already has a value node, reuse it instead of appending a new value node.: Aaron Marcuse-Kubitza
12:35 AM Revision 2007: xpath.py: put_obj(): Return the id_attr_node using get_1() because it should only be one node: Aaron Marcuse-Kubitza
12:30 AM Revision 2006: xml_func.py: _simplifyPath: Also treat the elem as empty if the required node exists but is empty: Aaron Marcuse-Kubitza
12:04 AM Revision 2005: db_xml.py: put_table(): Added part of put() code that should be common to both functions: Aaron Marcuse-Kubitza

04/27/2012

06:16 PM Revision 2004: xpath.py: put_obj(): Return a tuple of the inserted node and the id attr node: Aaron Marcuse-Kubitza
06:13 PM Revision 2003: xpath.py: set_id(): When creating the id_path, use obj() (which deepcopy()s the entire path) because it prevents pointers w/o targets: Aaron Marcuse-Kubitza
06:05 PM Revision 2002: xpath.py: set_id(): When creating the id_path, deepcopy() the id_elem because its keys will change in the main copy: Aaron Marcuse-Kubitza
05:47 PM Revision 2001: xpath.py: set_id(): Return the path to the ID attr, which can be used to change the ID: Aaron Marcuse-Kubitza
05:25 PM Revision 2000: xpath.py: put_obj(): Return the inserted node so it can be used to change the inserted value: Aaron Marcuse-Kubitza
05:08 PM Revision 1999: main Makefile: Maps validation: Fixed bug where there would be infinite recursion with the Maps validation section before the Subdir forwarding section (it's unknown why this is necessary): Aaron Marcuse-Kubitza

04/26/2012

07:12 PM Revision 1998: db_xml.py: put_table(): Added commit param to specify whether to commit after each query: Aaron Marcuse-Kubitza
06:55 PM Revision 1997: bin/map: in_is_db: by_col: Use new put_table() (defined but not implemented yet): Aaron Marcuse-Kubitza
06:54 PM Revision 1996: db_xml.py: Added put_table() (without implementation): Aaron Marcuse-Kubitza
06:52 PM Revision 1995: xml_func.py: strip(): Remove _ignore XML funcs completely instead of replacing them with their values: Aaron Marcuse-Kubitza
06:26 PM Revision 1994: bin/map: in_is_db: by_col: Prefix each input column name by "$": Aaron Marcuse-Kubitza
06:11 PM Revision 1993: bin/map: in_is_db: by_col: Strip off XML functions: Aaron Marcuse-Kubitza
06:09 PM Revision 1992: xml_func.py: Added strip(). pop_value(): Support custom name of value param.: Aaron Marcuse-Kubitza
05:44 PM Revision 1991: bin/map: in_is_db: by_col: Create XML tree of sample row, with the input column names as the values. This tree will guide the sequencing and creation of the column-based queries.: Aaron Marcuse-Kubitza
05:43 PM Revision 1990: input.Makefile: use_staged env var: defaults to on if by_col is on: Aaron Marcuse-Kubitza
05:00 PM Revision 1989: bin/map: Only turn on by_col optimization if mapping to same DB, rather than requiring each place that checks by_col to also check whether mapping to same DB: Aaron Marcuse-Kubitza

04/24/2012

06:32 PM Revision 1988: input.Makefile: Testing: Don't abort tester if only staging test fails, in case staging table missing: Aaron Marcuse-Kubitza
06:25 PM Revision 1987: input.Makefile: Testing: When cleaning up test outputs, remove everything that doesn't end in .ref: Aaron Marcuse-Kubitza
06:11 PM Revision 1986: input.Makefile: Testing: Added test/import.%.staging.out test to test the staging tables. Sources: cat: Updated Usage comment to include the "inputs/<datasrc>/" prefix the user would need to add when running make.: Aaron Marcuse-Kubitza
05:33 PM Revision 1985: bin/map: Fixed bug where mapping to same DB wouldn't work because by-column optimization wasn't implemented yet, by turning it off by default and allowing it to be enabled with an env var: Aaron Marcuse-Kubitza
05:25 PM Revision 1984: bin/map: DB inputs: Use by-column optimization if mapping to same DB (with skeleton code for optimization's implementation): Aaron Marcuse-Kubitza
05:12 PM Revision 1983: input.Makefile: Mapping: Use the staging tables instead of any flat files if use_staged is specified: Aaron Marcuse-Kubitza
05:10 PM Revision 1982: bin/map: Support custom schema name. Support input table/schema override via env vars, in case the map spreadsheet was written for a different input format.: Aaron Marcuse-Kubitza
05:01 PM Revision 1981: sql.py: qual_name(): Fixed bugs where esc_name() nested func couldn't have same name as outer func, and esc_name() needed to be invoked without the module name because it's in the same module. select(): Support already-escaped table names.: Aaron Marcuse-Kubitza
04:16 PM Revision 1980: main Makefile: $(psqlAsAdmin): Tell sudo to preserve env vars so PGOPTIONS is passed to psql: Aaron Marcuse-Kubitza
03:33 PM Revision 1979: root map: Fill in defaults for inputs from VegBIEN, as well as outputs to it: Aaron Marcuse-Kubitza
02:59 PM Revision 1978: disown_all: Updated to use main function, local vars, $self, etc. like other bash scripts run using ".": Aaron Marcuse-Kubitza
02:55 PM Revision 1977: vegbien_dest: Fixed bug where it would give a usage error if run from a makefile rule, because the BASH_LINENO would be 0, by also checking if ${BASH_ARGV[0]} is ${BASH_SOURCE[0]}: Aaron Marcuse-Kubitza
02:28 PM Revision 1976: postgres_vegbien: Fixed bug where interpreter did not match vegbien_dest's new required interpreter of /bin/bash: Aaron Marcuse-Kubitza
02:23 PM Revision 1975: vegbien_dest: Changed interpreter to /bin/bash. Removed comment that it requires var bien_password.: Aaron Marcuse-Kubitza
02:20 PM Revision 1974: postgres_vegbien: Removed no longer needed retrieval of bien_password: Aaron Marcuse-Kubitza
02:20 PM Revision 1973: vegbien_dest: Get bien_password by searching relative to $self, which we now have a way to get in a bash script (${BASH_SOURCE[0]}), rather than requiring the caller to set it. Provide usage error if run without initial ".".: Aaron Marcuse-Kubitza
02:12 PM Revision 1972: input.Makefile: Staging tables: import/install-%: Use new quiet option to determine whether to tee output to terminal. Don't use log option because that's always set to true except in test mode, which doesn't apply to installs.: Aaron Marcuse-Kubitza
02:12 PM Revision 1971: input.Makefile: Staging tables: import/install-%: Use new quiet option to determine whether to tee output to terminal. Don't use log option because that's always set to true except in test mode, which doesn't apply to installs.: Aaron Marcuse-Kubitza
01:56 PM Revision 1970: main Makefile: PostgreSQL: Edit /etc/phppgadmin/apache.conf to replace "deny from all" with "allow from all", instead of uncommenting an "allow from all" that may not be there: Aaron Marcuse-Kubitza
01:35 PM Revision 1969: input.Makefile: Sources: Fixed bug where cat was defined before $(tables), by moving Sources after Existing maps discovery and putting just $(inputFiles) and $(dbExport) from Sources at the beginning of Existing maps discovery: Aaron Marcuse-Kubitza
01:05 PM Revision 1968: sql.py: Made truncate(), tables(), empty_db() schema-aware. Added qual_name(). tables(): Added option to filter tables by a LIKE pattern.: Aaron Marcuse-Kubitza
12:34 PM Revision 1967: main Makefile: VegBIEN DB: Install public schema in a separate step, so that it can be dropped without dropping the entire DB (which also contains staging tables that shouldn't be dropped when there is a schema change). Added schemas/install, schemas/uninstall, implicit schemas/reinstall to manage the public schema separately from the rest of the DB. Moved Subdir forwarding to the bottom so overridden targets are not forwarded. README.TXT: Since `make reinstall_db` would drop the entire DB, tell user to run new `make schemas/reinstall` instead to reinstall (main) DB from schema.: Aaron Marcuse-Kubitza
12:30 PM Revision 1966: schemas/postgresql.Mac.conf: Set unix_socket_directory to the new dir it seems to be using, which is now /tmp: Aaron Marcuse-Kubitza
11:43 AM Revision 1965: csv2db: Fixed bug where extra columns were not truncated in INSERT mode. Replace empty column names with the column # to avoid errors with CSVs that have trailing ","s, etc.: Aaron Marcuse-Kubitza
11:41 AM Revision 1964: streams.py: StreamIter: Define readline() as a separate method so it can be overridden, and all calls to self.next() will use the overridden readline(). This fixes a bug in ProgressInputStream where incremental counts would not be displayed and it would end with "not all input read" if the StreamIter interface was used instead of readline().: Aaron Marcuse-Kubitza

04/23/2012

09:57 PM Revision 1963: csv2db: Fall back to manually inserting each row (autodetecting the encoding for each field) if COPY FROM doesn't work: Aaron Marcuse-Kubitza
09:56 PM Revision 1962: streams.py: FilterStream: Inherit from StreamIter so that all descendants automatically have StreamIter functionality: Aaron Marcuse-Kubitza
09:42 PM Revision 1961: sql.py: insert(): Support using the default value for columns designated with the special value sql.default: Aaron Marcuse-Kubitza
09:21 PM Revision 1960: sql.py: insert(): Support rows that are just a list of values, with no columns. Support already-escaped table names.: Aaron Marcuse-Kubitza
08:54 PM Revision 1959: strings.py: Added contains_any(): Aaron Marcuse-Kubitza
08:54 PM Revision 1958: csvs.py: reader_and_header(): Use make_reader(): Aaron Marcuse-Kubitza
08:07 PM Revision 1957: Added reinstall_all to reinstall all inputs at once: Aaron Marcuse-Kubitza
08:06 PM Revision 1956: with_all: Documented that it must be run from the root svn directory: Aaron Marcuse-Kubitza
08:05 PM Revision 1955: input.Makefile: Staging tables: import/install-%: Only install staging table if input contains only CSV sources. Changed $(isXml) to $(isCsv) (negated) everywhere because rules almost always only run something if input contains only CSV sources, rather than if input contains XML sources.: Aaron Marcuse-Kubitza
07:21 PM Revision 1954: input.Makefile: Staging tables: import/install-%: Output load status to log file if log option is set: Aaron Marcuse-Kubitza
07:00 PM Revision 1953: Scripts that are meant to be run in the calling shell: Fixed bug where running the script inside another script would make the script think it was being run as a program, and abort with a usage error: Aaron Marcuse-Kubitza
06:56 PM Revision 1952: Scripts that are meant to be run in the calling shell: Fixed bug where running the script as a program (without initial ".") wouldn't be able to call return in something that was not a function. Converted all code to a <script_name>_main method so that return would work properly again. Converted all variables to local variables.: Aaron Marcuse-Kubitza
06:38 PM Revision 1951: env_password: return instead of exit if password not yet stored, in case user is running it from a shell without the initial "-" argument. (This would be the case if the user is just testing out the script, instead of using a command that env_password directs them to run.): Aaron Marcuse-Kubitza
05:43 PM Revision 1950: env_password: Use ${BASH_SOURCE[0]} for $self and $self for $0. return instead of exit on usage error in case user is running it from a shell.: Aaron Marcuse-Kubitza
05:36 PM Revision 1949: stop_imports: Use ${BASH_SOURCE[0]} for $self and $self for $0: Aaron Marcuse-Kubitza
05:36 PM Revision 1948: import_all: Use new with_all. Use ${BASH_SOURCE[0]} for $self and $self for $0.: Aaron Marcuse-Kubitza
05:34 PM Revision 1947: Added with_all to run a make target on all inputs at once: Aaron Marcuse-Kubitza
05:05 PM Revision 1946: Made row #s 1-based to the user to match up with the staging table row #s: Aaron Marcuse-Kubitza
04:59 PM Revision 1945: bin/map: Fixed bug where limit passed to sql.select() was end instead of the # rows, causing extra rows to be fetched when start > 0. Documented that row #s start with 0.: Aaron Marcuse-Kubitza
04:19 PM Revision 1944: Removed no longer needed csv2ddl: Aaron Marcuse-Kubitza
04:19 PM Revision 1943: input.Makefile: Staging tables: import/install-%: Use new csv2db instead of csv2ddl/$(psqlAsBien), because it handles translating encodings properly: Aaron Marcuse-Kubitza
04:14 PM Revision 1942: Added csv2db to load a command's CSV output stream into a PostgreSQL table: Aaron Marcuse-Kubitza

04/21/2012

09:32 PM Revision 1941: schemas/postgresql.Mac.conf: Set unix_socket_directory to the appropriate Mac OS X dir, since otherwise, the socket is apparently not created and `make reinstall_db` doesn't work: Aaron Marcuse-Kubitza
09:30 PM Revision 1940: main Makefile: VegBIEN DB: db: Set LC_COLLATE and LC_CTYPE explicitly, to make it easier to change them: Aaron Marcuse-Kubitza
09:29 PM Revision 1939: Added ProgressInputStream: Aaron Marcuse-Kubitza
09:28 PM Revision 1938: exc.py: print_ex(): Added plain option to leave out traceback: Aaron Marcuse-Kubitza
06:48 PM Revision 1937: main Makefile: VegBIEN DB: db: Use template0 to allow encodings other than UTF-8. Because template0 doesn't have plpgsql on PostgreSQL before 9.x, add "CREATE PROCEDURAL LANGUAGE plpgsql;" manually in schemas/vegbien.sql.make, and filter it back out on PostgreSQL after 9.x using db_dump_localize.: Aaron Marcuse-Kubitza
06:39 PM Revision 1936: PostgreSQL-MySQL.csv: Remove "CREATE PROCEDURAL LANGUAGE" statements: Aaron Marcuse-Kubitza
06:36 PM Revision 1935: Added db_dump_localize to translate a PostgreSQL DB dump for the local server's version: Aaron Marcuse-Kubitza
06:32 PM Revision 1934: Added db_dump_localize to translate a PostgreSQL DB dump for the local server's version: Aaron Marcuse-Kubitza
03:42 PM Revision 1933: vegbien_dest: Added option to override the prefix of the created vars: Aaron Marcuse-Kubitza
03:35 PM Revision 1932: schemas/vegbien.sql.make: Fixed bug where data sources' schemas were also exported by exporting only the public schema. Note that this also removes the "CREATE OR REPLACE PROCEDURAL LANGUAGE plpgsql" statement, so that it doesn't have to be filtered out with `grep -v`.: Aaron Marcuse-Kubitza
03:19 PM Revision 1931: input.Makefile: input.Makefile: Use `$(catSrcs)|` instead of $(withCatSrcs) where possible: Aaron Marcuse-Kubitza
03:00 PM Revision 1930: sql.py: pkey(): Fixed bug where results were not being cached because the rows hadn't been explicitly fetched, by having DbConn.DbCursor.execute() fetch all rows if the rowcount is 0 and it's not an insert statement. DbConn.DbCursor: Made _is_insert an attribute rather than a method, which is set as soon as the query is known. Added consume_rows(). Moved Result retrieval section above Database connections because it's used by DbConn.: Aaron Marcuse-Kubitza
02:28 PM Revision 1929: sql.py: pkey(): Fixed bug where queries were not being cached. Use select() instead of run_query() so that caching is automatically turned on and table names are automatically escaped.: Aaron Marcuse-Kubitza
01:37 PM Revision 1928: streams.py: Added LineCountInputStream, which is faster than LineCountStream for input streams. Added InputStreamsOnlyException and raise it in all *InputStream classes' write() methods.: Aaron Marcuse-Kubitza
01:22 PM Revision 1927: sql.py: DbConn: For non-cacheable queries, use a plain cursor() instead of a DbCursor to avoid the overhead of saving the result and wrapping the cursor: Aaron Marcuse-Kubitza

04/20/2012

05:20 PM Revision 1926: Moved db_config_names from bin/map to sql.py so it can be used by other scripts as well: Aaron Marcuse-Kubitza
04:52 PM Revision 1925: csv2ddl: Also print a COPY FROM statement: Aaron Marcuse-Kubitza
04:47 PM Revision 1924: input.Makefile: Fixed bug where input type was considered to be different things if both $(inputFiles) and $(dbExport) are non-empty. Now, $(inputFiles) takes precedence so that the presence of any input files will cause a DB dump to be ignored. This ensures that a (slower) input DB is not used over a (faster) flat file.: Aaron Marcuse-Kubitza
04:21 PM Revision 1923: csvs.py: stream_info(): Added parse_header option. reader_and_header(): Use stream_info()'s new parse_header option.: Aaron Marcuse-Kubitza
03:53 PM Revision 1922: csv2ddl: Renamed schema name env var from datasrc to schema to reflect what it is, and to make the script general beyond importing inputs: Aaron Marcuse-Kubitza
03:32 PM Revision 1921: input.Makefile: Moved Installation, Staging tables after Existing maps discovery because they depend on it. Staging tables: Create a staging table for each table a map spreadsheet is available for. Put double quotes around the schema name so its case is preserved.: Aaron Marcuse-Kubitza
03:29 PM Revision 1920: Added csv2ddl to make a PostgreSQL CREATE TABLE statement from a CSV header: Aaron Marcuse-Kubitza
03:28 PM Revision 1919: sql.py: Input validation: Moved section after Database connections because some of its functions require a connection. Added esc_name_by_module() and esc_name_by_engine(), and use esc_name_by_module() in esc_name().: Aaron Marcuse-Kubitza
02:18 PM Revision 1918: input.Makefile: Installation: Create a schema for the datasource in VegBIEN as part of the installation process. This will be used to hold staging tables.: Aaron Marcuse-Kubitza
01:57 PM Revision 1917: input.Makefile: Changed install, uninstall to depend on src/install, src/uninstall targets, which in turn depend on db, rm_db. This will allow us to add additional install actions for all input types.: Aaron Marcuse-Kubitza

04/19/2012

07:17 PM Revision 1916: sql.py: DbConn: Cache the constructed CacheCursor itself, rather than the dict that's used to create it: Aaron Marcuse-Kubitza
07:06 PM Revision 1915: sql.py: pkey(): Changed to use the connection-wide caching mechanism rather than its own custom cache. DbConn.__getstate__(): Don't pickle the debug callback.: Aaron Marcuse-Kubitza
07:00 PM Revision 1914: sql.py: DbConn: Added is_cached(). run_query(): Use new DbConn.is_cached() to avoid creating a savepoint if the query is cached.: Aaron Marcuse-Kubitza
06:52 PM Revision 1913: sql.py: DbConn: Also cache cursor.description: Aaron Marcuse-Kubitza
06:50 PM Revision 1912: sql.py: DbConn: Cache query results as a dict subset of the cursor's key attributes, so that additional attributes can easily be cached by adding them to the subset list: Aaron Marcuse-Kubitza
06:48 PM Revision 1911: dicts.py: Added AttrsDictView: Aaron Marcuse-Kubitza
06:47 PM Revision 1910: util.py: NamedTuple.__iter__(): Removed unnecessary **attrs param: Aaron Marcuse-Kubitza
06:30 PM Revision 1909: sql.py: _query_lookup(): Fixed bug where params was cast to a tuple, even though it could also be a dict. index_cols(): Changed to use the connection-wide caching mechanism rather than its own custom cache.: Aaron Marcuse-Kubitza
06:28 PM Revision 1908: util.py: NamedTuple: Made it usable as a hashable dict (with string keys) by adding __iter__() and __getitem__(): Aaron Marcuse-Kubitza
06:27 PM Revision 1907: dicts.py: Added make_hashable(): Aaron Marcuse-Kubitza

04/17/2012

09:59 PM Revision 1906: sql.py: DbConn: Only cache exceptions for inserts since they are not idempotent, but an invalid insert will always be invalid. If a cached result in an exception, re-raise it in a separate method other than the constructor to ensure that the cursor object is still created, and that its query instance var is set.: Aaron Marcuse-Kubitza
09:11 PM Revision 1905: sql.py: insert(): Cache insert queries by default. This works because any DuplicateKeyException, etc. would be cached as well. This saves many inserts for rows that we already know are in the database.: Aaron Marcuse-Kubitza
09:06 PM Revision 1904: sql.py: DbConn.run_query(): Cache exceptions raised by queries as well: Aaron Marcuse-Kubitza
08:48 PM Revision 1903: sql.py: DbConn.run_query(): When debug logging, label queries with their cache status (hit/miss/non-cacheable): Aaron Marcuse-Kubitza
08:25 PM Revision 1902: sql.py: DbConn.run_query(): Also debug-log queries that produce exceptions: Aaron Marcuse-Kubitza
08:18 PM Revision 1901: sql.py: DbConn: Allow creator to provide a log function to call on debug messages, instead of using stderr directly: Aaron Marcuse-Kubitza
08:01 PM Revision 1900: bin/map: Pass debug mode to DbConn so that SQL query debugging works again: Aaron Marcuse-Kubitza
07:49 PM Revision 1899: sql.py: DbConn: DbCursor: Fixed bug where caching was always turned on, by passing the cacheable setting to it from run_query(). Turned caching back on (uncommented it) since it's now working.: Aaron Marcuse-Kubitza
07:21 PM Revision 1898: bin/map: map_rows()/map_table(): Pass kw_args to process_rows() so rows_start can be specified when using them. DB inputs: Skip the pre-start rows in the SQL query itself, so that they don't need to be iterated over by the cursor in the main loop.: Aaron Marcuse-Kubitza
07:07 PM Revision 1897: bin/map: Fixed bug introduced in r1718 where the row # would not be incremented if i < start, causing an semi-infinite loop that only ended when the input rows were exhausted. process_rows(): Added optional rows_start parameter to use if the input rows already have the pre-start rows skipped.: Aaron Marcuse-Kubitza
05:49 PM Revision 1896: input.Makefile: Sources: cat: Changed Usage message to use "--silent" make option: Aaron Marcuse-Kubitza
05:45 PM Revision 1895: input.Makefile: Sources: cat: Added Usage message with instructions for removing echoed make commands: Aaron Marcuse-Kubitza
05:17 PM Revision 1894: run_*query(): Fixed bug where INSERTs, etc. were cached by making callers (such as select()) explicitly turn on caching. DbConn.run_query(): Fixed bug where cur.mogrify() was not supported under MySQL by making the cache key a tuple of the unmogrified query and its params instead of the mogrified string query. CacheCursor: Store attributes of the original cursor that we use, such as query and rowcount.: Aaron Marcuse-Kubitza
04:38 PM Revision 1893: sql.py: Made row() and value() cache the result by fetching all rows before returning the first row: Aaron Marcuse-Kubitza
04:37 PM Revision 1892: iters.py: Added func_iter() and consume_iter(): Aaron Marcuse-Kubitza
04:11 PM Revision 1891: sql.py: Cache the results of queries (when all rows are read): Aaron Marcuse-Kubitza
03:48 PM Revision 1890: Proxy.py: Fixed infinite recursion bug by removing __setattr__() (which prevents the class and subclasses from storing instance variables using "self." syntax): Aaron Marcuse-Kubitza

04/16/2012

10:19 PM Revision 1889: sql.py: DbConn: Added run_query(). run_raw_query(): Use new DbConn.run_query().: Aaron Marcuse-Kubitza
10:18 PM Revision 1888: Added Proxy.py: Aaron Marcuse-Kubitza
09:32 PM Revision 1887: parallel.py: MultiProducerPool: Added code to create a shared Namespace object, commented out. Updated share() doc comment to reflect that it will writably share the values as well.: Aaron Marcuse-Kubitza
08:49 PM Revision 1886: bin/map: Share locals() with the pool at various times to try to get as many unpicklable values into the shared vars as possible: Aaron Marcuse-Kubitza
08:45 PM Revision 1885: dicts.py: Turned id_dict() factory function into IdDict class. parallel.py: MultiProducerPool: Added share_vars(). main_loop(): Only consider the program to be done if the queue is empty *and* there are no running tasks.: Aaron Marcuse-Kubitza
08:00 PM Revision 1884: collection.py: rmap(): Treat only built-in sequences specially instead of iterables. Pass whether the value is a leaf to the func. Added option to only recurse up to a certain # of levels.: Aaron Marcuse-Kubitza
07:10 PM Revision 1883: Added lists.py: Aaron Marcuse-Kubitza
04:40 PM Revision 1882: collection.py: rmap(): Fixed bugs: Made it recursive. Use iters.is_iterable() instead of isinstance(value, list) to work on all iterables. Use value and not nonexistent var list_.: Aaron Marcuse-Kubitza
04:38 PM Revision 1881: iters.py: Added is_iterable(): Aaron Marcuse-Kubitza
04:11 PM Revision 1880: parallel.py: prepickle(): Pickle *all* objects in vars_id_dict_ by ID, not just unpicklable ones. This ensures that a DB connection created in the main process will be shared with subprocesses by reference (id()) instead of by value, so that each process can take advantage of e.g. shared caches in the connection object. Note that this may require some synchronization.: Aaron Marcuse-Kubitza
04:06 PM Revision 1879: parallel.py: MultiProducerPool.main_loop(): Got rid of no longer correct doc comment: Aaron Marcuse-Kubitza
04:05 PM Revision 1878: bin/map: Share on_error with the pool: Aaron Marcuse-Kubitza
04:05 PM Revision 1877: parallel.py: MultiProducerPool: Pickle objects by ID if they're accessible to the main_loop process. This should allow e.g. DB connections and pools to be pickled, if they were defined in the main process.: Aaron Marcuse-Kubitza

04/14/2012

09:31 PM Revision 1876: Added dicts.py with id_dict() and MergeDict: Aaron Marcuse-Kubitza
09:30 PM Revision 1875: Added collection.py with rmap(): Aaron Marcuse-Kubitza
07:38 PM Revision 1874: db_xml.py: put(): Moved pool.apply_async() from put_child() to put_(), and don't use lambdas because they can't be pickled: Aaron Marcuse-Kubitza
07:35 PM Revision 1873: parallel.py: MultiProducerPool.apply_async(): Prepickle all function args. Try pickling the args before the queue pickles them, to get better debugging output.: Aaron Marcuse-Kubitza
07:33 PM Revision 1872: sql.py: with_savepoint(): Use new rand.rand_int(): Aaron Marcuse-Kubitza
07:33 PM Revision 1871: rand.py: rand_int() Fixed bug where newly-created objects did not have unique IDs because they were on the stack. So, we have to use random.randint() anyway.: Aaron Marcuse-Kubitza
07:27 PM Revision 1870: Added rand.py: Aaron Marcuse-Kubitza
06:56 PM Revision 1869: sql.py: DbConn: Made it picklable by establishing a connection on demand: Aaron Marcuse-Kubitza
06:54 PM Revision 1868: bin/map: Also consume asynchronous tasks before closing the DB connection (this is where most if not all tasks will be consumed): Aaron Marcuse-Kubitza
06:44 PM Revision 1867: Runnable.py: Made it picklable: Aaron Marcuse-Kubitza
06:44 PM Revision 1866: Added eval_.py: Aaron Marcuse-Kubitza
05:35 PM Revision 1865: Added Runnable: Aaron Marcuse-Kubitza
03:05 PM Revision 1864: db_xml.py: put(): Added parallel processing support for inserting children with fkeys to parent asynchronously: Aaron Marcuse-Kubitza
03:03 PM Revision 1863: parallel.py: Fixed bugs: Added self param to instance methods and inner classes where needed: Aaron Marcuse-Kubitza
02:32 PM Revision 1862: parallel.py: Changed to use multi-producer pool, which requires calling pool.main_loop(): Aaron Marcuse-Kubitza
01:04 PM Revision 1861: parallel.py: Pool: Added doc comment: Aaron Marcuse-Kubitza
01:03 PM Revision 1860: parallel.py: Pool: apply_async(): Return a result object like multiprocessing.Pool.apply_async(): Aaron Marcuse-Kubitza
12:53 PM Revision 1859: bin/map: Use new parallel.py for parallel processing: Aaron Marcuse-Kubitza
12:51 PM Revision 1858: Added parallel.py for parallel processing: Aaron Marcuse-Kubitza
12:37 PM Revision 1857: bin/map: Use dummy synchronous Pool implementation if not using parallel processing: Aaron Marcuse-Kubitza
12:18 PM Revision 1856: bin/map: Use multiprocessing instead of pp for parallel processing because it's easier to use (it uses the Python threading API and doesn't require providing all the functions a task calls). Allow the user to set the cpus option to to use all system CPUs (needed because in test mode, the default is 0 CPUs to turn off parallel processing).: Aaron Marcuse-Kubitza

04/13/2012

04:41 PM Revision 1855: disown_all, stop_imports: Use /bin/bash instead of /bin/sh because array subscripting is used: Aaron Marcuse-Kubitza
04:38 PM Revision 1854: input.Makefile: Editing import: Use $(datasrc) instead of $(db) since $(db) is only set for DB-source inputs: Aaron Marcuse-Kubitza
04:31 PM Revision 1853: input.Makefile: Import: If profile is on and test mode is on, output formatted profile stats to stdout: Aaron Marcuse-Kubitza
03:00 PM Revision 1852: sql.py: index_cols(): Cache return values in db.index_cols: Aaron Marcuse-Kubitza
02:56 PM Revision 1851: bin/map: Don't import pp unless cpus != 0 because it's slow and doesn't need to happen if we're not using parallelization. cpus option defaults to 0 in test mode so tests run faster.: Aaron Marcuse-Kubitza
02:52 PM Revision 1850: sql.py: pkey(): Use pkeys cache from db object instead of parameter: Aaron Marcuse-Kubitza
02:44 PM Revision 1849: sql.py: Wrapped db connection inside an object that can also store the cache of the pkeys and index_cols: Aaron Marcuse-Kubitza
02:27 PM Revision 1848: bin/map: If cpus is 0, run without Parallel Python: Aaron Marcuse-Kubitza
02:19 PM Revision 1847: bin/map: Set up Parallel Python with an env-var-customizable # CPUs: Aaron Marcuse-Kubitza
02:18 PM Revision 1846: bin/map: Set up Parallel Python with an env-var-customizable # CPUs: Aaron Marcuse-Kubitza
12:58 PM Revision 1845: root Makefile: python-Linux: Added `sudo pip install pp`: Aaron Marcuse-Kubitza
12:47 PM Revision 1844: root Makefile: python-Linux: Added python-parallel to installs: Aaron Marcuse-Kubitza
12:19 PM Revision 1843: mappings: Build VegX-VegBIEN.organisms.csv from VegX-VegBIEN.stems.csv instead of vice versa. This entails switching the roots around so stem points to organism instead of the other way around, which is a complex operation. Re-rooted VegX-VegBIEN.organisms.csv at /plantobservation instead of /taxonoccurrence to avoid traveling up the hierarchy to taxonoccurrence and back down again to plantobservation, etc. as would otherwise have been the case.: Aaron Marcuse-Kubitza
11:43 AM Revision 1842: bin/map: When determining if outer elements are types, look for /*s/ anywhere in the string instead of just at the beginning, because there might be root attrs (namespaces), etc. before it: Aaron Marcuse-Kubitza
10:45 AM Revision 1841: bin/map: When determining if outer elements are types, look for /*s/ anywhere in the string instead of just at the beginning, because there might be root attrs (namespaces), etc. before it: Aaron Marcuse-Kubitza
10:44 AM Revision 1840: xpath.py: get(): forward (parent-to-child) pointers: If last target object exists but doesn't have an ID attr (which indicates a bug), recover gracefully by just assuming the ID is 0. (Any bug will be noticeable in the output, which needs to be generated through workarounds like this in order to be able to debug.): Aaron Marcuse-Kubitza

04/10/2012

05:18 PM Revision 1839: VegX mappings: Updated stemParent mapping for VegX 1.5.3: Aaron Marcuse-Kubitza
04:54 PM Revision 1838: VegX mappings: Changed taxonDetermination of role identifier to instead have explicitly no role, because data providers' VegX files generally do not provide role information and we don't want the default taxonDetermination XPaths to require this: Aaron Marcuse-Kubitza
04:34 PM Revision 1837: inputs/CTFS/maps/VegX.organisms.csv: Connected plot to plotObservation by using new support for backward (child-to-parent) pointers whose target is a text element containing an ID: Aaron Marcuse-Kubitza
04:33 PM Revision 1836: xml_dom.py: get_id(): If the node doesn't have an ID, assumes the node itself is the ID. This enables backward (child-to-parent) pointers whose target is a text element containing an ID, rather than a regular element with an ID attribute.: Aaron Marcuse-Kubitza
04:04 PM Revision 1835: VegX mappings: Map locationevent.sourceaccessioncode to plotUniqueIdentifier since this field is no longer being used by authorlocationcode: Aaron Marcuse-Kubitza
03:48 PM Revision 1834: VegX mappings: Map the authorlocationcode to plotName instead of plotUniqueIdentifier because it's a better fit: Aaron Marcuse-Kubitza
03:13 PM Revision 1833: inputs/CTFS/maps/VegX.organisms.csv: Fixed bug in Species taxonConcept mapping where the role was computer instead of identifier: Aaron Marcuse-Kubitza
03:11 PM Revision 1832: xml_dom.py: value(): Skip comment nodes. This fixes a bug where comments inside text elements would prevent the value from being retrieved.: Aaron Marcuse-Kubitza
03:02 PM Revision 1831: inputs/CTFS/test: Accepted test outputs for new VegX_CTFS_row_120000_bci.0.test.organisms.xml instead of VegX_CTFS_row_180000.0.test.organisms.xml, which didn't have <taxonNameUsageConcepts> that match up with <individualOrganisms>: Aaron Marcuse-Kubitza
02:16 PM Revision 1830: inputs/CTFS/test: Accepted test outputs for new VegX_CTFS_row_120000_bci.0.test.organisms.xml instead of VegX_CTFS_row_180000.0.test.organisms.xml, which didn't have <taxonNameUsageConcepts> that match up with <individualOrganisms>: Aaron Marcuse-Kubitza
01:59 PM Revision 1829: inputs/CTFS/maps/VegX.organisms.csv: Added taxonConcept mappings: Aaron Marcuse-Kubitza
01:59 PM Revision 1828: mappings/VegX-VegBIEN.organisms.csv: Added species taxonConcept mapping for identifier role: Aaron Marcuse-Kubitza
01:33 PM Revision 1827: Added expand_xpath to expand XPath abbreviations: Aaron Marcuse-Kubitza
12:43 PM Revision 1826: VegX mappings: Renamed taxonNameUsageConceptsID to taxonNameUsageConceptID (no plural) to match VegX 1.5.3: Aaron Marcuse-Kubitza
12:33 PM Revision 1825: inputs/CTFS/maps/VegX.organisms.csv: Corrected CensusNumber input mapping: Aaron Marcuse-Kubitza
12:24 PM Revision 1824: mappings/Makefile: Generate self maps for all core maps: Aaron Marcuse-Kubitza
12:19 PM Revision 1823: mappings/Makefile: VegX-VegBIEN.stems.csv: Removed $(rootAttrs) from out root because stems don't use tcs namespace elements (stems don't have taxonDeterminations separate from the main organism): Aaron Marcuse-Kubitza
12:13 PM Revision 1822: VegX mappings: taxonConcept mappings: Added "tcs:" namespace prefix to appropriate elements. This will make the taxonConcept XPaths compatible with CTFS VegX.: Aaron Marcuse-Kubitza

04/09/2012

06:52 PM Revision 1821: input.Makefile: Vars/functions: Make: $(subMake): When forwarding to another dir based off of $(root), forward to $(root) rather than directly to the dir of the target. This ensures that any special targets that are only defined in the root Makefile still get run, even when the target is in a subdir with its own Makefile.: Aaron Marcuse-Kubitza
06:41 PM Revision 1820: inputs/CTFS/test: Accepted initial test outputs. A lot of leaves are still unmapped with the default mappings.: Aaron Marcuse-Kubitza
06:40 PM Revision 1819: inputs/CTFS/maps: Added initial maps: Aaron Marcuse-Kubitza
06:39 PM Revision 1818: VegX mappings: taxonConcept mappings: Added "tcs:" namespace prefix to appropriate elements. This will make the taxonConcept XPaths compatible with CTFS VegX.: Aaron Marcuse-Kubitza
06:13 PM Revision 1817: input.Makefile: Maps building: full via maps (maps/$(via).%.full.csv): $(makeFullCsv): Sort all maps so that rows are re-ordered whether or not a core self map exists. This way, if a core self map is created, it will not cause the sort order of the generated via-format XMLs to change. This makes it easier to accept any changes to test outputs that result from adding a core self map.: Aaron Marcuse-Kubitza
05:53 PM Revision 1816: mappings/Makefile: VegX: Added VegX.self.organisms.csv. Added root attrs to chRoot maps, commented out since it's not ready to be checked in yet.: Aaron Marcuse-Kubitza
05:34 PM Revision 1815: xpath.py: get(): Run xml_dom.by_tag_name() with ignore_namespace=False (possibly later set to True): Aaron Marcuse-Kubitza
05:32 PM Revision 1814: xml_dom.py: Comments: Added clean_comment() and mk_comment(). Searching child nodes: by_tag_name(): Added ignore_namespace option to ignore namespace of node name.: Aaron Marcuse-Kubitza
05:26 PM Revision 1813: root Makefile: Added %-remake target: Aaron Marcuse-Kubitza
04:53 PM Revision 1812: mappings/Makefile: Renamed joinMaps to dwcMaps and chrootMaps to vegxMaps. Added commented-out code to create VegX.self.organisms.csv (not ready to check in yet because it affects many dependent maps).: Aaron Marcuse-Kubitza
02:52 PM Revision 1811: input.Makefile: Removed no longer needed $(noEmptyMap): Aaron Marcuse-Kubitza
12:40 PM Revision 1810: xml_func.py: process(): Use new xml_dom.mk_comment(): Aaron Marcuse-Kubitza
12:40 PM Revision 1809: xml_dom.py: Added clean_comment() and mk_comment() to properly sanitize comment contents (comments can't contain '--'): Aaron Marcuse-Kubitza
12:14 PM Revision 1808: Added inputs/TRTE: Aaron Marcuse-Kubitza

Also available in: Atom