bin/map: Added prefixes support for XML inputs
digir_client: Filter by darwin:Kingdom=PLANTAE because presumably all records will have this. Don't debug-print URL.
Added initial bin/digir_client
Renamed timeout.py to timeouts.py. Renamed timeout_ vars to timeout.
opts.py: get_env_var(): default defaults to None
inputs/SpeciesLink: Accepted test outputs for new TAPIR download
bin/tapir/tapir2flat.php: Output to specieslink.specimens.csv instead of specieslink.txt so that the output file can be used right away without renaming
inputs/REMIB/src/nodes.make: Stop after a configurable # of empty responses (indicating no more nodes), instead of at a preset node ID, because there seem to be many more nodes than are listed on the web form
input.Makefile: import/rotate: Add "." before the date
input.Makefile: Added targets for editing import: import/rotate, import/rm
bin/tapir/tapir2flat.php: Fixed XML parsing to strip control chars so DOMDocument::loadXML() wouldn't complain about "PCDATA invalid Char value 8 in Entity", etc.
main Makefile: php-Darwin: Added instruction to set PHPRC if needed
Added inputs/SpeciesLink/src/tapir.make
input.Makefile: `src/%: src/%.make`: Don't tee recipe's stderr to make's stderr, because long-running make_scripts usually will be tracked using `tail -f`
input.Makefile: `src/%: src/%.make`: Name the log file using the make_script name instead of the output file name
cat_csv: If dialect == None, ignore that file because it's empty
csvs.py: stream_info(): If header_line == '', set dialect to None rather than trying (and failing) to auto-detect it
input.Makefile: Use new sort_filenames to putmultiple numbered sources in the correct order, dealing correctly with embedded numbers that aren't padded with leading zeros
Added sort_filenames to sort a list of filenames, comparing embedded numbers numerically instead of lexicographically
schemas/postgresql.conf: Decreased shared_buffers again because 4000MB wasn't enough less than 4GB SHMMAX
schemas/postgresql.conf: Expressed shared_buffers in MB, since decimal GB doesn't seem to work anymore on 9.1
schemas/postgresql.conf: Decreased shared_buffers to 3.9GB, slightly less than SHMMAX
schemas/postgresql.conf: Optimized again using same changes as were applied to 8.4 version
schemas/postgresql.conf: Replaced with original 9.1 version
schemas/postgresql.conf: Optimized using analogous settings as postgresql.nimoy.conf
inputs/REMIB/src/nodes.make: Don't abort entire import on empty response, because an empty response is also returned for nodes that are temporarily down, not just nodes that don't exist (assumed to be after the highest numbered node). Instead, stop import after 150 nodes if user did not specify an explicit # nodes.
inputs/REMIB/src/nodes.make: Abort prefix on empty response using break, rather than just done = True, to avoid running any more code except the finally block. Moved metadata row validation outside metadata row retrieval try-except block.
inputs/REMIB/src/nodes.make: If a read times out, abort the entire node rather than just the prefix to avoid waiting 20 sec for each of 26*26 prefixes
profiling.py ItersProfiler, exc.py ExPercentTracker: Only output fraction of rows with errors if self.iter_ct > 0, to avoid divide-by-zero error
inputs/REMIB/src/nodes.make: Fixed bug where row count was output in the middle of the row processing code, instead of after the first row is processed and the row count incremented. This removes "Processed 0 row(s)" messages at the beginning of every prefix.
inputs/REMIB/src/nodes.make: Support custom starting node ID and # nodes processed via env vars
Renamed inputs/REMIB/src/nodes.all.0.header.specimens.csv to node.0.header.specimens.csv so it would sort correctly with the new output file names
Renamed inputs/REMIB/src/nodes.all.specimens.csv.make to inputs/REMIB/src/nodes.make since it will not be used to generate nodes.all.specimens.csv. However, it can still be used with the `src/%.make` make target, but will generate a dummy empty output file "nodes".
inputs/REMIB/src/nodes.all.specimens.csv.make: Write each node to a separate output file
inputs/REMIB/src/nodes.all.specimens.csv.make: Raise InputException instead of AssertionError if invalid metadata row, so that it will be caught and printed instead of aborting the program
inputs/REMIB/src/nodes.all.specimens.csv.make: Moved header reading code inside TimeoutException try-except block since read sometimes times out before the header is even read
schemas/postgresql.nimoy.conf: Increased shared_buffers to 1.5GB since kernel.shmmax has been increased to 2GB
Renamed inputs/REMIB/src/remib_raw.0.header.specimens.txt to nodes.all.0.header.specimens.csv
inputs/REMIB/src/nodes.all.specimens.csv.make: Increased read timeout
inputs/REMIB/src/nodes.all.specimens.csv.make: Timeout stuck reads because sometimes nodes are offline, etc.
exc.py: str_(): Strip trailing whitespace. print_ex(): Since str_() now strips trailing whitespace, strings.ensure_newl() is no longer necessary.
streams.py: Added TimeoutInputStream and WrapStream. Changed StreamIter to use new WrapStream.
Added timeout.py
inputs/REMIB/src/nodes.all.specimens.csv.make: Download from all prefixes of all nodes. Stop when a node produces an empty response (not even an error), which indicates no more nodes. Changed status messages.
input.Makefile: `src/%: src/%.make`: Append stderr to log file
Added inputs/REMIB/src/nodes.all.specimens.csv.make to download REMIB data for all nodes
Added streams.py for I/O, which contains StreamIter, TracedOutputStream, and LineCountOutputStream
term.py: Added clear_line. Corrected file comment.
Makefiles: Let subdir's Makefile decide whether to delete on error
input.Makefile: Save partial outputs of aborted src make scripts
input.Makefile: Fixed bug in `%: %.make` rule to use $< instead of $*
mappings/DwC2-VegBIEN.specimens.csv: minimumElevationInMeters: Remove any "ca." prefix
xml_func.py: _replace: Strip whitespace from the returned string
csvs.py: Added TsvReader to support TSV quirks. Added reader_class(). reader_and_header(): Use reader_class() to automatically use TsvReader instead of csv.reader for TSVs. Added is_tsv() and use it where `dialect.delimiter == '\t'` was used.
strings.py: Added extract_line_ending() and remove_line_ending(). ensure_newl(): Use new remove_line_ending(). Moved Parsing section to top since it is used by the other sections.
csvs.py: stream_info(): Set dialect.quoting = csv.QUOTE_NONE for TSVs because they usually don't quote fields. Factored dialect detecting code into new function sniff().
input.Makefile: verify: Added reverify option, which can be turned off to prevent regenerating the verify/%.out file from the DB (which can be time-consuming), and instead just diff verify/%.out with verify/%.ref
count_error_rows: Allow input to be specified as last arg(s) in addition to as stdin
exc.py: ExPercentTracker: When diplaying fraction of iters that had errors, don't duplicate the iter_text ("row", etc.) in the numerator
bin/map: Use new ExPercentTracker iter_num tracking to track distinct row #s with errors
exc.py: ExPercentTracker: Track iter_nums of Exceptions as well, to distinguish how many distinct iters had errors
Added bin/count_error_rows to count distinct rows with errors in `map` error messages
input.Makefile: Changed "%.out: .make" rule to ": %.make" so that any file can be built from a corresponding .make file. This will allow flat files to be retrieved dynamically by running an associated .make file.
xml_func.py: FormatException: Inherit from ExceptionWithCause instead of SyntaxError because a FormatException signals a different kind of error condition (related to the input value rather than the function syntax)
xml_func.py: Renamed SyntaxException to SyntaxError because it's a user error signaling invalid mappings syntax
xml_func.py: SyntaxException: Use ExceptionWithCause to combine msg and cause's msg because it now combines them on one line, which is needed for bin/error_stats to work properly
exc.py: ExceptionWithCause: Prepend msg to cause's msg separated by ': ' instead of '\ncause: '
xml_func.py: Changed SyntaxException to FormatException where the error was with the input data format rather than the mapping syntax
mappings/VegX-VegBIEN.organisms.csv: slopeaspect: Apply new conversion _compass
xml_func.py: Added _compass to convert a compass direction (N, NE, NNE, etc.) into a degree heading
Added angles.py
inputs/SpeciesLink/maps: Updated to use new TAPIR download
input.Makefile: All targets can be specified with an optional trailing slash. This enables using tab completion to complete a target name which is also a subdir name, since tab completion appends a trailing slash.
bin/tapir/tapir2flat.php: Fixed bug in row assembly where XML elements that weren't found were left out of the array, causing the columns to shift to the left
xml_func.py: _map: Factored replacing code out into new function repl(), which can also be used by other XML funcs
bin/tapir/tapir2flat.php: Turned off exiting after 3 successive failures, because it causes the import to abort and it doesn't seem to restart where it left off
main Makefile: Added instructions to install PHP PEAR and HTTP_Request on Mac OS X
Makefile: Added PHP section, which installs php-http-request
Moved _archive/tapir2flatClient/trunk/client/ to bin/tapir/
_archive/tapir2flatClient/trunk/client/tapir2flat.php: Upgraded to use fputcsv(). This should fix errors caused by embedded delimeters. configurableParams.php: Set default delimeter to ','.
mappings/verify.specimens.sql: # species: Don't join at all on genus because DISTINCT is on the plantname_id rather than the plantname, which is already unique for a given genus because plantname_unique includes parent_id
mappings/verify.specimens.sql: # species: Fixed to join separately on plantname_ancestor for genus and species
input.Makefile: Moved log and trace files to new import subdir. Moved subdir-adding code from inputs/Makefile to input.Makefile.
mappings/verify.specimens.sql: Updated for schema changes
inputs/*: Added any missing standard subdirs
inputs/Makefile: Added %/-add to re-add existing dirs
inputs/Makefile: %-add: `svn mkdir` the datasource's standard subdirs
schemas/postgresql.nimoy.conf: Increased work_mem (for sorting) and maintenance_work_mem (for vacuum)
schemas/postgresql.nimoy.conf: Reset shared_buffers to initial value 24MB because although kernel.shmmax is 32MB, only values up to 26MB seem to work
schemas/postgresql.nimoy.conf: Set shared_buffers to SHMMAX
Optimized schemas/postgresql.nimoy.conf
Added schemas/postgresql.nimoy.conf
bin/map: When profiling, print the profile_to destination file
Added schemas/postgresql.conf
xml_func.py: _date: When converting month name to number, wrap any ValueError in a SyntaxException
xml_func.py: XML functions that assume their last argument is a value (_map, etc.): Use new helper function pop_value() to retrieve this value. Return None if value is None because this indicates the input is empty.
xml_func.py: _date: Use format.str2int instead of int to convert date parts to int so that strange formatting will be parsed correctly
format.py: clean_numeric(): Also fix some OCR errors
filter_errors: Default to outputing only the first match
xpath.py: Added append() to recursively append subpath to every leaf of a path tree. parse(): Use append() to fix bug in split path parsing where subpath was not added to every leaf of the tree, only the main leaf of the main branch and the main leaves of the other branches of the last element.