Project

General

Profile

Statistics
| Revision:

# Date Author Comment
1718 04/02/2012 08:05 AM Aaron Marcuse-Kubitza

bin/map: process_rows(): When iterating over each row, only retrieve the next row if the end (limit of # of rows) has not been reached. This prevents the next row from being fetched, possibly causing an entire additional consecutive XML document to be parsed, if the limit has already been reached. This is primarily useful for XML inputs with a ".0.top" segment prepended before the other documents, which contains just the first two nodes for fast parsing of this smaller XML document when only the first two nodes are needed for testing. Without this fix, the ".0.top" segment would have needed to contain the first three nodes instead.

1717 04/02/2012 07:55 AM Aaron Marcuse-Kubitza

inputs/XAL: Accepted initial test outputs

1716 04/02/2012 07:54 AM Aaron Marcuse-Kubitza

inputs/XAL: Added maps

1715 04/02/2012 07:52 AM Aaron Marcuse-Kubitza

bin/map: Extended consecutive XML document support to direct-XML inputs (without a map spreadsheet). Factored out consecutive XML document row-iteration code into helper method get_rows() which does the iters.flatten() and itertools.imap() calls.

1714 04/02/2012 07:37 AM Aaron Marcuse-Kubitza

bin/map: Fixed bug in iteration over consecutive XML documents where only the first element of the first document was processed. Use of iters.flatten() and itertools.imap() fixes this problem so that the consecutive XML documents are regarded as a continuous stream of rows.

1713 04/02/2012 07:16 AM Aaron Marcuse-Kubitza

bin/map: Use new xml_parse.docs_iter() to iterate over each consecutive XML document in stdin

1712 04/02/2012 07:16 AM Aaron Marcuse-Kubitza

xml_parse.py: Added support for parsing consecutive XML documents in a stream

1711 04/02/2012 07:01 AM Aaron Marcuse-Kubitza

Added iters.py

1710 03/29/2012 10:33 PM Aaron Marcuse-Kubitza

streams.py: Added FilterStream. Changed TracedStream to use FilterStream.

1709 03/29/2012 10:24 PM Aaron Marcuse-Kubitza

Moved parse_str() from xml_dom.py to xml_parse.py

1708 03/29/2012 10:24 PM Aaron Marcuse-Kubitza

Added xml_parse.py

1707 03/29/2012 10:21 PM Aaron Marcuse-Kubitza

streams.py: CaptureStream: Ignore start_str when recording and end_str when not recording

1706 03/29/2012 10:13 PM Aaron Marcuse-Kubitza

streams.py: CaptureStream: Get each match as a separate array elem instead of concatenated together

1705 03/29/2012 09:59 PM Aaron Marcuse-Kubitza

ch_root, repl, map: Use new maps.col_info() instead of parsing col name manually. This allows maps with prefixes containing ":" to be supported, without the ":" being misinterpreted as the label-root separator.

1704 03/29/2012 09:57 PM Aaron Marcuse-Kubitza

maps.py: Added col_info() to get label, root, prefixes from col_name. Added col_formats() for use by combinable(). Use new col_formats() in combinable(). Removed no longer needed col_label().

1703 03/29/2012 09:55 PM Aaron Marcuse-Kubitza

input.Makefile: Use with_cat instead of with_cat_csv for XML sources

1702 03/29/2012 09:54 PM Aaron Marcuse-Kubitza

Renamed inputs/XAL/src/digir.xml.make to digir.specimens.xml.make so it would generate an output file with the proper table name

1701 03/29/2012 08:53 PM Aaron Marcuse-Kubitza

bin/map: Support concatenated XML documents for XML inputs

1700 03/29/2012 08:46 PM Aaron Marcuse-Kubitza

bin/map: Merged XML inputs with and without a map into the in_is_xml section

1699 03/29/2012 08:33 PM Aaron Marcuse-Kubitza

digir_client: Output profiling information

1698 03/29/2012 08:21 PM Aaron Marcuse-Kubitza

Added inputs/XAL/src/digir.xml.make

1697 03/29/2012 08:21 PM Aaron Marcuse-Kubitza

digir_client: Import http to take advantage of httplib modifications to deal with IncompleteRead errors

1696 03/29/2012 08:20 PM Aaron Marcuse-Kubitza

Added http.py with httplib modifications to deal with IncompleteRead errors

1695 03/29/2012 07:46 PM Aaron Marcuse-Kubitza

digir_client: Fixed bug where chunk size was being adjusted even if count == None (indicating no determinable last chunk), causing a type mismatch between None and the integer total

1694 03/29/2012 07:28 PM Aaron Marcuse-Kubitza

input.Makefile: Removed no longer needed "ifneq ($(wildcard test/),)" guard around Testing section because all inputs now have a test subdir

1693 03/29/2012 07:25 PM Aaron Marcuse-Kubitza

Added inputs/XAL

1692 03/29/2012 07:22 PM Aaron Marcuse-Kubitza

digir_client: Made chunk_size a configurable env var. Removed schema env var because schema is always the same for DiGIR (can be different for TAPIR). Make sure output ends in a newline so that consecutive XML documents are on different lines.

1691 03/29/2012 07:13 PM Aaron Marcuse-Kubitza

digir_client: Fixed bug where chunk_size records would always be retrieved even in the last chunk, which ignored any manual count the user might have set via the "n" option

1690 03/29/2012 07:07 PM Aaron Marcuse-Kubitza

digir_client: Repeatedly retrieve data in chunks. Provide match count. Added section comments.

1689 03/29/2012 06:52 PM Aaron Marcuse-Kubitza

xpath.py: Added get_value() to run get_1() and returns the value of any result node

1688 03/29/2012 06:51 PM Aaron Marcuse-Kubitza

xml_dom.py: Added parse_str()

1687 03/29/2012 06:13 PM Aaron Marcuse-Kubitza

digir_client: Use new streams.copy() to copy returned data to stdout

1686 03/29/2012 06:13 PM Aaron Marcuse-Kubitza

streams.py: Added copy(). Added section comment for traced streams.

1685 03/29/2012 06:06 PM Aaron Marcuse-Kubitza

digir_client: Label debugging output

1684 03/29/2012 05:54 PM Aaron Marcuse-Kubitza

streams.py: Renamed LineCountOutputStream to LineCountStream since TracedStream now works on both input and output streams

1683 03/29/2012 05:52 PM Aaron Marcuse-Kubitza

digir_client: Capture diagnostics for later use in determining next start/count values

1682 03/29/2012 05:51 PM Aaron Marcuse-Kubitza

streams.py: Added CaptureStream to wrap a stream, capturing matching text. Renamed TracedOutputStream to TracedStream and made it work on both input and output streams. Made TracedStream inherit from WrapStream so that close() would be forwarded properly.

1681 03/29/2012 05:23 PM Aaron Marcuse-Kubitza

bin/map: Changed XML input prefix handling to prepend prefix directly to XPath instead of separating it from the XPath with a "/". Changed get_with_prefix() to use new strings.with_prefixes().

1680 03/29/2012 05:21 PM Aaron Marcuse-Kubitza

strings.py: Added with_prefixes()

1679 03/29/2012 04:56 PM Aaron Marcuse-Kubitza

digir_client: Made schema customizable

1678 03/29/2012 04:35 PM Aaron Marcuse-Kubitza

digir_client: Set header sendTime, source dynamically. In debug mode, print the request XML.

1677 03/29/2012 04:03 PM Aaron Marcuse-Kubitza

Added local_ip to get local IP address

1676 03/29/2012 03:48 PM Aaron Marcuse-Kubitza

bin/map: Added prefixes support for XML inputs

1675 03/28/2012 11:12 PM Aaron Marcuse-Kubitza

digir_client: Filter by darwin:Kingdom=PLANTAE because presumably all records will have this. Don't debug-print URL.

1674 03/28/2012 11:07 PM Aaron Marcuse-Kubitza

Added initial bin/digir_client

1673 03/28/2012 07:58 PM Aaron Marcuse-Kubitza

Renamed timeout.py to timeouts.py. Renamed timeout_ vars to timeout.

1672 03/28/2012 07:52 PM Aaron Marcuse-Kubitza

opts.py: get_env_var(): default defaults to None

1671 03/28/2012 06:35 PM Aaron Marcuse-Kubitza

inputs/SpeciesLink: Accepted test outputs for new TAPIR download

1670 03/28/2012 06:03 PM Aaron Marcuse-Kubitza

bin/tapir/tapir2flat.php: Output to specieslink.specimens.csv instead of specieslink.txt so that the output file can be used right away without renaming

1669 03/28/2012 05:52 PM Aaron Marcuse-Kubitza

inputs/REMIB/src/nodes.make: Stop after a configurable # of empty responses (indicating no more nodes), instead of at a preset node ID, because there seem to be many more nodes than are listed on the web form

1668 03/27/2012 11:10 PM Aaron Marcuse-Kubitza

input.Makefile: import/rotate: Add "." before the date

1667 03/27/2012 11:08 PM Aaron Marcuse-Kubitza

input.Makefile: Added targets for editing import: import/rotate, import/rm

1666 03/27/2012 09:41 PM Aaron Marcuse-Kubitza

bin/tapir/tapir2flat.php: Fixed XML parsing to strip control chars so DOMDocument::loadXML() wouldn't complain about "PCDATA invalid Char value 8 in Entity", etc.

1665 03/27/2012 09:07 PM Aaron Marcuse-Kubitza

main Makefile: php-Darwin: Added instruction to set PHPRC if needed

1664 03/27/2012 09:03 PM Aaron Marcuse-Kubitza

Added inputs/SpeciesLink/src/tapir.make

1663 03/27/2012 09:03 PM Aaron Marcuse-Kubitza

input.Makefile: `src/%: src/%.make`: Don't tee recipe's stderr to make's stderr, because long-running make_scripts usually will be tracked using `tail -f`

1662 03/27/2012 09:00 PM Aaron Marcuse-Kubitza

input.Makefile: `src/%: src/%.make`: Name the log file using the make_script name instead of the output file name

1661 03/27/2012 08:31 PM Aaron Marcuse-Kubitza

cat_csv: If dialect == None, ignore that file because it's empty

1660 03/27/2012 08:30 PM Aaron Marcuse-Kubitza

csvs.py: stream_info(): If header_line == '', set dialect to None rather than trying (and failing) to auto-detect it

1659 03/27/2012 08:19 PM Aaron Marcuse-Kubitza

input.Makefile: Use new sort_filenames to putmultiple numbered sources in the correct order, dealing correctly with embedded numbers that aren't padded with leading zeros

1658 03/27/2012 08:18 PM Aaron Marcuse-Kubitza

Added sort_filenames to sort a list of filenames, comparing embedded numbers numerically instead of lexicographically

1657 03/27/2012 07:18 PM Aaron Marcuse-Kubitza

schemas/postgresql.conf: Decreased shared_buffers again because 4000MB wasn't enough less than 4GB SHMMAX

1656 03/27/2012 07:16 PM Aaron Marcuse-Kubitza

schemas/postgresql.conf: Expressed shared_buffers in MB, since decimal GB doesn't seem to work anymore on 9.1

1655 03/27/2012 07:14 PM Aaron Marcuse-Kubitza

schemas/postgresql.conf: Decreased shared_buffers to 3.9GB, slightly less than SHMMAX

1654 03/27/2012 07:11 PM Aaron Marcuse-Kubitza

schemas/postgresql.conf: Optimized again using same changes as were applied to 8.4 version

1653 03/27/2012 07:10 PM Aaron Marcuse-Kubitza

schemas/postgresql.conf: Replaced with original 9.1 version

1652 03/27/2012 07:03 PM Aaron Marcuse-Kubitza

schemas/postgresql.conf: Optimized using analogous settings as postgresql.nimoy.conf

1651 03/27/2012 06:43 PM Aaron Marcuse-Kubitza

inputs/REMIB/src/nodes.make: Don't abort entire import on empty response, because an empty response is also returned for nodes that are temporarily down, not just nodes that don't exist (assumed to be after the highest numbered node). Instead, stop import after 150 nodes if user did not specify an explicit # nodes.

1650 03/27/2012 05:50 PM Aaron Marcuse-Kubitza

inputs/REMIB/src/nodes.make: Abort prefix on empty response using break, rather than just done = True, to avoid running any more code except the finally block. Moved metadata row validation outside metadata row retrieval try-except block.

1649 03/27/2012 05:41 PM Aaron Marcuse-Kubitza

inputs/REMIB/src/nodes.make: If a read times out, abort the entire node rather than just the prefix to avoid waiting 20 sec for each of 26*26 prefixes

1648 03/27/2012 05:40 PM Aaron Marcuse-Kubitza

profiling.py ItersProfiler, exc.py ExPercentTracker: Only output fraction of rows with errors if self.iter_ct > 0, to avoid divide-by-zero error

1647 03/27/2012 04:55 PM Aaron Marcuse-Kubitza

inputs/REMIB/src/nodes.make: Fixed bug where row count was output in the middle of the row processing code, instead of after the first row is processed and the row count incremented. This removes "Processed 0 row(s)" messages at the beginning of every prefix.

1646 03/27/2012 04:40 PM Aaron Marcuse-Kubitza

inputs/REMIB/src/nodes.make: Support custom starting node ID and # nodes processed via env vars

1645 03/27/2012 04:29 PM Aaron Marcuse-Kubitza

Renamed inputs/REMIB/src/nodes.all.0.header.specimens.csv to node.0.header.specimens.csv so it would sort correctly with the new output file names

1644 03/27/2012 04:27 PM Aaron Marcuse-Kubitza

Renamed inputs/REMIB/src/nodes.all.specimens.csv.make to inputs/REMIB/src/nodes.make since it will not be used to generate nodes.all.specimens.csv. However, it can still be used with the `src/%.make` make target, but will generate a dummy empty output file "nodes".

1643 03/27/2012 04:21 PM Aaron Marcuse-Kubitza

inputs/REMIB/src/nodes.all.specimens.csv.make: Write each node to a separate output file

1642 03/27/2012 04:00 PM Aaron Marcuse-Kubitza

inputs/REMIB/src/nodes.all.specimens.csv.make: Raise InputException instead of AssertionError if invalid metadata row, so that it will be caught and printed instead of aborting the program

1641 03/27/2012 03:56 PM Aaron Marcuse-Kubitza

inputs/REMIB/src/nodes.all.specimens.csv.make: Moved header reading code inside TimeoutException try-except block since read sometimes times out before the header is even read

1640 03/27/2012 03:55 PM Aaron Marcuse-Kubitza

schemas/postgresql.nimoy.conf: Increased shared_buffers to 1.5GB since kernel.shmmax has been increased to 2GB

1639 03/26/2012 11:07 PM Aaron Marcuse-Kubitza

Renamed inputs/REMIB/src/remib_raw.0.header.specimens.txt to nodes.all.0.header.specimens.csv

1638 03/26/2012 10:57 PM Aaron Marcuse-Kubitza

inputs/REMIB/src/nodes.all.specimens.csv.make: Increased read timeout

1637 03/26/2012 10:55 PM Aaron Marcuse-Kubitza

inputs/REMIB/src/nodes.all.specimens.csv.make: Timeout stuck reads because sometimes nodes are offline, etc.

1636 03/26/2012 10:53 PM Aaron Marcuse-Kubitza

exc.py: str_(): Strip trailing whitespace. print_ex(): Since str_() now strips trailing whitespace, strings.ensure_newl() is no longer necessary.

1635 03/26/2012 10:43 PM Aaron Marcuse-Kubitza

streams.py: Added TimeoutInputStream and WrapStream. Changed StreamIter to use new WrapStream.

1634 03/26/2012 10:42 PM Aaron Marcuse-Kubitza

Added timeout.py

1633 03/26/2012 10:25 PM Aaron Marcuse-Kubitza

inputs/REMIB/src/nodes.all.specimens.csv.make: Download from all prefixes of all nodes. Stop when a node produces an empty response (not even an error), which indicates no more nodes. Changed status messages.

1632 03/26/2012 10:17 PM Aaron Marcuse-Kubitza

input.Makefile: `src/%: src/%.make`: Append stderr to log file

1631 03/26/2012 09:21 PM Aaron Marcuse-Kubitza

Added inputs/REMIB/src/nodes.all.specimens.csv.make to download REMIB data for all nodes

1630 03/26/2012 09:20 PM Aaron Marcuse-Kubitza

Added streams.py for I/O, which contains StreamIter, TracedOutputStream, and LineCountOutputStream

1629 03/26/2012 09:20 PM Aaron Marcuse-Kubitza

term.py: Added clear_line. Corrected file comment.

1628 03/26/2012 08:06 PM Aaron Marcuse-Kubitza

Makefiles: Let subdir's Makefile decide whether to delete on error

1627 03/26/2012 08:05 PM Aaron Marcuse-Kubitza

input.Makefile: Save partial outputs of aborted src make scripts

1626 03/26/2012 06:44 PM Aaron Marcuse-Kubitza

input.Makefile: Fixed bug in `%: %.make` rule to use $< instead of $*

1625 03/26/2012 06:20 PM Aaron Marcuse-Kubitza

mappings/DwC2-VegBIEN.specimens.csv: minimumElevationInMeters: Remove any "ca." prefix

1624 03/26/2012 06:19 PM Aaron Marcuse-Kubitza

xml_func.py: _replace: Strip whitespace from the returned string

1623 03/26/2012 06:09 PM Aaron Marcuse-Kubitza

csvs.py: Added TsvReader to support TSV quirks. Added reader_class(). reader_and_header(): Use reader_class() to automatically use TsvReader instead of csv.reader for TSVs. Added is_tsv() and use it where `dialect.delimiter == '\t'` was used.

1622 03/26/2012 06:06 PM Aaron Marcuse-Kubitza

strings.py: Added extract_line_ending() and remove_line_ending(). ensure_newl(): Use new remove_line_ending(). Moved Parsing section to top since it is used by the other sections.

1621 03/26/2012 04:40 PM Aaron Marcuse-Kubitza

csvs.py: stream_info(): Set dialect.quoting = csv.QUOTE_NONE for TSVs because they usually don't quote fields. Factored dialect detecting code into new function sniff().

1620 03/26/2012 03:45 PM Aaron Marcuse-Kubitza

input.Makefile: verify: Added reverify option, which can be turned off to prevent regenerating the verify/%.out file from the DB (which can be time-consuming), and instead just diff verify/%.out with verify/%.ref

1619 03/24/2012 10:31 PM Aaron Marcuse-Kubitza

count_error_rows: Allow input to be specified as last arg(s) in addition to as stdin