- Table of contents
- Web interface
Web interface¶
Web interface options¶
Querying¶
Browsing¶
- user requests sample extract
- web interface creates NVS-style Google spreadsheet containing each table in a tab
Clickable fkeys¶
we would modify phpPgAdmin to hyperlink URL fkeys6
- click fkey to jump to the referenced row in phpPgAdmin
- note that phpPgAdmin implements this natively for the Browse and Select modes (click the key icon in the datasource column), but not for custom SQL queries
- click pkey to jump to the current row in phpPgAdmin (i.e. it's a self-refential URL)
- click
source
to jump to thesource
table entry- click
url
to jump to the row in the postprocessed (renamed/augmented) staging table7- click
raw_data_url
to jump to the row in the raw staging table (which is actually an insertable view of the postprocessed staging table) - click
source_url
to jump to the location of the actual source data, if it's accessible online- this may be a non-clickable pseudo-URL if the source data is not accessible online
- click
- click
6 indicated by the fkey column type being url
, which would be a domain over text
.
note that for LibreOffice to auto-hyperlink a URL, it needs to start with www.
or http://
, etc.
7 this should not be an insertable view of the associated VegCore tables, because they don't store the verbatim data8
8 this is so that the VegCore tables are not tasked with storing both the scrubbed and verbatim data in the same place
.
Requirements¶
- Detection of when user pressed Stop button
(to avoid clogging the CPU with orphaned queries that do not have someone waiting for their result. these are very common in large DBs, where queries often take very long to execute and are abandoned.)- Requires outputting space characters regularly (every 10 s or so) to receive SIGPIPE if socket was closed, and proper handling of SIGPIPE to cleanly close the connection
- Time limits
(to prevent abuse by non-BIEN members, if the DB is made public)- each user should be able to run only one query at a time (and can have only one connection to the DB)
- if a query takes longer than a certain soft limit (e.g. 30 min), an admin is e-mailed to see whether the query should be stopped
- if a query takes longer than a certain hard limit (e.g. several hours), it is automatically killed
- for trusted users, the hard limit can be increased by request to an admin
- Query caching
- *SEINet* is a datasource that does many of these
Adding data¶
Workflow¶
- web interface verifies user is not bot by using CAPTCHA
- web interface creates Postgres user account and user-owned schema(s) using security-of-definer Postgres function
- user DB quotas are not natively supported by PostgreSQL, so we would need to use a workaround like putting each user's tables in a disk space-limited tablespace
- user maps data to VegBIEN/VegCore on their own computer
- Linux users would use our scripts directly, which would install the necessary dependencies and the destination schema
- the install target would do everything
make install
does except download the flat files - the svn repo they check out from would leave out the datasources we've mapped, so they have a clean working copy with just their data and mappings
- the install target would do everything
- Windows/Mac users would use a Linux VM with our scripts and schema installed
- pkeys can be serial integers (don't have to be random numbers3) because their schema name prefix (described below) makes them globally unique and mergeable with other datasources
- using random numbers does not, however, remove the need for column-based import to match up the fkeys. that would instead require natural keys.
- Linux users would use our scripts directly, which would install the necessary dependencies and the destination schema
- user accesses our DB using psql/pgAdmin, from their own computer
- phpPgAdmin will continue to be provided to view the data, but is not intended for uploading data
- user imports data into schema by running psql, etc. locally with VegBIEN instead of localhost as the destination
- user runs security-of-definer Postgres function to link data to main DB using
INHERITS
- function will check that pkeys all start with the schema name5, to avoid collisions between datasources
- distinctness of all pkeys is ensured by adding a
CHECK
constraint that asserts each pkey isLIKE 'prefix.%'
, and then ensuring that the new table's prefix is different from all the other child tables' prefixes - to prevent the user from removing the
CHECK
constraint4, the bien user must acquire ownership of the table before making the inheritance connection - because each pkey is unique within the datasource, and there are no pkey overlaps between datasources, each pkey is globally unique within the database
- distinctness of all pkeys is ensured by adding a
- function must run as superuser, not bien user, because ownership of both tables is required to make an inheritance connection1.
however, superuser privileges are not required to sever the inheritance connection if the user wants to alter their table's structure.
- function will check that pkeys all start with the schema name5, to avoid collisions between datasources
- user can access results via phpPgAdmin
- user can publish their data by linking their tables to public VegCore tables, which will cause the public access rights to propagate to their table
1 otherwise, you will get a "must be owner of relation"
error
2 also, only ownership of the containing schema is required to drop a table, including one which is on the inheriting end of an inheritance connection
3 for random number pkeys, need at least 64 bits (bigint
) for <0.1% chance of collision with <200 million rows.
note that collisions just need to be rare, not impossible, because they can be resolved on merge simply by changing the random number and cascading the update to the fkeys.
to instead make collisions statistically impossible, use 128 bits (uuid
)
4 this cannot be pinned by the parent table because it is not inherited, and cannot be inherited because the prefix value is different for each child table
5 to prepend the schema name in a trigger that autopopulates the pkey, use TG_TABLE_SCHEMA
.
Target schema¶
- can be any schema supported by the import process, including VegCore, VegBIEN, and VegBank
- i.e. we can also use this simplified type of web interface for the VegBank upgrade