Ablageort im Repository (GitLab): Projekt433-metadaten
Diskussionsforum (Discourse): Projekt433-metadaten
Readme: Projekt433-metadaten
Beschreibung des Projektes: umwelt.info metadata index
PublicCode.YML: anzeigen
OSS Compliance: anzeigen
This project provides the metadata index used in the umwelt.info project. It aims for efficient operation by using the Rust programming language and storing the datasets and a search index directly in the file system to avoid dependencies on additional services like databases or search engines. It does not aim to be generic, configurable or programmable, especially where that would conflict with efficiency.
The system is implemented as three separate programs that access a common file system directory at `$DATA_PATH`.
The harvester periodically harvests/crawls/scrapes the sources defined in `DATA_PATH/harvester.toml` to write all datasets to `DATA_PATH/datasets` with one directory per source and one file per dataset and to store summary metrics in `$DATA_PATH/metrics`.
The indexer usually runs after the harvester and reads all datasets to produce a search index over their properties in `$DATA_PATH/index` using the Tantivy library.
The server provides an HTTP-based API to query the search index and retrieve individual datasets. It also collects access statistics about each datasets in `DATA_PATH/stats`. It is the only continuously running component and can be scaled out by exporting `DATA_PATH` via a networked file system like NFS or SMB.
The code is organized as a single library with three entry points for the above mentioned programs. A separate binary named `xtask` is used automate the development workflow.
The CI pipelines checks formatting via Rustfmt, ensures a warning-free build using Clippy, runs the unit and integration tests and builds and collects optimized binaries.
The system is deployed using a set of sandboxed systemd units, both for periodically running the harvester and indexer as well as continuously running the server.
The canonical deployment of the system is reachable at md.umwelt.info.
After installing a Rust toolchain and adding the optional Clippy and Rustfmt tools via
```console
rustup component add clippy rustfmt
```
the code can be formatted and linted by running
```console
> cargo xtask
```
`deployment/harvester.toml` tracks all relevant sources. Based on that, a configuration like
```toml
[[sources]]
name = "uba-gdi"
type = "csw"
url = "https://metadaten.uba.de/csw"
origins = ["/Bund/UBA/GDI"]
source_url = "https://metadaten.uba.de/smartfinder-client/?lang=de#/datasets/iso/{{id}}"
```
should be created at `data/harvester.toml`. (Please be responsible when reusing the harvester configurations provided here. Most importantly, this means not placing undue load on these servers.)
The harvester and indexer can then be invoked by
```console
> cargo xtask harvester
```
Finally, executing
```console
> cargo xtask server
```
will make the server listen on `127.0.0.1:8081`.
We use both unit tests which can be invoked using
```console
> cargo xtask test
```
and regression tests for the harvester which can be invoked by
```console
> cargo xtask regression-test
```
Iteratively developing harvesters can be time-consuming and place undue load on the source due to large responses being transmitted over the network. To mitigate this issue, each request must be identified using a key
```rust
let response = client.make_request(source, key, |client| ...).await?;
```
under which its response is stored on disk. Once development has reached a state where the set of requests is stable, all parsing and extraction can then be developed using replayed responses. During development, these files are only fetch again if they are deleted manually. During operations, they are expired using a randomized time-based procedure.
The datasets which resulting from running the harvester against some pre-recorded responses are checked automatically using the command
```console
> cargo xtask regression-test
```
which will display the difference between the JSON representations of any dataset that changed compared to what is currently checked into version control. (This implies that most changes to the metadata schema or the file format will be detected as changes to all datasets.)
When the changes are intentional, e.g. the pre-recorded responses were extended or new functionality was added to a harvester, the regression tests should run again with the additional `accept` subcommand appended
```console
> cargo xtask regression-test accept
```
to produce a commit which effectively updates the expected results.
Benchmarks are created in the `benches` directory and registered as a target in the `Cargo.toml` file.
Executing all benchmarks is done via
```console
> cargo xtask bench
```
and a single target can be run by
```console
> cargo xtask bench --bench [name_of_target]
```
The crate `tiny-bench` is used and will always display the change of timings compared to the last run of the benchmark.
Some functions use external databases which need to be pre-processed before deployment. This is done using the `xtask` binary during development and using CI jobs during operation.
```console
> curl --output allCountries.zip --location http://download.geonames.org/export/dump/allCountries.zip
> unzip allCountries.zip
> cargo xtask geonames < allCountries.txt
```
```console
> curl --output kaikki.org-dictionary-German.json --location https://kaikki.org/dictionary/German/kaikki.org-dictionary-German.json
> cargo xtask kaikki < kaikki.org-dictionary-German.json
```
```console
> curl --output Auszug_GV.xlsx --location https://www.destatis.de/DE/Themen/Laender-Regionen/Regionales/Gemeindeverzeichnis/Administrativ/Archiv/GVAuszugJ/31122021_Auszug_GV.xlsx?__blob=publicationFile
> cargo xtask ars_ags
```
```console
> curl --output b25_utm32s.zip --location "https://sg.geodatenzentrum.de/wfs_vertriebseinheiten?request=GetFeature&SERVICE=wfs&VERSION=2.0.0&typenames=vertriebseinheiten:b25_utm32s&srsname=urn:ogc:def:crs:EPSG:25832&outputformat=SHAPE-ZIP"
> unzip b25_utm32s.zip
> cargo xtask atkis
```
The HTTP routes `/search`, `/dataset` and `/origin` support content negotiation insofar they yield either rendered HTML pages or the underlying JSON data depending on the `Accept` header transmitted by the HTTP client. An OpenAPI-compatible specification is served at `/openapi.json` and can be explored using Swagger UI served at `/swagger-ui/`.