Skip to content
Snippets Groups Projects
Commit 2057208e authored by Jens Bröder's avatar Jens Bröder
Browse files

Add docs description of deployment and data pipeline.

parent 740536c2
No related branches found
No related tags found
1 merge request!5Merge dev into main for documenation release v1.0.0
Pipeline #251726 passed
......@@ -14,7 +14,7 @@ parts:
title: "Implementation overview"
- file: introduction/data_sources.md
- caption: Data in UnHIDE
numbered: False
numbered: True
chapters:
- file: data/overview.md
title: "Overview"
......@@ -55,10 +55,13 @@ parts:
- caption: Technical implementation
numbered: True
chapters:
- file: tech/harvesting.md
title: "Data Harvesting"
- file: tech/uplifting.md
title: "Data uplifting"
- file: tech/datapipe.md
title: Data pipeline
sections:
- file: tech/harvesting.md
title: "Data Harvesting"
- file: tech/uplifting.md
title: "Data uplifting"
- file: tech/backend.md
title: "Architecture"
sections:
......
......@@ -111,4 +111,26 @@ over completeness. Specify important, surprising, risky, complex or
volatile building blocks. Leave out normal, simple, boring or
standardized parts of your system.
:::
## Whitebox Overall System
A rough overview of UnHIDE related repositories under the [project](https://codebase.helmholtz.cloud/hmc/hmc-public/unhide)
are shown in the figure below.
![repository_overview](../../diagrams/unhide_overview_repositories.svg)
The *administration repository* is private and used for project related things that should not be made
public. Therefore you can link of to secret information from the docs to there.
The *documentation repository* is for the documentation of the overall project, which basically is
what you are reading here and now.
The *unhide-docker* repository contains different docker files for full or partly deployment of the
whole project. Docker files for developments environments also go there.
The *data-harvesting* repository is a python library with a command line tool to run harvesters and
data processing. Functionality should be kept general where ever possible and only UnHIDE
specifically configured.
The *unhide-ui* repository contains the [web front-end](https://search.unhide.helmholtz-metadaten.de) for UnHIDE,
exposing the full text search through the data.
Currently, it also contains the backend pieces which are needed for the full text search index, i.e
the indexer, an API and some SOLR config related things and schemas.
\ No newline at end of file
......@@ -99,6 +99,35 @@ Mapping of Building Blocks to Infrastructure
: *\<description of the mapping>*
:::
Unhide is deployed on HDf-cloud
## Infrastructure Level 1 {#_infrastructure_level_1}
UnHIDE is deployed on [HDF-cloud](https://www.fz-juelich.de/en/ias/jsc/systems/scientific-clouds/hdf-cloud)
at the Jülich supercomputing center. The cloud is an open stack instance hosted as a service by the supercomputing center.
The choice to host there, was to host on (institute) extern reliable infrastructure, for low cost.
So far the support is very nice and quick and there are no issues.
UnHIDE runs on a single virtual machine, while data is mounted via a volume.
***\<Overview Diagram>***
On the HDF-Cloud virtual machine the data pipeline with all the harvesters is executed through a cron job periodically.
(So the data pipeline is not event based, with for now is for simplicity.) The overview of this
deployment system is shown below in Fig. @fig:overview_deploy.
![overview_deploy](../../diagrams/unhide_deployment_overview.svg){#fig:overview_deploy}
The figure clearly shows all docker container spawned for the deployment of UnHIDE.
Each part and service is running in its own docker container, which can communicate with other containers
over an internal network. All connections from and to the outside world, i.e. our domains on the internet, go through
a reverse proxy using Nginx with ssl.
### Deployment of the documentation
![repository_overview](../../diagrams/documentation_deployment.svg)
The Figure above demonstrates, how the UnHIDE documentation is currently deployed.
Because `gitlab pages` is not enabled for our usecase and needs on the gitlab the unhide project is in,
we decided to mirror the documentation repository to [github](https://github.com/Materials-Data-Science-and-Informatics/unhide-docs) and deploy the documentation from there
over github pages.
![overview](../../diagrams/unhide_deployment_overview.svg)
......@@ -70,4 +70,15 @@ concepts](images/08-Crosscutting-Concepts-Structure-EN.png)
See [Concepts](https://docs.arc42.org/section-8/) in the arc42
documentation.
:::
## Domain concepts
### Provenance of linked-data
### Internal Data Model
### Updating linked data
## Operational concepts
### Interoperability with other graphs
\ No newline at end of file
# Data pipeline
In UnHIDE data is harvested from connected providers and partners.
Then data is 'uplifted', i.e semantically enriched and or completed,
where possible from aggregated data or schema.org semantics.
## Overview
All full view of the UnHIDE data pipeline is shown below:
![overview](../diagrams/unhide_harvester_datapipeline.svg)
In a single central yaml configuration [file](https://codebase.helmholtz.cloud/hmc/hmc-public/unhide/data_harvesting/-/blob/main/data_harvesting/configs/config.yaml?ref_type=heads) called `config.yml`, all sources from data providers
for the harvesters are listed. Further it defines provider specific things that a certain Harvester
class might need to find the metadata files. The configuration file also specifies since when changes
should be harvested. This allows for a frequent run of all harvesters only picking up lately new and
changed metadata entries.
The harvesters then extract all metadata files specified from a given provider.
The files are then in certain cases converted and overall validated for correct schema.org content.
The content of these json-ld files is stored together with some metadata of the harvesters in an
internal json-serializable UnHIDE DataModel class. This class tracks the original data content as well
as the uplifted version, as well a reproducible detail level of provenance data with so called
RDF patches. More detail on this con be found [here](./harvesting.md).
In the `config.yml` also configuration for the `Aggregator` class is stored, which specifies, what
data operations in terms of uplifted should be performed on the incoming data.
The serialization of the UnHIDE DataModel files forms the `ground truth source` for UnHIDE.
More detail on this can be found [here](./uplifting.md).
To now fulfill different use cases of our stakeholders, this data flows now further in two direction.
In the first direction it is imported into a single rdf graph database using Apache Jena.
This database can be accessed over a SPARQL endpoint exposed via Jena Fuseki.
The second direction is there to provide full text search on the data to end users.
For this an index of each uplifted data record is constructed and uploaded into a single SOLR index,
which is exposed to a certain extend via a custom fastAPI. A web front end using the javascript library
React provides a user interface for the full text search and supports special use cases as a service
to certain stakeholder groups.
\ No newline at end of file
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment