Add docs description of deployment and data pipeline.

2057208e · Jens Bröder · 740536c2 · 2057208e · 2057208e · 2057208e
Commit 2057208e authored 1 year ago by Jens Bröder
--- a/docs/_toc.yml
+++ b/docs/_toc.yml
@@ -14,7 +14,7 @@ parts:
         title: "Implementation overview"
       - file: introduction/data_sources.md
   - caption: Data in UnHIDE
-     numbered: False
+     numbered: True
     chapters:
       - file: data/overview.md
         title: "Overview"
@@ -55,10 +55,13 @@ parts:
   - caption: Technical implementation
     numbered: True
     chapters:
-       - file: tech/harvesting.md
-         title: "Data Harvesting"
-       - file: tech/uplifting.md
-         title: "Data uplifting"
+       - file: tech/datapipe.md
+         title: Data pipeline
+         sections:
+         - file: tech/harvesting.md
+           title: "Data Harvesting"
+         - file: tech/uplifting.md
+           title: "Data uplifting"
       - file: tech/backend.md
         title: "Architecture"
         sections:

--- a/docs/dev_guide/architecture/05_building_block_view.md
+++ b/docs/dev_guide/architecture/05_building_block_view.md
@@ -111,4 +111,26 @@ over completeness. Specify important, surprising, risky, complex or
 volatile building blocks. Leave out normal, simple, boring or
 standardized parts of your system.
 :::
+## Whitebox Overall System
+
+A rough overview of UnHIDE related repositories under the [project](https://codebase.helmholtz.cloud/hmc/hmc-public/unhide)
+are shown in the figure below.
+![repository_overview](../../diagrams/unhide_overview_repositories.svg)
+
+The *administration repository* is private and used for project related things that should not be made
+public. Therefore you can link of to secret information from the docs to there.
+
+The *documentation repository* is for the documentation of the overall project, which basically is
+what you are reading here and now.
+
+The *unhide-docker* repository contains different docker files for full or partly deployment of the 
+whole project. Docker files for developments environments also go there.
+
+The *data-harvesting* repository is a python library with a command line tool to run harvesters and
+data processing. Functionality should be kept general where ever possible and only UnHIDE 
+specifically configured.

+The *unhide-ui* repository contains the [web front-end](https://search.unhide.helmholtz-metadaten.de) for UnHIDE, 
+exposing the full text search through the data.
+Currently, it also contains the backend pieces which are needed for the full text search index, i.e 
+the indexer, an API and some SOLR config related things and schemas.
\ No newline at end of file
--- a/docs/dev_guide/architecture/07_deployment_view.md
+++ b/docs/dev_guide/architecture/07_deployment_view.md
@@ -99,6 +99,35 @@ Mapping of Building Blocks to Infrastructure
 :   *\<description of the mapping>*
 :::

-Unhide is deployed on HDf-cloud
+## Infrastructure Level 1 {#_infrastructure_level_1}
+
+
+UnHIDE is deployed on [HDF-cloud](https://www.fz-juelich.de/en/ias/jsc/systems/scientific-clouds/hdf-cloud)
+at the Jülich supercomputing center. The cloud is an open stack instance hosted as a service by the supercomputing center.
+
+The choice to host there, was to host on (institute) extern reliable infrastructure, for low cost.
+So far the support is very nice and quick and there are no issues.
+UnHIDE runs on a single virtual machine, while data is mounted via a volume.
+
+
+
+***\<Overview Diagram>***
+
+On the HDF-Cloud virtual machine the data pipeline with all the harvesters is executed through a cron job periodically.
+(So the data pipeline is not event based, with for now is for simplicity.) The overview of this 
+deployment system is shown below in Fig. @fig:overview_deploy.
+![overview_deploy](../../diagrams/unhide_deployment_overview.svg){#fig:overview_deploy}
+The figure clearly shows all docker container spawned for the deployment of UnHIDE. 
+Each part and service is running in its own docker container, which can communicate with other containers
+over an internal network. All connections from and to the outside world, i.e. our domains on the internet, go through
+a reverse proxy using Nginx with ssl.
+
+### Deployment of the documentation
+
+![repository_overview](../../diagrams/documentation_deployment.svg)
+
+The Figure above demonstrates, how the UnHIDE documentation is currently deployed.
+Because `gitlab pages` is not enabled for our usecase and needs on the gitlab the unhide project is in,
+we decided to mirror the documentation repository to [github](https://github.com/Materials-Data-Science-and-Informatics/unhide-docs) and deploy the documentation from there
+over github pages.

-![overview](../../diagrams/unhide_deployment_overview.svg)
--- a/docs/dev_guide/architecture/08_concepts.md
+++ b/docs/dev_guide/architecture/08_concepts.md
@@ -70,4 +70,15 @@ concepts](images/08-Crosscutting-Concepts-Structure-EN.png)
 See [Concepts](https://docs.arc42.org/section-8/) in the arc42
 documentation.
 :::
+## Domain concepts

+### Provenance of linked-data
+
+### Internal Data Model
+
+### Updating linked data
+
+
+## Operational concepts
+
+### Interoperability with other graphs
\ No newline at end of file
--- a/docs/tech/datapipe.md
+++ b/docs/tech/datapipe.md
+# Data pipeline
+
+In UnHIDE data is harvested from connected providers and partners. 
+Then data is 'uplifted', i.e semantically enriched and or completed, 
+where possible from aggregated data or schema.org semantics.
+
+## Overview
+
+All full view of the UnHIDE data pipeline is shown below:
+![overview](../diagrams/unhide_harvester_datapipeline.svg)
+
+In a single central yaml configuration [file](https://codebase.helmholtz.cloud/hmc/hmc-public/unhide/data_harvesting/-/blob/main/data_harvesting/configs/config.yaml?ref_type=heads) called `config.yml`, all sources from data providers 
+for the harvesters are listed. Further it defines provider specific things that a certain Harvester
+class might need to find the metadata files. The configuration file also specifies since when changes
+should be harvested. This allows for a frequent run of all harvesters only picking up lately new and 
+changed metadata entries.
+
+The harvesters then extract all metadata files specified from a given provider. 
+The files are then in certain cases converted and overall validated for correct schema.org content.
+The content of these json-ld files is stored together with some metadata of the harvesters in an 
+internal json-serializable UnHIDE DataModel class. This class tracks the original data content as well
+as the uplifted version, as well a reproducible detail level of provenance data with so called
+RDF patches. More detail on this con be found [here](./harvesting.md).
+
+In the `config.yml` also configuration for the `Aggregator` class is stored, which specifies, what
+data operations in terms of uplifted should be performed on the incoming data.
+The serialization of the UnHIDE DataModel files forms the `ground truth source` for UnHIDE.
+More detail on this can be found [here](./uplifting.md).
+
+To now fulfill different use cases of our stakeholders, this data flows now further in two direction.
+
+In the first direction it is imported into a single rdf graph database using Apache Jena. 
+This database can be accessed over a SPARQL endpoint exposed via Jena Fuseki.
+
+The second direction is there to provide full text search on the data to end users.
+For this an index of each uplifted data record is constructed and uploaded into a single SOLR index,
+which is exposed to a certain extend via a custom fastAPI. A web front end using the javascript library 
+React provides a user interface for the full text search and supports special use cases as a service
+to certain stakeholder groups. 
\ No newline at end of file