Backend Refactoring Idea - Introduce a relational Entity DB
This issue is just to record and share a few ideas concerning improvement of the architecture of the Unhide backend.
The current architecture has some limitations and challenges:
- the original data coming from repositories is "dirty" and varies a lot depending on repository
- this makes the idea of "uplifting" back in the same format to the repository not really feasible and it is questionable whether it is useful (only because the repositories provide a JSON export, does not mean that this makes it easy/possible to import in same format, and it is unclear who is responsible and authorized to do that)
- The graph currently is build by throwing all the schema.org-shaped pieces together and hoping that the subgraphs "link well" (after some processing of the sources)
Structural schemas for each supported entity type
I think the cleanest way to get a coherent graph with useful entities would be to actually have an unhide-internal schema for each entity, modeled after schema.org but possibly adjusted to the needs.
As Unhide already uses pydantic, it would be natural to just create some pydantic models and tweak them so that they can serialize to valid JSON-LD (e.g. similar to how metador-core is doing it, by auto-attaching @id
and @type
to all instances of a model/schema).
In the end, instead of having just the per-dataset original/uplifted data, it might be better to create a relational database, let's call it entity DB, with a table for each kind of entity (Person, Organization, Dataset, etc.).
Focus Unhide around the Entity DB
The schema-compliant entities could be the central "continuously uplifted" normalized representation for all things that Unhide cares about, being the target for new data that is ingested and forming the source for building a clean Unhide graph.
For example, each time when a repository is (re)harvested, new/changed pieces of information should be updated in the corresponding reference record.
Based on the Entity DB, the JSON-LD serialization could be used to (re)generate a clean-ish RDF knowledge graph that is exposed via SPARQL and the triple store.
Summary
Suggested architecture refactoring for the data backend:
Repositories -> Harvesters -> Relational Entity DB (e.g. PostgreSQL) -> Indexer / TripleStore -> Users
Discussion
Disadvantages
- This means giving up the idea of uplifting the individual dataset snippets that come from the repository. But I believe this is not realistic/feasible anyway.
Advantages
-
This setup would make clear what a single entity even is, currently it is unclear how to do entity resolution nicely, because there is not a single record that would be "all we know about Max Mustermann". If each entity is tracked explicitly, it can be source and target for operations, it is a concrete thing. Currently all entities are only implicit "triples" loosly held together by some schema.org and IRIs.
-
The indexing could become much more clean and structured. A generic method could generate the SOLR-specific information from an entity instance based on the value and type information, simplifying the tangled code that is currently plucking apart the json objects.
-
The resulting graph that unhide provides would be more clean and structured. For a SPARQL endpoint to be useful, a graph still needs to have some coherent structure to formulate useful queries. If the graph is built based on the controlled entities (which are created and maintained based on harvesting), than arbitrary incoherent "triple soup" will not end up in the graph.
-
It is easier to formulate "uplifting rules" -- one can actually talk about what to do with "Max Mustermann" if his address in a dataset we harvested is different than the one in the Orcid we have? or how do we distinguish Max Mustermanns from each other? (in somesy for that specific purpose e.g. we use a heuristic using orcid, email and names)
-
It becomes possible to actually point out where wrong information is and correct it (the respective data in the entity DB)
-
It becomes possible to e.g. link entities to datasets and track information provenance (not being limited by limitations of RDF tech), one could actually track that "Max mustermann is built from data contributed from 5 datasets unhide harvested" etc.
Other thoughts
- With this architecture, one could still compare "dataset-based entities" that come from the original repository, and compare them to the (hopefully) more complete and rich entities from the entity DB (i.e. "uplifted data")
- One could also provide both graphs (unprocessed triples thrown together, vs. the clean graph with improved/resolved entities)
- If certain harvested repositories would have procedures for accepting corrected metadata somehow, then repository-specific serializers based on the internal entity model of course could still be created (but I doubt that this loop will "close" in practice in this way)