Backend Refactoring Idea - Introduce a relational Entity DB

added Discussion enhancement ideas labels

I like the idea.

Add on.

There are already shacl shapes for all these entities, i.e for whole schema.org, through these are in less useful.
At my knowledge there are no json schemas for schema.org (I found this so far https://github.com/charlestati/schema-org-json-schemas) If one does this, the pydantic and generated json schemas should go into an separate repository for external use (on github), which I think could be an impactful project itself. I do not know if the community around ocean info hub moved forward with providing a repos base for extensions of schema.org templates. the pydantic and json schemas, would be an excellent place to go there. What I cannot foresee, are the choices on the way. Schema.org is quite lose, therefore a schema for it will might either be very loose in validation, or we start to restrict to a subset, or stricter rules. (which might not be a bad thing for the research community, but this is not a choice to make from our side.)
What this all requires: The schemas, a new layer of an 'entity DB', which requires to write logic how to merge entities (which we currently avoid, we just merge triples and in the index we overwrite.) The back construction for an uplifted original metadata piece becomes harder, i.e one also needs to define how to recombine these in a useful manner. (Since I assume in the entity db one would not duplicate data but just link to an identifier). Like a rule that says for authors use the id, name, affiliation, for the affiliation use the id and the name, ggf the location. if sub org use all orgs until org with ror. etc.

But yes, in any normal database world this is how one would/should store it.

Yeah, exactly. The pydantic models/schemas would be more tightly narrowed down, which is not a bug but a feature. "keeping everything open" is in direct contradiction to "validating and ensuring correctness", just by definition. Validation and cleaning stuff is always a restriction - you filter out stuff you declare as invalid/wrong/broken.

The shacl shapes could serve as a starting point. And here is an example how schema.org / ROCrate can be used with pydantic.

In the end, somebody must make certain decisions:

what information exactly does unhide expose?
what kind of queries should be possible?

I think it is perfectly valid and in fact necessary to have an internal model, as unhide is not trying to define a new community standard I don't think that this is super critical, and it can also evolve over time, based on needs and use-cases and new requirements (the same way somesy has its own "master schema" from which it maps back into CITATION.cff, codemeta, etc).

For that reason, I believe a pragmatic set of common-sense and hands-on fields would be more than enough as a starting point. If you keep the original data anyway, you can always choose to "reharvest" or "re-generate" the entities from the source json-ld snippets, when stuff changes significantly enough.

If implemented right, there's no irreversable loss of information, just a kind of curated filtering that can be made richer over time (i.e. you keep the "triple soup" as raw source data anyway)

added Backlog label

unassigned @j.broeder

Just to add to the discussion, I think having a table-per-entity setup is not scaling well for new integrations. I.e., if I am adding a new data type that is stored in the KG, I have to create a new table in the database

Backend Refactoring Idea - Introduce a relational Entity DB

Structural schemas for each supported entity type

Focus Unhide around the Entity DB

Summary

Discussion

Disadvantages

Advantages

Other thoughts

Designs

Child items ...

Activity