[feature] Expose for each entry rich schema.org jsonld metadata
To make the software better findable on the web and that the Helmholtz metadata collaboration can harvest the software metadata to inject into the helmholtz knowledge graph (https://search.unhide.helmholtz-metadaten.de/) it is important that metadata in a certain format is exposed in a web standard (as discussed at the deRSE24 briefly). It might be useful to add such changes to the base version of the Research software directory, that other instances of it also have this.
Details on how this can be done:
- How the data should look like:
We would prefer: codemeta.json https://codemeta.github.io/ which is the richest standard with schema.org semantics out there for software. There is some tooling around this for validation and for users to create one. Examples you find in the user guide, tools in the tooling section of the documentation. You could also allow users to upload their codemeta.json or extract it directly from their repository if they provide one. This could also be used to ease the upload of a new entry (prefill most of the metadata).
Alternatively one could use the lesser standard https://schema.org/SoftwareSourceCode which is how software is exposed by Datacite.
- How to expose the metadata:
There are several possible ways, of which several could be supported.
2.1 For indexing and crawlers it is best if the metadata is exposed on each page of the entry. A good way to do this is in the header of the page as an <script type="application/ld+json">
tag (see https://developers.google.com/search/docs/appearance/structured-data/intro-structured-data). If done right, tools like https://pypi.org/project/extruct/ can pick it up, you can use this for checking an implementation.
Zenodo and Indico have such implemenations.
2.2 Via some standard API/protocol like OAI-PMH (https://www.openarchives.org/OAI/2.0/guidelines-harvester.htm) or any of the others up to linked data specific APIs. I recommend a general one. (Best practices here: https://www.w3.org/TR/ld-bp/#uri-design-principles, https://w3c.github.io/json-ld-bp/)
Here it is important or nice for harvesting if a list of urls of entries is provided as well as when they where last updated. From certain protocols one gets this, otherwise an RSS feed would satisfy this or a sitemap. For google and others to crawl the directory you need to allow crawling and indexing of the software pages in the robots.txt file.
Ideally the metadata would contain most of what is displayed on the webpage for each software entry. On the landing page of the software directory you could also expose some metadata about the software directory itself.
Metadata is then useful if the contains Persistent identifiers to other research entities, like Orcids of the authors as well as the DOI of any associated publication, and of course the link to the source code or binaries.
For any questions/help on this feel free to contact me or Volker Hofmann. Thanks a lot! Keep up the great work!