harvesting.md



Data harvesting: extracting metadata from the web
How does UnHIDE harvested data?
Data harvesting and mining for the knowledge graph is done by Harvester classes.
For each interface a specific Harvester class should be implemented.
All Harvester classes should inherit from existing Harvesters or the BaseHarvester, which currently specifies that:

Each harvester needs a run method
Can read from the config.yml

Reads from a <harvesterclass>.last_run file the time the harvester was last run

Implemented harvester classes include:


Name (Cli)
Class Name
Interface
Comment


sitemap
SitemapHarvester
sitemaps
Selecting record links from the sitemap requires expression matching. Relies on the advertools lib.


oai
OAIHarvester
OAI-PMH
Relies on the oai lib. For the library providers, dublin core is converted to schema.org


git
GitHarvester
Git, Gitlab/Github API
Relies on codemetapy and codemeta-harvester as well as gitlab/github APIs.


datacite
DataciteHarvester
REST API & GraphQL endpoint
schema.org extracted through content negotiation.


feed
FeedHarvester
RSS & Atom Feeds
Relies on the atoma library, and also only works if on the landing pages schema.org metadata can be extracted. Can only get recent data, useful for event metadata.


indico
IndicoHarvester
Indico REST API
Directly extracts schema.org metadata through API, requires an access token


Json-ld metadata from landing pages of records is extracted via the extruct library, if it cannot be directly retrieved through some standardized interface.
All harvesters are exposed on the hmc-unhide commandline interface.
They store the extracted metadata per default in the internal data model LinkedDataObject.
Which has a serialization with some provenance information, original source data and uplifted data and provides method for validation.
In a single central yaml configuration file called config.yml, specifies for each harvester class the sources to harvest and harvester or source specific configuration.