Skip to content
Snippets Groups Projects

MIT license

data_mining

This is a collection of tools, ideas, common things used for (meta)data mining and analysis of (data, software, journal) publications. This project is created in the context of efforts from the Helmholtz Metadata collaboration (HMC).

THIS IS WORK IN PROGRESS

Contributions of any kind by you are always welcome!

Approach:

We establish certain data pipelines of certain data provides with as much 'high-quaility' linked metadata as possible and complement it. This we store as JSON-LD in an HMC specific JSON-LD format, so that the resulting linked data is in the proper format to be integrated in the HGF knowledgegraph. The metadta in the form of JSON-LD or an other standard can be extracted from PID landing pages directly or over APIs. There are usually two steps within a mining process:

  1. get all PIDs (landing pages) which belong to a certain set, like institution or field.
  2. extract and complement detailed metadata for each of these PIDs

Data pipelines contain code to execute harvesting from a local to a global level. They are exposed through a cmdline interface (cli) and thus easily integrated in a cron job and can therefore be used to stream data on a time interval bases into some database Data pipelines so far:

  • gitlab pipeline: harvest all public projects in Helmholtz gitlab instances and extracts and complements codemeta.jsonld files. (todo: also github)
  • scholix pipeline: Extract links and related resources for a list of given PIDs of any kind
  • invinio pipeline (todo): extract JSON-LD metadata from the standard
  • dataverse piepline (todo): extract JSON-LD metadata and other standards of data publications from dataverse instances (like JülichData)

Documentation:

Currently only in code documentation. In the future under the docs folder and hosted somewhere.

Installation

git clone git@codebase.helmholtz.cloud:hmc/hmc-public/unhide/data_mining.git
cd data_mining
pip install .

as a developer install with

pip install -e .[testing]

The individual pipelines have further dependencies outside of python.

For example the gitlab pipline relies an codemeta-harvester (https://github.com/proycon/codemeta-harvester)

How to use this

For examples look at the examples folder. Also the tests in tests folder may provide some insight. Also once installed there is a command line interface (CLI), 'hmc_unhide' for example one can execute the gitlab pipeline via:

hmc_unhide harvester run --name gitlab --out ~/work/data/gitlab_pipeline

Collection of external sources:

Websites we would like to mine, APIs:

see data_sources.csv file

also see https://os.helmholtz.de/open-science-in-der-helmholtz-gemeinschaft/open-research-data/forschungsdatenrepositorien-und-portale-in-der-helmholtz-gemeinschaft/

Other projects and aggregated data:

FREYA PID Graph (project ended end of 2020) https://zenodo.org/record/4028383#.YRozCVuxVH5 https://blog.datacite.org/introducing-the-pid-graph/ for FREYA PID Graph showcases see https://github.com/cernanalysispreservation/freya-pid-notebooks-showcase FREYA project DataCite GraphQL API (https://graphql.org/): https://datacite.org/assets/OpenHours_GraphQLAPIintro_June2020.pdf https://api.datacite.org/graphql (DataCite API playground) RORs in DataCite, CrossRef metadata https://ror.readme.io/docs/include-ror-ids-in-doi-metadata

DataCite REST API guide https://support.datacite.org/docs/api

Scholix API official developer documentation https://scholexplorer.openaire.eu/#/api some python script to query Scholix API https://github.com/sefnyn/scholix

some API documentation in the PID Forum https://www.pidforum.org/t/apis-and-documentation/754

OAM project at ZB Jülich https://open-access-monitor.de https://www.fz-juelich.de/zb/DE/Leistungen/Open_Access/oam/oam_node.html

May be interesting as well (Repository metadata assessment): https://zenodo.org/record/4478638#.YRo2lluxVH6 https://metadatagamechangers.com/blog/2021/2/2/a-pid-feast-for-research-pidapalooza-2021

Since the default minimal metadata schemas are often not very rich (most often there is no affiliation and so on), we start from sourcs who collect metadata like openaire, or datacite, but we also might have to mine local metadata databases of HGF centers. From there one gets the researchers and detailed institutions. Then one can look for further data publications in a dataset like: https://zenodo.org/record/4707307

License

data_mining is distributed under the terms and conditions of the MIT license which is specified in the LICENSE.txt file.

Acknowledgement

This project was supported by the Helmholtz Metadata Collaboration (HMC), an incubator-platform of the Helmholtz Association within the framework of the Information and Data Science strategic initiative.