data_mining
This is a collection of tools, ideas, common things used for (meta)data mining and analysis of (data, software, journal) publications. This project is created in the context of efforts from the Helmholtz Metadata collaboration (HMC).
THIS IS WORK IN PROGRESS
Contributions of any kind by you are always welcome!
Approach:
We establish certain data pipelines of certain data provides with as much 'high-quaility' linked metadata as possible and complement it. This we store as JSON-LD in an HMC specific JSON-LD format, so that the resulting linked data is in the proper format to be integrated in the HGF knowledgegraph. The metadta in the form of JSON-LD or an other standard can be extracted from PID landing pages directly or over APIs. There are usually two steps within a mining process:
- get all PIDs (landing pages) which belong to a certain set, like institution or field.
- extract and complement detailed metadata for each of these PIDs
Data pipelines contain code to execute harvesting from a local to a global level. They are exposed through a cmdline interface (cli) and thus easily integrated in a cron job and can therefore be used to stream data on a time interval bases into some database Data pipelines so far:
- gitlab pipeline: harvest all public projects in Helmholtz gitlab instances and extracts and complements codemeta.jsonld files. (todo: also github)
- scholix pipeline: Extract links and related resources for a list of given PIDs of any kind
- invinio pipeline (todo): extract JSON-LD metadata from the standard
- dataverse piepline (todo): extract JSON-LD metadata and other standards of data publications from dataverse instances (like JülichData)
Documentation:
Currently only in code documentation. In the future under the docs folder and hosted somewhere.
Installation
git clone git@codebase.helmholtz.cloud:hmc/hmc-public/unhide/data_mining.git
cd data_mining
pip install .
as a developer install with
pip install -e .[testing]
The individual pipelines have further dependencies outside of python.
For example the gitlab pipline relies an codemeta-harvester (https://github.com/proycon/codemeta-harvester)
How to use this
For examples look at the examples
folder. Also the tests in tests
folder may provide some insight.
Also once installed there is a command line interface (CLI), 'hmc_unhide' for example one can execute the gitlab pipeline via:
hmc_unhide harvester run --name gitlab --out ~/work/data/gitlab_pipeline
Collection of external sources:
Websites we would like to mine, APIs:
see data_sources.csv file
Other projects and aggregated data:
FREYA PID Graph (project ended end of 2020) https://zenodo.org/record/4028383#.YRozCVuxVH5 https://blog.datacite.org/introducing-the-pid-graph/ for FREYA PID Graph showcases see https://github.com/cernanalysispreservation/freya-pid-notebooks-showcase FREYA project DataCite GraphQL API (https://graphql.org/): https://datacite.org/assets/OpenHours_GraphQLAPIintro_June2020.pdf https://api.datacite.org/graphql (DataCite API playground) RORs in DataCite, CrossRef metadata https://ror.readme.io/docs/include-ror-ids-in-doi-metadata
DataCite REST API guide https://support.datacite.org/docs/api
Scholix API official developer documentation https://scholexplorer.openaire.eu/#/api some python script to query Scholix API https://github.com/sefnyn/scholix
some API documentation in the PID Forum https://www.pidforum.org/t/apis-and-documentation/754
OAM project at ZB Jülich https://open-access-monitor.de https://www.fz-juelich.de/zb/DE/Leistungen/Open_Access/oam/oam_node.html
May be interesting as well (Repository metadata assessment): https://zenodo.org/record/4478638#.YRo2lluxVH6 https://metadatagamechangers.com/blog/2021/2/2/a-pid-feast-for-research-pidapalooza-2021
Since the default minimal metadata schemas are often not very rich (most often there is no affiliation and so on), we start from sourcs who collect metadata like openaire, or datacite, but we also might have to mine local metadata databases of HGF centers. From there one gets the researchers and detailed institutions. Then one can look for further data publications in a dataset like: https://zenodo.org/record/4707307
License
data_mining is distributed under the terms and conditions of the MIT license which is specified in the LICENSE.txt
file.
Acknowledgement
This project was supported by the Helmholtz Metadata Collaboration (HMC), an incubator-platform of the Helmholtz Association within the framework of the Information and Data Science strategic initiative.