Skip to content
Snippets Groups Projects
Gabriel Preuß's avatar
Gabriel Preuß authored
Resolve "2nd cronjob to scan for updated Scholix links for all existing literature publications (once per month, over the weekend?)"

Closes #133

See merge request !90
6cbd0bfa
History

HMC Toolbox for Data Mining

REUSE status

Description

The HMC Toolbox for Data Mining is designed to harvest scientific literature publications by Helmholtz centers, find associated data publications and assess their "FAIR"-ness, meaning how findable (F), accessible (A), interoperable (I) and reusable (R) they are.

See "The FAIR Guiding Principles for scientific data management and stewardship" at https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4792175/ for more information.

The metadata of the literature publications is obtained chiefly from OAI-MPH-endpoints of the different Helmholtz centers, an exception being the HZB, the publications of which are harvested from CSV files.

For each found literature publication that has a DOI, a request to the API of ScholeXplorer – The Data Interlinking Service is made and linked data publications are obtained.

In the final step of processing for all data publications that have a DOI the software makes a request to F-UJI, an automated FAIR Data Assessment Tool. Thereby FAIR-scores are generated that inform about how FAIR these data publications are.

Installation

The HMC Toolbox for Data Mining comes with an easy-to-use Docker container. Install Docker as well as Docker Compose V2, run it and execute the following in your terminal to start the container:

docker compose up

A prerequisite for using the HMC Toolbox for Data Mining is a local instance of the F-UJI Server. Within the Docker-Compose setup an instance of the F-UJI Server is launched.

Alternatively you can also use the HMC Toolbox for Data Mining without Docker. In this case please make sure, that you have installed all dependencies. The easiest way is to simply install poetry and then run:

poetry install

Usage

Usage with Docker

Simply start the Docker container as described above.

Usage with Python

Run the application directly via Python by navigating to your locally checked out root directory hmc-toolbox-for-data-mining and executing:

poetry run hmc-toolbox run

Usage as CLI Application

The HMC Toolbox for Data Mining can also be used per CLI. A prerequisite is that you configure a Poetry Environment in which you can run the CLI commands. How this is done depends on the IDE you are using. Find instructions for Pycharm here.

After you can run the following in your terminal:

hmc-toolbox run

It is also possible to execute only parts of the functionality provided by the HMC Toolbox for Data Mining. Run

hmc-toolbox --help

for more detailed information.

Output Structure

After successfully running the HMC Toolbox for Data Mining the output will be saved in the following way:

  1. For each Helmholtz center processed an eponymous output directory will be created and zipped in the root directory, e.g. hmc-toolbox-for-data-mining/KIT.zip.
  2. Within you will find 2 subdirectories, one of which contains the xml files harvested from an OAI-MPH endpoint, the other is named output. (In the special case of the HZB instead of the directory with xml files you will find a file called pastalist.csv.)
  3. The directory output again contains 2 subdirectories called literature and datasets, both of which contain JSON files.
  4. Each JSON file in literature represents a metadata record of a literature publication and has the same name as the corresponding xml file or in the case of the HZB a UUID.
  5. Each JSON file in datasets represents a metadata record of a data publication and is named according to the following schema:
[literature publication it belongs to].[index].json

One way of conveniently displaying the output data of the HMC Toolbox for Data Mining is to import it into a database and display it using the HMC FAIR Data Dashboard.

Documentation

A complete API documentation of the HMC Toolbox for Data Mining can be found in the repository subdirectory documentation in the form of html-Files. You can simply open the index.html therein in your browser and navigate the entire documentation from there.

Roadmap

The project as it is, is ready for usage. However, development is continued nevertheless, as there are many potential improvements to be made.

Some features planned are:

  • Validate the relationship between a literature publication and the data publications found for it.
  • Find and exploit other sources of linked data-publications.
  • Find and integrate other forms of FAIR assessment than F-UJI.

Support and Contributing

If you are interested in contributing to the project or have any questions not answered here or in the API documentation, please contact hmc-matter@helmholtz-berlin.de.

Disclaimer

Please note that the list of data publications obtained from data harvesting using the HMC Toolbox for Data Mining, as presented in the HMC FAIR Data Dashboard is affected by method-specific biases and is neither complete nor entirely free of falsely identified data. If you wish to reuse the data shown in this dashboard for sensitive topics such as funding mechanisms, we highly recommend a manual review of the data.

We also recommend careful interpretation of evaluation-results derived from automatized FAIR assessment. The FAIR principles are a set of high-level principles and applying them depends on the specific context such as discipline-specific aspects. There are various quantitative and qualitative methods to assess the FAIRness of data (see also FAIRassist.org but no definitive methodology (see Wilkinson et al.). For this reason, different FAIR assessment tools can provide different scores for the same dataset. We may include alternate, complementary methodologies in future versions of this project. To illustrate the potentials of identifying systematic gaps with automated evaluation approaches, in this dashboard you can observe evaluation results obtained from F-UJI as one selected approach. Both, the F-UJI framework and the underlying metrics are subject of continuous development. The evaluation results can be useful in providing guidance for improving the FAIRness of data and repository infrastructure, respectively, but focus on machine-actionable aspects and are limited with respect to human-understandable and discipline-specific aspects of metadata. Evaluations results obtained from F-UJI can be useful in providing guidance for improving the FAIRness of data and repository infrastructure, respectively, but cannot truly assess how FAIR research data really is.

Authors

Please find all authors of this project in the CITATION.cff in this repository.

License

The HMC Toolbox for Data Mining is licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License in the LICENSE.md in this repository or at

http://www.apache.org/licenses/LICENSE-2.0

Support and Contribution

If you are interested in contributing to the project or have any questions not answered here or in the documentation, please contact hmc-matter@helmholtz-berlin.de.

Funding

This work was supported by the Helmholtz Metadata Collaboration (HMC), an incubator-platform of the Helmholtz Association within the framework of the Information and Data Science strategic initiative. The project was initiated by HMC Hub Matter at Helmholtz-Zentrum Berlin für Materialien und Energie GmbH (HZB) and was later supported by HMC Hub Aeronautics, Space and Transport (AST) at the German Aerospace Center (DLR).

Logo_HMC

  Logo_HZB

  Logo_DLR

Acknowledgements according to CRediT:

With respect to HMC Dashboard on Open and FAIR Data in Helmholtz, the following individuals are mapped to CRediT (alphabetical order): \

Astrid Gilein (AG); Alexander Schmidt (AS); Gabriel Preuß (GP); Mojeeb Rahman Sedeqi (MRS); Markus Kubin (MK); Oonagh Brendike-Mannix(OBM); Pascal Ehlers (PE); Tempest Glodowski (TG); Vivien Serve (VS);

Contributions according to CRediT are:
Conceptualization: AG, AS, GP, MRS, MK, OBM, VS; Data curation: AG, AS, GP, MK, PE, TG; Methodology; AG, AS, GP, MRS, MK; Project administration: MK; Software: AG, GP, PE, MRS, MK; Supervision: MK, OBM; Validation: GP, MRS, PE, MK, VS; Visualization: MRS, MK, VS; Writing – original draft: MRS, MK, OBM; Writing – review & editing: MRS, MK, OBM;