HMC Toolbox for Data Mining
Description
The HMC Toolbox for Data Mining
is designed to harvest scientific literature publications by
Helmholtz centers, find associated data publications and assess their "FAIR"-ness, meaning
how findable (F), accessible (A), interoperable (I) and reusable (R) they are.
See "The FAIR Guiding Principles for scientific data management and stewardship" at https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4792175/ for more information.
The metadata of the literature publications is obtained chiefly from OAI-MPH-endpoints of the different Helmholtz centers, an exception being the HZB, the publications of which are harvested from CSV files.
For each found literature publication that has a DOI, a request to the API of ScholeXplorer – The Data Interlinking Service is made and linked data publications are obtained.
In the final step of processing for all data publications that have a DOI the software makes a request to F-UJI, an automated FAIR Data Assessment Tool. Thereby FAIR-scores are generated that inform about how FAIR these data publications are.
Installation
The HMC Toolbox for Data Mining
comes with an easy-to-use Docker container.
Install Docker as well as
Docker Compose V2, run it and execute the following in your
terminal to start the container:
docker compose up
A prerequisite for using the HMC Toolbox for Data Mining
is a
local instance of the F-UJI Server.
Within the Docker-Compose setup an instance of the F-UJI Server is launched.
Alternatively you can also use the HMC Toolbox for Data Mining
without Docker. In this case please
make sure, that you have installed all dependencies. The easiest way is to simply
install poetry and then run:
poetry install
Usage
Usage with Docker
Simply start the Docker container as described above.
Usage with Python
Run the application directly via Python by navigating to your
locally checked out root directory hmc-toolbox-for-data-mining
and executing:
poetry run hmc-toolbox run
Usage as CLI Application
The HMC Toolbox for Data Mining
can also be used per CLI. A prerequisite is that you configure a
Poetry Environment in which you can run the CLI commands. How this is done depends on the IDE you are
using. Find instructions for Pycharm here.
After you can run the following in your terminal:
hmc-toolbox run
It is also possible to execute only parts of the functionality provided by the HMC Toolbox for Data Mining
.
Run
hmc-toolbox --help
for more detailed information.
Output Structure
After successfully running the HMC Toolbox for Data Mining
the output will be saved in the following
way:
- For each Helmholtz center processed an eponymous output directory will be created and zipped in
the root directory, e.g.
hmc-toolbox-for-data-mining/KIT.zip
. - Within you will find 2 subdirectories, one of which contains the xml files harvested from an
OAI-MPH endpoint, the other is named
output
. (In the special case of the HZB instead of the directory with xml files you will find a file called pastalist.csv.) - The directory
output
again contains 2 subdirectories calledliterature
anddatasets
, both of which contain JSON files. - Each JSON file in
literature
represents a metadata record of a literature publication and has the same name as the corresponding xml file or in the case of the HZB a UUID. - Each JSON file in
datasets
represents a metadata record of a data publication and is named according to the following schema:
[literature publication it belongs to].[index].json
One way of conveniently displaying the output data of the HMC Toolbox for Data Mining
is to import
it into a database and display it using the HMC FAIR Data Dashboard
.
Documentation
A complete API documentation of the HMC Toolbox for Data Mining
can be found in the repository
subdirectory documentation
in the form of html-Files. You can simply open the index.html therein
in your browser and navigate the entire documentation from there.
Roadmap
The project as it is, is ready for usage. However, development is continued nevertheless, as there are many potential improvements to be made.
Some features planned are:
- Validate the relationship between a literature publication and the data publications found for it.
- Find and exploit other sources of linked data-publications.
- Find and integrate other forms of FAIR assessment than F-UJI.
Support and Contributing
If you are interested in contributing to the project or have any questions not answered here or in the API documentation, please contact hmc-matter@helmholtz-berlin.de.
Disclaimer
Please note that the list of data publications obtained from data harvesting using the HMC Toolbox for Data Mining, as presented in the HMC FAIR Data Dashboard is affected by method-specific biases and is neither complete nor entirely free of falsely identified data. If you wish to reuse the data shown in this dashboard for sensitive topics such as funding mechanisms, we highly recommend a manual review of the data.
We also recommend careful interpretation of evaluation-results derived from automatized FAIR assessment. The FAIR principles are a set of high-level principles and applying them depends on the specific context such as discipline-specific aspects. There are various quantitative and qualitative methods to assess the FAIRness of data (see also FAIRassist.org but no definitive methodology (see Wilkinson et al.). For this reason, different FAIR assessment tools can provide different scores for the same dataset. We may include alternate, complementary methodologies in future versions of this project. To illustrate the potentials of identifying systematic gaps with automated evaluation approaches, in this dashboard you can observe evaluation results obtained from F-UJI as one selected approach. Both, the F-UJI framework and the underlying metrics are subject of continuous development. The evaluation results can be useful in providing guidance for improving the FAIRness of data and repository infrastructure, respectively, but focus on machine-actionable aspects and are limited with respect to human-understandable and discipline-specific aspects of metadata. Evaluations results obtained from F-UJI can be useful in providing guidance for improving the FAIRness of data and repository infrastructure, respectively, but cannot truly assess how FAIR research data really is.
Authors
Please find all authors of this project in the CITATION.cff in this repository.
License
The HMC Toolbox for Data Mining
is licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License in the LICENSE.md in this repository or at
http://www.apache.org/licenses/LICENSE-2.0
Support and Contribution
If you are interested in contributing to the project or have any questions not answered here or in the documentation, please contact hmc-matter@helmholtz-berlin.de.
Funding
This work was supported by the Helmholtz Metadata Collaboration (HMC), an incubator-platform of the Helmholtz Association within the framework of the Information and Data Science strategic initiative. The project was initiated by HMC Hub Matter at Helmholtz-Zentrum Berlin für Materialien und Energie GmbH (HZB) and was later supported by HMC Hub Aeronautics, Space and Transport (AST) at the German Aerospace Center (DLR).
CRediT:
Acknowledgements according toWith respect to HMC Dashboard on Open and FAIR Data in Helmholtz, the following individuals are mapped to CRediT (alphabetical order): \
Astrid Gilein (AG); Alexander Schmidt (AS); Gabriel Preuß (GP); Mojeeb Rahman Sedeqi (MRS); Markus Kubin (MK); Oonagh Brendike-Mannix(OBM); Pascal Ehlers (PE); Tempest Glodowski (TG); Vivien Serve (VS);
Contributions according to CRediT are:
Conceptualization: AG, AS, GP, MRS, MK, OBM, VS;
Data curation: AG, AS, GP, MK, PE, TG;
Methodology; AG, AS, GP, MRS, MK;
Project administration: MK;
Software: AG, GP, PE, MRS, MK;
Supervision: MK, OBM;
Validation: GP, MRS, PE, MK, VS;
Visualization: MRS, MK, VS;
Writing – original draft: MRS, MK, OBM;
Writing – review & editing: MRS, MK, OBM;