Skip to content

Gitlab data pipeline todos

Thinks to advanced the gitlab pipeline:

  1. ignore all forks, ideally already in the gitlab requests for the projects.
  • So far I have not figured this out, since one only gets how many forks a project has
  1. Better codemeta harvester, less failures and more data DOI badges and other badges, repo and git status, i.e. fill up the codemeta.json more. We could also store additional metadata. For example the person ids are not good, AUTHOrs files are parsed badly

  2. Implement the since feature? Currently we just pull on every repo. But one could speed up the pipeline by pulling only repos where gitlab tells us that there was activity since the last pipeline run.

  3. Better quality of codemeta.json data. some metadata created by codemeta harvester is wrong.

  4. Implemented a shallow git clone to not checkout all the large files.

Edited by Jens Bröder