Review & update averaging algorithm of scores for the sunburst plot
The following analysis was performed on the data collected with version 3 of the toolbox / presented with version 3 of the dashboard.
The following table shows the average scores averaged over ALL UNIQUE IDs found in the database, WHERE publication.type = 'Dataset' AND publication.publication_year >= 2000 AND publication.publication_year <= 2025 AND reference.sub_type = 'IsSupplementedBy'.
| Test-result | Locally averaged Scores v3 | Averaged Scores v3 shown in the Dashboard | Rel. Deviation |
|---|---|---|---|
| score_percent_A | 51.1% | 52.8% | 3.3% |
| score_percent_A1 | 39.7% | 41.6% | 4.8% |
| score_percent_A1.1 | 59.7% | 61.3% | 2.7% |
| score_percent_A1.2 | 59.7% | 61.3% | 2.7% |
| score_percent_F | 58.2% | 61.5% | 5.7% |
| score_percent_F1 | 89.0% | 91.4% | 2.7% |
| score_percent_F2 | 81.0% | 85.8% | 5.9% |
| score_percent_F3 | 19.4% | 22.6% | 16.5% |
| score_percent_F4 | 23.9% | 26.7% | 11.7% |
| score_percent_FAIR | 56.9% | 59.2% | 4.0% |
| score_percent_I | 66.7% | 67.7% | 1.5% |
| score_percent_I1 | 99.7% | 99.8% | 0.1% |
| score_percent_I2 | 16.6% | 15.0% | -9.6% |
| score_percent_I3 | 83.9% | 88.3% | 5.2% |
| score_percent_R | 52.3% | 55.4% | 5.9% |
| score_percent_R1 | 45.1% | 47.8% | 6.0% |
| score_percent_R1.1 | 11.0% | 21.6% | 96.4% |
| score_percent_R1.2 | 93.9% | 94.5% | 0.6% |
| score_percent_R1.3 | 59.1% | 60.2% | 1.9% |
The comparison to averaged scores shown in the most recent test instance of the dashboard / subpage "Data in Helmholtz" (where no center or research field is selected) reveals deviations.
My understanding of the deviations observed is that the current method of calculating average scores in the dashboard center by center and then averaging these scores in a weighted manner does NOT account for multiple occurrence of datasets for multiple centers, which induces distortions on the average scores. Dropping these multiple occurrences reduces the number of IDs considered in the average by 8.8 percent.
The averaging algorithm needs to be updated to remove this bias.