Review & update averaging algorithm of scores for the sunburst plot

The following analysis was performed on the data collected with version 3 of the toolbox / presented with version 3 of the dashboard.

The following table shows the average scores averaged over ALL UNIQUE IDs found in the database, WHERE publication.type = 'Dataset' AND publication.publication_year >= 2000 AND publication.publication_year <= 2025 AND reference.sub_type = 'IsSupplementedBy'.

Test-result Locally averaged Scores v3 Averaged Scores v3 shown in the Dashboard Rel. Deviation
score_percent_A 51.1% 52.8% 3.3%
score_percent_A1 39.7% 41.6% 4.8%
score_percent_A1.1 59.7% 61.3% 2.7%
score_percent_A1.2 59.7% 61.3% 2.7%
score_percent_F 58.2% 61.5% 5.7%
score_percent_F1 89.0% 91.4% 2.7%
score_percent_F2 81.0% 85.8% 5.9%
score_percent_F3 19.4% 22.6% 16.5%
score_percent_F4 23.9% 26.7% 11.7%
score_percent_FAIR 56.9% 59.2% 4.0%
score_percent_I 66.7% 67.7% 1.5%
score_percent_I1 99.7% 99.8% 0.1%
score_percent_I2 16.6% 15.0% -9.6%
score_percent_I3 83.9% 88.3% 5.2%
score_percent_R 52.3% 55.4% 5.9%
score_percent_R1 45.1% 47.8% 6.0%
score_percent_R1.1 11.0% 21.6% 96.4%
score_percent_R1.2 93.9% 94.5% 0.6%
score_percent_R1.3 59.1% 60.2% 1.9%

The comparison to averaged scores shown in the most recent test instance of the dashboard / subpage "Data in Helmholtz" (where no center or research field is selected) reveals deviations.

My understanding of the deviations observed is that the current method of calculating average scores in the dashboard center by center and then averaging these scores in a weighted manner does NOT account for multiple occurrence of datasets for multiple centers, which induces distortions on the average scores. Dropping these multiple occurrences reduces the number of IDs considered in the average by 8.8 percent.

The averaging algorithm needs to be updated to remove this bias.

Edited by Markus Kubin