From e68d4a9bb50e228957492c5760a863ee00fdabae Mon Sep 17 00:00:00 2001 From: Paul Millar <paul.millar@desy.de> Date: Tue, 31 Dec 2024 14:38:10 +0100 Subject: [PATCH] Add possibility to skip harvesting Motivation: Some OAI-PMH endpoints are broken; moreover, they're broken in such a way that makes harvesting information wastes a lot of time without producing useful information. The specific example is the ISIS endpoint, which is both very slow (~10 seconds per request) and, after ~9 hours of havesting returns a resumptionToken that results in failures in a subsequent ListIdentifiers request. The ESS endpoint is also broken. While also annoying, the impact is less because of special handling when a server hasn't provided useful information. The goal is to allow selective disabiling of harvesting while continuing to update high-level OAI-PMH information based on the Identity call. Modification: Add a `skip-harvesting` boolean option. If set with the value true then harvesting is skipped for this endpoint. Result: It's possible to update all endpoints without a very long and fruitless time spent harvesting from broken endpoints. --- _data/facilities.yml | 2 ++ scripts/update_oai-pmh.rb | 11 ++++++++--- 2 files changed, 10 insertions(+), 3 deletions(-) diff --git a/_data/facilities.yml b/_data/facilities.yml index 35f9e7a..0de75ea 100644 --- a/_data/facilities.yml +++ b/_data/facilities.yml @@ -179,6 +179,7 @@ link: https://oai.panosc.ess.eu/openaire/oai last-check: 2024-12-22 status: Active + skip-harvesting: true items: count: 0 adminAddress: @@ -479,6 +480,7 @@ oai-pmh-endpoint: link: https://icatisis-prod.esc.rl.ac.uk/oaipmh/request status: Error + skip-harvesting: true last-check: 2024-12-22 adminAddress: - isisdata@stfc.ac.uk diff --git a/scripts/update_oai-pmh.rb b/scripts/update_oai-pmh.rb index 1e91caa..0786cab 100644 --- a/scripts/update_oai-pmh.rb +++ b/scripts/update_oai-pmh.rb @@ -340,7 +340,7 @@ def count_identifiers(endpoint, prefix) end -def query_oai_pmh_endpoint(endpoint) +def query_oai_pmh_endpoint(endpoint, do_harvesting) status, adminAddress = check_oai_pmh_endpoint(endpoint) if status == "Error" return status, [], {}, 0, {} @@ -355,6 +355,10 @@ def query_oai_pmh_endpoint(endpoint) set_names = list_sets(endpoint) + if !do_harvesting + return "Active", adminAddress, {}, 0, {} + end + begin total_count, set_counts = count_identifiers(endpoint, dc_prefix) rescue StandardError => e @@ -385,7 +389,8 @@ facilities.each do |facility| name = facility['short-name'] puts "Checking OAI-PMH endpoint for #{name}: #{oai_pmh_endpoint}" - status, adminAddress, set_names, total_count, set_count = query_oai_pmh_endpoint(oai_pmh_endpoint) + do_harvesting = !oai_pmh['skip-harvesting'] + status, adminAddress, set_names, total_count, set_count = query_oai_pmh_endpoint(oai_pmh_endpoint, do_harvesting) oai_pmh['status'] = status @@ -395,7 +400,7 @@ facilities.each do |facility| oai_pmh.delete('items') - if status == "Active" + if status == "Active" && do_harvesting items = {} oai_pmh['items'] = items items['count'] = total_count -- GitLab