From e68d4a9bb50e228957492c5760a863ee00fdabae Mon Sep 17 00:00:00 2001
From: Paul Millar <paul.millar@desy.de>
Date: Tue, 31 Dec 2024 14:38:10 +0100
Subject: [PATCH] Add possibility to skip harvesting

Motivation:

Some OAI-PMH endpoints are broken; moreover, they're broken in such a
way that makes harvesting information wastes a lot of time without
producing useful information.

The specific example is the ISIS endpoint, which is both very slow (~10
seconds per request) and, after ~9 hours of havesting returns a
resumptionToken that results in failures in a subsequent ListIdentifiers
request.

The ESS endpoint is also broken.  While also annoying, the impact is
less because of special handling when a server hasn't provided useful
information.

The goal is to allow selective disabiling of harvesting while continuing
to update high-level OAI-PMH information based on the Identity call.

Modification:

Add a `skip-harvesting` boolean option.  If set with the value true then
harvesting is skipped for this endpoint.

Result:

It's possible to update all endpoints without a very long and fruitless
time spent harvesting from broken endpoints.
---
 _data/facilities.yml      |  2 ++
 scripts/update_oai-pmh.rb | 11 ++++++++---
 2 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/_data/facilities.yml b/_data/facilities.yml
index 35f9e7a..0de75ea 100644
--- a/_data/facilities.yml
+++ b/_data/facilities.yml
@@ -179,6 +179,7 @@
       link: https://oai.panosc.ess.eu/openaire/oai
       last-check: 2024-12-22
       status: Active
+      skip-harvesting: true
       items:
         count: 0
       adminAddress:
@@ -479,6 +480,7 @@
     oai-pmh-endpoint:
       link: https://icatisis-prod.esc.rl.ac.uk/oaipmh/request
       status: Error
+      skip-harvesting: true
       last-check: 2024-12-22
       adminAddress:
       - isisdata@stfc.ac.uk
diff --git a/scripts/update_oai-pmh.rb b/scripts/update_oai-pmh.rb
index 1e91caa..0786cab 100644
--- a/scripts/update_oai-pmh.rb
+++ b/scripts/update_oai-pmh.rb
@@ -340,7 +340,7 @@ def count_identifiers(endpoint, prefix)
 end
 
 
-def query_oai_pmh_endpoint(endpoint)
+def query_oai_pmh_endpoint(endpoint, do_harvesting)
     status, adminAddress = check_oai_pmh_endpoint(endpoint)
     if status == "Error"
         return status, [], {}, 0, {}
@@ -355,6 +355,10 @@ def query_oai_pmh_endpoint(endpoint)
 
     set_names = list_sets(endpoint)
 
+    if !do_harvesting
+        return "Active", adminAddress, {}, 0, {}
+    end
+
     begin
         total_count, set_counts = count_identifiers(endpoint, dc_prefix)
     rescue StandardError => e
@@ -385,7 +389,8 @@ facilities.each do |facility|
 
     name = facility['short-name']
     puts "Checking OAI-PMH endpoint for #{name}: #{oai_pmh_endpoint}"
-    status, adminAddress, set_names, total_count, set_count = query_oai_pmh_endpoint(oai_pmh_endpoint)
+    do_harvesting = !oai_pmh['skip-harvesting']
+    status, adminAddress, set_names, total_count, set_count = query_oai_pmh_endpoint(oai_pmh_endpoint, do_harvesting)
 
     oai_pmh['status'] = status
 
@@ -395,7 +400,7 @@ facilities.each do |facility|
 
     oai_pmh.delete('items')
 
-    if status == "Active"
+    if status == "Active" && do_harvesting
         items = {}
         oai_pmh['items'] = items
         items['count'] = total_count
-- 
GitLab