Skip to content
Snippets Groups Projects
Commit e68d4a9b authored by Paul Millar's avatar Paul Millar
Browse files

Add possibility to skip harvesting

Motivation:

Some OAI-PMH endpoints are broken; moreover, they're broken in such a
way that makes harvesting information wastes a lot of time without
producing useful information.

The specific example is the ISIS endpoint, which is both very slow (~10
seconds per request) and, after ~9 hours of havesting returns a
resumptionToken that results in failures in a subsequent ListIdentifiers
request.

The ESS endpoint is also broken.  While also annoying, the impact is
less because of special handling when a server hasn't provided useful
information.

The goal is to allow selective disabiling of harvesting while continuing
to update high-level OAI-PMH information based on the Identity call.

Modification:

Add a `skip-harvesting` boolean option.  If set with the value true then
harvesting is skipped for this endpoint.

Result:

It's possible to update all endpoints without a very long and fruitless
time spent harvesting from broken endpoints.
parent ca76d30d
No related branches found
No related tags found
1 merge request!64update_oai-pmh Add a back-off strategy, to allow a service to recover
......@@ -179,6 +179,7 @@
link: https://oai.panosc.ess.eu/openaire/oai
last-check: 2024-12-22
status: Active
skip-harvesting: true
items:
count: 0
adminAddress:
......@@ -479,6 +480,7 @@
oai-pmh-endpoint:
link: https://icatisis-prod.esc.rl.ac.uk/oaipmh/request
status: Error
skip-harvesting: true
last-check: 2024-12-22
adminAddress:
- isisdata@stfc.ac.uk
......
......@@ -340,7 +340,7 @@ def count_identifiers(endpoint, prefix)
end
def query_oai_pmh_endpoint(endpoint)
def query_oai_pmh_endpoint(endpoint, do_harvesting)
status, adminAddress = check_oai_pmh_endpoint(endpoint)
if status == "Error"
return status, [], {}, 0, {}
......@@ -355,6 +355,10 @@ def query_oai_pmh_endpoint(endpoint)
set_names = list_sets(endpoint)
if !do_harvesting
return "Active", adminAddress, {}, 0, {}
end
begin
total_count, set_counts = count_identifiers(endpoint, dc_prefix)
rescue StandardError => e
......@@ -385,7 +389,8 @@ facilities.each do |facility|
name = facility['short-name']
puts "Checking OAI-PMH endpoint for #{name}: #{oai_pmh_endpoint}"
status, adminAddress, set_names, total_count, set_count = query_oai_pmh_endpoint(oai_pmh_endpoint)
do_harvesting = !oai_pmh['skip-harvesting']
status, adminAddress, set_names, total_count, set_count = query_oai_pmh_endpoint(oai_pmh_endpoint, do_harvesting)
oai_pmh['status'] = status
......@@ -395,7 +400,7 @@ facilities.each do |facility|
oai_pmh.delete('items')
if status == "Active"
if status == "Active" && do_harvesting
items = {}
oai_pmh['items'] = items
items['count'] = total_count
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment