Skip to content
Snippets Groups Projects
Commit e68d4a9b authored by Paul Millar's avatar Paul Millar
Browse files

Add possibility to skip harvesting

Motivation:

Some OAI-PMH endpoints are broken; moreover, they're broken in such a
way that makes harvesting information wastes a lot of time without
producing useful information.

The specific example is the ISIS endpoint, which is both very slow (~10
seconds per request) and, after ~9 hours of havesting returns a
resumptionToken that results in failures in a subsequent ListIdentifiers
request.

The ESS endpoint is also broken.  While also annoying, the impact is
less because of special handling when a server hasn't provided useful
information.

The goal is to allow selective disabiling of harvesting while continuing
to update high-level OAI-PMH information based on the Identity call.

Modification:

Add a `skip-harvesting` boolean option.  If set with the value true then
harvesting is skipped for this endpoint.

Result:

It's possible to update all endpoints without a very long and fruitless
time spent harvesting from broken endpoints.
parent ca76d30d
Branches joss
No related tags found
1 merge request!64update_oai-pmh Add a back-off strategy, to allow a service to recover
...@@ -179,6 +179,7 @@ ...@@ -179,6 +179,7 @@
link: https://oai.panosc.ess.eu/openaire/oai link: https://oai.panosc.ess.eu/openaire/oai
last-check: 2024-12-22 last-check: 2024-12-22
status: Active status: Active
skip-harvesting: true
items: items:
count: 0 count: 0
adminAddress: adminAddress:
...@@ -479,6 +480,7 @@ ...@@ -479,6 +480,7 @@
oai-pmh-endpoint: oai-pmh-endpoint:
link: https://icatisis-prod.esc.rl.ac.uk/oaipmh/request link: https://icatisis-prod.esc.rl.ac.uk/oaipmh/request
status: Error status: Error
skip-harvesting: true
last-check: 2024-12-22 last-check: 2024-12-22
adminAddress: adminAddress:
- isisdata@stfc.ac.uk - isisdata@stfc.ac.uk
......
...@@ -340,7 +340,7 @@ def count_identifiers(endpoint, prefix) ...@@ -340,7 +340,7 @@ def count_identifiers(endpoint, prefix)
end end
def query_oai_pmh_endpoint(endpoint) def query_oai_pmh_endpoint(endpoint, do_harvesting)
status, adminAddress = check_oai_pmh_endpoint(endpoint) status, adminAddress = check_oai_pmh_endpoint(endpoint)
if status == "Error" if status == "Error"
return status, [], {}, 0, {} return status, [], {}, 0, {}
...@@ -355,6 +355,10 @@ def query_oai_pmh_endpoint(endpoint) ...@@ -355,6 +355,10 @@ def query_oai_pmh_endpoint(endpoint)
set_names = list_sets(endpoint) set_names = list_sets(endpoint)
if !do_harvesting
return "Active", adminAddress, {}, 0, {}
end
begin begin
total_count, set_counts = count_identifiers(endpoint, dc_prefix) total_count, set_counts = count_identifiers(endpoint, dc_prefix)
rescue StandardError => e rescue StandardError => e
...@@ -385,7 +389,8 @@ facilities.each do |facility| ...@@ -385,7 +389,8 @@ facilities.each do |facility|
name = facility['short-name'] name = facility['short-name']
puts "Checking OAI-PMH endpoint for #{name}: #{oai_pmh_endpoint}" puts "Checking OAI-PMH endpoint for #{name}: #{oai_pmh_endpoint}"
status, adminAddress, set_names, total_count, set_count = query_oai_pmh_endpoint(oai_pmh_endpoint) do_harvesting = !oai_pmh['skip-harvesting']
status, adminAddress, set_names, total_count, set_count = query_oai_pmh_endpoint(oai_pmh_endpoint, do_harvesting)
oai_pmh['status'] = status oai_pmh['status'] = status
...@@ -395,7 +400,7 @@ facilities.each do |facility| ...@@ -395,7 +400,7 @@ facilities.each do |facility|
oai_pmh.delete('items') oai_pmh.delete('items')
if status == "Active" if status == "Active" && do_harvesting
items = {} items = {}
oai_pmh['items'] = items oai_pmh['items'] = items
items['count'] = total_count items['count'] = total_count
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment