Skip to content
Snippets Groups Projects
  1. Mar 12, 2025
  2. Feb 22, 2025
  3. Feb 21, 2025
    • Paul Millar's avatar
    • Paul Millar's avatar
      Remove jekyll from gem install · 8445eea5
      Paul Millar authored
      Motivation:
      
      A problem suddenly appeared where CI/CD jobs started to fail.
      
      This is (seemingly) from the 'gem install' command pulling in a newer
      version of jekyll, which caused an incompatibility with one of its
      dependencies (jekyll --> sass-embedded).  Although the exact nature of
      the incompatibility is unclear, the install fails while building, with
      the error:
      
          NameError: uninitialized constant JSON::Fragment
      
      Modification:
      
      Remove jekyll from "gem install".  Use bundler to fetch jekyll instead.
      This should honour the version.
      
      Results:
      
      More reliable builds
      8445eea5
  4. Feb 05, 2025
  5. Jan 28, 2025
  6. Jan 20, 2025
  7. Jan 17, 2025
  8. Jan 15, 2025
  9. Jan 14, 2025
  10. Jan 13, 2025
  11. Jan 09, 2025
  12. Jan 08, 2025
  13. Jan 07, 2025
  14. Jan 05, 2025
  15. Jan 03, 2025
    • Paul Millar's avatar
      oai-pmh add extra type information · 50c41dbf
      Paul Millar authored
      Motivation:
      
      OAI-PMH provides information about resources.  The PaNOSC OAI-PMH
      endpoints are not limited to describing datasets; therefore, it's useful
      to show which of the items are of which type.
      
      Modification:
      
      Update the main repository cell to include type information.  If the
      repository is using sets then the sets table now includes type
      information.
      
      Result:
      
      A better understanding of which resources are available.
      50c41dbf
    • Paul Millar's avatar
      update_oai-pmh Add support for querying DataCite resourceType · 10f6f60b
      Paul Millar authored
      Motivation:
      
      OAI-PMH, by itself, doesn't identify the nature of the resource; rather,
      this is achieved by the metadata record itself.
      
      Just to mention it, OAI-PMH sets don't provide any guaranteed semantics;
      such semantics can be added through the set description, but there's no
      consensus or practice in doing this.
      
      Therefore, in order to categorise OAI-PMH items by type, we need to
      obtain records: listing identifiers isn't sufficient.  Moreover, Dublin
      Core (as used currently) doesn't support the fine-grain type semantics
      we would like present.
      
      The DataCite metadata schema provides `resourceType` metadata, with the
      `resourceTypeGeneral` providing the course-grain type of the resource.
      This is what we would like to use.
      
      Modification:
      
      Add support for querying all records using the DataCite records.  This
      task is very similar to the existing code that lists all identifiers of
      records with Dublin Core.
      
      The patch adds support for querying DataCite metadata as mostly a
      copy-and-paste of the existing code.  This is technical debt that future
      patches MUST address, through refactorinng.
      
      The OAI-PMH client code is updated to support ListRecords requests.
      This is also a copy-n-paste, inducing further technical debt that future
      patches must address.
      
      Result:
      
      The facilities YAML file now includes a breakdown of OAI-PMH items based
      on their DataCite resourceTypeGeneral.
      10f6f60b
  16. Jan 02, 2025
    • Paul Millar's avatar
      update_oai-pmh fix URL handling for endpoints with no trailing slash · bd4503da
      Paul Millar authored
      Motivation:
      
      A bug was introduced with commit aee25f9f where OAI-PMH URLs without a
      trailing slash have one added; for example, the Identify request
      targeting PSI changes from `https://doi.psi.ch/oaipmh/oai?verb=Identify`
      to `https://doi.psi.ch/oaipmh/oai/?verb=Identify`.
      
      For some endpoints, this distinction is important, resulting in the
      requests failing.
      
      Modification:
      
      Fix regression so that the correct URL is used.
      
      Result:
      
      The PSI endpoint is now shown (correctly) to be working.
      bd4503da
    • Paul Millar's avatar
    • Paul Millar's avatar
    • Paul Millar's avatar
      update_oai-pmh Improve HTTP request headers · 08af574c
      Paul Millar authored
      Motivation:
      
      The `User-Agent` request header was not being sent due to a bug.
      
      The havester lacks support for the `From` request header.
      
      Both HTTP request headers are recommented by the Harvester's guidelines:
      
      https://www.openarchives.org/OAI/2.0/guidelines-harvester.htm
      
      Modification:
      
      Fix bug with `User-Agent` header.  Add support for `From` header.
      
      The `From` header takes an email address as an argument.  Rather than
      requiring a configuration file, the code looks for likely places where
      the user may have already configured their email address.
      
      Result:
      
      The OAI-PMH harvester more closely followes the corresponding
      guidelines.
      08af574c
    • Paul Millar's avatar
      api-pmh Add connection caching · 6f6b4c75
      Paul Millar authored
      Motivation:
      
      The OAI-PMH endpoints being queried behave so that the client makes a
      large number of requests, each returning a relatively small amount of
      data.
      
      When requests are processed by the OAI-PMH server quickly, the overhead
      for establishing the TCP and TLS connections can be very significant.
      
      Connection caching (sometimes called HTTP Keep Alive) involves sending
      multiple HTTP requests over a single TCP connection, allowing us to
      ameliorate the connection overhead by (effectively) spreading the cost
      over all OAI-PMH requests.
      
      Modification:
      
      Update client to use `persistent_http` connection pool, via the
      `persistent_httparty` adapter.
      
      A bug was discovered, where the host entity is cached between successive
      requests.
      
      Result:
      
      OAI-PMH requests are now faster.  Some observed speedups per request are
      (0.12 +/- 0.02) s, (0.16 +/- 0.01) s and (0.16 +/- 0.03) s for ESRF, HZB
      and HZDR respectively (measured with ListIdentifiers request on Dublin
      Core, following the resumptionToken).
      
      The overall impact of this improvement depends on how long the OAI-PMH
      endpoint takes to process a request.  For end above endpoints, the
      percentage improvements (per request) are 12%, 42% and 70% respectively.
      6f6b4c75
    • Paul Millar's avatar
      oai-pmh: add custom HTTParty client · aee25f9f
      Paul Millar authored
      Motivation:
      
      Currently, there is a lot of HTTP/networking code mixed in with
      application code.  It would be good to seperate these concerns,
      particularly as we want to introduce connection reuse.
      
      Modification:
      
      Introduce new class that uses HTTParty as a mixin.  Update code to take
      advantage of this new class.
      
      Result:
      
      Reduced code duplication; now easier to add new features.
      aee25f9f
  17. Jan 01, 2025
    • Paul Millar's avatar
      oai-pmh update how non-optimal endpoints are shown · e3f28fde
      Paul Millar authored
      Motivation:
      
      Currently, if an endpoint has harvesting suspended then the cell is
      shown as green, with no indication of any problem.  Similarly, when
      there was a problem, the message is Status: Error, with limited
      explanation.
      
      Modification:
      
      Remove "Status: Error" as this (in essence) provides no information that
      isn't already captured by the cell colour.
      
      Update cell colour to show when harvesting was disabled, now shown as an
      orange cell.  This is meant as a kind of half-way between green (=>
      good) and red (=> bad).
      
      For both cases (error and harvesting disabled) a simple text is
      provided.
      
      Result:
      
      The output is now better reflects the status of the endpoints.
      e3f28fde
    • Paul Millar's avatar
    • Paul Millar's avatar
      Update webpage language to use correct terms from OAI-PMH · 40231ccc
      Paul Millar authored
      Motivation:
      
      It is certainly not guaranteed that all OAI-PMH items are datasets;
      indeed, we have a counter-example from HZDR, which also describes
      software.
      
      Also, although 'category' (to my thinking) would be a better term,
      OAI-PMH uses `set`.  Using a different term is more confusing than using
      any better term.
      
      Modification:
      
      Switch from `Category` to `Set` and `Dataset` to `Item`.
      
      Result:
      
      Webpage is easier to underxtand and less likely to cause confusion.
      40231ccc
    • Paul Millar's avatar
  18. Dec 31, 2024
    • Paul Millar's avatar
      Add possibility to skip harvesting · e68d4a9b
      Paul Millar authored
      Motivation:
      
      Some OAI-PMH endpoints are broken; moreover, they're broken in such a
      way that makes harvesting information wastes a lot of time without
      producing useful information.
      
      The specific example is the ISIS endpoint, which is both very slow (~10
      seconds per request) and, after ~9 hours of havesting returns a
      resumptionToken that results in failures in a subsequent ListIdentifiers
      request.
      
      The ESS endpoint is also broken.  While also annoying, the impact is
      less because of special handling when a server hasn't provided useful
      information.
      
      The goal is to allow selective disabiling of harvesting while continuing
      to update high-level OAI-PMH information based on the Identity call.
      
      Modification:
      
      Add a `skip-harvesting` boolean option.  If set with the value true then
      harvesting is skipped for this endpoint.
      
      Result:
      
      It's possible to update all endpoints without a very long and fruitless
      time spent harvesting from broken endpoints.
      e68d4a9b
    • Paul Millar's avatar
      Update facility data OAI-PMH metadata to record information as items · ca76d30d
      Paul Millar authored
      Motivation:
      
      The current OAI-PMH information is recorded as `datasets`.  However,
      this assumes that the items underlying the harvested OAI-PMH records
      correspond to datasets.  This is not guaranteed, and there are
      counter-examples.
      
      OAI-PMH describes three concepts: resource, item and record.  The
      OAI-PMH responses provide records (descriptive metadata of some item) or
      identifiers thereof.  However, since OAI-PMH requires repositories to
      support Dublin Core records, it seems a reasonable assumption for there
      to be a 1:1 relationship between each Dublin Core record and some
      corresponding item.
      
      Therefore, we can use the metadata-agnostic `item` concept when
      describing the information about the endpoint.
      
      Modification:
      
      Update script to record information under `items` node in the facilities
      YAML file.
      
      Update the Jekyll to consume this information when rendering the
      corresponding HTML.
      
      Result:
      
      No observable change, but the facilities YAML file now uses the more
      neutral 'items' instead of 'datasets'.
      ca76d30d
    • Paul Millar's avatar
      update_oai-pmh robust against metadata prefix lookup failures · 2a6ce96e
      Paul Millar authored
      Motivation:
      
      The ListMetadataFormats call can fail.  Currently, this causes the enire
      script to fail.
      
      Modification:
      
      Catch the exception and report a failure.
      
      Result:
      
      A metadata prefix lookup failure is now limited to a single
      OAI-PMH endpoint
      2a6ce96e
    • Paul Millar's avatar
      update_oai-pmh Add HTTP request timing statistics · f2fef214
      Paul Millar authored
      Motivation:
      
      Different OAI-PMH endpoints provide different performance
      characteristics.  It would be helpful to categorise them
      
      Modification:
      
      Introduce a Stats class to capture statistics
      
      Monkey-patch float to support printing number to specific significant
      figures.
      
      Produce request stats per round (40 requests) and overall as output.
      f2fef214
    • Paul Millar's avatar
      update_oai-pmh: make code DRY-er · 15618eaf
      Paul Millar authored
      15618eaf
    • Paul Millar's avatar
      update_oai-pmh record adminEmail address · 6039af5d
      Paul Millar authored
      Motivation:
      
      OAI-PMH provides admin contact details as email addresses. This could
      prove useful information.  One such situation is when the OAI-PMH
      endpoint is not working.  When this happens, the admin contact details
      are no longer available from the endpoint, so caching the values would
      prove useful.
      
      Modification:
      
      Update code to capture the admin contact information and record it
      against the facility-specific information.
      
      If, when updating the OAI-PMH details, the admin contact details are
      discovered (from the OAI-PMH endpoint) then any existing contact details
      are replaced with the discovered information; otherwise, any existing
      admin contact details are left unmodified.
      
      Result:
      
      We collect and cache OAI-PMH admin contact details in the facility
      metadata.
      6039af5d
  19. Dec 29, 2024
    • Paul Millar's avatar
    • Paul Millar's avatar
      update_oai-pmh Add a back-off strategy, to allow a service to recover · 21a17caf
      Paul Millar authored
      Motivation:
      
      Some server-side problems are quickly fixed by retrying the request,
      while other problems take longer to recover.  Using a fixed duration as
      a delay between attempts makes it hard to reconcile these different
      failure modes.
      
      Modification:
      
      Add a progressive timeout strategy.  The delay between attempts now
      scales linearly between successive attempts.
      
      As a special case, the retry strategy is different if the server has not
      provided any useful response.  In this case, a fixed delay is used.
      
      Also, catch more problematic scenarios and handle them with the same
      retry strategy.
      
      Result:
      
      More robust handling of server-side errors, by retrying.
      21a17caf
Loading