- Mar 12, 2025
-
-
Paul Millar authored
-
- Feb 22, 2025
-
-
Paul Millar authored
-
Paul Millar authored
-
Paul Millar authored
-
- Feb 21, 2025
-
-
Paul Millar authored
-
Paul Millar authored
Motivation: A problem suddenly appeared where CI/CD jobs started to fail. This is (seemingly) from the 'gem install' command pulling in a newer version of jekyll, which caused an incompatibility with one of its dependencies (jekyll --> sass-embedded). Although the exact nature of the incompatibility is unclear, the install fails while building, with the error: NameError: uninitialized constant JSON::Fragment Modification: Remove jekyll from "gem install". Use bundler to fetch jekyll instead. This should honour the version. Results: More reliable builds
-
- Feb 05, 2025
-
-
Paul Millar authored
-
- Jan 28, 2025
-
-
Paul Millar authored
-
Paul Millar authored
-
- Jan 20, 2025
-
-
Paul Millar authored
-
- Jan 17, 2025
-
-
Paul Millar authored
-
- Jan 15, 2025
-
-
Paul Millar authored
-
- Jan 14, 2025
-
-
Paul Millar authored
-
- Jan 13, 2025
-
-
Paul Millar authored
-
- Jan 09, 2025
-
-
Paul Millar authored
-
- Jan 08, 2025
-
-
Paul Millar authored
-
- Jan 07, 2025
-
-
Paul Millar authored
-
Paul Millar authored
-
Paul Millar authored
The repo was mentioned by Oonagh Brendike-Mannix.
-
- Jan 05, 2025
-
-
Paul Millar authored
This file is a place-holder while we figure out the best place to record this information.
-
- Jan 03, 2025
-
-
Paul Millar authored
Motivation: OAI-PMH provides information about resources. The PaNOSC OAI-PMH endpoints are not limited to describing datasets; therefore, it's useful to show which of the items are of which type. Modification: Update the main repository cell to include type information. If the repository is using sets then the sets table now includes type information. Result: A better understanding of which resources are available.
-
Paul Millar authored
Motivation: OAI-PMH, by itself, doesn't identify the nature of the resource; rather, this is achieved by the metadata record itself. Just to mention it, OAI-PMH sets don't provide any guaranteed semantics; such semantics can be added through the set description, but there's no consensus or practice in doing this. Therefore, in order to categorise OAI-PMH items by type, we need to obtain records: listing identifiers isn't sufficient. Moreover, Dublin Core (as used currently) doesn't support the fine-grain type semantics we would like present. The DataCite metadata schema provides `resourceType` metadata, with the `resourceTypeGeneral` providing the course-grain type of the resource. This is what we would like to use. Modification: Add support for querying all records using the DataCite records. This task is very similar to the existing code that lists all identifiers of records with Dublin Core. The patch adds support for querying DataCite metadata as mostly a copy-and-paste of the existing code. This is technical debt that future patches MUST address, through refactorinng. The OAI-PMH client code is updated to support ListRecords requests. This is also a copy-n-paste, inducing further technical debt that future patches must address. Result: The facilities YAML file now includes a breakdown of OAI-PMH items based on their DataCite resourceTypeGeneral.
-
- Jan 02, 2025
-
-
Paul Millar authored
Motivation: A bug was introduced with commit aee25f9f where OAI-PMH URLs without a trailing slash have one added; for example, the Identify request targeting PSI changes from `https://doi.psi.ch/oaipmh/oai?verb=Identify` to `https://doi.psi.ch/oaipmh/oai/?verb=Identify`. For some endpoints, this distinction is important, resulting in the requests failing. Modification: Fix regression so that the correct URL is used. Result: The PSI endpoint is now shown (correctly) to be working.
-
Paul Millar authored
-
Paul Millar authored
-
Paul Millar authored
Motivation: The `User-Agent` request header was not being sent due to a bug. The havester lacks support for the `From` request header. Both HTTP request headers are recommented by the Harvester's guidelines: https://www.openarchives.org/OAI/2.0/guidelines-harvester.htm Modification: Fix bug with `User-Agent` header. Add support for `From` header. The `From` header takes an email address as an argument. Rather than requiring a configuration file, the code looks for likely places where the user may have already configured their email address. Result: The OAI-PMH harvester more closely followes the corresponding guidelines.
-
Paul Millar authored
Motivation: The OAI-PMH endpoints being queried behave so that the client makes a large number of requests, each returning a relatively small amount of data. When requests are processed by the OAI-PMH server quickly, the overhead for establishing the TCP and TLS connections can be very significant. Connection caching (sometimes called HTTP Keep Alive) involves sending multiple HTTP requests over a single TCP connection, allowing us to ameliorate the connection overhead by (effectively) spreading the cost over all OAI-PMH requests. Modification: Update client to use `persistent_http` connection pool, via the `persistent_httparty` adapter. A bug was discovered, where the host entity is cached between successive requests. Result: OAI-PMH requests are now faster. Some observed speedups per request are (0.12 +/- 0.02) s, (0.16 +/- 0.01) s and (0.16 +/- 0.03) s for ESRF, HZB and HZDR respectively (measured with ListIdentifiers request on Dublin Core, following the resumptionToken). The overall impact of this improvement depends on how long the OAI-PMH endpoint takes to process a request. For end above endpoints, the percentage improvements (per request) are 12%, 42% and 70% respectively.
-
Paul Millar authored
Motivation: Currently, there is a lot of HTTP/networking code mixed in with application code. It would be good to seperate these concerns, particularly as we want to introduce connection reuse. Modification: Introduce new class that uses HTTParty as a mixin. Update code to take advantage of this new class. Result: Reduced code duplication; now easier to add new features.
-
- Jan 01, 2025
-
-
Paul Millar authored
Motivation: Currently, if an endpoint has harvesting suspended then the cell is shown as green, with no indication of any problem. Similarly, when there was a problem, the message is Status: Error, with limited explanation. Modification: Remove "Status: Error" as this (in essence) provides no information that isn't already captured by the cell colour. Update cell colour to show when harvesting was disabled, now shown as an orange cell. This is meant as a kind of half-way between green (=> good) and red (=> bad). For both cases (error and harvesting disabled) a simple text is provided. Result: The output is now better reflects the status of the endpoints.
-
Paul Millar authored
-
Paul Millar authored
Motivation: It is certainly not guaranteed that all OAI-PMH items are datasets; indeed, we have a counter-example from HZDR, which also describes software. Also, although 'category' (to my thinking) would be a better term, OAI-PMH uses `set`. Using a different term is more confusing than using any better term. Modification: Switch from `Category` to `Set` and `Dataset` to `Item`. Result: Webpage is easier to underxtand and less likely to cause confusion.
-
Paul Millar authored
-
- Dec 31, 2024
-
-
Paul Millar authored
Motivation: Some OAI-PMH endpoints are broken; moreover, they're broken in such a way that makes harvesting information wastes a lot of time without producing useful information. The specific example is the ISIS endpoint, which is both very slow (~10 seconds per request) and, after ~9 hours of havesting returns a resumptionToken that results in failures in a subsequent ListIdentifiers request. The ESS endpoint is also broken. While also annoying, the impact is less because of special handling when a server hasn't provided useful information. The goal is to allow selective disabiling of harvesting while continuing to update high-level OAI-PMH information based on the Identity call. Modification: Add a `skip-harvesting` boolean option. If set with the value true then harvesting is skipped for this endpoint. Result: It's possible to update all endpoints without a very long and fruitless time spent harvesting from broken endpoints.
-
Paul Millar authored
Motivation: The current OAI-PMH information is recorded as `datasets`. However, this assumes that the items underlying the harvested OAI-PMH records correspond to datasets. This is not guaranteed, and there are counter-examples. OAI-PMH describes three concepts: resource, item and record. The OAI-PMH responses provide records (descriptive metadata of some item) or identifiers thereof. However, since OAI-PMH requires repositories to support Dublin Core records, it seems a reasonable assumption for there to be a 1:1 relationship between each Dublin Core record and some corresponding item. Therefore, we can use the metadata-agnostic `item` concept when describing the information about the endpoint. Modification: Update script to record information under `items` node in the facilities YAML file. Update the Jekyll to consume this information when rendering the corresponding HTML. Result: No observable change, but the facilities YAML file now uses the more neutral 'items' instead of 'datasets'.
-
Paul Millar authored
Motivation: The ListMetadataFormats call can fail. Currently, this causes the enire script to fail. Modification: Catch the exception and report a failure. Result: A metadata prefix lookup failure is now limited to a single OAI-PMH endpoint
-
Paul Millar authored
Motivation: Different OAI-PMH endpoints provide different performance characteristics. It would be helpful to categorise them Modification: Introduce a Stats class to capture statistics Monkey-patch float to support printing number to specific significant figures. Produce request stats per round (40 requests) and overall as output.
-
Paul Millar authored
-
Paul Millar authored
Motivation: OAI-PMH provides admin contact details as email addresses. This could prove useful information. One such situation is when the OAI-PMH endpoint is not working. When this happens, the admin contact details are no longer available from the endpoint, so caching the values would prove useful. Modification: Update code to capture the admin contact information and record it against the facility-specific information. If, when updating the OAI-PMH details, the admin contact details are discovered (from the OAI-PMH endpoint) then any existing contact details are replaced with the discovered information; otherwise, any existing admin contact details are left unmodified. Result: We collect and cache OAI-PMH admin contact details in the facility metadata.
-
- Dec 29, 2024
-
-
Paul Millar authored
-
Paul Millar authored
Motivation: Some server-side problems are quickly fixed by retrying the request, while other problems take longer to recover. Using a fixed duration as a delay between attempts makes it hard to reconcile these different failure modes. Modification: Add a progressive timeout strategy. The delay between attempts now scales linearly between successive attempts. As a special case, the retry strategy is different if the server has not provided any useful response. In this case, a fixed delay is used. Also, catch more problematic scenarios and handle them with the same retry strategy. Result: More robust handling of server-side errors, by retrying.
-