Skip to content

⚙️ Backend Feature Request — Integrate Elasticsearch Search (with Full ICAT Authorization Support via allowed_sample_ids)

🎯 Goal

Implement Elasticsearch search integration in SEPIA to improve scalability and search experience — while ensuring that all queries strictly respect ICAT-based authorization through allowed_sample_ids. Every search result must be limited to samples that the current user is authorized to access in ICAT.


🔐 Authorization Principle

  • SEPIA receives from ICAT the exact list of allowed_sample_ids for the current session.
  • Elasticsearch must only return results from within this ID list.
  • No samples outside allowed_sample_ids may ever appear in results — even if they match other filters.

⚙️ Required Filters

Elasticsearch search will continue to support all filters currently defined in /sample/initiate, extended with the new ones:

{
  "filters": {
    "only_user": true,
    "proposal_id": "PROJ-001",
    "search": "catalyst",
    "keyword": "XRD",
    "date_type": "created_at",
    "date_from": "2025-01-01",
    "date_to": "2025-12-31",
    "sort_column": "created_at",
    "sort_order": "desc",
    "person_ids": [1, 2, 3],
    "sample_type_ids": [5, 6]
  },
  "session_id": "ICAT_SESSION_ABC"
}

🧱 Implementation Details

1. Elasticsearch Integration

  • Deploy and configure Elasticsearch for SEPIA’s samples index.

  • Define mappings for:

    • sample
    • sample_type
    • sample_user
    • keyword
  • Index all searchable metadata fields (including related sample users, types, and keywords).


2. Authorization Handling

Because allowed_sample_ids may be a large list, the implementation must handle this efficiently while preserving correctness.

Strategy:

  • Retrieve allowed_sample_ids from ICAT for the current session.

  • Use a chunked query strategy to avoid exceeding Elasticsearch’s terms size limit:

    • Split the ID list into manageable chunks (e.g. 500–1000 IDs per query).
    • Query Elasticsearch per chunk.
    • Merge results in the backend before pagination.
  • Use caching to store allowed_sample_ids for the current session_id to avoid repeated ICAT lookups.

  • Ensure full consistency between ICAT permissions and Elasticsearch results.

Example internal logic:

allowed_ids = get_allowed_sample_ids(session_id)
results = []
for chunk in chunk_list(allowed_ids, 1000):
    part = es.search(
        index="samples",
        query={
            "bool": {
                "filter": [
                    {"terms": {"id": chunk}},
                    # other filters like date, keywords, etc.
                ]
            }
        }
    )
    results.extend(part["hits"]["hits"])

3. /sample/initiate Endpoint

  • Accepts all filters (including person_ids and sample_type_ids).
  • Builds the Elasticsearch query using those filters.
  • Applies allowed_sample_ids filtering (in chunks if large).
  • Returns a cursor_id representing the filtered query context.

4. /sample Endpoint

  • Retrieves paginated results using Elasticsearch’s search_after API or a stored query cursor.
  • Always respects the allowed_sample_ids constraint for the session.
  • Optionally, merge in relational metadata if required (e.g., display info from SQL).

5. Investigation-Based Filtering

If a user filters by investigation:

  1. Get sample IDs linked to that investigation from ICAT.
  2. Intersect them with allowed_sample_ids.
  3. Use this reduced set in the Elasticsearch terms query.

This ensures that investigation filters and authorization always work together securely.


🧠 Notes

  • The Elasticsearch index itself does not need to contain authorization data — access is enforced at query time using the allowed_sample_ids list.
  • The chunked query merging guarantees no sample is lost or leaked.
  • This approach prioritizes security and correctness over raw query speed.

Acceptance Criteria

  • Search results only include samples within allowed_sample_ids.
  • No authorized samples are missed.
  • Performance is acceptable for 10k+ allowed IDs.
  • /sample/initiate and /sample endpoints preserve their existing response structure.
  • Supports all existing filters, including person_ids and sample_type_ids.

🏷️ Labels

backend feature elasticsearch authorization icat security search priority::critical

Edited by Mojeeb Rahman Sedeqi