⚙️ Backend Feature Request — Integrate Elasticsearch Search (with Full ICAT Authorization Support via `allowed_sample_ids`)

🎯 Goal

Implement Elasticsearch search integration in SEPIA to improve scalability and search experience — while ensuring that all queries strictly respect ICAT-based authorization through allowed_sample_ids. Every search result must be limited to samples that the current user is authorized to access in ICAT.

🔐 Authorization Principle

SEPIA receives from ICAT the exact list of allowed_sample_ids for the current session.
Elasticsearch must only return results from within this ID list.
No samples outside allowed_sample_ids may ever appear in results — even if they match other filters.

⚙️ Required Filters

Elasticsearch search will continue to support all filters currently defined in /sample/initiate, extended with the new ones:

{
  "filters": {
    "only_user": true,
    "proposal_id": "PROJ-001",
    "search": "catalyst",
    "keyword": "XRD",
    "date_type": "created_at",
    "date_from": "2025-01-01",
    "date_to": "2025-12-31",
    "sort_column": "created_at",
    "sort_order": "desc",
    "person_ids": [1, 2, 3],
    "sample_type_ids": [5, 6]
  },
  "session_id": "ICAT_SESSION_ABC"
}

🧱 Implementation Details

1. Elasticsearch Integration

Deploy and configure Elasticsearch for SEPIA’s samples index.
Define mappings for:
- sample
- sample_type
- sample_user
- keyword
Index all searchable metadata fields (including related sample users, types, and keywords).

2. Authorization Handling

Because allowed_sample_ids may be a large list, the implementation must handle this efficiently while preserving correctness.

✅ Strategy:

Retrieve allowed_sample_ids from ICAT for the current session.
Use a chunked query strategy to avoid exceeding Elasticsearch’s terms size limit:
- Split the ID list into manageable chunks (e.g. 500–1000 IDs per query).
- Query Elasticsearch per chunk.
- Merge results in the backend before pagination.
Use caching to store allowed_sample_ids for the current session_id to avoid repeated ICAT lookups.
Ensure full consistency between ICAT permissions and Elasticsearch results.

Example internal logic:

allowed_ids = get_allowed_sample_ids(session_id)
results = []
for chunk in chunk_list(allowed_ids, 1000):
    part = es.search(
        index="samples",
        query={
            "bool": {
                "filter": [
                    {"terms": {"id": chunk}},
                    # other filters like date, keywords, etc.
                ]
            }
        }
    )
    results.extend(part["hits"]["hits"])

3. `/sample/initiate` Endpoint

Accepts all filters (including person_ids and sample_type_ids).
Builds the Elasticsearch query using those filters.
Applies allowed_sample_ids filtering (in chunks if large).
Returns a cursor_id representing the filtered query context.

4. `/sample` Endpoint

Retrieves paginated results using Elasticsearch’s search_after API or a stored query cursor.
Always respects the allowed_sample_ids constraint for the session.
Optionally, merge in relational metadata if required (e.g., display info from SQL).

5. Investigation-Based Filtering

If a user filters by investigation:

Get sample IDs linked to that investigation from ICAT.
Intersect them with allowed_sample_ids.
Use this reduced set in the Elasticsearch terms query.

This ensures that investigation filters and authorization always work together securely.

🧠 Notes

The Elasticsearch index itself does not need to contain authorization data — access is enforced at query time using the allowed_sample_ids list.
The chunked query merging guarantees no sample is lost or leaked.
This approach prioritizes security and correctness over raw query speed.

✅ Acceptance Criteria

Search results only include samples within allowed_sample_ids.
No authorized samples are missed.
Performance is acceptable for 10k+ allowed IDs.
/sample/initiate and /sample endpoints preserve their existing response structure.
Supports all existing filters, including person_ids and sample_type_ids.

🏷️ Labels

backend feature elasticsearch authorization icat security search priority::critical

Edited Oct 23, 2025 by Mojeeb Rahman Sedeqi

⚙️ Backend Feature Request — Integrate Elasticsearch Search (with Full ICAT Authorization Support via allowed_sample_ids)