⚙️  Backend Feature Request — Integrate Elasticsearch Search (with Full ICAT Authorization Support via allowed_sample_ids)
🎯  Goal
Implement Elasticsearch search integration in SEPIA to improve scalability and search experience — while ensuring that all queries strictly respect ICAT-based authorization through allowed_sample_ids.
Every search result must be limited to samples that the current user is authorized to access in ICAT.
🔐  Authorization Principle
- SEPIA receives from ICAT the exact list of 
allowed_sample_idsfor the current session. - Elasticsearch must only return results from within this ID list.
 - 
No samples outside 
allowed_sample_idsmay ever appear in results — even if they match other filters. 
⚙️  Required Filters
Elasticsearch search will continue to support all filters currently defined in /sample/initiate, extended with the new ones:
{
  "filters": {
    "only_user": true,
    "proposal_id": "PROJ-001",
    "search": "catalyst",
    "keyword": "XRD",
    "date_type": "created_at",
    "date_from": "2025-01-01",
    "date_to": "2025-12-31",
    "sort_column": "created_at",
    "sort_order": "desc",
    "person_ids": [1, 2, 3],
    "sample_type_ids": [5, 6]
  },
  "session_id": "ICAT_SESSION_ABC"
}
🧱  Implementation Details
1. Elasticsearch Integration
- 
Deploy and configure Elasticsearch for SEPIA’s
samplesindex. - 
Define mappings for:
samplesample_typesample_userkeyword
 - 
Index all searchable metadata fields (including related sample users, types, and keywords).
 
2. Authorization Handling
Because allowed_sample_ids may be a large list, the implementation must handle this efficiently while preserving correctness.
✅  Strategy:
- 
Retrieve
allowed_sample_idsfrom ICAT for the current session. - 
Use a chunked query strategy to avoid exceeding Elasticsearch’s
termssize limit:- Split the ID list into manageable chunks (e.g. 500–1000 IDs per query).
 - Query Elasticsearch per chunk.
 - Merge results in the backend before pagination.
 
 - 
Use caching to store
allowed_sample_idsfor the currentsession_idto avoid repeated ICAT lookups. - 
Ensure full consistency between ICAT permissions and Elasticsearch results.
 
Example internal logic:
allowed_ids = get_allowed_sample_ids(session_id)
results = []
for chunk in chunk_list(allowed_ids, 1000):
    part = es.search(
        index="samples",
        query={
            "bool": {
                "filter": [
                    {"terms": {"id": chunk}},
                    # other filters like date, keywords, etc.
                ]
            }
        }
    )
    results.extend(part["hits"]["hits"])
3. /sample/initiate Endpoint
- Accepts all filters (including 
person_idsandsample_type_ids). - Builds the Elasticsearch query using those filters.
 - Applies 
allowed_sample_idsfiltering (in chunks if large). - Returns a 
cursor_idrepresenting the filtered query context. 
4. /sample Endpoint
- Retrieves paginated results using Elasticsearch’s 
search_afterAPI or a stored query cursor. - Always respects the 
allowed_sample_idsconstraint for the session. - Optionally, merge in relational metadata if required (e.g., display info from SQL).
 
5. Investigation-Based Filtering
If a user filters by investigation:
- Get sample IDs linked to that investigation from ICAT.
 - Intersect them with 
allowed_sample_ids. - Use this reduced set in the Elasticsearch 
termsquery. 
This ensures that investigation filters and authorization always work together securely.
🧠  Notes
- The Elasticsearch index itself does not need to contain authorization data — access is enforced at query time using the 
allowed_sample_idslist. - The chunked query merging guarantees no sample is lost or leaked.
 - This approach prioritizes security and correctness over raw query speed.
 
✅  Acceptance Criteria
- 
Search results only include samples within allowed_sample_ids. - 
No authorized samples are missed.  - 
Performance is acceptable for 10k+ allowed IDs.  - 
/sample/initiateand/sampleendpoints preserve their existing response structure. - 
Supports all existing filters, including person_idsandsample_type_ids. 
🏷️  Labels
backend feature elasticsearch authorization icat security search priority::critical