Hi @cwperks ,
Thank you for your response,
We have tested “Redaction on the Ingestion Side” approach to handle PII and sensitive internal ratings, primarily due to the known complexities and limitations of Field Masking (Read-Time Security).
The Ingest Pipeline approach is robust, but our testing confirmed the primary trade-off is sacrificing auditability or storage efficiency.
Here is our simplified understanding of how Ingest Pipeline Redaction works and the mandatory trade-offs we’ve identified. We welcome feedback and documentation pointers!
The “Write-Time” Security Model (Using Ingest Pipelines)
This model uses processors (gsub, script, copy, remove) to alter the data before it is saved.
- Ingest Pipeline as a Security Firewall
The sensitive data (e.g., a credit card number) hits the pipeline, which acts as a checkpoint. It immediately strips out or replaces the sensitive content with a generic placeholder (like XXX or XXXX-4321). This ensures the raw PII never enters the searchable index.
- Searchability is Inherently Secure
Because the searchable field only ever contains the safe, masked tokens (XXX, 4321), the inverted index is built entirely on non-sensitive data. When a user queries for the placeholder, the search hits the index directly, just like a normal query. Crucially, a search for the original sensitive term (e.g., the full 16-digit card number) will correctly yield zero results.
Observed Trade-Offs (The Two Main Paths)
We found two paths forward, each forcing a trade-off between compliance (auditability) and storage efficiency.
Path 1: Storage Efficiency, Sacrifice Auditability
Implementation: The pipeline copies the sensitive data, masks the copy, and then permanently deletes the original field from the document.
Trade-off:
Pro: Highly storage efficient, as only one document is indexed.
Con (The Big Risk): The original, raw data is permanently lost from OpenSearch. If your auditors or compliance team requires the ability to query the unmasked data using their privileged OpenSearch credentials, this approach fails. You must rely on an external audit log (S3, etc.) for compliance.
Path 2: Maintain Auditability, Sacrifice Storage Efficiency
Implementation: The ingestion layer duplicates the document and sends it to two separate indices:
logs_sensitive (Original, unmasked document).
logs_masked (Document passed through the pipeline and masked).
Trade-off:
Pro: Perfect separation. Highly privileged roles can query logs_sensitive for audit, while analysts can only query logs_masked for daily work. This is the most compliant method.
Con: Duplicate Storage. You are storing two full copies of every document.
Request: Could the documentation be expanded to clearly outline the failure/deny behavior of Field Masking? If a Field Masking rule fails to parse or execute, is the default action to apply an implicit DLS filter and deny all documents? Understanding this hidden filter is critical for troubleshooting security configurations.
Thanks for any insights!