Versions (relevant - OpenSearch/Dashboard/Server OS/Browser):
OpenSearch 2.18
Describe the issue:
I am trying to pass a PDF for the chunking processor but not able to get any results
Configuration:
OS 2.18 with ingest-attachment plugin
Relevant Logs or Screenshots:
Ingest pipeline Config:
{
"description": "chunking pipeline for attachments",
"processors": [
{
"attachment": {
"field": "data",
"target_field": "attachment"
},
"text_chunking": {
"algorithm": {
"delimiter":{
"delimiter": "\n",
"max_chunk_limit":500
}
},
"field_map": {
"attachment.content": "passage_chunk"
}
}
}
]
}
Indexing using the API bulk
POST _bulk?pipeline=attachment-chunking
{ “index”: { “_index”: “resume”, “_id”: 9 } }
{ “data”: “VGhpcyBpcyBhIHRlc3QgZG9jdW1lbnQuDQoNCkZpcnN0IHBhcmFncmFwaC4NCg0KU2Vjb25kIHBhcmFncmFwaC4NCg0KVGhpcmQgcGFyYWdyYXBoLg0K”}
data is base 64 encoded for the passage:
This is a test document.
First paragraph.
Second paragraph.
Third paragraph.
Result:
{
"_index": "resume",
"_id": "9",
"_score": 1,
"_source": {
"data": "VGhpcyBpcyBhIHRlc3QgZG9jdW1lbnQuDQoNCkZpcnN0IHBhcmFncmFwaC4NCg0KU2Vjb25kIHBhcmFncmFwaC4NCg0KVGhpcmQgcGFyYWdyYXBoLg0K",
"attachment": {
"content_type": "text/plain; charset=windows-1252",
"language": "en",
"content": "This is a test document.\r\n\r\nFirst paragraph.\r\n\r\nSecond paragraph.\r\n\r\nThird paragraph.",
"content_length": 88
},
"passage_chunk": []
}
}
Not able to see the chunks in passage_chunk.
Could you please let me know how to get the chunking correctly