How can I chunk PDF with ingest attachment and text chunking processor

tejashu · December 4, 2024, 5:20am

Versions (relevant - OpenSearch/Dashboard/Server OS/Browser):
OpenSearch 2.18

Describe the issue:
I am trying to pass a PDF for the chunking processor but not able to get any results

Configuration:
OS 2.18 with ingest-attachment plugin

Relevant Logs or Screenshots:
Ingest pipeline Config:

{

    "description": "chunking pipeline for attachments",
    "processors": [
        {
            "attachment": {
                "field": "data",
                "target_field": "attachment"
            },
            
         "text_chunking": {
                "algorithm": {
                     "delimiter":{
                             "delimiter": "\n",
                             "max_chunk_limit":500
                                 }
                             },
                "field_map": {
                     "attachment.content": "passage_chunk"
                            }
                          }
        }
    ]

}

Indexing using the API bulk

POST _bulk?pipeline=attachment-chunking

{ “index”: { “_index”: “resume”, “_id”: 9 } }
{ “data”: “VGhpcyBpcyBhIHRlc3QgZG9jdW1lbnQuDQoNCkZpcnN0IHBhcmFncmFwaC4NCg0KU2Vjb25kIHBhcmFncmFwaC4NCg0KVGhpcmQgcGFyYWdyYXBoLg0K”}

data is base 64 encoded for the passage:
This is a test document.

First paragraph.

Second paragraph.

Third paragraph.

Result:

  {
    "_index": "resume",
    "_id": "9",
    "_score": 1,
    "_source": {
      "data": "VGhpcyBpcyBhIHRlc3QgZG9jdW1lbnQuDQoNCkZpcnN0IHBhcmFncmFwaC4NCg0KU2Vjb25kIHBhcmFncmFwaC4NCg0KVGhpcmQgcGFyYWdyYXBoLg0K",
      "attachment": {
        "content_type": "text/plain; charset=windows-1252",
        "language": "en",
        "content": "This is a test document.\r\n\r\nFirst paragraph.\r\n\r\nSecond paragraph.\r\n\r\nThird paragraph.",
        "content_length": 88
      },
      "passage_chunk": []
    }
  }

Not able to see the chunks in passage_chunk.

Could you please let me know how to get the chunking correctly

yeonghyeonKo · December 5, 2024, 5:54am

tejashu:

{

    "description": "chunking pipeline for attachments",
    "processors": [
        {
            "attachment": {
                "field": "data",
                "target_field": "attachment"
            },
            
         "text_chunking": {
                "algorithm": {
                     "delimiter":{
                             "delimiter": "\n",
                             "max_chunk_limit":500
                                 }
                             },
                "field_map": {
                     "attachment.content": "passage_chunk"
                            }
                          }
        }
    ]

}

Can you try wrapping “text_chunking” processor in parenthesis?
You should separate processors using a parenthesis for each.
For example,

  "processors": [
    {
      "text_chunking": {
        "algorithm": {
          "delimiter": {
            "delimiter": "\n\n"
          }
        },
        "field_map": {
          "passage_text": "passage_chunk1"
        }
      }
    }
  ]

tejashu · December 5, 2024, 12:12pm

I need two processors in a single pipeline, one for attachment and another for chunking. The processors is an array where i have differentiated the attachment and text_chunking as two different keys

yeonghyeonKo · December 6, 2024, 4:40pm

yes if you need two sequential processors in a single pipeline, make sure that a processor closed in a parenthesis one by one.

"processors": [
    {
      "text_chunking": {
        "algorithm": {
          "delimiter": {
            "delimiter": "\n\n"
          }
        },
        "field_map": {
          "passage_text": "passage_chunk1"
        }
      }
    },
    {
      # another processor
    }
  ]

tejashu · December 9, 2024, 7:55am

I added the paranthesis for the 2 sequential processsors.

{
“description”: “chunking pipeline for attachments”,
“processors”: [
{
“attachment”: {
“field”: “data”,
“target_field”: “attachment”
}
},
{
“text_chunking”: {
“algorithm”: {
“fixed_token_length”: {
“token_limit”: 10,
“overlap_rate”: 0.2,
“tokenizer”: “standard”
}
},
“field_map”: {
“attachment.content”: “passage_chunk”
}
}
}
]
}

still not getting the chunking output:

{
“took”: 412,
“timed_out”: false,
“_shards”: {
“total”: 1,
“successful”: 1,
“skipped”: 0,
“failed”: 0
},
“hits”: {
“total”: {
“value”: 1,
“relation”: “eq”
},
“max_score”: 1,
“hits”: [
{
“_index”: “resume”,
“_id”: “13”,
“_score”: 1,
“_source”: {
“data”: “VGhpcyBpcyBhIHRlc3QgZG9jdW1lbnQuDQoNCkZpcnN0IHBhcmFncmFwaC4NCg0KU2Vjb25kIHBhcmFncmFwaC4NCg0KVGhpcmQgcGFyYWdyYXBoLg0K”,
“attachment”: {
“content_type”: “text/plain; charset=windows-1252”,
“language”: “en”,
“content”: “This is a test document.\r\n\r\nFirst paragraph.\r\n\r\nSecond paragraph.\r\n\r\nThird paragraph.”,
“content_length”: 88
},
“passage_chunk”: [ ]
}
}
]
}
}

tejashu · December 12, 2024, 2:39am

@yeonghyeonKo any suggestions?

Arun · December 28, 2024, 7:09pm

“processors” : [

{
  "attachment" : {
    "field" : "document_content"
  }
},

{
  "set" : {
    "field": "document_content",
    "value": "{{ _source.attachment.content }}"
  }
},

{
  "remove": {
    "field": "attachment"
  }
},

{
  "text_chunking": {
    "algorithm": {
      "delimiter": {
        "delimiter": "\n\n",
        "max_chunk_limit" : -1
      }
    },
    "field_map": {
      "document_content": "document_words"
    }
  }
},

{
  "remove": {
    "field": "document_content"
  }
},

{
  "text_embedding": {
    "model_id": "RALxB5QBEa9zn_xSoFvO",
    "field_map": {
      "document_words": "document_words_embedding"
    }
  }
},

{
  "text_chunking": {
    "algorithm": {
      "fixed_token_length": {
        "token_limit": 200,
        "overlap_rate": 0.2,
        "tokenizer": "standard"
      }
    },
    "field_map": {
      "document_words": "document_chunked_words"
    }
  }
},

{
  "text_embedding": {
    "model_id": "RALxB5QBEa9zn_xSoFvO",
    "field_map": {
      "document_chunked_words": "document_chunked_words_embeding"
    }
  }
}

] try this

yuye-aws · January 6, 2025, 6:58am

Well. Have you tried to convert the PDF into texts? As far as I know, text chunking processor only supports text input.

Topic		Replies	Views
Opensearch 2.13 Text chunking test error OpenSearch troubleshoot	3	65	September 4, 2024
Append processor for vector field OpenSearch configure	6	84	October 12, 2024
How to get score per chunk so that i can retrieve the most relevant chunk from the document? OpenSearch	2	187	January 16, 2025
How to do chunking of dataset before sending into index OpenSearch configure	1	417	June 11, 2024
Ingest pipeline: Extract certain line with grok OpenSearch	0	385	January 24, 2024

How can I chunk PDF with ingest attachment and text chunking processor

Related topics