Skip to main content

ETL

This Compressa platform module allows you to extract data from unstructured documents and chunk them according to document structure. This is important for efficient search / RAG or use in LLM.

In addition to the Layout API, a UI is available at https://compressa-api.mil-team.ru/ui-layout/

Let's create a request using a document example:

# pip install requests, if you don't have this library
import requests
import os

# Download a PDF file as an example:
pdf_url = "https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf"
pdf_response = requests.get(pdf_url)

# Save the file
with open("dummy.pdf", "wb") as file:
file.write(pdf_response.content)

# Send the file for chunking
compressa_url = "https://compressa-api.mil-team.ru/v1/layout"
headers = {
"Authorization": "Your_Compressa_API_key",
"accept": "application/json",
}

# Specify the path to our file
files = {"files": open("dummy.pdf", "rb")}

# Set chunking parameters
data = {
"output_format": "application/json",
"coordinates": "false",
"strategy": "fast",
"languages": ["rus", "eng"]
}

response = requests.post(
compressa_url,
headers=headers,
files=files,
data=data
)

# Output document chunks in JSON format
print(response.json())

Below we'll discuss some of the settings in more detail, the full API schema is available here.

Document Processing Strategy (strategy)

fast: The "rule-based" strategy uses traditional NLP-based extraction techniques to quickly extract all text elements. The "fast" strategy is not recommended for files containing complex layouts, images, and other visual style elements.

hi_res: The "model-based" strategy determines the document layout. The advantage of the "hi_res" strategy is that it uses the document layout to obtain additional information about its elements. We recommend using this strategy if your use case requires high accuracy in classifying document elements.

hi_res_model: You can choose the yolox model or yolox-tiny model for faster inference.

auto: The "auto" strategy automatically selects the document splitting method depending on its characteristics and other passed parameters.

Chunking Parameters (text splitting)

max_characters (default = 500) — Hard limit for the size of a single block. No block will exceed the specified number of characters. If an element itself exceeds this size, it will be split into two or more blocks using text splitting.

new_after_n_chars (default = max_characters) — "Soft" limit for block size. A block that has already reached this number of characters will not be expanded, even if the next element would fit without exceeding the hard limit. This parameter can be used together with max_characters to set a "preferred" size, for example: "I prefer blocks around 1000 characters, but it's better to take a block of 1500 (max_characters) than to resort to text splitting". This can be set as (..., max_characters=1500, new_after_n_chars=1000).

overlap (default = 0) — When using text splitting to divide a block that is too large, the specified number of characters from the end of the previous block is included as a prefix for the next one. This helps mitigate the effect of breaking a semantic unit represented by a large element.

overlap_all (default = False) — Also applies overlap between "regular" blocks, not just when splitting large elements. Since regular blocks are formed from whole elements, each of which has a clear semantic boundary, this option can add noise to normal blocks. You need to consider the specifics of your use case to decide if this parameter is right for you.

To learn more about CompressaChunking with a practical example — check out our guide.