Platform Installation
Available Services
compressa/compressa-pod:0.3.10- base Compressa image for model inferencecompressa/compressa-entrypoint:0.3.10- central part of Compressa, dispatchercompressa/compressa-autotest:0.3.10- model autotesting servicecompressa/compressa-layout-gpu:0.3.10- ETL servicecompressa/compressa-ui:0.3.10- chat and ETL UInginx:latest- for connecting to external addressopensearchproject/opensearch:2.13.0- ELKopensearchproject/opensearch-dashboards:2.13.0- ELK dashboardopensearchproject/data-prepper:2.6.0- ELK data storagecompressa/compressa-auth:0.3.10- authorization
Can be downloaded via docker pull...
Docker Authentication
Authenticate in Docker with your token:
export PAT=<TOKEN>
echo $PAT | docker login -u compressa --password-stdin
Basic Setup
Terms
- central pod (dispatcher) - service managing platform components and proxying requests to models
- pod - platform component, universal container based on compressa-pod
- Each pod has 2 ports - port 5100 (API) and port 5000 (model port). API and model can work independently of each other
- state (state.yaml) - updatable file of pod states
- config - configuration file, editable by user with specification of task types, models and devices for pods
First, clone the repository with configuration:
git clone git@github.com:compressa-ai/compressa-deploy.git
cd platform
Storage
Set access permissions to working directories chmod 777 -R %folder_name%
By default, containers use the following paths:
- RESOURCES_PATH -
./data/compressa - HF_HOME -
./data/compressa platform/build- folder for configuration and state filesplatform/test_results- autotest results
Environment Variable Configuration:
- AUTODEPLOY - automatic platform build. True recommended for docker compose
- UI_LOGIN - whether authorization is required for UI access (finetune, chat, layout)
- RUN_TESTS - automatic test launch. True recommended for docker compose
- RESOURCES_PATH - path to model files, similar to single Compressa instance
- HF_HOME - path to cache, similar to single Compressa instance
- COMPRESSA_API_KEY - your key
- POD_NAME - project alias, will be used in Compressa container names
- DISPATCHER_HOST - name of central Compressa container
- DISPATCHER_PORT - central container port address
- NGINX_LISTEN_PORT - nginx port inside docker network, used for running autotests
- NETWORK - docker network name
- PROJECT - project name, will be used in service container names
- LOG_LEVEL - logging level
Nginx Port Configuration
By default, services use port 8080. It can be configured in the docker-compose.yaml file and in nginx.conf:
nginx:
image: nginx:stable-alpine3.19-slim
ports:
- "8080:80" <----
...
Starting Dispatcher and Service Containers
- create docker network
docker network create test_network - edit config to determine which models will be used
deploy/platform/build/config.yaml - go to folder
cd deploy/platform - initialize environment variables
set -a
source .env
set +a
- start dispatcher and service containers
docker compose up - dispatcher will automatically generate compose file for starting Compressa
deploy/platform/build/auto.yaml - start the generated file
docker compose -f ./build/auto.yaml up - dispatcher and all models will be available through nginx at
http://localhost:8080/ - ELK dashboard will be available at
http://localhost:5601/
You can also run individual Compressa elements independently, as described in compressa-local-llm.
For this, in the environment variables of each pod you need to specify the dispatcher address, make sure all containers are in the same docker network and set dispatcher environment variables AUTODEPLOY and RUN_TESTS - False.
Then all containers will automatically connect to the dispatcher.
Example Configuration File (6 services)
settings:
base_image: "compressa/compressa-pod:0.3.10"
pods:
- engine: vllm
task: llm
model_name: "Qwen/Qwen3-14B"
gpu_id: "0"
gpu_memory_utilization: 0.9
- engine: vllm
task: embeddings
model_name: "Salesforce/SFR-Embedding-Mistral"
gpu_id: "1"
gpu_memory_utilization: 0.7
- engine: infinity
task: rerank
model_name: "mixedbread-ai/mxbai-rerank-large-v1"
gpu_id: "1"
- engine: asr_backend
task: asr
model_name: "t-tech/T-one"
gpu_id: "1"
- engine: coqui
task: tts
model_name: "compressa-ai/XTTS-v2"
gpu_id: "1"
- engine: llamacpp
task: llm
model_name: "TheBloke/Mistral-7B-Instruct-v0.1-GGUF"
quantization: "Q8_0"
How the Dispatcher Works
Manual Mode (Autodeploy=False)
| User Action | What Happens |
|---|---|
| Start central pod | Dispatcher and other auxiliary containers (nginx, auth...) start. If there are already running pods that connected to the dispatcher - don't execute next steps. |
POST create-pods | Read state from state if exists or from config, if config differs from state - use it, create docker compose for pods. |
| Start docker-compose | required number of pods start without deployment |
POST deploy-pods | Read state from state if exists or from config, if config differs from state - use it, send corresponding post requests to deploy pods |
| Running | Wait for all models to be deployed (have running status), form or update state |
Automatic Mode (Autodeploy=True)
| User Action | What Happens |
|---|---|
| Start central pod | Dispatcher and other auxiliary containers (nginx, auth, etc) start, generates compose for pods, waits for pods readiness |
| Start compose | |
| Running | Wait for all models to be deployed (have running status), form or update state, start autotests |
When the service is running, the user can redeploy models by changing the config and sending a post request to deploy-pods. Only models that have changed will be redeployed. Before redeployment, the previous one will be completed. Limitation - cannot change the number of pods online. It's necessary to restart the service.
Error Handling
| Event | System Behavior |
|---|---|
| Dispatcher crash | pods continue sending reports to dispatcher (unsuccessfully), request proxying unavailable, but each pod is available locally on ports 5100, 5000. When dispatcher restarts, first check for working pods and if they are available and in running status - redeployment doesn't happen, connection is restored |
| One or all pods crash (container stop) | By timeout of not receiving report from pod, dispatcher marks pod as inactive. Periodically tries to connect and redeploy model. When pod restarts, deployment happens, functionality is restored |
| Model crash on one or several pods (port 5000 unavailable, but port 5100 available). For example, out of memory with subsequent model termination from engine side or forcibly sent request to terminate deployment bypassing dispatcher | Pod is marked as "model unavailable". By timeout of model unavailability (10 minutes) redeployment happens |
| Unsuccessful deployment (failed status). Usually, crash at the very beginning, for example, due to incorrect config or out of memory. If failed occurs later - handled inside pod | Make 3 attempts to redeploy, then send request to restart container. After restart make 3 more attempts, and if success not achieved - restart container again. Mark pod as problematic. Then if config hasn't changed - do nothing, inform user that needs to fix, if changed - redeploy |
| All pod ports open, but model doesn't work (500 error) | When counter of responses with 500 error exceeds set threshold, pod redeployment happens (by default counter is off, can be set via environment variable MAX_SERVER_ERRORS. On successful response or after redeployment counter is reset |
Usage
After successful installation of all modules and model startup, they are ready for use. How to send requests via REST API or Python client is described here.
Testing
Successful deployment is checked by built-in tests (reports are output to console and saved to test_reports folder), model operation can also be checked using the following requests:
LLM Model
curl -X 'POST' "$SERVER_NAME:$PORT/v1/chat/completions" \
-H 'accept: application/json' \
-H 'Content-Type: application/json' -d '{
"model": "Compressa-LLM",
"messages": [{"role": "user", "content": "Write a bedtime story about a kind artificial intelligence!"}],
"stream": false
}'
Embeddings Model
curl -X 'POST' "$SERVER_NAME:$PORT/v1/embeddings" \
-H 'accept: application/json' \
-H 'Content-Type: application/json' -d '{
"model": "Compressa-Embedding",
"input": "Document 1",
"encoding_format": "float"
}'
Rerank Model
curl -X POST "$SERVER_NAME:$PORT/v1/rerank" -H "accept: application/json" -H "Content-Type: application/json" -d '{
"model": "Compressa-ReRank",
"query": "Query?",
"documents": [
"document 1",
"document 2",
"document 3"
],
"return_documents": false
}'
To work with InsightStream RAG module, you'll need to perform one more step