Skip to main content

Platform Installation

Available Services

  • compressa/compressa-pod:0.3.10 - base Compressa image for model inference
  • compressa/compressa-entrypoint:0.3.10 - central part of Compressa, dispatcher
  • compressa/compressa-autotest:0.3.10 - model autotesting service
  • compressa/compressa-layout-gpu:0.3.10 - ETL service
  • compressa/compressa-ui:0.3.10 - chat and ETL UI
  • nginx:latest - for connecting to external address
  • opensearchproject/opensearch:2.13.0 - ELK
  • opensearchproject/opensearch-dashboards:2.13.0 - ELK dashboard
  • opensearchproject/data-prepper:2.6.0 - ELK data storage
  • compressa/compressa-auth:0.3.10 - authorization

Can be downloaded via docker pull...

Docker Authentication

Authenticate in Docker with your token:

export PAT=<TOKEN>
echo $PAT | docker login -u compressa --password-stdin

Basic Setup

Terms

  • central pod (dispatcher) - service managing platform components and proxying requests to models
  • pod - platform component, universal container based on compressa-pod
  • Each pod has 2 ports - port 5100 (API) and port 5000 (model port). API and model can work independently of each other
  • state (state.yaml) - updatable file of pod states
  • config - configuration file, editable by user with specification of task types, models and devices for pods

First, clone the repository with configuration:

git clone git@github.com:compressa-ai/compressa-deploy.git
cd platform

Storage

Set access permissions to working directories chmod 777 -R %folder_name% By default, containers use the following paths:

  • RESOURCES_PATH - ./data/compressa
  • HF_HOME - ./data/compressa
  • platform/build - folder for configuration and state files
  • platform/test_results - autotest results

Environment Variable Configuration:

  • AUTODEPLOY - automatic platform build. True recommended for docker compose
  • UI_LOGIN - whether authorization is required for UI access (finetune, chat, layout)
  • RUN_TESTS - automatic test launch. True recommended for docker compose
  • RESOURCES_PATH - path to model files, similar to single Compressa instance
  • HF_HOME - path to cache, similar to single Compressa instance
  • COMPRESSA_API_KEY - your key
  • POD_NAME - project alias, will be used in Compressa container names
  • DISPATCHER_HOST - name of central Compressa container
  • DISPATCHER_PORT - central container port address
  • NGINX_LISTEN_PORT - nginx port inside docker network, used for running autotests
  • NETWORK - docker network name
  • PROJECT - project name, will be used in service container names
  • LOG_LEVEL - logging level

Nginx Port Configuration

By default, services use port 8080. It can be configured in the docker-compose.yaml file and in nginx.conf:

  nginx:
image: nginx:stable-alpine3.19-slim
ports:
- "8080:80" <----
...

Starting Dispatcher and Service Containers

  • create docker network docker network create test_network
  • edit config to determine which models will be used deploy/platform/build/config.yaml
  • go to folder cd deploy/platform
  • initialize environment variables
set -a
source .env
set +a
  • start dispatcher and service containers docker compose up
  • dispatcher will automatically generate compose file for starting Compressa deploy/platform/build/auto.yaml
  • start the generated file docker compose -f ./build/auto.yaml up
  • dispatcher and all models will be available through nginx at http://localhost:8080/
  • ELK dashboard will be available at http://localhost:5601/

You can also run individual Compressa elements independently, as described in compressa-local-llm. For this, in the environment variables of each pod you need to specify the dispatcher address, make sure all containers are in the same docker network and set dispatcher environment variables AUTODEPLOY and RUN_TESTS - False. Then all containers will automatically connect to the dispatcher.

Example Configuration File (6 services)

settings:
base_image: "compressa/compressa-pod:0.3.10"

pods:
- engine: vllm
task: llm
model_name: "Qwen/Qwen3-14B"
gpu_id: "0"
gpu_memory_utilization: 0.9
- engine: vllm
task: embeddings
model_name: "Salesforce/SFR-Embedding-Mistral"
gpu_id: "1"
gpu_memory_utilization: 0.7
- engine: infinity
task: rerank
model_name: "mixedbread-ai/mxbai-rerank-large-v1"
gpu_id: "1"
- engine: asr_backend
task: asr
model_name: "t-tech/T-one"
gpu_id: "1"
- engine: coqui
task: tts
model_name: "compressa-ai/XTTS-v2"
gpu_id: "1"
- engine: llamacpp
task: llm
model_name: "TheBloke/Mistral-7B-Instruct-v0.1-GGUF"
quantization: "Q8_0"

How the Dispatcher Works

Manual Mode (Autodeploy=False)

User ActionWhat Happens
Start central podDispatcher and other auxiliary containers (nginx, auth...) start. If there are already running pods that connected to the dispatcher - don't execute next steps.
POST create-podsRead state from state if exists or from config, if config differs from state - use it, create docker compose for pods.
Start docker-composerequired number of pods start without deployment
POST deploy-podsRead state from state if exists or from config, if config differs from state - use it, send corresponding post requests to deploy pods
RunningWait for all models to be deployed (have running status), form or update state

Automatic Mode (Autodeploy=True)

User ActionWhat Happens
Start central podDispatcher and other auxiliary containers (nginx, auth, etc) start, generates compose for pods, waits for pods readiness
Start compose
RunningWait for all models to be deployed (have running status), form or update state, start autotests

When the service is running, the user can redeploy models by changing the config and sending a post request to deploy-pods. Only models that have changed will be redeployed. Before redeployment, the previous one will be completed. Limitation - cannot change the number of pods online. It's necessary to restart the service.

Error Handling

EventSystem Behavior
Dispatcher crashpods continue sending reports to dispatcher (unsuccessfully), request proxying unavailable, but each pod is available locally on ports 5100, 5000. When dispatcher restarts, first check for working pods and if they are available and in running status - redeployment doesn't happen, connection is restored
One or all pods crash (container stop)By timeout of not receiving report from pod, dispatcher marks pod as inactive. Periodically tries to connect and redeploy model. When pod restarts, deployment happens, functionality is restored
Model crash on one or several pods (port 5000 unavailable, but port 5100 available). For example, out of memory with subsequent model termination from engine side or forcibly sent request to terminate deployment bypassing dispatcherPod is marked as "model unavailable". By timeout of model unavailability (10 minutes) redeployment happens
Unsuccessful deployment (failed status). Usually, crash at the very beginning, for example, due to incorrect config or out of memory. If failed occurs later - handled inside podMake 3 attempts to redeploy, then send request to restart container. After restart make 3 more attempts, and if success not achieved - restart container again. Mark pod as problematic. Then if config hasn't changed - do nothing, inform user that needs to fix, if changed - redeploy
All pod ports open, but model doesn't work (500 error)When counter of responses with 500 error exceeds set threshold, pod redeployment happens (by default counter is off, can be set via environment variable MAX_SERVER_ERRORS. On successful response or after redeployment counter is reset

Usage

After successful installation of all modules and model startup, they are ready for use. How to send requests via REST API or Python client is described here.

Testing

Successful deployment is checked by built-in tests (reports are output to console and saved to test_reports folder), model operation can also be checked using the following requests:

LLM Model

curl -X 'POST' "$SERVER_NAME:$PORT/v1/chat/completions" \
-H 'accept: application/json' \
-H 'Content-Type: application/json' -d '{
"model": "Compressa-LLM",
"messages": [{"role": "user", "content": "Write a bedtime story about a kind artificial intelligence!"}],
"stream": false
}'

Embeddings Model

curl -X 'POST' "$SERVER_NAME:$PORT/v1/embeddings" \
-H 'accept: application/json' \
-H 'Content-Type: application/json' -d '{
"model": "Compressa-Embedding",
"input": "Document 1",
"encoding_format": "float"
}'

Rerank Model

curl -X POST "$SERVER_NAME:$PORT/v1/rerank" -H "accept: application/json" -H "Content-Type: application/json" -d '{
"model": "Compressa-ReRank",
"query": "Query?",
"documents": [
"document 1",
"document 2",
"document 3"
],
"return_documents": false
}'

To work with InsightStream RAG module, you'll need to perform one more step