Quickstart: On-Premises InsightStream

This guide shows how to deploy InsightStream RAG chatbot together with Compressa for model inference.


Deploying InsightStream with Compressa requires a server with 2 GPUs.
The requirements for GPU versions and server setup can be found on this page.


First, clone the repository with the configuration:

git clone -b insight-stream
cd compressa-deploy

The repository contains two main files that we’ll configure:

  • .env
  • docker-compose.yml

Set up the IDs for GPUs in the .env file:


With the default configuration, the services use the following ports:

  • qdrant - 6333
  • compressa - 5500
  • insight-stream-bot - 80

If you need to modify these, update the port mappings in docker-compose.yml for containers qdrant, openai-api, nginx accordingly.

The SERVER_NAME variable should be set to the URL on which the InsightStream bot will be used. For example localhost:80 if you are running the solution locally or forwarding port 80 of the server to port 80 of localhost.

Setup storage

By default, the containers use the following storage paths:

  • qdrant - ./data/qdrant
  • compressa - ./data/compressa
    This directory should have 777 permissions, which can be set via:
    chmod 777 -R ./data/compressa
  • document's storage - ./data/documents
    This directory should have 755 permissions for the user systemd-network and the group systemd-journal, which can be set via:
    sudo chown systemd-network:systemd-journal ./data/documents && sudo chmod 755 ./data/documents

You can change the storage paths in docker-compose.yml.

Then, you can run the solution with:

docker compose up --build

Deploy Inference and Embedding Models

When the services are running, we need to deploy LLM models to Compressa. The solution uses the LLama3-8B model for chat and the SFR-Embedding-Mistral model for embeddings.
Models can be deployed using the REST API or using Swagger's UI.

The REST APIs are available at:

  • SERVER_NAME:5500/api/chat/
  • SERVER_NAME:5500/api/embeddings/

Swagger’s UI is available at:

  • SERVER_NAME:5500/api/chat/docs
  • SERVER_NAME:5500/api/embeddings/docs

Here are the commands to deploy models using curl:

Add LLama3-8B model in Compressa:

curl -X 'POST' \
'http://localhost:5500/api/chat/v1/models/add/?model_id=compressa-ai%2FLlama-3-8B-Instruct' \
-H 'accept: application/json' \
-d ''

Add embedding model in Compressa:

curl -X 'POST' \
'http://localhost:5500/api/embeddings/v1/models/add/?model_id=Salesforce%2FSFR-Embedding-Mistral' \
-H 'accept: application/json' \
-d ''

When downloading is finished, we can deploy the models:

Deploy LLama3-8B

curl -X 'POST' \
'http://localhost:5500/api/chat/v1/deploy/' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model_id": "compressa-ai/Llama-3-8B-Instruct",
"dtype": "float16"

Deploy embedding model

curl -X 'POST' \
'http://localhost:5500/api/embeddings/v1/deploy/' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model_id": "Salesforce/SFR-Embedding-Mistral",
"dtype": "float16"

When the models are deployed, the server is ready to use!



The CLI tool is used to add new documents into RAG. The CLI can be installed from the same repository (python3.10+ required):

cd compressa-deploy/cli
pip install -r requirements.txt


The CLI tool has to have access to the deployed chatbot, models, and qdrant.
Please set the URL to them in the .env file:

SERVER_NAME=<SERVER_NAME> # in case of port 80
OPENAI_BASE=<SERVER_NAME>:5500/v1 # Compressa


Add Documents in Index

When all environment variables are set, documents can be added to the system using one of the following commands:

python3 <BOT_ID> /path/to/document.pdf  
python3 <BOT_ID> /path/to/folder

The InsightStream bot supports .docx and .pdf documents.

When documents are uploaded, the bot is available at <SERVER_NAME>/agent/<bot_id>.

Ask InsightStream bot

You can open the InsightStream bot at <SERVER_NAME>/agent/<bot_id> and ask a question in the Chat UI: Chat UI

The bot can also be used via REST API.


Ask a question to the bot:

curl -X POST \
-H "Content-Type: application/json" \
-d '{
"question": "<your_question_here>"
}' \

Download a file from the server:

curl <SERVER_NAME>/documents/<filename> > <filename> 

Upload a new document to the server:

curl -X PUT -T /path/to/file <SERVER_NAME>/documents/<filename>