Quickstart: On-Premises InsightStream

This guide shows how to deploy InsightStream RAG chatbot together with Compressa for model inference.

Requirements

Deploying InsightStream with Compressa requires a server with 2 GPUs.
The requirements for GPU versions and server setup can be found on this page.

Setup

First, clone the repository with the configuration:

git clone -b insight-stream git@github.com:compressa-ai/compressa-deploy.git
cd compressa-deploy

The repository contains two main files that we’ll configure:

.env
docker-compose.yml

Set up the IDs for GPUs in the .env file:

DOCKER_GPU_IDS_CHAT=<ID1>
DOCKER_GPU_IDS_EMB=<ID2>

With the default configuration, the services use the following ports:

qdrant - 6333
compressa - 5500
insight-stream-bot - 80

If you need to modify these, update the port mappings in docker-compose.yml for containers qdrant, openai-api, nginx accordingly.

The SERVER_NAME variable should be set to the URL on which the InsightStream bot will be used. For example localhost:80 if you are running the solution locally or forwarding port 80 of the server to port 80 of localhost.

Setup storage

By default, the containers use the following storage paths:

qdrant - ./data/qdrant
compressa - ./data/compressa
This directory should have 777 permissions, which can be set via:
```
chmod 777 -R ./data/compressa
```
document's storage - ./data/documents
This directory should have 755 permissions for the user systemd-network and the group systemd-journal, which can be set via:
```
sudo chown systemd-network:systemd-journal ./data/documents && sudo chmod 755 ./data/documents
```

You can change the storage paths in docker-compose.yml.

Then, you can run the solution with:

docker compose up --build

Deploy Inference and Embedding Models

When the services are running, we need to deploy LLM models to Compressa. The solution uses the LLama3-8B model for chat and the SFR-Embedding-Mistral model for embeddings.
Models can be deployed using the REST API or using Swagger's UI.

The REST APIs are available at:

SERVER_NAME:5500/api/chat/
SERVER_NAME:5500/api/embeddings/

Swagger’s UI is available at:

SERVER_NAME:5500/api/chat/docs
SERVER_NAME:5500/api/embeddings/docs

Here are the commands to deploy models using curl:

Add LLama3-8B model in Compressa:

curl -X 'POST' \
  'http://localhost:5500/api/chat/v1/models/add/?model_id=compressa-ai%2FLlama-3-8B-Instruct' \
  -H 'accept: application/json' \
  -d ''

Add embedding model in Compressa:

curl -X 'POST' \
  'http://localhost:5500/api/embeddings/v1/models/add/?model_id=Salesforce%2FSFR-Embedding-Mistral' \
  -H 'accept: application/json' \
  -d ''

When downloading is finished, we can deploy the models:

Deploy LLama3-8B

curl -X 'POST' \
  'http://localhost:5500/api/chat/v1/deploy/' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "model_id": "compressa-ai/Llama-3-8B-Instruct",
  "dtype": "float16"
}'

Deploy embedding model

curl -X 'POST' \
  'http://localhost:5500/api/embeddings/v1/deploy/' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "model_id": "Salesforce/SFR-Embedding-Mistral",
  "dtype": "float16"
}'

When the models are deployed, the server is ready to use!

CLI

Install

The CLI tool is used to add new documents into RAG. The CLI can be installed from the same repository (python3.10+ required):

cd compressa-deploy/cli
pip install -r requirements.txt

Setup

The CLI tool has to have access to the deployed chatbot, models, and qdrant.
Please set the URL to them in the .env file:

SERVER_NAME=<SERVER_NAME> # in case of port 80
QDRANT_URL=<SERVER_NAME>:6333
OPENAI_BASE=<SERVER_NAME>:5500/v1 # Compressa 
QDRANT_KEY=your_secret_api_key_here

Usage

Add Documents in Index

When all environment variables are set, documents can be added to the system using one of the following commands:

python3 create_bot.py <BOT_ID> /path/to/document.pdf  
python3 create_bot.py <BOT_ID> /path/to/folder

The InsightStream bot supports .docx and .pdf documents.

When documents are uploaded, the bot is available at <SERVER_NAME>/agent/<bot_id>.

Ask InsightStream bot

You can open the InsightStream bot at <SERVER_NAME>/agent/<bot_id> and ask a question in the Chat UI: Chat UI

The bot can also be used via REST API.

REST API

Ask a question to the bot:

curl -X POST \
  -H "Content-Type: application/json" \
  -d '{
    "question": "<your_question_here>"
  }' \
  "<SERVER_NAME>/v.1.0/<bot_id>"

Download a file from the server:

curl <SERVER_NAME>/documents/<filename> > <filename>

Upload a new document to the server:

curl -X PUT -T /path/to/file <SERVER_NAME>/documents/<filename>

Quickstart: On-Premises InsightStream

Requirements​

Setup​

Setup storage​

Deploy Inference and Embedding Models​

Add LLama3-8B model in Compressa:​

Add embedding model in Compressa:​

Deploy LLama3-8B​

Deploy embedding model​

CLI​

Install​

Setup​

Usage​

Add Documents in Index​

Ask InsightStream bot​

REST API​

Ask a question to the bot:​

Download a file from the server:​

Upload a new document to the server:​

Requirements

Setup

Setup storage

Deploy Inference and Embedding Models

Add LLama3-8B model in Compressa:

Add embedding model in Compressa:

Deploy LLama3-8B

Deploy embedding model

CLI

Install

Setup

Usage

Add Documents in Index

Ask InsightStream bot

REST API

Ask a question to the bot:

Download a file from the server:

Upload a new document to the server: