LLM

The Compressa platform includes a ready-made module for fast and cost-effective inference of open-source LLM models on your server. We've already applied the best optimization techniques at the infrastructure level so you save on costs and improve user experience.

As part of the cloud version, we provide a test API for one of the current open-source models (e.g., Qwen or Llama), but with load limitations.

You can interact with LLM through our Python library for Langchain or through a direct cURL request.

In addition, our APIs are compatible with OpenAI, more details on the separate page.

Calling the model without streaming

Python (Langchain)
cURL

#pip install langchain-openai - if you don't have this package yet
#pip install langchain-core - if you don't have this package yet
#pip install langchain - if you don't have this package yet

from langchain-openai import ChatOpenAI

llm = ChatOpenAI(
    model="Compressa-LLM",
    base_url="https://compressa-api.mil-team.ru/v1", 
    api_key="Your_Compressa_API_key", 
    temperature=0.7, 
    max_tokens=150, 
    stream="false"
)

messages = [
    ("system", "You are a helpful assistant who translates from Russian to English. Translate the user's sentence."),
    ("human", "I love programming.")
]

ai_msg = llm.invoke(messages)
print(f"Model response: {ai_msg.content}")

# Model response: I love programming.

curl -X POST \
  'https://compressa-api.mil-team.ru/v1/chat/completions' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -H 'Authorization: Bearer Your_Compressa_API_key' \
  -d '{
    "model": "Compressa-LLM",
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant who translates from Russian to English. Translate the user's sentence."
      },
      {
        "role": "user",
        "content": "I love programming."
      }
    ],
    "max_tokens": 128,
    "temperature": 0.5,
    "stream": false
}'

Calling the model with response streaming option

Python (requests)
cURL

import requests

response = requests.post(
  url="https://compressa-api.mil-team.ru/v1/chat/completions", 
  headers={
    "Authorization": "Bearer Your_Compressa_API_key", 
    "accept": "application/json",
    "Content-Type": "application/json"
  },
  json={
    "model": "Compressa-LLM",
    "messages": [
      {
        "role": "system",
        "content": "You are an expert in soccer"
      },
      {
        "role": "user",
        "content": "Write a bedtime story about a kind artificial intelligence!"
      }
    ],
    "max_tokens": 512,
    "temperature": 0.7,
    "stream": True
  }
)

for chunk in response.iter_content(chunk_size=None):
    if chunk:
        print(chunk.decode('utf-8'))

# Example data:
# data: {"id":"126","object":"chat.completion.chunk","created":1728680725,"model":"Compressa-LLM","choices":[{"index":0,"delta":{"role":"assistant","content":"Once"},"logprobs":null,"finish_reason":null}],"usage":null}
# data: {"id":"126","object":"chat.completion.chunk","created":1728680725,"model":"Compressa-LLM","choices":[{"index":0,"delta":{"role":"assistant","content":" upon"},"logprobs":null,"finish_reason":null}],"usage":null}
# ...

curl -X POST \
  'https://compressa-api.mil-team.ru/v1/chat/completions' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -H 'Authorization: Bearer Your_Compressa_API_key' \
  -d '{
    "model": "Compressa-LLM",
    "messages": [
      {
        "role": "user",
        "content": "Write a bedtime story about a kind artificial intelligence!"
      }
    ],
    "max_tokens": 512,
    "temperature": 0.7,
    "stream": true
}'

LLM

Calling the model without streaming​

Calling the model with response streaming option​

Calling the model without streaming

Calling the model with response streaming option