Skip to main content

Management API

URL: http://localhost:5100/

API allows:

  • adding models
  • deploying models
  • interrupting model deployment
  • running Performance tests using the compressa-perf library
  • running Observability tests on standard or custom datasets using the DeepEval library or Self-Scoring (for Qwen2.5)

Model Library

GET /v1/models/

List of models available for launch and fine-tuning.

Example:

curl -X 'GET' \
'http://localhost:5100/v1/models/' \
-H 'accept: application/json'

Response Schema:

[
{
"model_id": "string",
"adapter": true,
"base_model_id": "string"
}
]

POST /v1/models/add/

Downloading a model from Hugging Face.

Parameters:

  • query: model_id - model identifier on Hugging Face, e.g. Qwen/Qwen3-14B for link.

Example:

curl -X 'POST' \
'http://localhost:5100/v1/models/add/?model_id=mymodel_id' \
-H 'accept: application/json' \
-d ''

Response Schema:

{
"id": "4d78d943-1896-4d7b-9f11-b10cc2389ba3",
"name": "DOWNLOAD_mymodel_id",
"status": "RUNNING",
"started_at": "2024-03-21T09:58:29.846708"
}

Model Deployment

GET /v1/deploy/

Get information about the currently deployed model.

Example:

curl -X 'GET' \
'http://localhost:5100/v1/deploy/' \
-H 'accept: application/json'

Response Schema:

{
"model_id": "string",
"adapter_ids": [
"string"
]
}

POST /v1/deploy/

Launch models and fine-tuned adapters for inference. List of ids can be obtained using GET /v1/models/.

Request Body:

{
"model_id": "string",
"adapter_ids": [
"string"
]
}
  • model_id - model identifier
  • adapter_ids - list of adapter identifiers

Example:

curl -X 'POST' \
'http://localhost:5100/v1/deploy/' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model_id": "model_id1",
"adapter_ids": [
"adapter_id1",
"adapter_id2"
]
}'

Response Schema:

{
"id": "3fa85f64-5717-4562-b3fc-2c963f66afa6",
"name": "string",
"status": "CREATED",
"started_at": "2024-03-21T10:10:07.521Z"
}

GET /v1/deploy/status

Get status of the deployed model.

Example:

curl -X 'GET' \
'http://localhost:5100/v1/deploy/status/' \
-H 'accept: application/json'

Response Schema:

{
"model_id": "model_id1",
"adapter_ids": [
"adapter_id1",
"adapter_id2"
],
"job": {
"id": "8a63349c-078f-4e98-8968-4f011593329c",
"name": "DEPLOY_model_id1_adapters_id1_id2",
"status": "RUNNING",
"started_at": "2024-03-21T07:35:16.861681"
}
}

POST /v1/deploy/interrupt/

Disconnect the currently deployed model.

Example:

curl -X 'POST' \
'http://localhost:5100/v1/deploy/interrupt/' \
-H 'accept: application/json' \
-d ''

Response Schema:

{
"id": "8a63349c-078f-4e98-8968-4f011593329c",
"name": "DEPLOY_model_id1_adapters_id1_id2",
"status": "RUNNING",
"started_at": "2024-03-21T07:35:16.861681"
}

Jobs

Operations such as model loading or deployment are associated with jobs. The following APIs allow managing job execution.

GET /v1/jobs/

Get all jobs with statuses.

Example:

curl -X 'GET' \
'http://localhost:5100/v1/jobs/' \
-H 'accept: application/json'

Response Schema:

[
{
"id": "8a63349c-078f-4e98-8968-4f011593329c",
"name": "DEPLOY_model_id1_adapters_id1_id2",
"status": "RUNNING",
"started_at": "2024-03-21T07:35:16.861681"
},
{
"id": "4d78d943-1896-4d7b-9f11-b10cc2389ba3",
"name": "DOWNLOAD_test",
"status": "FINISHED",
"started_at": "2024-03-21T09:58:29.846708"
}
]

GET /v1/jobs/{job_id}/status/

Get the latest status of job with job_id.

Parameters:

  • path: job_id

Example:

curl -X 'GET' \
'http://localhost:5100/v1/jobs/4d78d943-1896-4d7b-9f11-b10cc2389ba3/status/' \
-H 'accept: application/json'

Response Schema:

{
"id": "4d78d943-1896-4d7b-9f11-b10cc2389ba3",
"name": "DOWNLOAD_test",
"status": "FINISHED",
"started_at": "2024-03-21T09:58:29.846708"
}

POST /v1/jobs/{job_id}/interrupt/

Interrupt execution of job with job_id.

Parameters:

  • path: job_id

Example:

curl -X 'POST' \
'http://localhost:5100/v1/jobs/8a63349c-078f-4e98-8968-4f011593329c/interrupt/' \
-H 'accept: application/json' \
-d ''

Response Schema:

{
"id": "8a63349c-078f-4e98-8968-4f011593329c",
"name": "DEPLOY_model_id1_adapters_id1_id2",
"status": "KILLED",
"started_at": "2024-03-21T07:35:16.861681"
}

Running Performance Tests

GET /v1/performance/

GET /v1/performance/status

Get information about a running test (if any) or indication that the test is completed and the result is saved.

Example:

curl -X 'GET' 'http://localhost:5100/v1/performance/'

Response Schema if test is in progress:

{"model":"Compressa-LLM","job":{"id":"b83cb091-5acd-4254-84b5-e1141ec01711","name":"performance TEST","status":"RUNNING","started_at":"2025-08-03T10:07:30.818815","exception_details":null,"retry":0},"message":null}

Response Schema if test is completed

{"model":"Compressa-LLM","job":{"id":"b83cb091-5acd-4254-84b5-e1141ec01711","name":"performance TEST","status":"FINISHED","started_at":"2025-08-03T10:07:30.818815","exception_details":null,"retry":0},"message":"Results are available in /app/resources/performance_results"}

POST /v1/performance/

Start a test. Test parameters are passed in the request, such as number of examples, number of threads, report format.

Example

curl -X 'POST' 'http://localhost:5100/v1/performance/?num_tasks=100&num_runners=10&report_mode=md'

Response Schema:

{
"id": "3fa85f64-5717-4562-b3fc-2c963f66afa6",
"name": "string",
"status": "RUNNING",
"started_at": "2024-03-21T10:10:07.521Z"
}

After testing is complete, reports are saved in the selected format (pdf, md or csv) in the previously mounted RESOURCES_PATH folder

POST /v1/performance/interrupt/

Interrupt a running test.

Example:

curl -X 'POST' \
'http://localhost:5100/v1/performance/interrupt/' \
-H 'accept: application/json' \
-d ''

Response Schema:

{"id":"6be69577-ebe8-4062-a5cb-ad8085c11393","name":"performance TEST","status":"INTERRUPTED","started_at":"2025-08-03T10:18:00.368896","exception_details":null,"retry":0}

Running Quality Tests

GET /v1/metrics/

GET /v1/metrics/status

Get information about a running test or test result if the test is completed.

Example:

curl -X 'GET' \
'http://localhost:5100/v1/metrics/'

Response Schema if test is in progress:

{"model":"Compressa-LLM","message":null,"job":{"id":"26908ef2-60b4-4211-9e7f-3a649c200419","name":"QUALITY TEST","status":"RUNNING","started_at":"2025-08-03T10:30:18.573467","exception_details":null,"retry":0}}

Response Schema if test is completed

{"model":"Compressa-LLM","job":{"id":"b83cb091-5acd-4254-84b5-e1141ec01711","name":"observability TEST","status":"FINISHED","started_at":"2025-08-03T10:07:30.818815","exception_details":null,"retry":0},"message":"Results are available in /app/resources/quality_tests"}

POST /v1/metrics/

Start a test. Test parameters are passed in the request, such as dataset (standard or custom) and number of randomly selected examples (default - 10, if -1 is specified - all)

Request for standard dataset

curl -X 'POST' 'http://localhost:5100/v1/metrics/?dataset=medical&num_examples=5'

List of available standard datasets

Request for custom dataset

import requests

url = 'http://localhost:5100/v1/metrics/?dataset=custom'
payload = {
"question": [
"What is important when welding titanium and its alloys?",
"What is boric acid needed for in reactor coolant?",
"In what year did the Komsomolets submarine sink?",
],
"answers": [
"Ensure an inert atmosphere in the welding zone, prevent titanium oxidation",
"Boric acid is used as a liquid absorber to control the chain fission reaction",
"In 1989",
]
}

response = requests.post(
url,
headers={
"accept": "application/json",
"Content-Type": "application/json"
},
json=payload
)

Response Schema:

{
{"id":"ce79eaee-4541-4341-ac21-b79d9bb7f9bb","name":"QUALITY TEST","status":"RUNNING","started_at":"2025-08-03T10:34:09.113636","exception_details":null,"retry":0}
}

After testing is complete, reports are saved in pdf, csv and json format in the previously mounted RESOURCES_PATH folder

POST /v1/metrics/interrupt/

Interrupt a running test.

Example:

curl -X 'POST' \
'http://localhost:5100/v1/metrics/interrupt/' \
-H 'accept: application/json' \
-d ''

Response Schema:

{"id":"ce79eaee-4541-4341-ac21-b79d9bb7f9bb","name":"QUALITY TEST","status":"INTERRUPTED","started_at":"2025-08-03T10:34:09.113636","exception_details":null,"retry":0}

Self-Check Mode - Model self-assessment of the quality of its answers on a custom dataset without labels.

Experimental mode.

About the evaluation method

Based on the EM (Expectation-Maximization) algorithm for evaluating the quality of language models without ground truth answers. The algorithm analyzes 6 key metrics from 0 to 10 (the higher, the better):

  • Self-Consistency — consistency of answers at different temperatures
  • Self-Rating — model's self-assessment of its answers
  • STD-Rating — spread of model's self-assessment scores at different temperatures
  • Abstention — model's ability to express uncertainty
  • Chain-of-Thought Critique — quality of step-by-step reasoning
  • Paraphrase Self-Consistency — consistency when paraphrasing questions

The EM algorithm ranks each question from the dataset depending on the model's behavior on it. Each question is assigned its own latent score, which shows only the comparative quality of the model relative to other questions. The higher the latent score, the better the model's quality on this question.

Since different datasets may have different distributions of model answer quality, EM model weights are introduced, which show how strongly each metric affects the final latent score. The weight of a metric is greater the better it describes the entire dataset and agrees with other metrics. The final score is aggregated as a weighted average of metric scores by latent score quantiles. Scores are averaged taking into account EM model weights. This makes it possible to obtain a final objective assessment of model quality that can be compared with other datasets.

import requests
url = 'http://localhost:5100/v1/metrics/?dataset=custom&mode=no_gt'
payload = {
"question": [
"What factors increase the risk of developing malignant tumors?",
"How can you quickly navigate to a directory with a long name, and how to find out the file type using the ls command?",
],
"answers": []
}

response = requests.post(
url,
headers={
"accept": "application/json",
"Content-Type": "application/json"
},
json=payload
)

Response Schema:

{
'id': 'c8b0e922-f89b-4d32-af34-bcee22f5a82f', 'name': 'QUALITY TEST [NO_GT (PARALLEL)]', 'status': 'RUNNING', 'started_at': '2025-10-02T14:19:59.471777', 'exception_details': None, 'retry': 0
}