LLM Chat Response

Blocking Call – Wait for Result

Depending on if “wait_for_result” was set to True or False we get two different responses. In case wait_for_result is set to True the request will wait until the LLM is finished generating the answer and will then return the completed result of the request in following format:

{
	'text': "Here's one:\n\nWhat do you call a fake noodle?\n\nAn impasta.", '	num_generated_tokens': 18, 
	'model_name': 'Llama-3.1-8B-Instruct', 
	'max_seq_len': 65536, 
	'current_context_length': 88, 
	'prompt_length': 70, 
	'success': True, 
	'job_id': '41520777-93c8-4119-a34d-4f5b21b8b856', 
	'total_duration': 0.295, 
	'compute_duration': 0.278
}

text:[string|json]

The complete generate chat response in the desired chat_output_format. For an explanation of the different chat output formats please see here.

num_generated_tokens:integer

Total number of generated tokens in this request.

prompt_length:integer

The length of the given chat_context aka prompt in tokens.

current_context_length:integer

The total length of the current chat context in tokens.

max_seq_len:integer

The maximum context length in tokens this endpoint is capable to process.

success:string

"true" in case the request could be processed successful

error:string

The error message in case success is "false". Errors could be not correct syntax of the JSON, invalid parameters, range errors, not valid chat context, not available endpoint, access and authorization issues.

Non-Blocking - Streaming Response

In case “wait_for_result” was set to False in the LLM request, the request will return immediately and return following information:

{
    "success": true,
    "job_id": "7412cf4b-c230-454a-a1db-7e5df401f973",
}

success:string

"true" in case the request could be start the stream successful.

error:string

The error message in case success is "false". Errors could be not correct syntax of the JSON, invalid parameters, range errors, not valid chat context, not available endpoint, access and authorization issues.

job_id:string

The id of the started streaming job. This id is required to address the request in further requests.

Reading the Stream Progress

The progress of a LLM job can be called with a HTTPS POST request to:

https://aki.io/api/progress/{endpoint name}

The JSON payload of the stream progress request should have following parameters:

{
    "job_id": "7412cf4b-c230-454a-a1db-7e5df401f973",
    "key": "b0050e7b-286d-46a9-9354-b57ecf809952",
    "cancel": false
}

key:string

Your API key.

job_id:string

The job id returned from the initial chat request to identify to which stream you are refering to.

cancel:boolean [optional]

Optional the stream can be canceled by setting this parameter to true. No further compute of the request will be done and only the already generated tokens will be billed.

The stream progress call, will depending on the state of job, return following response:

{
    "success": true,
    "job_id": "7412cf4b-c230-454a-a1db-7e5df401f973",
    "job_state": "processing",
    "progress": {
        "queue_position": 0,
        "estimate": 2.9
        "progress_data": {
            "text": "Hello",
            "num_generated_tokens": 1,
            "current_context_length": 145
        },
    }
}

job_state:string

It returns the state the streaming request is currently in possible values are:

‘queued’

This status is returned in case the request is waiting in the queue to be processed. This state should be rarely seen, as AKI.IO is dedicated to fast streaming of LLM requests. In case this state is returned, the “que_position” is the waiting position in the queue and “estimate” the time estimated until the processing will begin.

‘processing’

This is the state when the request is active processing and generating output. The “progress_data” branch will hold a reduced set of final result. The until this time generated output will be in the “text” attribute and “num_generated_tokens” returns the number of generated tokens until that time.

‘canceled’

This state is returned in case the request was canceled on user request or in case the request has timed out because of to long time between starting the request and fetching the stream results.

‘done’

The final streaming response will have the job_state ‘done’. It will have a different JSON layout and contain the same data attributes in the “job_result” branch as one will receive with a blocking call.

{
    "success": true,
    "job_id": "7412cf4b-c230-454a-a1db-7e5df401f973",
    "job_state": "done",
    "job_result": {
        "text": "Here's one:\n\nWhat do you call a fake noodle?\n\nAn impasta.\n\nI hope that made you smile! Do you want to hear another one?",
        "model_name": "Llama-3.1-8B-Instruct",
        "max_seq_len": 65536,
        "prompt_length": 164,
        "num_generated_tokens": 33,
        "current_context_length": 197,
        "total_duration": 0.505,
        "compute_duration": 0.488,
    }

It is advised to call the /stream/progress/{endpoint name} at the rate suitable to your application, for real-time with a very high rate or for bulk processing at a lower rate.

The stream progress should be polled until either job_state is 'done' or 'canceled' or success is 'false' in case an error has occurred.

The streaming polling mechanism is very robust against network issues as connection losses caused by broadcast cell changes, IP changes, etc. Interrupted or disconnected streams need no complicated reconnection procedure. The next progress call will pickup wherever the last missed or stalled called was discontinued. The JSON streaming instead of a streaming HTTP response has also the advantage that out of band data can be transmitted, in this example the num_generated tokens always correspond to the received text, no side channels are required. Also the stream can handle bidirectional communication in this case the canceling of the stream can be directly send to the stream.

This was the low level description how to communicate with the AKI.IO LLM API directly with HTTPS JSON requests. There are currently client interfaces for Javascript and Python available that wrap the functionality for a fast and future proof integration in your application.