LLM Request Parameters

A call to a AKI.IO LLM endpoint is done with a HTTPS POST request to:

https://aki.io/api/call/{endpoint name}

For a list of available LLM endpoints please read here.

The data of the POST request are send to the endpoint in JSON format.

The basic parameters of an LLM chat request are standardized across all models. An example of the JSON request payload to an LLM endpoint:

{
    "key": "fc3a8c50-b12b-4d6a-ba07-c9f6a6c32c37",
    "chat_context": "[{\"role\":\"system\",\"content\":\"You are a helpful assistant named AKI\"},{\"role\":\"user\",\"content\":\"Tell a joke\"}]",
    "text_context": "",
    "chat_output_format": "chatml",
    "top_k": 40,
    "top_p": 0.9,
    "temperature": 0.6,
    "max_gen_tokens": 1000,
    "wait_for_result": false
}

The description of the JSON attributes:

key:string

Your AKI.IO API key for authentication and authorization. Get your AKI.IO API key at https://aki.io/signup

chat_context:json

The current chat context, the last statement should be role “user” stating the question, command or request for which a response should be generated. The AKI.IO chat_context format is standardized across all offered LLM endpoints and converted to the model specific instruct syntax. The chat_context can also define embedded multimodal inputs. For a detail description of the AKI.IO chat format please read here.

text_context:string

Alternative to the chat context a string based context prompt can be set, please note that most instruct based LLMs require a model specific syntax as input to respond. Please consult the documentation of the model in case you would like to use it on a low level or use it without instruct format.

Please take note that some models do not give any response at all in case the instruct format is not followed precisely.

We would advise to use the AKI.IO chat format if you are unsure as it give the option to exchange the model with very little to no adjustments.

chat_output_format:string

The desired format of the chat response should be outputted, available options are: "chatml", "raw" and "json". If not specified the default chat output is “chatml”. The most capable and future proof output format is currently “json” as it also can embed reasoning channels, multimodal output and support for tool calls.

For a detailed explanation of the llm chat output formats please read here.

temperature:float

The temperature sampling parameter in the range 0.0 – 1.0

A LLM is a deterministic numeric model which calculates the most probable token based on the previous tokens. To introduce variations in the responses a randomness in picking the next tokens is introduced by not always use the token with the best scoring but to pick a random token of a bucket of the N best tokens. This process is called “sampling”. The temperature controls the distribution of the sampling process. A temperature closer to zero stays close to the best scoring token and seldom strays to the Nth best token. A temperature of 1.0 picks a token with even distribution across the N best tokens. With a temperature of 0.0 always the best token is picked, in this case the LLM will always answer with the exact same response to a given context. Rule of thumb for deterministic tasks the temperature should be close to 0. For creative writing tasks a higher temperature can give more variations and maybe unexpected but interesting responses.

top_k:int

The Top-K sampling parameter in the range 0..1000

The size of the sampling bucket. Top-K of 40 means that the answer is sampled among the 40 tokens with the best scoring. Setting to Top-K to 1 disables the sampling process as always the token with best scoring will be taken.

top_p:float

The Top-P sampling parameter in the range 0.0 – 1.0

The Top-P sampling parameter controls the “quality” of the best tokens that should be considered in the sampling bucket. If the quality is below the Top-P value tokens are not considered and the bucket will be cut off and have less parameters as specified by top_k as there are not enough tokens that score above the Top-P parameter.

It is advised to keep this parameter around 0.9 to only allow synonyms in the sampling bucket. E.g. “yes”, “true”, “correct” could all be good answers with a quality above 0.9 . But not “maybe” which was the 4th best token but scored only 0.8 .

max_gen_tokens:int

Limit the response to generate maximum number of tokens. For most endpoints the maximum output tokens length limit to 16.000 tokens.

wait_for_result:boolean

For blocking API call set to "true". For a streaming response set to "false"

Read more about the powerful AKI.IO streaming responses