Early LLMs were plain text continuation models, they continued to write a given text in the style as the given input text, a chat was simulated by giving it a chat style conversation, the model picked up the format and continued with writing a chat style response. Current, so called, instruction LLMs use a defined input syntax as text input to control various aspects of the chat conversation with a “system-prompt” and also allow input of encoded media (images, documents, audio) in the input text to realize a multimodal and controllable chat response.

As every model vendor (Meta, OpenAI, Qwen, etc.) defines his own “chat context” format with subtle difference in syntax, AKI defines a meta chat context format: the “AKI Chat Context”, which is automatically converted to the syntax the specific model understands.

The AKI Chat Context Format

The Aki Chat Context is a JSON structure with following basic structure:

[
    {
        "role": "system",
        "content": "You are a helpful assistant named AKI"
    },
    {
        "role": "assistant", 
        "content": “How can I help you?”
    },
    {
        "role": "user", 
        "content": “Tell a joke”
    }
]

There are three roles defined: The assistant the user and the system.

The user role is the input that is coming from the user or a request to the assistant.

The assistant is the LLM that should answer or fulfil the task.

The system is the boss of the assistant, describing the setting and briefs the assistant how he should do his job, what his role is and gives background information to the assistant.

A chat is the alternating conversation between user and assistant. To send a request to the LLM, the last entry should be the request of the user.

Multimodal Content

The “content” part can have an extended form to embed media like images, audio and video. Either as input from the user for describing and asking questions about the media, or as answer from the assistant as generated images, audio or videos.

Only specialized LLMs do support media inputs and outputs.

User request with embed media

The media is directly embed in the JSON in base64 format. The encoding and tokenizing of the media is taken care of when processing the request.

{
        "role": "user", 
        "content": [
            {
                "image": f"data:image/jpeg;base64,{base64_image}"
            },
            {
                "audio": f"data:audio/mp3;base64,{base64_audio}"
            },
            {
                "video": f"data:video/mp4;base64,{base64_video}"
            },
            {
                "text": "What's in this image?"
            }           
        ]
}

Assistant response with generated media

{
	"role": "assistant",
	"content": [
		{
			"text": "Here is a bar chart of the given values"
		},
		{
			"image": f"data:image/jpeg;base64,{base64_image}"
		},
		{
			"audio": f"data:audio/mp3;base64,{base64_audio}"
		},
		{
			"video": f"data:video/mp4;base64,{base64_video}"
		},
		{
			"text": "Are the bars in the correct order?"
		}
	]
}

We are currently extending the chat context format with tool calling mechanisms, we will publish an updated specification soon.

LLM Chat Context

The AKI Chat Context Format

Multimodal Content

User request with embed media

Assistant response with generated media