Skip to main content

Multimodal

Baleybots provides a builder pattern for multimodal inputs. These functions are imported from @baleybots/core and passed as input to bot.process().

Text

import { Baleybot, text } from '@baleybots/core';

const bot = Baleybot.create({ name: 'bot', goal: 'Answer questions' });

const result = await bot.process(text('What is the capital of France?'));

You can also pass a plain string directly -- text() is mainly useful when combining with other modalities.

Image

import { Baleybot, image, combine, text } from '@baleybots/core';

const bot = Baleybot.create({ name: 'vision', goal: 'Describe images' });

// From a URL
const result = await bot.process(
combine(text('Describe this image'), image('https://example.com/photo.jpg'))
);

// From a Buffer
const buffer = await fetch('https://example.com/photo.jpg').then(r => r.arrayBuffer());
const result2 = await bot.process(image(buffer));

Audio

import { Baleybot, audio, combine, text } from '@baleybots/core';

const bot = Baleybot.create({ name: 'audio-bot', goal: 'Analyze audio content' });

const audioBuffer = await fetch('https://example.com/speech.mp3').then(r => r.arrayBuffer());
const result = await bot.process(
combine(text('Transcribe this audio'), audio(audioBuffer))
);

Video / Frames

The frames() function accepts an array of extracted frames or a MediaStream for real-time video. The video() function is an alias for frames().

import { Baleybot, video, combine, text } from '@baleybots/core';

const bot = Baleybot.create({ name: 'video-bot', goal: 'Analyze video content' });

// From a live MediaStream
const stream = await navigator.mediaDevices.getDisplayMedia({ video: true });
const result = await bot.process(
combine(text('What is happening in this video?'), video(stream))
);

File

For documents, PDFs, CSVs, and other file types:

import { Baleybot, file, combine, text } from '@baleybots/core';

const bot = Baleybot.create({ name: 'doc-bot', goal: 'Analyze documents' });

const pdfBlob = await fetch('report.pdf').then(r => r.blob());
const result = await bot.process(
combine(text('Summarize this document'), file(pdfBlob, 'application/pdf'))
);

Combining inputs

Use combine() to pass multiple input types together:

import { Baleybot, combine, text, image, audio } from '@baleybots/core';

const bot = Baleybot.create({
name: 'multimodal-bot',
goal: 'Analyze multimedia content',
});

const result = await bot.process(
combine(
text('Compare the audio description with what you see in the image'),
image('https://example.com/photo.jpg'),
audio(audioBuffer)
)
);

Summary

FunctionInput typesDescription
text(string)Plain textText input
image(url | buffer)URL string, Buffer, ArrayBufferSingle image or array of images
audio(url | buffer)URL string, Buffer, ArrayBufferAudio content
video(frames | stream)MediaFrame[], MediaStreamVideo frames or live stream
frames(frames | stream)MediaFrame[], MediaStreamAlias for video()
file(data, mimeType)Blob, Uint8Array, ArrayBuffer, stringGeneric file with MIME type
combine(...inputs)Any of the aboveMerge multiple inputs together