BentoML Documentation¶

BentoML is a Unified Inference Platform for deploying and scaling AI systems with any model, on any cloud.

Featured examples¶

Deploy an open-source LLM endpoint

Serve large language models with OpenAI-compatible APIs and vLLM inference backend.

LLM inference: vLLM

Document Q&A with RAG

Deploy private RAG systems with open-source embedding and large language models.

RAG: Document ingestion and search

Serve diffusion models

Deploy image generation APIs with flexible customization and optimized batch processing.

Stable Diffusion XL Turbo

Deploy ComfyUI pipelines

Automate reproducible workflows with queued execution using ComfyUI pipelines.

ComfyUI: Deploy workflows as APIs

Build a phone calling agent

Build a phone calling agent with end-to-end streaming capabilities using open-source models and Twilio.

https://github.com/bentoml/BentoTwilioConversationRelay

LLM safety: ShieldGemma

Protect your LLM API endpoint from harmful input using Google’s safety content moderation model.

LLM safety: ShieldGemma

More examples 👉

Explore what developers are building with BentoML.

Overview

What is BentoML¶

BentoML is a Unified Inference Platform for deploying and scaling AI models with production-grade reliability, all without the complexity of managing infrastructure. It enables your developers to build AI systems 10x faster with custom models, scale efficiently in your cloud, and maintain complete control over security and compliance.

The architecture diagram of the BentoML unified inference platform

To get started with BentoML:

Use pip to install the BentoML open-source model serving framework, which is distributed as a Python package on PyPI.
```
# Recommend Python 3.9+
pip install bentoml
```
Sign up for BentoCloud to get a free trial.

How-tos¶

Create online API Services

Build your custom AI APIs with BentoML.

Create online API Services

Create Deployments

Deploy your AI application to production with one command.

Create Deployments

Concurrency and autoscaling

Configure fast autoscaling to achieve optimal performance.

Concurrency and autoscaling

Work with GPUs

Run model inference on GPUs with BentoML.

Work with GPUs

Develop with Codespaces

Develop with powerful cloud GPUs using your favorite IDE.

Develop with Codespaces

Load and manage models

Load and serve your custom models with BentoML.

Load and manage models

Stay informed¶

The BentoML team uses the following channels to announce important updates like major product releases and share tutorials, case studies, as well as community news.

To receive release notifications, star and watch the BentoML project on GitHub. For release notes and detailed changelogs, see the Releases page.