Deploy a large language model with OpenLLM and BentoML#
As an important component in the BentoML ecosystem, OpenLLM is an open platform designed to facilitate the operation and deployment of large language models (LLMs) in production. The platform provides functionalities that allow users to fine-tune, serve, deploy, and monitor LLMs with ease. OpenLLM supports a wide range of state-of-the-art LLMs and model runtimes, such as StableLM, Falcon, Dolly, Flan-T5, ChatGLM, StarCoder, and more.
With OpenLLM, you can deploy your models to the cloud or on-premises, and build powerful AI applications. It supports the integration of your LLMs with other models and services such as LangChain, BentoML, and Hugging Face, thereby allowing the creation of complex AI applications.
This quickstart demonstrates how to integrate OpenLLM with BentoML to deploy a large language model.
Prerequisites#
Make sure you have Python 3.8+ and
pip
installed. See the Python downloads page to learn more.You have BentoML installed.
You have a basic understanding of key concepts in BentoML, such as Services and Bentos. We recommend you read Deploy a Transformer model with BentoML first.
(Optional) Install Docker if you want to containerize the Bento.
(Optional) We recommend you create a virtual environment for dependency isolation for this quickstart. For more information about virtual environments in Python, see Creation of virtual environments.
Install OpenLLM#
Run the following command to install OpenLLM.
pip install openllm
Create a BentoML Service#
Create a service.py
file to define a BentoML Service and a model Runner. As the Service starts, the model defined in it will be downloaded automatically if it does not exist locally.
from __future__ import annotations
import bentoml
import openllm
model = "dolly-v2"
llm_runner = openllm.Runner(model)
svc = bentoml.Service(name="llm-dolly-service", runners=[llm_runner])
@svc.on_startup
def download(_: bentoml.Context):
llm_runner.download_model()
@svc.api(input=bentoml.io.Text(), output=bentoml.io.Text())
async def prompt(input_text: str) -> str:
answer = await llm_runner.generate.async_run(input_text)
return answer[0]["generated_text"]
Here is a breakdown of this service.py
file.
model
: Themodel
variable is assigned the name of the model to be used (dolly-v2
in this example). Runopenllm models
to view all supported models and their corresponding model IDs. Note that certain models may only support running on GPUs.openllm.Runner()
: Creates a bentoml.Runner instance for the model specified.bentoml.Service()
: Creates a BentoML Service namedllm-dolly-service
and wraps the previously created Runner into the Service.@svc.on_startup
: Different from the Transformer model quickstart, this tutorial creates an action to be performed when the Service starts using theon_startup
hook in theservice.py
file. It calls thedownload_model()
function to ensure the necessary model and weights are downloaded if they do not exist locally. This makes sure the Service is ready to serve requests when it starts.@svc.api()
: Defines an API endpoint for the BentoML Service that takes a text input and outputs a text. The endpointβs functionality is defined in theprompt()
function: it takes in a string of text, runs it through the model to generate an answer, and returns the generated text.
Use bentoml serve
to start the Service.
$ bentoml serve service:svc
2023-07-11T16:17:38+0800 [INFO] [cli] Prometheus metrics for HTTP BentoServer from "service:svc" can be accessed at http://localhost:3000/metrics.
2023-07-11T16:17:39+0800 [INFO] [cli] Starting production HTTP BentoServer from "service:svc" listening on http://0.0.0.0:3000 (Press CTRL+C to quit)
The server is now active at http://0.0.0.0:3000. You can interact with it in different ways.
curl -X 'POST' \
'http://0.0.0.0:3000/prompt' \
-H 'accept: text/plain' \
-H 'Content-Type: text/plain' \
-d '$PROMPT' # Replace $PROMPT here with your prompt.
import requests
response = requests.post(
"http://0.0.0.0:3000/prompt",
headers={
"accept": "text/plain",
"Content-Type": "text/plain",
},
data="$PROMPT", # Replace $PROMPT here with your prompt.
)
print(response.text)
Visit http://0.0.0.0:3000, scroll down to Service APIs, and click Try it out. In the Request body box, enter your prompt and click Execute.

The following example shows the modelβs answer to a question about the concept of large language models.
Input:
What are Large Language Models?
Output:
Large Language Models (LLMs) are statistical models that are trained using a large body of text to recognize words, phrases, sentences, and paragraphs. A neural network is used to train the LLM and a likelihood score is used to quantify the quality of the modelβs predictions. LLMs are also called named entity recognition models and can be used in various applications, including question answering, sentiment analysis, and information retrieval.
The model should be downloaded automatically to the Model Store.
$ bentoml models list
Tag Module Size Creation Time
pt-databricks-dolly-v2-3b:f6c9be08f16fe4d3a719bee0a4a7c7415b5c65df openllm.serialisation.transformers 5.30 GiB 2023-07-11 16:17:26
Build a Bento#
After the Service is ready, you can package it into a Bento by specifying a configuration YAML file (bentofile.yaml
) that defines the build options. See Bento build options to learn more.
service: "service:svc"
include:
- "*.py"
python:
packages:
- openllm
models:
- pt-databricks-dolly-v2-3b:latest
Run bentoml build
in your project directory to build the Bento.
$ bentoml build
Building BentoML service "llm-dolly-service:oatecjraxktp6nry" from build context "/Users/demo/Documents/openllm-test".
Packing model "pt-databricks-dolly-v2-3b:f6c9be08f16fe4d3a719bee0a4a7c7415b5c65df"
Locking PyPI package versions.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββ¦ββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββ¦ββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Successfully built Bento(tag="llm-dolly-service:oatecjraxktp6nry").
Possible next steps:
* Containerize your Bento with `bentoml containerize`:
$ bentoml containerize llm-dolly-service:oatecjraxktp6nry
* Push to BentoCloud with `bentoml push`:
$ bentoml push llm-dolly-service:oatecjraxktp6nry
Deploy a Bento#
To containerize the Bento with Docker, run:
bentoml containerize llm-dolly-service:oatecjraxktp6nry
You can then deploy the Docker image in different environments like Kubernetes. Alternatively, push the Bento to BentoCloud for distributed deployments of your model. For more information, see Deploy Bentos.