Cloud deployment¶
BentoCloud is an Inference Management Platform and Compute Orchestration Engine built on top of BentoML’s open-source serving framework. It provides a complete stack for building fast and scalable AI systems with any model, on any cloud.
Why developers love BentoCloud:
Flexible Pythonic APIs for building inference APIs, batch jobs, and compound AI systems
Blazing fast cold start with a container infrastructure stack rebuilt for ML/AI workloads
Support for any ML frameworks and inference runtimes (vLLM, TensorRT, Triton, etc.)
Streamlined workflows across development, testing, deployment, monitoring, and CI/CD
Easy access to various GPUs like L4 and A100, in our cloud or yours
Log in to BentoCloud¶
Visit the BentoML website to sign up.
Install BentoML.
pip install bentoml
Log in to BentoCloud with the
bentoml cloud login
command. Follow the on-screen instructions to create a new API token.$ bentoml cloud login ? How would you like to authenticate BentoML CLI? [Use arrows to move] > Create a new API token with a web browser Paste an existing API token
Deploy your first model¶
Clone the Hello world example.
git clone https://github.com/bentoml/quickstart.git cd quickstart
Deploy it to BentoCloud from the project directory. Optionally, use the
-n
flag to set a name.bentoml deploy . -n my-first-bento
Sample output:
🍱 Built bento summarization:ngfnciv5g6nxonry Successfully pushed Bento "summarization:ngfnciv5g6nxonry" ✅ Created deployment "my-first-bento" in cluster "google-cloud-us-central-1" 💻 View Dashboard: https://demo.cloud.bentoml.com/deployments/my-first-bento
The first Deployment might take a minute or two. Wait until it’s fully ready:
✅ Deployment "my-first-bento" is ready: https://demo.cloud.bentoml.com/deployments/my-first-bento
On the BentoCloud console, navigate to the Deployments page, and click your Deployment. Once it’s up and running, you can interact with it using the Form section on the Playground tab.
Call the Deployment endpoint¶
Retrieve the Deployment URL via CLI. Replace
my-first-bento
if you use another name.bentoml deployment get my-first-bento -o json | jq ."endpoint_urls"
Note
Ensure
jq
is installed for processing JSON output.Create a BentoML client to call the exposed endpoint. Replace the example URL with your Deployment’s URL:
import bentoml client = bentoml.SyncHTTPClient("https://my-first-bento-e3c1c7db.mt-guc1.bentoml.ai") result: str = client.summarize( text="Breaking News: In an astonishing turn of events, the small town of Willow Creek has been taken by storm as local resident Jerry Thompson's cat, Whiskers, performed what witnesses are calling a 'miraculous and gravity-defying leap.' Eyewitnesses report that Whiskers, an otherwise unremarkable tabby cat, jumped a record-breaking 20 feet into the air to catch a fly. The event, which took place in Thompson's backyard, is now being investigated by scientists for potential breaches in the laws of physics. Local authorities are considering a town festival to celebrate what is being hailed as 'The Leap of the Century.", ) print(result)
Configure scaling¶
The replica count defaults to 1
. You can update the minimum and maximum replicas allowed for scaling:
bentoml deployment update my-first-bento --scaling-min 0 --scaling-max 3
Cleanup¶
To terminate this Deployment, click Stop in the top right corner of its details page or simply run:
bentoml deployment terminate my-first-bento
More resources¶
If you are a first-time user of BentoCloud, we recommend you read the following documents to get started:
Deploy example projects to BentoCloud