Transformers#
🤗 Transformers is a popular open-source library for natural language processing, providing pre-trained models and tools for building, training, and deploying custom language models. It offers support for a wide range of transformer-based architectures, access to pre-trained models for various NLP tasks, and the ability to fine-tune pre-trained models on specific tasks. BentoML provides native support for serving and deploying models trained from Transformers.
Compatibility#
BentoML requires Transformers version 4 or above. For other versions of Transformers, consider using a Custom Runner.
When constructing a bentofile.yaml, include transformers
and the machine learning
framework of the model, e.g. pytorch
, tensorflow
, or jax
.
Pre-Trained Models#
Transformers provides pre-trained models for a wide range of tasks, including text classification, question answering, language translation, and text generation. The pre-trained models have been trained on large amounts of data and are designed to be fine-tuned on specific downstream tasks. Fine-tuning pretrained models is a highly effective practice that enables users to reduce computation costs while adapting state-of-the-art models to their specific domain dataset. To facilitate this process, Transformers provides a diverse range of libraries specifically designed for fine-tuning pretrained models. To learn more, refer to the Transformers guide on fine-tuning pretrained models.
Tip
Saving and loading pre-trained instances with the bentoml.transformers
APIs are supported starting release v1.0.17
.
Saving Pre-Trained Models and Instances#
Pre-trained models can be saved either as a pipeline or as a standalone model. Other pre-trained instances from Transformers,
such as tokenizers, preprocessors, and feature extractors, can also be saved as standalone models using the bentoml.transformers.save_model
API.
import bentoml
from transformers import AutoTokenizer
processor = SpeechT5Processor.from_pretrained("microsoft/speecht5_tts")
model = SpeechT5ForTextToSpeech.from_pretrained("microsoft/speecht5_tts")
vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan")
bentoml.transformers.save_model("speecht5_tts_processor", processor)
bentoml.transformers.save_model("speecht5_tts_model", model, signatures={"generate_speech": {"batchable": False}})
bentoml.transformers.save_model("speecht5_tts_vocoder", vocoder)
To load the pre-trained instances for testing and debugging, use bentoml.transformers.load_model
with the same tags.
Starting from BentoML version 1.1.9, importing pre-trained Transformer models from Hugging Face has been further streamlined.
You can choose to use the new bentoml.transformers.import_model
function to import models directly into the BentoML Model Store without the overhead of loading them into memory.
By contrast, bentoml.transformers.save_model
necessitates prior model loading and can be resource-intensive for models with large weights.
Here is an example of using the new function:
import bentoml
from transformers import AutoTokenizer
# Save the tokenizer with minimal memory overhead
tokenizer = AutoTokenizer.from_pretrained("t5-small")
bentoml.transformers.save_model('t5-small-tokenizer', tokenizer)
# Import the model without loading into memory, conserving memory
bentoml.transformers.import_model("t5-small-model", "t5-small")
The bentoml.transformers.import_model
function has two required parameters:
name
: The name of the model in the BentoML Model Store.model_name_or_path
: This can be a string, a Hugging Face repository identifier (repo_id), or a directory path containing weights saved usingtransformers.AutoModel.save_pretrained
(for example,./my_pretrained_directory/
).
When importing models from repositories that require the keyword argument trust_remote_code=True
for custom defined model classes, BentoML will load the model into memory by default.
In such cases, to avoid loading the model into memory, add the keyword argument clone_repository=True
. Note that since this downloads all the files in the repository instead of selectively picking certain model files,
it results in greater storage requirements. Here is how you can invoke this:
# Import a trust_remote_code=True model without loading into memory by cloning the entire repository
import bentoml
model = "your_model_name_or_path"
task = "your_task_name"
bentoml.transformers.import_model(
name=task,
model_name_or_path=model,
trust_remote_code=True,
clone_repository=True, # This will avoid loading the model into memory
metadata=dict(model_name=model)
)
Serving Pretrained Models and Instances#
Pre-trained models and instances can be run either independently as Transformers framework runners or jointly in a custom runner. If you wish to run them in isolated processes, use pre-trained models and instances as individual framework runners. On the other hand, if you wish to run them in the same process, use pre-trained models and instances in a custom runner. Using a custom runner is typically more efficient as it can avoid unnecessary overhead incurred during interprocess communication.
To use pre-trained models and instances as individual framework runners, simply get the models reference and convert them to runners using the
to_runner
method.
import bentoml
import torch
from bentoml.io import Text, NumpyNdarray
from datasets import load_dataset
proccessor_runner = bentoml.transformers.get("speecht5_tts_processor").to_runner()
model_runner = bentoml.transformers.get("speecht5_tts_model").to_runner()
vocoder_runner = bentoml.transformers.get("speecht5_tts_vocoder").to_runner()
embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation")
speaker_embeddings = torch.tensor(embeddings_dataset[7306]["xvector"]).unsqueeze(0)
svc = bentoml.Service("text2speech", runners=[proccessor_runner, model_runner, vocoder_runner])
@svc.api(input=Text(), output=NumpyNdarray())
def generate_speech(inp: str):
inputs = proccessor_runner.run(text=inp, return_tensors="pt")
speech = model_runner.generate_speech.run(input_ids=inputs["input_ids"], speaker_embeddings=speaker_embeddings, vocoder=vocoder_runner.run)
return speech.numpy()
Alternatively, to use the pre-trained models and instances together in a custom runner, use the bentoml.transformers.get
API to get the models
references and load them in a custom runner. The pretrained instances can then be used for inference in the custom runner.
import bentoml
import torch
from datasets import load_dataset
processor_ref = bentoml.models.get("speecht5_tts_processor:latest")
model_ref = bentoml.models.get("speecht5_tts_model:latest")
vocoder_ref = bentoml.models.get("speecht5_tts_vocoder:latest")
class SpeechT5Runnable(bentoml.Runnable):
def __init__(self):
self.processor = bentoml.transformers.load_model(processor_ref)
self.model = bentoml.transformers.load_model(model_ref)
self.vocoder = bentoml.transformers.load_model(vocoder_ref)
self.embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation")
self.speaker_embeddings = torch.tensor(self.embeddings_dataset[7306]["xvector"]).unsqueeze(0)
@bentoml.Runnable.method(batchable=False)
def generate_speech(self, inp: str):
inputs = self.processor(text=inp, return_tensors="pt")
speech = self.model.generate_speech(inputs["input_ids"], self.speaker_embeddings, vocoder=self.vocoder)
return speech.numpy()
text2speech_runner = bentoml.Runner(SpeechT5Runnable, name="speecht5_runner", models=[processor_ref, model_ref, vocoder_ref])
svc = bentoml.Service("talk_gpt", runners=[text2speech_runner])
@svc.api(input=bentoml.io.Text(), output=bentoml.io.NumpyNdarray())
async def generate_speech(inp: str):
return await text2speech_runner.generate_speech.async_run(inp)
Built-in Pipelines#
Transformers pipelines are a high-level API for performing common natural language processing tasks using pre-trained transformer models. See Transformers Pipelines tutorial to learn more.
Saving a Pipeline#
To save a Transformers Pipeline, first create a Pipeline object using the desired model and other pre-trained instances, and then save it to
the model store using the bentoml.transformers.save_model
API. Transformers pipelines are callable objects, and thus the signatures of the
model are automatically saved as __call__ by default.
import bentoml
from transformers import pipeline
unmasker = pipeline('fill-mask', model=model, tokenizer=tokenizer)
bentoml.transformers.save_model(name="unmasker", pipeline=unmasker)
To load the pipeline for testing and debugging, use bentoml.transformers.load_model
with the unmasker:latest
tag.
Serving a Pipeline#
See also
See Building a Service to learn more on creating a prediction service with BentoML.
To serve a Transformers pipeline, first get the pipeline reference using the bentoml.transformers.get
API and convert it to a runner using
the to_runner
method.
import bentoml
from bentoml.io import Text, JSON
runner = bentoml.transformers.get("unmasker:latest").to_runner()
svc = bentoml.Service("unmasker_service", runners=[runner])
@svc.api(input=Text(), output=JSON())
async def unmask(input_series: str) -> list:
return await runner.async_run(input_series)
Custom Pipelines#
Transformers custom pipelines allow users to define their own pre and post-process logic and customize how input data is forwarded to the model for inference.
See also
How to add a pipeline from Hugging Face to learn more.
from transformers import Pipeline
class MyClassificationPipeline(Pipeline):
def _sanitize_parameters(self, **kwargs):
preprocess_kwargs = {}
if "maybe_arg" in kwargs:
preprocess_kwargs["maybe_arg"] = kwargs["maybe_arg"]
return preprocess_kwargs, {}, {}
def preprocess(self, text, maybe_arg=2):
input_ids = self.tokenizer(text, return_tensors="pt")
return input_ids
def _forward(self, model_inputs):
outputs = self.model(**model_inputs)
return outputs
def postprocess(self, model_outputs):
return model_outputs["logits"].softmax(-1).numpy()
Saving a Custom Pipeline#
A custom pipeline first needs to be added to the Transformers supported tasks, SUPPORTED_TASKS
before it can be created with
the Transformers pipeline
API.
from transformers import pipeline
from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification
from transformers.pipelines import SUPPORTED_TASKS
TASK_NAME = "my-classification-task"
TASK_DEFINITION = {
"impl": MyClassificationPipeline,
"tf": (),
"pt": (AutoModelForSequenceClassification,),
"default": {},
"type": "text",
}
SUPPORTED_TASKS[TASK_NAME] = TASK_DEFINITION
classifier = pipeline(
task=TASK_NAME,
model=AutoModelForSequenceClassification.from_pretrained(
"distilbert-base-uncased-finetuned-sst-2-english"
),
tokenizer=AutoTokenizer.from_pretrained(
"distilbert-base-uncased-finetuned-sst-2-english"
),
)
Once a new pipeline is added to the Transformers supported tasks, it can be saved to the BentoML model store with the additional
arguments of task_name
and task_definition
, the same arguments that were added to the Transformers SUPPORTED_TASKS
when creating the pipeline. task_name
and task_definition
will be saved as model options alongside the model.
import bentoml
bentoml.transformers.save_model(
"my_classification_model",
pipeline=classifier,
task_name=TASK_NAME,
task_definition=TASK_DEFINITION,
)
Serving a Custom Pipeline#
To serve a custom pipeline, simply create a runner and service with the previously saved pipeline. task_name
and
task_definition
will be automatically applied when initializing the runner.
import bentoml
from bentoml.io import Text, JSON
runner = bentoml.transformers.get("my_classification_model:latest").to_runner()
svc = bentoml.Service("my_classification_service", runners=[runner])
@svc.api(input=Text(), output=JSON())
async def classify(input_series: str) -> list:
return await runner.async_run(input_series)
Adaptive Batching#
If the model supports batched interence, it is recommended to enable batching to take advantage of the adaptive batching capability
in BentoML by overriding the signatures
argument with the method name (__call__
), batchable
, and batch_dim
configurations when saving the model to the model store .
See also
See Adaptive Batching to learn more.
import bentoml
bentoml.transformers.save_model(
name="unmasker",
pipeline=unmasker,
signatures={
"__call__": {
"batchable": True,
"batch_dim": 0,
},
},
)