Transformers#

🤗 Transformers is a popular open-source library for natural language processing, providing pre-trained models and tools for building, training, and deploying custom language models. It offers support for a wide range of transformer-based architectures, access to pre-trained models for various NLP tasks, and the ability to fine-tune pre-trained models on specific tasks. BentoML provides native support for serving and deploying models trained from Transformers.

Compatibility#

BentoML requires Transformers version 4 or above. For other versions of Transformers, consider using a Custom Runner.

When constructing a bentofile.yaml, include transformers and the machine learning framework of the model, e.g. pytorch, tensorflow, or jax.

bentofile.yaml#
service: "service.py:svc"
labels:
owner: bentoml-team
project: gallery
include:
- "*.py"
python:
  packages:
  - transformers
  - torch
bentofile.yaml#
service: "service.py:svc"
labels:
owner: bentoml-team
project: gallery
include:
- "*.py"
python:
  packages:
  - transformers
  - tensorflow

Pre-Trained Models#

Transformers provides pre-trained models for a wide range of tasks, including text classification, question answering, language translation, and text generation. The pre-trained models have been trained on large amounts of data and are designed to be fine-tuned on specific downstream tasks. Fine-tuning pretrained models is a highly effective practice that enables users to reduce computation costs while adapting state-of-the-art models to their specific domain dataset. To facilitate this process, Transformers provides a diverse range of libraries specifically designed for fine-tuning pretrained models. To learn more, refer to the Transformers guide on fine-tuning pretrained models.

Tip

Saving and loading pre-trained instances with the bentoml.transformers APIs are supported starting release v1.0.17.

Saving Pre-Trained Models and Instances#

Pre-trained models can be saved either as a pipeline or as a standalone model. Other pre-trained instances from Transformers, such as tokenizers, preprocessors, and feature extractors, can also be saved as standalone models using the bentoml.transformers.save_model API.

train.py#
import bentoml
from transformers import AutoTokenizer

processor = SpeechT5Processor.from_pretrained("microsoft/speecht5_tts")
model = SpeechT5ForTextToSpeech.from_pretrained("microsoft/speecht5_tts")
vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan")

bentoml.transformers.save_model("speecht5_tts_processor", processor)
bentoml.transformers.save_model("speecht5_tts_model", model, signatures={"generate_speech": {"batchable": False}})
bentoml.transformers.save_model("speecht5_tts_vocoder", vocoder)

To load the pre-trained instances for testing and debugging, use bentoml.transformers.load_model with the same tags.

Starting from BentoML version 1.1.9, importing pre-trained Transformer models from Hugging Face has been further streamlined. You can choose to use the new bentoml.transformers.import_model function to import models directly into the BentoML Model Store without the overhead of loading them into memory. By contrast, bentoml.transformers.save_model necessitates prior model loading and can be resource-intensive for models with large weights. Here is an example of using the new function:

download_model.py#
import bentoml
from transformers import AutoTokenizer

# Save the tokenizer with minimal memory overhead
tokenizer = AutoTokenizer.from_pretrained("t5-small")
bentoml.transformers.save_model('t5-small-tokenizer', tokenizer)

# Import the model without loading into memory, conserving memory
bentoml.transformers.import_model("t5-small-model", "t5-small")

The bentoml.transformers.import_model function has two required parameters:

  • name: The name of the model in the BentoML Model Store.

  • model_name_or_path: This can be a string, a Hugging Face repository identifier (repo_id), or a directory path containing weights saved using transformers.AutoModel.save_pretrained (for example, ./my_pretrained_directory/).

When importing models from repositories that require the keyword argument trust_remote_code=True for custom defined model classes, BentoML will load the model into memory by default. In such cases, to avoid loading the model into memory, add the keyword argument clone_repository=True. Note that since this downloads all the files in the repository instead of selectively picking certain model files, it results in greater storage requirements. Here is how you can invoke this:

# Import a trust_remote_code=True model without loading into memory by cloning the entire repository

import bentoml

model = "your_model_name_or_path"
task = "your_task_name"

bentoml.transformers.import_model(
    name=task,
    model_name_or_path=model,
    trust_remote_code=True,
    clone_repository=True,  # This will avoid loading the model into memory
    metadata=dict(model_name=model)
)

Serving Pretrained Models and Instances#

Pre-trained models and instances can be run either independently as Transformers framework runners or jointly in a custom runner. If you wish to run them in isolated processes, use pre-trained models and instances as individual framework runners. On the other hand, if you wish to run them in the same process, use pre-trained models and instances in a custom runner. Using a custom runner is typically more efficient as it can avoid unnecessary overhead incurred during interprocess communication.

To use pre-trained models and instances as individual framework runners, simply get the models reference and convert them to runners using the to_runner method.

service.py#
import bentoml
import torch

from bentoml.io import Text, NumpyNdarray
from datasets import load_dataset

proccessor_runner = bentoml.transformers.get("speecht5_tts_processor").to_runner()
model_runner = bentoml.transformers.get("speecht5_tts_model").to_runner()
vocoder_runner = bentoml.transformers.get("speecht5_tts_vocoder").to_runner()
embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation")
speaker_embeddings = torch.tensor(embeddings_dataset[7306]["xvector"]).unsqueeze(0)

svc = bentoml.Service("text2speech", runners=[proccessor_runner, model_runner, vocoder_runner])

@svc.api(input=Text(), output=NumpyNdarray())
def generate_speech(inp: str):
    inputs = proccessor_runner.run(text=inp, return_tensors="pt")
    speech = model_runner.generate_speech.run(input_ids=inputs["input_ids"], speaker_embeddings=speaker_embeddings, vocoder=vocoder_runner.run)
    return speech.numpy()

Alternatively, to use the pre-trained models and instances together in a custom runner, use the bentoml.transformers.get API to get the models references and load them in a custom runner. The pretrained instances can then be used for inference in the custom runner.

service.py#
import bentoml
import torch

from datasets import load_dataset


processor_ref = bentoml.models.get("speecht5_tts_processor:latest")
model_ref = bentoml.models.get("speecht5_tts_model:latest")
vocoder_ref = bentoml.models.get("speecht5_tts_vocoder:latest")


class SpeechT5Runnable(bentoml.Runnable):

    def __init__(self):
        self.processor = bentoml.transformers.load_model(processor_ref)
        self.model = bentoml.transformers.load_model(model_ref)
        self.vocoder = bentoml.transformers.load_model(vocoder_ref)
        self.embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation")
        self.speaker_embeddings = torch.tensor(self.embeddings_dataset[7306]["xvector"]).unsqueeze(0)

    @bentoml.Runnable.method(batchable=False)
    def generate_speech(self, inp: str):
        inputs = self.processor(text=inp, return_tensors="pt")
        speech = self.model.generate_speech(inputs["input_ids"], self.speaker_embeddings, vocoder=self.vocoder)
        return speech.numpy()


text2speech_runner = bentoml.Runner(SpeechT5Runnable, name="speecht5_runner", models=[processor_ref, model_ref, vocoder_ref])
svc = bentoml.Service("talk_gpt", runners=[text2speech_runner])

@svc.api(input=bentoml.io.Text(), output=bentoml.io.NumpyNdarray())
async def generate_speech(inp: str):
    return await text2speech_runner.generate_speech.async_run(inp)

Built-in Pipelines#

Transformers pipelines are a high-level API for performing common natural language processing tasks using pre-trained transformer models. See Transformers Pipelines tutorial to learn more.

Saving a Pipeline#

To save a Transformers Pipeline, first create a Pipeline object using the desired model and other pre-trained instances, and then save it to the model store using the bentoml.transformers.save_model API. Transformers pipelines are callable objects, and thus the signatures of the model are automatically saved as __call__ by default.

train.py#
import bentoml
from transformers import pipeline

unmasker = pipeline('fill-mask', model=model, tokenizer=tokenizer)

bentoml.transformers.save_model(name="unmasker", pipeline=unmasker)

To load the pipeline for testing and debugging, use bentoml.transformers.load_model with the unmasker:latest tag.

Serving a Pipeline#

See also

See Building a Service to learn more on creating a prediction service with BentoML.

To serve a Transformers pipeline, first get the pipeline reference using the bentoml.transformers.get API and convert it to a runner using the to_runner method.

service.py#
import bentoml

from bentoml.io import Text, JSON

runner = bentoml.transformers.get("unmasker:latest").to_runner()

svc = bentoml.Service("unmasker_service", runners=[runner])

@svc.api(input=Text(), output=JSON())
async def unmask(input_series: str) -> list:
    return await runner.async_run(input_series)

Custom Pipelines#

Transformers custom pipelines allow users to define their own pre and post-process logic and customize how input data is forwarded to the model for inference.

See also

How to add a pipeline from Hugging Face to learn more.

train.py#
from transformers import Pipeline

class MyClassificationPipeline(Pipeline):
    def _sanitize_parameters(self, **kwargs):
        preprocess_kwargs = {}
        if "maybe_arg" in kwargs:
            preprocess_kwargs["maybe_arg"] = kwargs["maybe_arg"]
        return preprocess_kwargs, {}, {}

    def preprocess(self, text, maybe_arg=2):
        input_ids = self.tokenizer(text, return_tensors="pt")
        return input_ids

    def _forward(self, model_inputs):
        outputs = self.model(**model_inputs)
        return outputs

    def postprocess(self, model_outputs):
        return model_outputs["logits"].softmax(-1).numpy()

Saving a Custom Pipeline#

A custom pipeline first needs to be added to the Transformers supported tasks, SUPPORTED_TASKS before it can be created with the Transformers pipeline API.

train.py#
from transformers import pipeline
from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification
from transformers.pipelines import SUPPORTED_TASKS

TASK_NAME = "my-classification-task"
TASK_DEFINITION = {
    "impl": MyClassificationPipeline,
    "tf": (),
    "pt": (AutoModelForSequenceClassification,),
    "default": {},
    "type": "text",
}
SUPPORTED_TASKS[TASK_NAME] = TASK_DEFINITION

classifier = pipeline(
    task=TASK_NAME,
    model=AutoModelForSequenceClassification.from_pretrained(
        "distilbert-base-uncased-finetuned-sst-2-english"
    ),
    tokenizer=AutoTokenizer.from_pretrained(
        "distilbert-base-uncased-finetuned-sst-2-english"
    ),
)

Once a new pipeline is added to the Transformers supported tasks, it can be saved to the BentoML model store with the additional arguments of task_name and task_definition, the same arguments that were added to the Transformers SUPPORTED_TASKS when creating the pipeline. task_name and task_definition will be saved as model options alongside the model.

train.py#
 import bentoml

 bentoml.transformers.save_model(
     "my_classification_model",
     pipeline=classifier,
     task_name=TASK_NAME,
     task_definition=TASK_DEFINITION,
 )

Serving a Custom Pipeline#

To serve a custom pipeline, simply create a runner and service with the previously saved pipeline. task_name and task_definition will be automatically applied when initializing the runner.

service.py#
import bentoml

from bentoml.io import Text, JSON

runner = bentoml.transformers.get("my_classification_model:latest").to_runner()

svc = bentoml.Service("my_classification_service", runners=[runner])

@svc.api(input=Text(), output=JSON())
async def classify(input_series: str) -> list:
    return await runner.async_run(input_series)

Adaptive Batching#

If the model supports batched inference, it is recommended to enable batching to take advantage of the adaptive batching capability in BentoML by overriding the signatures argument with the method name (__call__), batchable, and batch_dim configurations when saving the model to the model store .

See also

See Adaptive Batching to learn more.

train.py#
import bentoml

bentoml.transformers.save_model(
    name="unmasker",
    pipeline=unmasker,
    signatures={
        "__call__": {
            "batchable": True,
            "batch_dim": 0,
        },
    },
)