Stable Diffusion in multiple languages using Hugging Face pipelines

Nov 27, 2022

Stable Diffusion is an AI image generator model that creates novel images from text descriptions (called “prompts”). Stable Diffusion is a latent text-to-image diffusion model capable of generating photo-realistic images given any text input.

Since its release in August 2022, a few model checkpoints have been released. Newer versions are trained for longer which could potentially improve the model's capabilities in terms of generating images. Hugging Face's model card is a good starting point to have access to these weights.

How to use Hugging Face pre-trained models

Hugging Face first launched a chat platform in 2017, after this release platform has been focusing on Natural Language Processing (NLP) tasks. The platform provides several resources to normalize NLP tasks and make them accessible to all.

Since then, Hugging Face has grown outside the NLP space. By using the platform, the Data Scientist can choose from Computer Vision and Audio Classification models as well.

Hugging Face uses pipelines with pre-trained models. Pipelines are the most basic object in their library. Each pipelines comprises all necessary steps for a particular task, from pre-processing to post-processing. Here is an example for text classification:

from transformers import pipeline

classifier = pipeline("sentiment-analysis")
classifier("Hugging Face is an amazing tool for Data Scientists.")

In the example above, sentiment-analysis is the particular task you want to accomplish. You can find all the different tasks available in the model hub.

How to build an image generation pipeline

First you need to create an account on Hugging Face and generate an access token on your settings page. This access token will be necessary to download the model checkpoints.

pipe = StableDiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5", 
    use_auth_token=access_token
)

Where your access token is passed on to the method above through the variable access_token. The next step is to feed your prompt to the pipeline to generate an image.

pipe = pipe.to("cuda")  # device used to make inference
images = pipe("An astronaut riding a horse on Mars")

That's it, you can save your image or display it in case you run this workflow in a Jupyter notebook.

Adding a translation step

Hugging Face’s pipelines makes our job really easy. For example, you can stack different tasks and pre-process the prompt to include a translation step. this can be easily done by loading a tokenizer and a translation model. Here is an example to translate Portuguese to English. This implementation was inspired by @joao_gante in this Twitter post.

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

translation_model_id = "Narrativa/mbart-large-50-finetuned-opus-pt-en-translation"
tokenizer = AutoTokenizer.from_pretrained(
    translation_model_id,
    use_auth_token=access_token
)
translator = AutoModelForSeq2SeqLM.from_pretrained(
    translation_model_id,
    use_auth_token=access_token
)

Your translation would be as easy as:

pt_tokens = tokenizer("Um astronauta montando um cavalo em Marte", return_tensors="pt")
en_tokens = translator.generate(
    **pt_tokens, max_new_tokens=100,
    num_beams=8, early_stopping=True
)
en_prompt = tokenizer.batch_decode(en_tokens, skip_special_tokens=True)

Feeding this prompt into the image generation pipeline we get:

If you wanna look into a repo with everything put together, access it here.

Stable Diffusion in multiple languages using Hugging Face pipelines

How to use Hugging Face pre-trained models

How to build an image generation pipeline

Adding a translation step

Discussion about this post