Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tasks: Add image-text-to-text pipeline and inference API to task page #1039

Merged
merged 13 commits into from
Dec 12, 2024
66 changes: 42 additions & 24 deletions packages/tasks/src/tasks/image-text-to-text/about.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,39 +32,57 @@ Vision language models can recognize images through descriptions. When given det

## Inference

You can use the Transformers library to interact with vision-language models. You can load the model like below.
You can use the Transformers library to interact with vision-language models. Specifically, `pipeline` makes it easy to infer models.
merveenoyan marked this conversation as resolved.
Show resolved Hide resolved

Initialize the pipeline first.

```python
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="llava-hf/llava-interleave-qwen-0.5b-hf")
pcuenca marked this conversation as resolved.
Show resolved Hide resolved
```

The model's built-in chat template will be used to format the conversational input. We can pass the image as an URL in the `content` part of the user message:

```python
from transformers import LlavaNextProcessor, LlavaNextForConditionalGeneration
import torch

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
processor = LlavaNextProcessor.from_pretrained("llava-hf/llava-v1.6-mistral-7b-hf")
model = LlavaNextForConditionalGeneration.from_pretrained(
"llava-hf/llava-v1.6-mistral-7b-hf",
torch_dtype=torch.float16
)
model.to(device)
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg",
},
{"type": "text", "text": "Describe this image."},
],
},
{
"role": "assistant",
"content": [
{"type": "text", "text": "There's a pink flower"},
],
},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit strange to me that the input ends with an assistant turn. I see in the example later that the model completes the sentence with more details, but I'm not sure this is compatible with all chat VLMs. Can we maybe skip the assistant role from the input and see if the model provides a good description of the image?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has not been addressed, I think it's unusual that users supply an assistant turn with the input.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry I thought I answered to this. basically it's to give more control to further align the output during inference. I used the same example here where you can see the output https://huggingface.co/docs/transformers/en/tasks/image_text_to_text

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But that example ends with an user role, while this one ends with an assistant role. I don't think models are expected to be queried with an assistant role in the last turn: they receive a conversation that always ends with an user role, and then they respond with an assistant message.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry I think I should've sent the particular title, here you go https://huggingface.co/docs/transformers/en/tasks/image_text_to_text#pipeline I meant this one

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still looks weird / confusing to me, but ok if you feel strongly about it.

]

```

We can infer by passing image and text dialogues.
We can now directly pass in the messages to the pipeline to infer. The `return_full_text` flag is used to return the full prompt in the response, including the user input. Here we pass `False` to only return the generated text.

```python
from PIL import Image
import requests
outputs = pipe(text=messages, max_new_tokens=20, return_full_text=False)

# image of a radar chart
url = "https://github.com/haotian-liu/LLaVA/blob/1a91fc274d7c35a9b50b3cb29c4247ae5837ce39/images/llava_v1_5_radar.jpg?raw=true"
image = Image.open(requests.get(url, stream=True).raw)
prompt = "[INST] <image>\nWhat is shown in this image? [/INST]"
outputs[0]["generated_text"]
# with a yellow center in the foreground. The flower is surrounded by red and white flowers with green stems
```

inputs = processor(prompt, image, return_tensors="pt").to(device)
output = model.generate(**inputs, max_new_tokens=100)
You can also use the Inference API to test image-text-to-text models. You need to use a [Hugging Face token](https://huggingface.co/settings/tokens) for authentication.

print(processor.decode(output[0], skip_special_tokens=True))
# The image appears to be a radar chart, which is a type of multivariate chart that displays values for multiple variables represented on axes
# starting from the same point. This particular radar chart is showing the performance of different models or systems across various metrics.
# The axes represent different metrics or benchmarks, such as MM-Vet, MM-Vet, MM-Vet, MM-Vet, MM-Vet, MM-V
```bash
curl https://api-inference.huggingface.co/models/meta-llama/Llama-3.2-11B-Vision-Instruct \
-X POST \
-d '{"inputs": "Can you please let us know more details about your "}' \
-H 'Content-Type: application/json' \
-H "Authorization: Bearer hf_***"
merveenoyan marked this conversation as resolved.
Show resolved Hide resolved
```

## Useful Resources
Expand Down
Loading