Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tasks: Add image-text-to-text pipeline and inference API to task page #1039

Merged
merged 13 commits into from
Dec 12, 2024
66 changes: 42 additions & 24 deletions packages/tasks/src/tasks/image-text-to-text/about.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,39 +32,57 @@ Vision language models can recognize images through descriptions. When given det

## Inference

You can use the Transformers library to interact with vision-language models. You can load the model like below.
You can use the Transformers library to interact with vision-language models. Specifically `pipeline` makes it easy to infer models.
merveenoyan marked this conversation as resolved.
Show resolved Hide resolved

Initialize the pipeline first.

```python
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="llava-hf/llava-interleave-qwen-0.5b-hf")
pcuenca marked this conversation as resolved.
Show resolved Hide resolved
```

We will use chat templates to format the text input. We can also pass the image as URL in context part of the user role in our chat template.
merveenoyan marked this conversation as resolved.
Show resolved Hide resolved

```python
from transformers import LlavaNextProcessor, LlavaNextForConditionalGeneration
import torch

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
processor = LlavaNextProcessor.from_pretrained("llava-hf/llava-v1.6-mistral-7b-hf")
model = LlavaNextForConditionalGeneration.from_pretrained(
"llava-hf/llava-v1.6-mistral-7b-hf",
torch_dtype=torch.float16
)
model.to(device)
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg",
},
{"type": "text", "text": "Describe this image."},
],
},
{
"role": "assistant",
"content": [
{"type": "text", "text": "There's a pink flower"},
],
},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit strange to me that the input ends with an assistant turn. I see in the example later that the model completes the sentence with more details, but I'm not sure this is compatible with all chat VLMs. Can we maybe skip the assistant role from the input and see if the model provides a good description of the image?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has not been addressed, I think it's unusual that users supply an assistant turn with the input.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry I thought I answered to this. basically it's to give more control to further align the output during inference. I used the same example here where you can see the output https://huggingface.co/docs/transformers/en/tasks/image_text_to_text

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But that example ends with an user role, while this one ends with an assistant role. I don't think models are expected to be queried with an assistant role in the last turn: they receive a conversation that always ends with an user role, and then they respond with an assistant message.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry I think I should've sent the particular title, here you go https://huggingface.co/docs/transformers/en/tasks/image_text_to_text#pipeline I meant this one

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still looks weird / confusing to me, but ok if you feel strongly about it.

]

```

We can infer by passing image and text dialogues.
We can now directly pass in the messages to pipeline to infer. `return_full_text` is a flag to include the full prompt including the user input. Here we pass as `False` to only return the generated part.
merveenoyan marked this conversation as resolved.
Show resolved Hide resolved

```python
from PIL import Image
import requests
outputs = pipe(text=messages, max_new_tokens=20, return_full_text=False)

# image of a radar chart
url = "https://github.com/haotian-liu/LLaVA/blob/1a91fc274d7c35a9b50b3cb29c4247ae5837ce39/images/llava_v1_5_radar.jpg?raw=true"
image = Image.open(requests.get(url, stream=True).raw)
prompt = "[INST] <image>\nWhat is shown in this image? [/INST]"
outputs[0]["generated_text"]
# with a yellow center in the foreground. The flower is surrounded by red and white flowers with green stems
```

inputs = processor(prompt, image, return_tensors="pt").to(device)
output = model.generate(**inputs, max_new_tokens=100)
You can also use Inference API to play with image-text-to-text models.
merveenoyan marked this conversation as resolved.
Show resolved Hide resolved

print(processor.decode(output[0], skip_special_tokens=True))
# The image appears to be a radar chart, which is a type of multivariate chart that displays values for multiple variables represented on axes
# starting from the same point. This particular radar chart is showing the performance of different models or systems across various metrics.
# The axes represent different metrics or benchmarks, such as MM-Vet, MM-Vet, MM-Vet, MM-Vet, MM-Vet, MM-V
```bash
curl https://api-inference.huggingface.co/models/meta-llama/Llama-3.2-11B-Vision-Instruct \
-X POST \
-d '{"inputs": "Can you please let us know more details about your "}' \
-H 'Content-Type: application/json' \
-H "Authorization: Bearer hf_***"
merveenoyan marked this conversation as resolved.
Show resolved Hide resolved
```

## Useful Resources
Expand Down
Loading