From d3e58a14b49480d1f4fa3a8ed352cf5c58d73213 Mon Sep 17 00:00:00 2001 From: Merve Noyan Date: Mon, 18 Nov 2024 16:22:34 +0100 Subject: [PATCH 01/11] Add it2t pipeline to task page --- .../src/tasks/image-text-to-text/about.md | 56 +++++++++++-------- 1 file changed, 32 insertions(+), 24 deletions(-) diff --git a/packages/tasks/src/tasks/image-text-to-text/about.md b/packages/tasks/src/tasks/image-text-to-text/about.md index f220604fc..490a321c6 100644 --- a/packages/tasks/src/tasks/image-text-to-text/about.md +++ b/packages/tasks/src/tasks/image-text-to-text/about.md @@ -32,39 +32,47 @@ Vision language models can recognize images through descriptions. When given det ## Inference -You can use the Transformers library to interact with vision-language models. You can load the model like below. +You can use the Transformers library to interact with vision-language models. Specifically `pipeline` makes it easy to infer models. + +Initialize the pipeline first. ```python -from transformers import LlavaNextProcessor, LlavaNextForConditionalGeneration -import torch - -device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') -processor = LlavaNextProcessor.from_pretrained("llava-hf/llava-v1.6-mistral-7b-hf") -model = LlavaNextForConditionalGeneration.from_pretrained( - "llava-hf/llava-v1.6-mistral-7b-hf", - torch_dtype=torch.float16 -) -model.to(device) +from transformers import pipeline + +pipe = pipeline("image-text-to-text", model="llava-hf/llava-interleave-qwen-0.5b-hf") ``` -We can infer by passing image and text dialogues. +We will use chat templates to format the text input. We can also pass the image as URL in context part of the user role in our chat template. ```python -from PIL import Image -import requests +messages = [ + { + "role": "user", + "content": [ + { + "type": "image", + "image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg", + }, + {"type": "text", "text": "Describe this image."}, + ], + }, + { + "role": "assistant", + "content": [ + {"type": "text", "text": "There's a pink flower"}, + ], + }, + ] -# image of a radar chart -url = "https://github.com/haotian-liu/LLaVA/blob/1a91fc274d7c35a9b50b3cb29c4247ae5837ce39/images/llava_v1_5_radar.jpg?raw=true" -image = Image.open(requests.get(url, stream=True).raw) -prompt = "[INST] \nWhat is shown in this image? [/INST]" +``` -inputs = processor(prompt, image, return_tensors="pt").to(device) -output = model.generate(**inputs, max_new_tokens=100) +We can now directly pass in the messages to pipeline to infer. `return_full_text` is a flag to include the full prompt including the user input. Here we pass as `False` to only return the generated part. + +```python +outputs = pipe(text=messages, max_new_tokens=20, return_full_text=False) -print(processor.decode(output[0], skip_special_tokens=True)) -# The image appears to be a radar chart, which is a type of multivariate chart that displays values for multiple variables represented on axes -# starting from the same point. This particular radar chart is showing the performance of different models or systems across various metrics. -# The axes represent different metrics or benchmarks, such as MM-Vet, MM-Vet, MM-Vet, MM-Vet, MM-Vet, MM-V +outputs[0]["generated_text"] +# with a yellow center in the foreground. The flower is surrounded by red and white flowers with green stems ``` ## Useful Resources From 403e846a5615e625d0fa31c5e9692385395cd1f7 Mon Sep 17 00:00:00 2001 From: Merve Noyan Date: Mon, 18 Nov 2024 16:25:12 +0100 Subject: [PATCH 02/11] Add inference API --- packages/tasks/src/tasks/image-text-to-text/about.md | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/packages/tasks/src/tasks/image-text-to-text/about.md b/packages/tasks/src/tasks/image-text-to-text/about.md index 490a321c6..a7100561d 100644 --- a/packages/tasks/src/tasks/image-text-to-text/about.md +++ b/packages/tasks/src/tasks/image-text-to-text/about.md @@ -75,6 +75,16 @@ outputs[0]["generated_text"] # with a yellow center in the foreground. The flower is surrounded by red and white flowers with green stems ``` +You can also use Inference API to play with image-text-to-text models. + +```bash +curl https://api-inference.huggingface.co/models/meta-llama/Llama-3.2-11B-Vision-Instruct \ + -X POST \ + -d '{"inputs": "Can you please let us know more details about your "}' \ + -H 'Content-Type: application/json' \ + -H "Authorization: Bearer hf_***" +``` + ## Useful Resources - [Vision Language Models Explained](https://huggingface.co/blog/vlms) From eee0f1717d27fcfc7a9965b3a10d61f169e28037 Mon Sep 17 00:00:00 2001 From: Merve Noyan Date: Tue, 19 Nov 2024 08:52:44 +0100 Subject: [PATCH 03/11] Update packages/tasks/src/tasks/image-text-to-text/about.md Co-authored-by: Pedro Cuenca --- packages/tasks/src/tasks/image-text-to-text/about.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/packages/tasks/src/tasks/image-text-to-text/about.md b/packages/tasks/src/tasks/image-text-to-text/about.md index a7100561d..a8dd0497f 100644 --- a/packages/tasks/src/tasks/image-text-to-text/about.md +++ b/packages/tasks/src/tasks/image-text-to-text/about.md @@ -32,7 +32,7 @@ Vision language models can recognize images through descriptions. When given det ## Inference -You can use the Transformers library to interact with vision-language models. Specifically `pipeline` makes it easy to infer models. +You can use the Transformers library to interact with vision-language models. Specifically, `pipeline` makes it easy to infer models. Initialize the pipeline first. From 21b3aaa95fd2b43483007b31ee3094590e58aedc Mon Sep 17 00:00:00 2001 From: Merve Noyan Date: Tue, 19 Nov 2024 08:52:50 +0100 Subject: [PATCH 04/11] Update packages/tasks/src/tasks/image-text-to-text/about.md Co-authored-by: Pedro Cuenca --- packages/tasks/src/tasks/image-text-to-text/about.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/packages/tasks/src/tasks/image-text-to-text/about.md b/packages/tasks/src/tasks/image-text-to-text/about.md index a8dd0497f..92405f072 100644 --- a/packages/tasks/src/tasks/image-text-to-text/about.md +++ b/packages/tasks/src/tasks/image-text-to-text/about.md @@ -66,7 +66,7 @@ messages = [ ``` -We can now directly pass in the messages to pipeline to infer. `return_full_text` is a flag to include the full prompt including the user input. Here we pass as `False` to only return the generated part. +We can now directly pass in the messages to the pipeline to infer. The `return_full_text` flag is used to return the full prompt in the response, including the user input. Here we pass `False` to only return the generated text. ```python outputs = pipe(text=messages, max_new_tokens=20, return_full_text=False) From c32d9df3dbc04f393fbec86fa3ea034e5f9b2d70 Mon Sep 17 00:00:00 2001 From: Merve Noyan Date: Tue, 19 Nov 2024 08:52:56 +0100 Subject: [PATCH 05/11] Update packages/tasks/src/tasks/image-text-to-text/about.md Co-authored-by: Pedro Cuenca --- packages/tasks/src/tasks/image-text-to-text/about.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/packages/tasks/src/tasks/image-text-to-text/about.md b/packages/tasks/src/tasks/image-text-to-text/about.md index 92405f072..4ce652d6a 100644 --- a/packages/tasks/src/tasks/image-text-to-text/about.md +++ b/packages/tasks/src/tasks/image-text-to-text/about.md @@ -75,7 +75,7 @@ outputs[0]["generated_text"] # with a yellow center in the foreground. The flower is surrounded by red and white flowers with green stems ``` -You can also use Inference API to play with image-text-to-text models. +You can also use the Inference API to test image-text-to-text models. You need to use a [Hugging Face token](https://huggingface.co/settings/tokens) for authentication. ```bash curl https://api-inference.huggingface.co/models/meta-llama/Llama-3.2-11B-Vision-Instruct \ From 4577d744de5f0f481af12522d8b06a5c34dd724c Mon Sep 17 00:00:00 2001 From: Merve Noyan Date: Tue, 19 Nov 2024 08:55:32 +0100 Subject: [PATCH 06/11] Update packages/tasks/src/tasks/image-text-to-text/about.md Co-authored-by: Pedro Cuenca --- packages/tasks/src/tasks/image-text-to-text/about.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/packages/tasks/src/tasks/image-text-to-text/about.md b/packages/tasks/src/tasks/image-text-to-text/about.md index 4ce652d6a..abd0673cf 100644 --- a/packages/tasks/src/tasks/image-text-to-text/about.md +++ b/packages/tasks/src/tasks/image-text-to-text/about.md @@ -42,7 +42,7 @@ from transformers import pipeline pipe = pipeline("image-text-to-text", model="llava-hf/llava-interleave-qwen-0.5b-hf") ``` -We will use chat templates to format the text input. We can also pass the image as URL in context part of the user role in our chat template. +The model's built-in chat template will be used to format the conversational input. We can pass the image as an URL in the `content` part of the user message: ```python messages = [ From 82d9af6cbc188766e019ba9238e62e4d5eef361e Mon Sep 17 00:00:00 2001 From: Merve Noyan Date: Tue, 10 Dec 2024 16:35:21 +0100 Subject: [PATCH 07/11] Update packages/tasks/src/tasks/image-text-to-text/about.md Co-authored-by: vb --- packages/tasks/src/tasks/image-text-to-text/about.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/packages/tasks/src/tasks/image-text-to-text/about.md b/packages/tasks/src/tasks/image-text-to-text/about.md index abd0673cf..8e58bbb16 100644 --- a/packages/tasks/src/tasks/image-text-to-text/about.md +++ b/packages/tasks/src/tasks/image-text-to-text/about.md @@ -32,7 +32,7 @@ Vision language models can recognize images through descriptions. When given det ## Inference -You can use the Transformers library to interact with vision-language models. Specifically, `pipeline` makes it easy to infer models. +You can use the Transformers library to interact with [vision-language models](https://huggingface.co/models?pipeline_tag=image-text-to-text&transformers). Specifically, `pipeline` makes it easy to infer models. Initialize the pipeline first. From 5e5131ff7868f03abcb62f3c268417ca1f0ae396 Mon Sep 17 00:00:00 2001 From: Merve Noyan Date: Tue, 10 Dec 2024 16:40:25 +0100 Subject: [PATCH 08/11] Add roles to snippet --- packages/tasks/src/tasks/image-text-to-text/about.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/packages/tasks/src/tasks/image-text-to-text/about.md b/packages/tasks/src/tasks/image-text-to-text/about.md index 8e58bbb16..1fbc7ab95 100644 --- a/packages/tasks/src/tasks/image-text-to-text/about.md +++ b/packages/tasks/src/tasks/image-text-to-text/about.md @@ -77,11 +77,12 @@ outputs[0]["generated_text"] You can also use the Inference API to test image-text-to-text models. You need to use a [Hugging Face token](https://huggingface.co/settings/tokens) for authentication. + ```bash curl https://api-inference.huggingface.co/models/meta-llama/Llama-3.2-11B-Vision-Instruct \ -X POST \ - -d '{"inputs": "Can you please let us know more details about your "}' \ - -H 'Content-Type: application/json' \ + -d '{"messages": [{"role": "user","content": [{"type": "image"}, {"type": "text", "text": "Can you describe the image?"}]}]}' \ + -H "Content-Type: application/json" \ -H "Authorization: Bearer hf_***" ``` From 620db9cc1ba743d0ab33157adc2805cc3c74b9d5 Mon Sep 17 00:00:00 2001 From: Merve Noyan Date: Tue, 10 Dec 2024 17:39:31 +0100 Subject: [PATCH 09/11] lint --- packages/tasks/src/tasks/image-text-to-text/about.md | 1 - 1 file changed, 1 deletion(-) diff --git a/packages/tasks/src/tasks/image-text-to-text/about.md b/packages/tasks/src/tasks/image-text-to-text/about.md index 1fbc7ab95..ce4d6718a 100644 --- a/packages/tasks/src/tasks/image-text-to-text/about.md +++ b/packages/tasks/src/tasks/image-text-to-text/about.md @@ -77,7 +77,6 @@ outputs[0]["generated_text"] You can also use the Inference API to test image-text-to-text models. You need to use a [Hugging Face token](https://huggingface.co/settings/tokens) for authentication. - ```bash curl https://api-inference.huggingface.co/models/meta-llama/Llama-3.2-11B-Vision-Instruct \ -X POST \ From 03149491f819162c7c32e8e68585af416689d61b Mon Sep 17 00:00:00 2001 From: Merve Noyan Date: Thu, 12 Dec 2024 16:53:41 +0100 Subject: [PATCH 10/11] Update about.md --- packages/tasks/src/tasks/image-text-to-text/about.md | 8 +------- 1 file changed, 1 insertion(+), 7 deletions(-) diff --git a/packages/tasks/src/tasks/image-text-to-text/about.md b/packages/tasks/src/tasks/image-text-to-text/about.md index ce4d6718a..fac4a861b 100644 --- a/packages/tasks/src/tasks/image-text-to-text/about.md +++ b/packages/tasks/src/tasks/image-text-to-text/about.md @@ -55,13 +55,7 @@ messages = [ }, {"type": "text", "text": "Describe this image."}, ], - }, - { - "role": "assistant", - "content": [ - {"type": "text", "text": "There's a pink flower"}, - ], - }, + } ] ``` From 6a40460449ac77b32768942835b2739ad0187d6d Mon Sep 17 00:00:00 2001 From: Merve Noyan Date: Thu, 12 Dec 2024 17:06:00 +0100 Subject: [PATCH 11/11] Update about.md --- packages/tasks/src/tasks/image-text-to-text/about.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/packages/tasks/src/tasks/image-text-to-text/about.md b/packages/tasks/src/tasks/image-text-to-text/about.md index fac4a861b..8da1621bb 100644 --- a/packages/tasks/src/tasks/image-text-to-text/about.md +++ b/packages/tasks/src/tasks/image-text-to-text/about.md @@ -63,10 +63,10 @@ messages = [ We can now directly pass in the messages to the pipeline to infer. The `return_full_text` flag is used to return the full prompt in the response, including the user input. Here we pass `False` to only return the generated text. ```python -outputs = pipe(text=messages, max_new_tokens=20, return_full_text=False) +outputs = pipe(text=messages, max_new_tokens=60, return_full_text=False) outputs[0]["generated_text"] -# with a yellow center in the foreground. The flower is surrounded by red and white flowers with green stems +# The image captures a moment of tranquility in nature. At the center of the frame, a pink flower with a yellow center is in full bloom. The flower is surrounded by a cluster of red flowers, their vibrant color contrasting with the pink of the flower. \n\nA black and yellow bee is per ``` You can also use the Inference API to test image-text-to-text models. You need to use a [Hugging Face token](https://huggingface.co/settings/tokens) for authentication.