From d3e58a14b49480d1f4fa3a8ed352cf5c58d73213 Mon Sep 17 00:00:00 2001
From: Merve Noyan <merveenoyan@gmail.com>
Date: Mon, 18 Nov 2024 16:22:34 +0100
Subject: [PATCH 01/11] Add it2t pipeline to task page

---
 .../src/tasks/image-text-to-text/about.md     | 56 +++++++++++--------
 1 file changed, 32 insertions(+), 24 deletions(-)
diff --git a/packages/tasks/src/tasks/image-text-to-text/about.md b/packages/tasks/src/tasks/image-text-to-text/about.md
index f220604fc..490a321c6 100644
--- a/packages/tasks/src/tasks/image-text-to-text/about.md
+++ b/packages/tasks/src/tasks/image-text-to-text/about.md
@@ -32,39 +32,47 @@ Vision language models can recognize images through descriptions. When given det
 
 ## Inference
 
-You can use the Transformers library to interact with vision-language models. You can load the model like below.
+You can use the Transformers library to interact with vision-language models. Specifically `pipeline` makes it easy to infer models.
+
+Initialize the pipeline first.
 
 ```python
-from transformers import LlavaNextProcessor, LlavaNextForConditionalGeneration
-import torch
-
-device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
-processor = LlavaNextProcessor.from_pretrained("llava-hf/llava-v1.6-mistral-7b-hf")
-model = LlavaNextForConditionalGeneration.from_pretrained(
-    "llava-hf/llava-v1.6-mistral-7b-hf",
-    torch_dtype=torch.float16
-)
-model.to(device)
+from transformers import pipeline
+
+pipe = pipeline("image-text-to-text", model="llava-hf/llava-interleave-qwen-0.5b-hf")
 ```
 
-We can infer by passing image and text dialogues.
+We will use chat templates to format the text input. We can also pass the image as URL in context part of the user role in our chat template.
 
 ```python
-from PIL import Image
-import requests
+messages = [
+     {
+         "role": "user",
+         "content": [
+             {
+                 "type": "image",
+                 "image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg",
+             },
+             {"type": "text", "text": "Describe this image."},
+         ],
+     },
+     {
+         "role": "assistant",
+         "content": [
+             {"type": "text", "text": "There's a pink flower"},
+         ],
+     },
+ ]
 
-# image of a radar chart
-url = "https://github.com/haotian-liu/LLaVA/blob/1a91fc274d7c35a9b50b3cb29c4247ae5837ce39/images/llava_v1_5_radar.jpg?raw=true"
-image = Image.open(requests.get(url, stream=True).raw)
-prompt = "[INST] <image>\nWhat is shown in this image? [/INST]"
+```
 
-inputs = processor(prompt, image, return_tensors="pt").to(device)
-output = model.generate(**inputs, max_new_tokens=100)
+We can now directly pass in the messages to pipeline to infer. `return_full_text` is a flag to include the full prompt including the user input. Here we pass as `False` to only return the generated part.
+
+```python
+outputs = pipe(text=messages, max_new_tokens=20, return_full_text=False)
 
-print(processor.decode(output[0], skip_special_tokens=True))
-# The image appears to be a radar chart, which is a type of multivariate chart that displays values for multiple variables represented on axes
-# starting from the same point. This particular radar chart is showing the performance of different models or systems across various metrics.
-# The axes represent different metrics or benchmarks, such as MM-Vet, MM-Vet, MM-Vet, MM-Vet, MM-Vet, MM-V
+outputs[0]["generated_text"]
+# with a yellow center in the foreground. The flower is surrounded by red and white flowers with green stems
 ```
 
 ## Useful Resources

From 403e846a5615e625d0fa31c5e9692385395cd1f7 Mon Sep 17 00:00:00 2001
From: Merve Noyan <merveenoyan@gmail.com>
Date: Mon, 18 Nov 2024 16:25:12 +0100
Subject: [PATCH 02/11] Add inference API

---
 packages/tasks/src/tasks/image-text-to-text/about.md | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/packages/tasks/src/tasks/image-text-to-text/about.md b/packages/tasks/src/tasks/image-text-to-text/about.md
index 490a321c6..a7100561d 100644
--- a/packages/tasks/src/tasks/image-text-to-text/about.md
+++ b/packages/tasks/src/tasks/image-text-to-text/about.md
@@ -75,6 +75,16 @@ outputs[0]["generated_text"]
 # with a yellow center in the foreground. The flower is surrounded by red and white flowers with green stems
 ```
 
+You can also use Inference API to play with image-text-to-text models.
+
+```bash
+curl https://api-inference.huggingface.co/models/meta-llama/Llama-3.2-11B-Vision-Instruct \
+	-X POST \
+	-d '{"inputs": "Can you please let us know more details about your "}' \
+	-H 'Content-Type: application/json' \
+	-H "Authorization: Bearer hf_***"
+```
+
 ## Useful Resources
 
 - [Vision Language Models Explained](https://huggingface.co/blog/vlms)

From eee0f1717d27fcfc7a9965b3a10d61f169e28037 Mon Sep 17 00:00:00 2001
From: Merve Noyan <merveenoyan@gmail.com>
Date: Tue, 19 Nov 2024 08:52:44 +0100
Subject: [PATCH 03/11] Update
 packages/tasks/src/tasks/image-text-to-text/about.md

Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
---
 packages/tasks/src/tasks/image-text-to-text/about.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/packages/tasks/src/tasks/image-text-to-text/about.md b/packages/tasks/src/tasks/image-text-to-text/about.md
index a7100561d..a8dd0497f 100644
--- a/packages/tasks/src/tasks/image-text-to-text/about.md
+++ b/packages/tasks/src/tasks/image-text-to-text/about.md
@@ -32,7 +32,7 @@ Vision language models can recognize images through descriptions. When given det
 
 ## Inference
 
-You can use the Transformers library to interact with vision-language models. Specifically `pipeline` makes it easy to infer models.
+You can use the Transformers library to interact with vision-language models. Specifically, `pipeline` makes it easy to infer models.
 
 Initialize the pipeline first.
 

From 21b3aaa95fd2b43483007b31ee3094590e58aedc Mon Sep 17 00:00:00 2001
From: Merve Noyan <merveenoyan@gmail.com>
Date: Tue, 19 Nov 2024 08:52:50 +0100
Subject: [PATCH 04/11] Update
 packages/tasks/src/tasks/image-text-to-text/about.md

Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
---
 packages/tasks/src/tasks/image-text-to-text/about.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/packages/tasks/src/tasks/image-text-to-text/about.md b/packages/tasks/src/tasks/image-text-to-text/about.md
index a8dd0497f..92405f072 100644
--- a/packages/tasks/src/tasks/image-text-to-text/about.md
+++ b/packages/tasks/src/tasks/image-text-to-text/about.md
@@ -66,7 +66,7 @@ messages = [
 
 ```
 
-We can now directly pass in the messages to pipeline to infer. `return_full_text` is a flag to include the full prompt including the user input. Here we pass as `False` to only return the generated part.
+We can now directly pass in the messages to the pipeline to infer. The `return_full_text` flag is used to return the full prompt in the response, including the user input. Here we pass `False` to only return the generated text.
 
 ```python
 outputs = pipe(text=messages, max_new_tokens=20, return_full_text=False)

From c32d9df3dbc04f393fbec86fa3ea034e5f9b2d70 Mon Sep 17 00:00:00 2001
From: Merve Noyan <merveenoyan@gmail.com>
Date: Tue, 19 Nov 2024 08:52:56 +0100
Subject: [PATCH 05/11] Update
 packages/tasks/src/tasks/image-text-to-text/about.md

Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
---
 packages/tasks/src/tasks/image-text-to-text/about.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/packages/tasks/src/tasks/image-text-to-text/about.md b/packages/tasks/src/tasks/image-text-to-text/about.md
index 92405f072..4ce652d6a 100644
--- a/packages/tasks/src/tasks/image-text-to-text/about.md
+++ b/packages/tasks/src/tasks/image-text-to-text/about.md
@@ -75,7 +75,7 @@ outputs[0]["generated_text"]
 # with a yellow center in the foreground. The flower is surrounded by red and white flowers with green stems
 ```
 
-You can also use Inference API to play with image-text-to-text models.
+You can also use the Inference API to test image-text-to-text models. You need to use a [Hugging Face token](https://huggingface.co/settings/tokens) for authentication.
 
 ```bash
 curl https://api-inference.huggingface.co/models/meta-llama/Llama-3.2-11B-Vision-Instruct \

From 4577d744de5f0f481af12522d8b06a5c34dd724c Mon Sep 17 00:00:00 2001
From: Merve Noyan <merveenoyan@gmail.com>
Date: Tue, 19 Nov 2024 08:55:32 +0100
Subject: [PATCH 06/11] Update
 packages/tasks/src/tasks/image-text-to-text/about.md

Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
---
 packages/tasks/src/tasks/image-text-to-text/about.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/packages/tasks/src/tasks/image-text-to-text/about.md b/packages/tasks/src/tasks/image-text-to-text/about.md
index 4ce652d6a..abd0673cf 100644
--- a/packages/tasks/src/tasks/image-text-to-text/about.md
+++ b/packages/tasks/src/tasks/image-text-to-text/about.md
@@ -42,7 +42,7 @@ from transformers import pipeline
 pipe = pipeline("image-text-to-text", model="llava-hf/llava-interleave-qwen-0.5b-hf")
 ```
 
-We will use chat templates to format the text input. We can also pass the image as URL in context part of the user role in our chat template.
+The model's built-in chat template will be used to format the conversational input. We can pass the image as an URL in the `content` part of the user message:
 
 ```python
 messages = [

From 82d9af6cbc188766e019ba9238e62e4d5eef361e Mon Sep 17 00:00:00 2001
From: Merve Noyan <merveenoyan@gmail.com>
Date: Tue, 10 Dec 2024 16:35:21 +0100
Subject: [PATCH 07/11] Update
 packages/tasks/src/tasks/image-text-to-text/about.md

Co-authored-by: vb <vaibhavs10@gmail.com>
---
 packages/tasks/src/tasks/image-text-to-text/about.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/packages/tasks/src/tasks/image-text-to-text/about.md b/packages/tasks/src/tasks/image-text-to-text/about.md
index abd0673cf..8e58bbb16 100644
--- a/packages/tasks/src/tasks/image-text-to-text/about.md
+++ b/packages/tasks/src/tasks/image-text-to-text/about.md
@@ -32,7 +32,7 @@ Vision language models can recognize images through descriptions. When given det
 
 ## Inference
 
-You can use the Transformers library to interact with vision-language models. Specifically, `pipeline` makes it easy to infer models.
+You can use the Transformers library to interact with [vision-language models](https://huggingface.co/models?pipeline_tag=image-text-to-text&transformers). Specifically, `pipeline` makes it easy to infer models.
 
 Initialize the pipeline first.
 

From 5e5131ff7868f03abcb62f3c268417ca1f0ae396 Mon Sep 17 00:00:00 2001
From: Merve Noyan <merveenoyan@gmail.com>
Date: Tue, 10 Dec 2024 16:40:25 +0100
Subject: [PATCH 08/11] Add roles to snippet

---
 packages/tasks/src/tasks/image-text-to-text/about.md | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/packages/tasks/src/tasks/image-text-to-text/about.md b/packages/tasks/src/tasks/image-text-to-text/about.md
index 8e58bbb16..1fbc7ab95 100644
--- a/packages/tasks/src/tasks/image-text-to-text/about.md
+++ b/packages/tasks/src/tasks/image-text-to-text/about.md
@@ -77,11 +77,12 @@ outputs[0]["generated_text"]
 
 You can also use the Inference API to test image-text-to-text models. You need to use a [Hugging Face token](https://huggingface.co/settings/tokens) for authentication.
 
+
 ```bash
 curl https://api-inference.huggingface.co/models/meta-llama/Llama-3.2-11B-Vision-Instruct \
 	-X POST \
-	-d '{"inputs": "Can you please let us know more details about your "}' \
-	-H 'Content-Type: application/json' \
+	-d '{"messages": [{"role": "user","content": [{"type": "image"}, {"type": "text", "text": "Can you describe the image?"}]}]}' \
+	-H "Content-Type: application/json" \
 	-H "Authorization: Bearer hf_***"
 ```
 

From 620db9cc1ba743d0ab33157adc2805cc3c74b9d5 Mon Sep 17 00:00:00 2001
From: Merve Noyan <mervenoyan@Merve-MacBook-Pro.local>
Date: Tue, 10 Dec 2024 17:39:31 +0100
Subject: [PATCH 09/11] lint

---
 packages/tasks/src/tasks/image-text-to-text/about.md | 1 -
 1 file changed, 1 deletion(-)

diff --git a/packages/tasks/src/tasks/image-text-to-text/about.md b/packages/tasks/src/tasks/image-text-to-text/about.md
index 1fbc7ab95..ce4d6718a 100644
--- a/packages/tasks/src/tasks/image-text-to-text/about.md
+++ b/packages/tasks/src/tasks/image-text-to-text/about.md
@@ -77,7 +77,6 @@ outputs[0]["generated_text"]
 
 You can also use the Inference API to test image-text-to-text models. You need to use a [Hugging Face token](https://huggingface.co/settings/tokens) for authentication.
 
-
 ```bash
 curl https://api-inference.huggingface.co/models/meta-llama/Llama-3.2-11B-Vision-Instruct \
 	-X POST \

From 03149491f819162c7c32e8e68585af416689d61b Mon Sep 17 00:00:00 2001
From: Merve Noyan <merveenoyan@gmail.com>
Date: Thu, 12 Dec 2024 16:53:41 +0100
Subject: [PATCH 10/11] Update about.md

---
 packages/tasks/src/tasks/image-text-to-text/about.md | 8 +-------
 1 file changed, 1 insertion(+), 7 deletions(-)

diff --git a/packages/tasks/src/tasks/image-text-to-text/about.md b/packages/tasks/src/tasks/image-text-to-text/about.md
index ce4d6718a..fac4a861b 100644
--- a/packages/tasks/src/tasks/image-text-to-text/about.md
+++ b/packages/tasks/src/tasks/image-text-to-text/about.md
@@ -55,13 +55,7 @@ messages = [
              },
              {"type": "text", "text": "Describe this image."},
          ],
-     },
-     {
-         "role": "assistant",
-         "content": [
-             {"type": "text", "text": "There's a pink flower"},
-         ],
-     },
+     }
  ]
 
 ```

From 6a40460449ac77b32768942835b2739ad0187d6d Mon Sep 17 00:00:00 2001
From: Merve Noyan <merveenoyan@gmail.com>
Date: Thu, 12 Dec 2024 17:06:00 +0100
Subject: [PATCH 11/11] Update about.md

---
 packages/tasks/src/tasks/image-text-to-text/about.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/packages/tasks/src/tasks/image-text-to-text/about.md b/packages/tasks/src/tasks/image-text-to-text/about.md
index fac4a861b..8da1621bb 100644
--- a/packages/tasks/src/tasks/image-text-to-text/about.md
+++ b/packages/tasks/src/tasks/image-text-to-text/about.md
@@ -63,10 +63,10 @@ messages = [
 We can now directly pass in the messages to the pipeline to infer. The `return_full_text` flag is used to return the full prompt in the response, including the user input. Here we pass `False` to only return the generated text.
 
 ```python
-outputs = pipe(text=messages, max_new_tokens=20, return_full_text=False)
+outputs = pipe(text=messages, max_new_tokens=60, return_full_text=False)
 
 outputs[0]["generated_text"]
-# with a yellow center in the foreground. The flower is surrounded by red and white flowers with green stems
+# The image captures a moment of tranquility in nature. At the center of the frame, a pink flower with a yellow center is in full bloom. The flower is surrounded by a cluster of red flowers, their vibrant color contrasting with the pink of the flower. \n\nA black and yellow bee is per
 ```
 
 You can also use the Inference API to test image-text-to-text models. You need to use a [Hugging Face token](https://huggingface.co/settings/tokens) for authentication.