Enable qwen2vl video #2756

drbh · 2024-11-18T17:59:01Z

This PR is a work in progress that explores adding support for video inputs with Qwen2-VL. Thank you @mfarre for getting this effort started.

TODOS

suport video_urls
fetch video contents in router
update protobufs to support video chunks
handle padding video token inputs
tokenize video bytes
integrate video logic with vision model (update position ids)
ensure tokenization process is correct
add tests
refactor/improve

update*

start server

text-generation-launcher \
--model-id Qwen/Qwen2-VL-7B-Instruct \
--max-batch-prefill-tokens 10000 \
--max-input-tokens 10000 \
--max-total-tokens 10001

send request

import requests
import json

def chat_completion(url="http://127.0.0.1:3000", video_url=None, prompt=None):
    messages = [{
        "role": "user",
        "content": [
            {
                "type": "video_url",
                "video_url": { 
                    "url": video_url
                }
            },
            {
                "type": "text",
                "text": prompt
            }
        ]
    }]

    payload = {
        "messages": messages,
        "seed": 42,
        "max_tokens": 30
    }

    response = requests.post(
        f"{url}/v1/chat/completions",
        json=payload,
        headers={"Content-Type": "application/json"}
    )

    return response.json()

video_url = "https://test-videos.co.uk/vids/bigbuckbunny/mp4/h264/360/Big_Buck_Bunny_360_10s_1MB.mp4"
result = chat_completion(
    video_url=video_url,
    prompt="Describe this video."
)
print(json.dumps(result, indent=2))
# {
#     "object": "chat.completion",
#     "id": "",
#     "created": 1731964042,
#     "model": "Qwen/Qwen2-VL-7B-Instruct",
#     "system_fingerprint": "2.4.1-dev0-native",
#     "choices": [
#         {
#             "index": 0,
#             "message": {
#                 "role": "assistant",
#                 "content": "The video showcases lush green trees with vibrant shades of green and various shades of yellow and brown, as well as moss-covered stumps and piles of moss",
#             },
#             "logprobs": null,
#             "finish_reason": "length",
#         }
#     ],
#     "usage": {"prompt_tokens": 9593, "completion_tokens": 30, "total_tokens": 9623},
# }

HuggingFaceDocBuilderDev · 2024-11-18T21:17:52Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

router/src/validation.rs

… frames

Narsil · 2025-01-14T15:31:10Z

It still doesn't work @drbh

Narsil · 2025-01-15T16:56:11Z

server/text_generation_server/models/vlm_causal_lm.py

+                        frames = []
+                        for i in range(chunk.video.frames):
+                            frame = video_frame_buf[
+                                i * bytes_per_frame : (i + 1) * bytes_per_frame
+                            ]
+                            frame = frame.reshape(
+                                chunk.video.height, chunk.video.width, 3
+                            )
+                            frames.append(frame)
+
+                        video_frame_buf = np.stack(frames)
+                        frame_nchw_tensor = torch.from_numpy(video_frame_buf).permute(
+                            0, 3, 1, 2


This can and should be done without reallocating once.

good point, the latest commit updates this to avoid the loop and prefer a vectorized method with numpy and torch.

Narsil · 2025-01-15T16:57:53Z

server/text_generation_server/models/vlm_causal_lm.py

@@ -212,9 +291,10 @@ def batch_tokenized_inputs(
                        processor, image_inputs, config, image_id
                    )
                    image_id += 1
+                elif chunk_type == "video":
+                    full_text += video_text_replacement(processor, video_inputs, config)


else: raise Error ( we're starting to have many types of chunks)

good catch, updated

Narsil · 2025-01-15T17:00:45Z

.github/workflows/autodocs.yaml

@@ -20,7 +20,7 @@ jobs:
    - name: Install Protocol Buffers compiler
      run: |
        sudo apt-get update
-        sudo apt-get install -y protobuf-compiler libprotobuf-dev
+        sudo apt-get install -y protobuf-compiler libprotobuf-dev clang libavcodec-dev libavfilter-dev libavdevice-dev libavformat-dev libavutil-dev pkg-config


No dev packages please (we shouldn't need them most likely).

Why is clang in there?

Narsil · 2025-01-15T17:01:05Z

.github/workflows/tests.yaml

@@ -43,7 +43,9 @@ jobs:
      - name: Install
        run: |
          sudo apt update
-          sudo apt install python3.11-dev -y
+          sudo apt install python3.11-dev python3.11-venv python3-pip clang libavcodec-dev libavfilter-dev libavdevice-dev libavformat-dev libavutil-dev pkg-config -y
+          export PKG_CONFIG_PATH=$PKG_CONFIG_PATH:/usr/lib/x86_64-linux-gnu/pkgconfig


Not env shenanigans

Narsil · 2025-01-15T17:03:45Z

flake.nix

                openssl.dev
                pkg-config
-                cargo
+                router
+                rustPlatform.bindgenHook


No bindgenhook.

ffmpeg is ok, but please let's keep the diff readable (no sorting).

Narsil · 2025-01-15T17:04:16Z

integration-tests/models/test_flash_qwen2_vl_video.py

+        max_input_length=10_000,
+        max_batch_prefill_tokens=10_000,
+        max_total_tokens=10_001,
+        cuda_graphs=[0],


Remove all of these. We shouldn't need them.
We NEED to make sure users don't have to set anything.

agreed removed in latest commit

Narsil · 2025-01-15T17:04:39Z

integration-tests/models/test_flash_qwen2_vl_video.py

+                        {
+                            "type": "video_url",
+                            "video_url": {
+                                "url": "https://test-videos.co.uk/vids/bigbuckbunny/mp4/h264/360/Big_Buck_Bunny_360_10s_1MB.mp4"


Let's put a URL under our control please.
Cehck what we have for images.

Narsil · 2025-01-15T17:05:01Z

integration-tests/models/test_flash_qwen2_vl_video.py

+                    full_text += response["choices"][0]["delta"]["content"]
+                except json.JSONDecodeError:
+                    pass
+


Where's the actual assert again the full text ?

Narsil · 2025-01-15T17:07:13Z

router/src/lib.rs

@@ -1229,6 +1230,9 @@ impl From<Message> for TextMessage {
                    .map(|chunk| match chunk {
                        MessageChunk::Text { text } => text,
                        MessageChunk::ImageUrl { image_url } => format!("![]({})", image_url.url),
+                        MessageChunk::VideoUrl { video_url } => {
+                            format!("<video>({})", video_url.url)


Let's remove this altogther. We don't want the serialized shenanigan anymore.

Narsil · 2025-01-15T17:07:28Z

router/Cargo.toml

@@ -74,3 +77,4 @@ default = ["ngrok"]
 ngrok = ["dep:ngrok"]
 google = []
 kserve = []
+video = ["ffmpeg-next", "mp4parse", "tempfile"]


Why so many deps?

Narsil · 2025-01-15T17:11:20Z

router/src/validation.rs

+                    tokenizer_query.push_str(&inputs[start..chunk_start]);
+                }
+                let processed_video = match config {
+                    Idefics | Mllama | Idefics2(_) | Paligemma(_) | LlavaNext(_) => {


Why ? They do not support video for now. So let's not do anything.

agreed, this should have been removed originally. removed in latest commit

Narsil · 2025-01-15T17:11:31Z

router/src/validation.rs

@@ -645,13 +808,64 @@ fn prepare_input<T: TokenizerTrait>(
 ) -> Result<(tokenizers::Encoding, Vec<Chunk>), ValidationError> {
    use Config::*;
    static RE: Lazy<Regex> = Lazy::new(|| Regex::new(r"!\[\]\([^\)]*\)").unwrap());
+    // Add video regex
+    static VIDEO_RE: Lazy<Regex> =


No regex + No string shenanigans.

Narsil · 2025-01-15T17:16:46Z

router/src/validation.rs

+}
+
+#[cfg(feature = "video")]
+pub fn fetch_video(


This overall seems super convoluted.

Why do we need a file for something that should stay in RAM ?

We are re-encoding the video here, this is fine if we are agressively trimming the output video (like removing extra frames). We have to assume ppl are going to send 4k video of 10hour long content.

We are binding to ffmpeg which seems to add a lot of dependency complexity. Why not just depend on ffmpeg binary, and be done with it ? Seems much simpler, and we're not looking at all at the contents it seems.

For the tempfile, it might be the simplest. But I feel we need to discuss options before doing it that way.

Narsil · 2025-01-15T17:20:10Z

router/src/validation.rs

+            let nframes = (sampled_frames).max(min_frames).min(max_frames);
+            let nframes = (nframes / 2.0).round() as usize * 2;
+            let num_tokens = nframes * height as usize * width as usize / 1541;


Keep everything in usize and use regular division or div_ceil. Should be much simpler.

Where is that 1514 coming from ?

Narsil · 2025-01-15T17:21:02Z

router/src/validation.rs

+                num_frames: _,
+            }) => {
+                // TODO: revisit if we should limit video support to v3 - to avoid sending very large base64 strings
+                let encoded = STANDARD.encode(data);


Let's revisit now.

Narsil · 2025-01-15T17:25:58Z

server/text_generation_server/models/custom_modeling/qwen2_vl.py

-                    # copy the value position of the next image token from GPU<->CPU
-                    next_image_pos = (
-                        (input_ids[current_pos:] == self.image_token_id)
+                for _ in range(image_count + video_count):


This looks pretty bloated.

It was like that before but it's still seems quite bloated.
There should be 1 replace for all images, 1 replace for all videos.

Not super urgent if this works (as it's limited to qwen).

Also seems everything named image was renamed to video.. Sounds like better names can be found.

drbh force-pushed the enable-qwen2vl-video branch from b780f00 to 6b4697e Compare November 18, 2024 18:03

mfarre reviewed Nov 20, 2024

View reviewed changes

router/src/validation.rs Show resolved Hide resolved

drbh force-pushed the enable-qwen2vl-video branch from b9707b9 to 32438fc Compare November 25, 2024 21:43

drbh force-pushed the enable-qwen2vl-video branch from 2ef3038 to 17b27d4 Compare December 3, 2024 00:54

drbh force-pushed the enable-qwen2vl-video branch 4 times, most recently from 4e921bf to 93a2413 Compare December 18, 2024 01:41

mfarre and others added 21 commits December 23, 2024 13:47

WIP video support

18c9f06

router changes

7c67939

adopting video url

5ced960

connecting video to qwen2

05464d2

fix

c7c2fda

downloading videos

b9c8152

fix

464609f

refactoring

a25c3ec

fix

3c07391

feat: support video input chunks and enable qwen2 vl to process video

b2c5575

fix: add protobuf update and mp4parse dep

83a7f18

fix: remove unused deps and imports

322165d

moving video sampling and resize to validation. downstream we receive…

e65ead1

… frames

flatten frames to data block when needed

36e095b

fix: adjust video process, reduce to 1 fps and adjust tensor shape

bc5e202

fix: adjust deps after rebase

1afaa69

feat: adjust impure shell deps and autodocs workflow

16007b6

fix: include more deps for ffmpeg as docs suggest

39fac7e

fix: add ffmpeg deps to test build

b508b10

fix: debug ffmpeg install in tests workflow

4a3a724

fix: debug ffmpeg deps in tests II

ac7483c

drbh added 8 commits December 23, 2024 13:47

fix: adjust batch_tokenized_inputs output in mllama

5c7bc91

fix: update lints after rebase

bb00fb3

fix: update trtllm dockefile after rebase

91ed362

fix: adjust whitespace lint

5322abd

fix: feature flag video and remove from non cuda dockerfiles

b4da6ad

fix: make ffmpeg-next dep optional with feature

27f758d

fix: include the video feature in cargo chef command

4f42d0c

fix: adjust trtllm looper for video chunk enum

dcc1194

drbh force-pushed the enable-qwen2vl-video branch from 93a2413 to dcc1194 Compare December 23, 2024 18:47

drbh marked this pull request as ready for review January 3, 2025 15:49

fix: small refactor and cleanups

b27749e

Narsil reviewed Jan 15, 2025

View reviewed changes

drbh added 2 commits January 16, 2025 17:20

fix: improve video processing and update unsupported paths

78cd756

fix: remove test debug params

17192c9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable qwen2vl video #2756

Enable qwen2vl video #2756

drbh commented Nov 18, 2024 •

edited

Loading

HuggingFaceDocBuilderDev commented Nov 18, 2024

Narsil commented Jan 14, 2025

Narsil Jan 15, 2025

drbh Jan 17, 2025

Narsil Jan 15, 2025

drbh Jan 17, 2025

Narsil Jan 15, 2025

Narsil Jan 15, 2025

Narsil Jan 15, 2025

Narsil Jan 15, 2025

drbh Jan 17, 2025

Narsil Jan 15, 2025

Narsil Jan 15, 2025 •

edited

Loading

Narsil Jan 15, 2025

Narsil Jan 15, 2025

Narsil Jan 15, 2025

drbh Jan 17, 2025

Narsil Jan 15, 2025

Narsil Jan 15, 2025

Narsil Jan 15, 2025

Narsil Jan 15, 2025

Narsil Jan 15, 2025

Narsil Jan 15, 2025

Enable qwen2vl video #2756

Are you sure you want to change the base?

Enable qwen2vl video #2756

Conversation

drbh commented Nov 18, 2024 • edited Loading

HuggingFaceDocBuilderDev commented Nov 18, 2024

Narsil commented Jan 14, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Narsil Jan 15, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

drbh commented Nov 18, 2024 •

edited

Loading

Narsil Jan 15, 2025 •

edited

Loading