Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable qwen2vl video #2756

Open
wants to merge 54 commits into
base: main
Choose a base branch
from
Open

Enable qwen2vl video #2756

wants to merge 54 commits into from

Conversation

drbh
Copy link
Collaborator

@drbh drbh commented Nov 18, 2024

This PR is a work in progress that explores adding support for video inputs with Qwen2-VL. Thank you @mfarre for getting this effort started.

TODOS

  • suport video_urls
  • fetch video contents in router
  • update protobufs to support video chunks
  • handle padding video token inputs
  • tokenize video bytes
  • integrate video logic with vision model (update position ids)
  • ensure tokenization process is correct
  • add tests
  • refactor/improve

update*

start server

text-generation-launcher \
--model-id Qwen/Qwen2-VL-7B-Instruct \
--max-batch-prefill-tokens 10000 \
--max-input-tokens 10000 \
--max-total-tokens 10001

send request

import requests
import json

def chat_completion(url="http://127.0.0.1:3000", video_url=None, prompt=None):
    messages = [{
        "role": "user",
        "content": [
            {
                "type": "video_url",
                "video_url": { 
                    "url": video_url
                }
            },
            {
                "type": "text",
                "text": prompt
            }
        ]
    }]

    payload = {
        "messages": messages,
        "seed": 42,
        "max_tokens": 30
    }

    response = requests.post(
        f"{url}/v1/chat/completions",
        json=payload,
        headers={"Content-Type": "application/json"}
    )

    return response.json()

video_url = "https://test-videos.co.uk/vids/bigbuckbunny/mp4/h264/360/Big_Buck_Bunny_360_10s_1MB.mp4"
result = chat_completion(
    video_url=video_url,
    prompt="Describe this video."
)
print(json.dumps(result, indent=2))
# {
#     "object": "chat.completion",
#     "id": "",
#     "created": 1731964042,
#     "model": "Qwen/Qwen2-VL-7B-Instruct",
#     "system_fingerprint": "2.4.1-dev0-native",
#     "choices": [
#         {
#             "index": 0,
#             "message": {
#                 "role": "assistant",
#                 "content": "The video showcases lush green trees with vibrant shades of green and various shades of yellow and brown, as well as moss-covered stumps and piles of moss",
#             },
#             "logprobs": null,
#             "finish_reason": "length",
#         }
#     ],
#     "usage": {"prompt_tokens": 9593, "completion_tokens": 30, "total_tokens": 9623},
# }

@drbh drbh force-pushed the enable-qwen2vl-video branch from b780f00 to 6b4697e Compare November 18, 2024 18:03
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@drbh drbh force-pushed the enable-qwen2vl-video branch from b9707b9 to 32438fc Compare November 25, 2024 21:43
@drbh drbh force-pushed the enable-qwen2vl-video branch from 2ef3038 to 17b27d4 Compare December 3, 2024 00:54
@drbh drbh force-pushed the enable-qwen2vl-video branch 4 times, most recently from 4e921bf to 93a2413 Compare December 18, 2024 01:41
@drbh drbh force-pushed the enable-qwen2vl-video branch from 93a2413 to dcc1194 Compare December 23, 2024 18:47
@drbh drbh marked this pull request as ready for review January 3, 2025 15:49
@Narsil
Copy link
Collaborator

Narsil commented Jan 14, 2025

It still doesn't work @drbh

Comment on lines 244 to 256
frames = []
for i in range(chunk.video.frames):
frame = video_frame_buf[
i * bytes_per_frame : (i + 1) * bytes_per_frame
]
frame = frame.reshape(
chunk.video.height, chunk.video.width, 3
)
frames.append(frame)

video_frame_buf = np.stack(frames)
frame_nchw_tensor = torch.from_numpy(video_frame_buf).permute(
0, 3, 1, 2
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can and should be done without reallocating once.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, the latest commit updates this to avoid the loop and prefer a vectorized method with numpy and torch.

@@ -212,9 +291,10 @@ def batch_tokenized_inputs(
processor, image_inputs, config, image_id
)
image_id += 1
elif chunk_type == "video":
full_text += video_text_replacement(processor, video_inputs, config)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

else: raise Error ( we're starting to have many types of chunks)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, updated

@@ -20,7 +20,7 @@ jobs:
- name: Install Protocol Buffers compiler
run: |
sudo apt-get update
sudo apt-get install -y protobuf-compiler libprotobuf-dev
sudo apt-get install -y protobuf-compiler libprotobuf-dev clang libavcodec-dev libavfilter-dev libavdevice-dev libavformat-dev libavutil-dev pkg-config
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No dev packages please (we shouldn't need them most likely).

Why is clang in there?

@@ -43,7 +43,9 @@ jobs:
- name: Install
run: |
sudo apt update
sudo apt install python3.11-dev -y
sudo apt install python3.11-dev python3.11-venv python3-pip clang libavcodec-dev libavfilter-dev libavdevice-dev libavformat-dev libavutil-dev pkg-config -y
export PKG_CONFIG_PATH=$PKG_CONFIG_PATH:/usr/lib/x86_64-linux-gnu/pkgconfig
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not env shenanigans

openssl.dev
pkg-config
cargo
router
rustPlatform.bindgenHook
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No bindgenhook.

ffmpeg is ok, but please let's keep the diff readable (no sorting).

Comment on lines 10 to 13
max_input_length=10_000,
max_batch_prefill_tokens=10_000,
max_total_tokens=10_001,
cuda_graphs=[0],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove all of these. We shouldn't need them.
We NEED to make sure users don't have to set anything.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed removed in latest commit

{
"type": "video_url",
"video_url": {
"url": "https://test-videos.co.uk/vids/bigbuckbunny/mp4/h264/360/Big_Buck_Bunny_360_10s_1MB.mp4"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's put a URL under our control please.
Cehck what we have for images.

full_text += response["choices"][0]["delta"]["content"]
except json.JSONDecodeError:
pass

Copy link
Collaborator

@Narsil Narsil Jan 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where's the actual assert again the full text ?

@@ -1229,6 +1230,9 @@ impl From<Message> for TextMessage {
.map(|chunk| match chunk {
MessageChunk::Text { text } => text,
MessageChunk::ImageUrl { image_url } => format!("![]({})", image_url.url),
MessageChunk::VideoUrl { video_url } => {
format!("<video>({})", video_url.url)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove this altogther. We don't want the serialized shenanigan anymore.

@@ -74,3 +77,4 @@ default = ["ngrok"]
ngrok = ["dep:ngrok"]
google = []
kserve = []
video = ["ffmpeg-next", "mp4parse", "tempfile"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why so many deps?

tokenizer_query.push_str(&inputs[start..chunk_start]);
}
let processed_video = match config {
Idefics | Mllama | Idefics2(_) | Paligemma(_) | LlavaNext(_) => {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why ? They do not support video for now. So let's not do anything.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed, this should have been removed originally. removed in latest commit

@@ -645,13 +808,64 @@ fn prepare_input<T: TokenizerTrait>(
) -> Result<(tokenizers::Encoding, Vec<Chunk>), ValidationError> {
use Config::*;
static RE: Lazy<Regex> = Lazy::new(|| Regex::new(r"!\[\]\([^\)]*\)").unwrap());
// Add video regex
static VIDEO_RE: Lazy<Regex> =
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No regex.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No regex + No string shenanigans.

}

#[cfg(feature = "video")]
pub fn fetch_video(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This overall seems super convoluted.

  • Why do we need a file for something that should stay in RAM ?
  • We are re-encoding the video here, this is fine if we are agressively trimming the output video (like removing extra frames). We have to assume ppl are going to send 4k video of 10hour long content.
  • We are binding to ffmpeg which seems to add a lot of dependency complexity. Why not just depend on ffmpeg binary, and be done with it ? Seems much simpler, and we're not looking at all at the contents it seems.

For the tempfile, it might be the simplest. But I feel we need to discuss options before doing it that way.

Comment on lines +778 to +780
let nframes = (sampled_frames).max(min_frames).min(max_frames);
let nframes = (nframes / 2.0).round() as usize * 2;
let num_tokens = nframes * height as usize * width as usize / 1541;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep everything in usize and use regular division or div_ceil. Should be much simpler.

Where is that 1514 coming from ?

num_frames: _,
}) => {
// TODO: revisit if we should limit video support to v3 - to avoid sending very large base64 strings
let encoded = STANDARD.encode(data);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's revisit now.

# copy the value position of the next image token from GPU<->CPU
next_image_pos = (
(input_ids[current_pos:] == self.image_token_id)
for _ in range(image_count + video_count):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks pretty bloated.

It was like that before but it's still seems quite bloated.
There should be 1 replace for all images, 1 replace for all videos.

Not super urgent if this works (as it's limited to qwen).

Also seems everything named image was renamed to video.. Sounds like better names can be found.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants