-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable qwen2vl video #2756
base: main
Are you sure you want to change the base?
Enable qwen2vl video #2756
Conversation
b780f00
to
6b4697e
Compare
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
b9707b9
to
32438fc
Compare
2ef3038
to
17b27d4
Compare
4e921bf
to
93a2413
Compare
93a2413
to
dcc1194
Compare
It still doesn't work @drbh |
frames = [] | ||
for i in range(chunk.video.frames): | ||
frame = video_frame_buf[ | ||
i * bytes_per_frame : (i + 1) * bytes_per_frame | ||
] | ||
frame = frame.reshape( | ||
chunk.video.height, chunk.video.width, 3 | ||
) | ||
frames.append(frame) | ||
|
||
video_frame_buf = np.stack(frames) | ||
frame_nchw_tensor = torch.from_numpy(video_frame_buf).permute( | ||
0, 3, 1, 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can and should be done without reallocating once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point, the latest commit updates this to avoid the loop and prefer a vectorized method with numpy and torch.
@@ -212,9 +291,10 @@ def batch_tokenized_inputs( | |||
processor, image_inputs, config, image_id | |||
) | |||
image_id += 1 | |||
elif chunk_type == "video": | |||
full_text += video_text_replacement(processor, video_inputs, config) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
else: raise Error ( we're starting to have many types of chunks)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch, updated
@@ -20,7 +20,7 @@ jobs: | |||
- name: Install Protocol Buffers compiler | |||
run: | | |||
sudo apt-get update | |||
sudo apt-get install -y protobuf-compiler libprotobuf-dev | |||
sudo apt-get install -y protobuf-compiler libprotobuf-dev clang libavcodec-dev libavfilter-dev libavdevice-dev libavformat-dev libavutil-dev pkg-config |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No dev packages please (we shouldn't need them most likely).
Why is clang in there?
@@ -43,7 +43,9 @@ jobs: | |||
- name: Install | |||
run: | | |||
sudo apt update | |||
sudo apt install python3.11-dev -y | |||
sudo apt install python3.11-dev python3.11-venv python3-pip clang libavcodec-dev libavfilter-dev libavdevice-dev libavformat-dev libavutil-dev pkg-config -y | |||
export PKG_CONFIG_PATH=$PKG_CONFIG_PATH:/usr/lib/x86_64-linux-gnu/pkgconfig |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not env shenanigans
openssl.dev | ||
pkg-config | ||
cargo | ||
router | ||
rustPlatform.bindgenHook |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No bindgenhook.
ffmpeg is ok, but please let's keep the diff readable (no sorting).
max_input_length=10_000, | ||
max_batch_prefill_tokens=10_000, | ||
max_total_tokens=10_001, | ||
cuda_graphs=[0], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove all of these. We shouldn't need them.
We NEED to make sure users don't have to set anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed removed in latest commit
{ | ||
"type": "video_url", | ||
"video_url": { | ||
"url": "https://test-videos.co.uk/vids/bigbuckbunny/mp4/h264/360/Big_Buck_Bunny_360_10s_1MB.mp4" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's put a URL under our control please.
Cehck what we have for images.
full_text += response["choices"][0]["delta"]["content"] | ||
except json.JSONDecodeError: | ||
pass | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where's the actual assert again the full text ?
@@ -1229,6 +1230,9 @@ impl From<Message> for TextMessage { | |||
.map(|chunk| match chunk { | |||
MessageChunk::Text { text } => text, | |||
MessageChunk::ImageUrl { image_url } => format!("![]({})", image_url.url), | |||
MessageChunk::VideoUrl { video_url } => { | |||
format!("<video>({})", video_url.url) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's remove this altogther. We don't want the serialized shenanigan anymore.
@@ -74,3 +77,4 @@ default = ["ngrok"] | |||
ngrok = ["dep:ngrok"] | |||
google = [] | |||
kserve = [] | |||
video = ["ffmpeg-next", "mp4parse", "tempfile"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why so many deps?
router/src/validation.rs
Outdated
tokenizer_query.push_str(&inputs[start..chunk_start]); | ||
} | ||
let processed_video = match config { | ||
Idefics | Mllama | Idefics2(_) | Paligemma(_) | LlavaNext(_) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why ? They do not support video for now. So let's not do anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed, this should have been removed originally. removed in latest commit
@@ -645,13 +808,64 @@ fn prepare_input<T: TokenizerTrait>( | |||
) -> Result<(tokenizers::Encoding, Vec<Chunk>), ValidationError> { | |||
use Config::*; | |||
static RE: Lazy<Regex> = Lazy::new(|| Regex::new(r"!\[\]\([^\)]*\)").unwrap()); | |||
// Add video regex | |||
static VIDEO_RE: Lazy<Regex> = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No regex.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No regex + No string shenanigans.
} | ||
|
||
#[cfg(feature = "video")] | ||
pub fn fetch_video( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This overall seems super convoluted.
- Why do we need a file for something that should stay in RAM ?
- We are re-encoding the video here, this is fine if we are agressively trimming the output video (like removing extra frames). We have to assume ppl are going to send 4k video of 10hour long content.
- We are binding to
ffmpeg
which seems to add a lot of dependency complexity. Why not just depend onffmpeg
binary, and be done with it ? Seems much simpler, and we're not looking at all at the contents it seems.
For the tempfile, it might be the simplest. But I feel we need to discuss options before doing it that way.
let nframes = (sampled_frames).max(min_frames).min(max_frames); | ||
let nframes = (nframes / 2.0).round() as usize * 2; | ||
let num_tokens = nframes * height as usize * width as usize / 1541; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keep everything in usize and use regular division or div_ceil
. Should be much simpler.
Where is that 1514
coming from ?
router/src/validation.rs
Outdated
num_frames: _, | ||
}) => { | ||
// TODO: revisit if we should limit video support to v3 - to avoid sending very large base64 strings | ||
let encoded = STANDARD.encode(data); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's revisit now.
# copy the value position of the next image token from GPU<->CPU | ||
next_image_pos = ( | ||
(input_ids[current_pos:] == self.image_token_id) | ||
for _ in range(image_count + video_count): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks pretty bloated.
It was like that before but it's still seems quite bloated.
There should be 1 replace for all images, 1 replace for all videos.
Not super urgent if this works (as it's limited to qwen).
Also seems everything named image was renamed to video.. Sounds like better names can be found.
This PR is a work in progress that explores adding support for video inputs with Qwen2-VL. Thank you @mfarre for getting this effort started.
TODOS
video_url
supdate*
start server
send request