Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MIOpen(HIP): Error [EvaluateInvokers] /MIOpen/src/hipoc/hipoc_kernel.cpp:106: Failed to launch kernel: invalid configuration argument #3483

Open
Looong01 opened this issue Jan 24, 2025 · 5 comments

Comments

@Looong01
Copy link

Hi,

I am running VLLM on my 7900XTX(gfx1100). I use vllm serve ./qwen2-vl-instruct-pytorch-7b --dtype auto --port 8000 --limit_mm_per_prompt image=4 --max_model_len 8784 --gpu_memory_utilization 0.9

But then it shows errors:

$ vllm serve ./qwen2-vl-instruct-pytorch-7b --dtype auto --port 8000 --limit_mm_per_prompt image=4 --max_model_len 8784 --gpu_memory_utilization 0.9
WARNING 01-24 14:06:31 rocm.py:31] `fork` method is not supported by ROCm. VLLM_WORKER_MULTIPROC_METHOD is overridden to `spawn` instead.
INFO 01-24 14:06:32 api_server.py:712] vLLM API server version 0.6.6.post1
INFO 01-24 14:06:32 api_server.py:713] args: Namespace(subparser='serve', model_tag='./qwen2-vl-instruct-pytorch-7b', config='', host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='./qwen2-vl-instruct-pytorch-7b', task='auto', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=8784, guided_decoding_backend='xgrammar', logits_processor_pattern=None, distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt={'image': 4}, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', generation_config=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, dispatch_function=<function serve at 0x7f04d0f26f80>)
INFO 01-24 14:06:32 api_server.py:199] Started engine process with PID 91849
INFO 01-24 14:06:40 config.py:510] This model supports multiple tasks: {'reward', 'score', 'generate', 'embed', 'classify'}. Defaulting to 'generate'.
INFO 01-24 14:06:40 config.py:1338] Disabled the custom all-reduce kernel because it is not supported on AMD GPUs.
INFO 01-24 14:06:44 config.py:510] This model supports multiple tasks: {'score', 'classify', 'reward', 'embed', 'generate'}. Defaulting to 'generate'.
INFO 01-24 14:06:44 config.py:1338] Disabled the custom all-reduce kernel because it is not supported on AMD GPUs.
INFO 01-24 14:06:44 llm_engine.py:234] Initializing an LLM engine (v0.6.6.post1) with config: model='./qwen2-vl-instruct-pytorch-7b', speculative_config=None, tokenizer='./qwen2-vl-instruct-pytorch-7b', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8784, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=./qwen2-vl-instruct-pytorch-7b, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"candidate_compile_sizes":[],"compile_sizes":[],"capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=True,
INFO 01-24 14:06:48 selector.py:134] Using ROCmFlashAttention backend.
INFO 01-24 14:06:48 model_runner.py:1094] Starting to load model ./qwen2-vl-instruct-pytorch-7b...
WARNING 01-24 14:06:48 registry.py:307] Model architecture 'Qwen2ForCausalLM' is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting `VLLM_USE_TRITON_FLASH_ATTN=0`
Loading safetensors checkpoint shards:   0% Completed | 0/5 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  20% Completed | 1/5 [00:01<00:07,  1.87s/it]
Loading safetensors checkpoint shards:  40% Completed | 2/5 [00:06<00:10,  3.39s/it]
Loading safetensors checkpoint shards:  60% Completed | 3/5 [00:07<00:04,  2.16s/it]
Loading safetensors checkpoint shards:  80% Completed | 4/5 [00:11<00:03,  3.25s/it]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:17<00:00,  4.13s/it]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:17<00:00,  3.53s/it]

INFO 01-24 14:07:06 model_runner.py:1099] Loading model weights took 15.5083 GB
WARNING 01-24 14:07:06 model_runner.py:1279] Computed max_num_seqs (min(256, 8784 // 81920)) to be less than 1. Setting it to the minimum value of 1.
Token indices sequence length is longer than the specified maximum sequence length for this model (65536 > 32768). Running this sequence through the model will result in indexing errors
WARNING 01-24 14:07:10 processing.py:878] The context length (8784) of the model is too short to hold the multi-modal embeddings in the worst case (65536 tokens in total, out of which {'image': 65536} are reserved for multi-modal embeddings). This may cause certain multi-modal inputs to fail during inference, even when the input text is short. To avoid this, you should increase `max_model_len`, reduce `max_num_seqs`, and/or reduce `mm_counts`.
MIOpen(HIP): Error [EvaluateInvokers] /MIOpen/src/hipoc/hipoc_kernel.cpp:106: Failed to launch kernel: invalid configuration argument

After I turn on export MIOPEN_ENABLE_LOGGING=1 and export MIOPEN_ENABLE_LOGGING_CMD=1, it shows:

MIOpen(HIP): miopenStatus_t miopenCreateTensorDescriptor(miopenTensorDescriptor_t *){
MIOpen(HIP):    tensorDesc = 0x7f563b65ee90
MIOpen(HIP): }
MIOpen(HIP): miopenStatus_t miopenSetTensorDescriptor(miopenTensorDescriptor_t, miopenDataType_t, int, const int *, const int *){
MIOpen(HIP):    tensorDesc = {}, {}, packed,
MIOpen(HIP):    dataType = 5
MIOpen(HIP):    nbDims = 5
MIOpen(HIP):    dim.values = { 262144 3 2 14 14 }
MIOpen(HIP):    stride.values = { 1176 392 196 14 1 }
MIOpen(HIP): }
MIOpen(HIP): miopenStatus_t miopenCreateTensorDescriptor(miopenTensorDescriptor_t *){
MIOpen(HIP):    tensorDesc = 0xe00000002
MIOpen(HIP): }
MIOpen(HIP): miopenStatus_t miopenSetTensorDescriptor(miopenTensorDescriptor_t, miopenDataType_t, int, const int *, const int *){
MIOpen(HIP):    tensorDesc = {}, {}, packed,
MIOpen(HIP):    dataType = 5
MIOpen(HIP):    nbDims = 5
MIOpen(HIP):    dim.values = { 1280 3 2 14 14 }
MIOpen(HIP):    stride.values = { 1176 392 196 14 1 }
MIOpen(HIP): }
MIOpen(HIP): miopenStatus_t miopenCreateTensorDescriptor(miopenTensorDescriptor_t *){
MIOpen(HIP):    tensorDesc = 0x7f563b679c6c
MIOpen(HIP): }
MIOpen(HIP): miopenStatus_t miopenSetTensorDescriptor(miopenTensorDescriptor_t, miopenDataType_t, int, const int *, const int *){
MIOpen(HIP):    tensorDesc = {}, {}, packed,
MIOpen(HIP):    dataType = 5
MIOpen(HIP):    nbDims = 5
MIOpen(HIP):    dim.values = { 262144 1280 1 1 1 }
MIOpen(HIP):    stride.values = { 1280 1 1 1 1 }
MIOpen(HIP): }
MIOpen(HIP): miopenStatus_t miopenCreateConvolutionDescriptor(miopenConvolutionDescriptor_t *){
MIOpen(HIP):    convDesc = 0x7fffa5ad3c80
MIOpen(HIP): }
MIOpen(HIP): miopenStatus_t miopenInitConvolutionNdDescriptor(miopenConvolutionDescriptor_t, int, const int *, const int *, const int *, miopenConvolutionMode_t){
MIOpen(HIP):    convDesc = conv2d, miopenConvolution, miopenPaddingDefault, {0, 0}, {1, 1}, {1, 1},
MIOpen(HIP):    spatialDim = 3
MIOpen(HIP):    pads = { 0 0 0 }
MIOpen(HIP):    strides = { 2 14 14 }
MIOpen(HIP):    dilations = { 1 1 1 }
MIOpen(HIP):    c_mode = 0
MIOpen(HIP): }
MIOpen(HIP): miopenStatus_t miopenSetConvolutionGroupCount(miopenConvolutionDescriptor_t, int){
MIOpen(HIP):    convDesc = conv3d, miopenConvolution, miopenPaddingDefault, {0, 0, 0}, {2, 14, 14}, {1, 1, 1},
MIOpen(HIP):    groupCount = 1
MIOpen(HIP): }
MIOpen(HIP): miopenStatus_t miopenSetConvolutionAttribute(miopenConvolutionDescriptor_t, const miopenConvolutionAttrib_t, const int){
MIOpen(HIP):    convDesc = conv3d, miopenConvolution, miopenPaddingDefault, {0, 0, 0}, {2, 14, 14}, {1, 1, 1},
MIOpen(HIP):    attr = 1
MIOpen(HIP):    value = 0
MIOpen(HIP): }
MIOpen(HIP): miopenStatus_t miopenConvolutionForwardGetWorkSpaceSize(miopenHandle_t, const miopenTensorDescriptor_t, const miopenTensorDescriptor_t, const miopenConvolutionDescriptor_t, const miopenTensorDescriptor_t, size_t *){
MIOpen(HIP):    handle = stream: 0, device_id: 0
MIOpen(HIP):    wDesc = {1280, 3, 2, 14, 14}, {1176, 392, 196, 14, 1}, packed,
MIOpen(HIP):    xDesc = {262144, 3, 2, 14, 14}, {1176, 392, 196, 14, 1}, packed,
MIOpen(HIP):    convDesc = conv3d, miopenConvolution, miopenPaddingDefault, {0, 0, 0}, {2, 14, 14}, {1, 1, 1},
MIOpen(HIP):    yDesc = {262144, 1280, 1, 1, 1}, {1280, 1, 1, 1, 1}, packed,
MIOpen(HIP): }
MIOpen(HIP): miopenStatus_t miopenFindConvolutionForwardAlgorithm(miopenHandle_t, const miopenTensorDescriptor_t, const void *, const miopenTensorDescriptor_t, const void *, const miopenConvolutionDescriptor_t, const miopenTensorDescriptor_t, void *, const int, int *, miopenConvAlgoPerf_t *, void *, size_t, bool){
MIOpen(HIP):    handle = stream: 0, device_id: 0
MIOpen(HIP):    xDesc = {262144, 3, 2, 14, 14}, {1176, 392, 196, 14, 1}, packed,
MIOpen(HIP):    x = 0x7f4e03600000
MIOpen(HIP):    wDesc = {1280, 3, 2, 14, 14}, {1176, 392, 196, 14, 1}, packed,
MIOpen(HIP):    w = 0x7f52cb400000
MIOpen(HIP):    convDesc = conv3d, miopenConvolution, miopenPaddingDefault, {0, 0, 0}, {2, 14, 14}, {1, 1, 1},
MIOpen(HIP):    yDesc = {262144, 1280, 1, 1, 1}, {1280, 1, 1, 1, 1}, packed,
MIOpen(HIP):    y = 0x7f4ddb400000
MIOpen(HIP):    requestAlgoCount = 1
MIOpen(HIP):    returnedAlgoCount = 32767
MIOpen(HIP):    perfResults =
MIOpen(HIP):    workSpace = 0x7f52c51ad600
MIOpen(HIP):    workSpaceSize = 2352
MIOpen(HIP):    exhaustiveSearch = 0
MIOpen(HIP): }
MIOpen(HIP): Command [LogCmdFindConvolution] ./bin/MIOpenDriver convbfp16 -n 262144 -c 3 --in_d 2 -H 14 -W 14 -k 1280 --fil_d 2 -y 14 -x 14 --pad_d 0 -p 0 -q 0 --conv_stride_d 2 -u 14 -v 14 --dilation_d 1 -l 1 -j 1 --spatial_dim 3 -m conv -g 1 -F 1 -t 1
MIOpen(HIP): Error [EvaluateInvokers] /MIOpen/src/hipoc/hipoc_kernel.cpp:106: Failed to launch kernel: invalid configuration argument
MIOpen(HIP): auto miopen::solver::conv::GemmFwdRest::GetSolution(const ExecutionContext &, const ProblemDescription &)::(anonymous class)::operator()(const std::vector<Kernel> &)::(anonymous class)::operator()(const Handle &, const AnyInvokeParams &) const{
MIOpen(HIP):    name + ", non 1x1" = convolution, non 1x1
MIOpen(HIP): }
MIOpen(HIP): miopenStatus_t miopen::CallGemm(const Handle &, GemmDescriptor, ConstData_t, std::size_t, ConstData_t, std::size_t, Data_t, std::size_t, GemmBackend_t){
MIOpen(HIP):    "rocBLAS" = rocBLAS
MIOpen(HIP): }
MIOpen(HIP): miopenStatus_t miopen::CallGemm(const Handle &, GemmDescriptor, ConstData_t, std::size_t, ConstData_t, std::size_t, Data_t, std::size_t, GemmBackend_t){
MIOpen(HIP):    "rocBLAS" = rocBLAS
MIOpen(HIP): }
MIOpen(HIP): miopenStatus_t miopen::CallGemm(const Handle &, GemmDescriptor, ConstData_t, std::size_t, ConstData_t, std::size_t, Data_t, std::size_t, GemmBackend_t){
MIOpen(HIP):    "rocBLAS" = rocBLAS
MIOpen(HIP): }
MIOpen(HIP): miopenStatus_t miopen::CallGemm(const Handle &, GemmDescriptor, ConstData_t, std::size_t, ConstData_t, std::size_t, Data_t, std::size_t, GemmBackend_t){
MIOpen(HIP):    "rocBLAS" = rocBLAS
MIOpen(HIP): }
MIOpen(HIP): miopenStatus_t miopen::CallGemm(const Handle &, GemmDescriptor, ConstData_t, std::size_t, ConstData_t, std::size_t, Data_t, std::size_t, GemmBackend_t){
MIOpen(HIP):    "rocBLAS" = rocBLAS
MIOpen(HIP): }
MIOpen(HIP): miopenStatus_t miopen::CallGemm(const Handle &, GemmDescriptor, ConstData_t, std::size_t, ConstData_t, std::size_t, Data_t, std::size_t, GemmBackend_t){
MIOpen(HIP):    "rocBLAS" = rocBLAS
MIOpen(HIP): }

And it always shows MIOpen(HIP): miopenStatus_t miopen::CallGemm(const Handle &, GemmDescriptor, ConstData_t, std::size_t, ConstData_t, std::size_t, Data_t, std::size_t, GemmBackend_t){ MIOpen(HIP): "rocBLAS" = rocBLAS MIOpen(HIP): } without stopping. And the GPU utils is steadly at 95%.

@Looong01
Copy link
Author

========================================= ROCm System Management Interface =========================================
=================================================== Concise Info ===================================================
Device  Node  IDs              Temp    Power   Partitions          SCLK     MCLK   Fan  Perf  PwrCap  VRAM%  GPU%
^[3m              (DID,     GUID)  (Edge)  (Avg)   (Mem, Compute, ID)                                                  ^[0m
====================================================================================================================
0       1     0x744c,   33510  40.0°C  182.0W  N/A, N/A, 0         3119Mhz  96Mhz  0%   auto  327.0W  81%    96%
====================================================================================================================
=============================================== End of ROCm SMI Log ================================================

@Looong01
Copy link
Author

$ rocminfo
ROCk module version 6.8.5 is loaded
=====================
HSA System Attributes
=====================
Runtime Version:         1.14
Runtime Ext Version:     1.6
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE
System Endianness:       LITTLE
Mwaitx:                  DISABLED
DMAbuf Support:          YES

==========
HSA Agents
==========
*******
Agent 1
*******
  Name:                    Intel(R) Core(TM) i7-9700 CPU @ 3.00GHz
  Uuid:                    CPU-XX
  Marketing Name:          Intel(R) Core(TM) i7-9700 CPU @ 3.00GHz
  Vendor Name:             CPU
  Feature:                 None specified
  Profile:                 FULL_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        0(0x0)
  Queue Min Size:          0(0x0)
  Queue Max Size:          0(0x0)
  Queue Type:              MULTI
  Node:                    0
  Device Type:             CPU
  Cache Info:
    L1:                      32768(0x8000) KB
  Chip ID:                 0(0x0)
  ASIC Revision:           0(0x0)
  Cacheline Size:          64(0x40)
  Max Clock Freq. (MHz):   4700
  BDFID:                   0
  Internal Node ID:        0
  Compute Unit:            8
  SIMDs per CU:            0
  Shader Engines:          0
  Shader Arrs. per Eng.:   0
  WatchPts on Addr. Ranges:1
  Memory Properties:
  Features:                None
  Pool Info:
    Pool 1
      Segment:                 GLOBAL; FLAGS: FINE GRAINED
      Size:                    32799352(0x1f47a78) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
    Pool 2
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    32799352(0x1f47a78) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
    Pool 3
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED
      Size:                    32799352(0x1f47a78) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
  ISA Info:
*******
Agent 2
*******
  Name:                    gfx1100
  Uuid:                    GPU-85631fd855c9cea1
  Marketing Name:          Radeon RX 7900 XTX
  Vendor Name:             AMD
  Feature:                 KERNEL_DISPATCH
  Profile:                 BASE_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        128(0x80)
  Queue Min Size:          64(0x40)
  Queue Max Size:          131072(0x20000)
  Queue Type:              MULTI
  Node:                    1
  Device Type:             GPU
  Cache Info:
    L1:                      32(0x20) KB
    L2:                      6144(0x1800) KB
    L3:                      98304(0x18000) KB
  Chip ID:                 29772(0x744c)
  ASIC Revision:           0(0x0)
  Cacheline Size:          64(0x40)
  Max Clock Freq. (MHz):   2482
  BDFID:                   768
  Internal Node ID:        1
  Compute Unit:            96
  SIMDs per CU:            2
  Shader Engines:          6
  Shader Arrs. per Eng.:   2
  WatchPts on Addr. Ranges:4
  Coherent Host Access:    FALSE
  Memory Properties:
  Features:                KERNEL_DISPATCH
  Fast F16 Operation:      TRUE
  Wavefront Size:          32(0x20)
  Workgroup Max Size:      1024(0x400)
  Workgroup Max Size per Dimension:
    x                        1024(0x400)
    y                        1024(0x400)
    z                        1024(0x400)
  Max Waves Per CU:        32(0x20)
  Max Work-item Per CU:    1024(0x400)
  Grid Max Size:           4294967295(0xffffffff)
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)
    y                        4294967295(0xffffffff)
    z                        4294967295(0xffffffff)
  Max fbarriers/Workgrp:   32
  Packet Processor uCode:: 342
  SDMA engine uCode::      21
  IOMMU Support::          None
  Pool Info:
    Pool 1
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED
      Size:                    25149440(0x17fc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:2048KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 2
      Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINED
      Size:                    25149440(0x17fc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:2048KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 3
      Segment:                 GROUP
      Size:                    64(0x40) KB
      Allocatable:             FALSE
      Alloc Granule:           0KB
      Alloc Recommended Granule:0KB
      Alloc Alignment:         0KB
      Accessible by all:       FALSE
  ISA Info:
    ISA 1
      Name:                    amdgcn-amd-amdhsa--gfx1100
      Machine Models:          HSA_MACHINE_MODEL_LARGE
      Profiles:                HSA_PROFILE_BASE
      Default Rounding Mode:   NEAR
      Default Rounding Mode:   NEAR
      Fast f16:                TRUE
      Workgroup Max Size:      1024(0x400)
      Workgroup Max Size per Dimension:
        x                        1024(0x400)
        y                        1024(0x400)
        z                        1024(0x400)
      Grid Max Size:           4294967295(0xffffffff)
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)
        y                        4294967295(0xffffffff)
        z                        4294967295(0xffffffff)
      FBarrier Max Size:       32
*** Done ***

@ppanchad-amd
Copy link

Hi @Looong01. Internal ticket has been created to investigate your issue. Thanks!

@huanrwan-amd
Copy link
Contributor

huanrwan-amd commented Jan 24, 2025

Hi @Looong01 , thank you for posting the issue. Can you please provide more info as following?

  • The rocm version. using amd-smi
  • The version info for vLLM. which branch in https://github.com/ROCm/vllm?
  • The link to download the qwen model you used (./qwen2-vl-instruct-pytorch-7b)

Thanks.

@Looong01
Copy link
Author

Looong01 commented Jan 24, 2025

Hi @Looong01 , thank you for posting the issue. Can you please provide more info as following?

  • The rocm version. using amd-smi
  • The version info for vLLM. which branch in https://github.com/ROCm/vllm?
  • The link to download the qwen model you used (./qwen2-vl-instruct-pytorch-7b)

Thanks.

$ sudo amd-smi
usage: amd-smi [-h]  ...

AMD System Management Interface | Version: 24.6.3+9578815 | ROCm version: 6.2.4 |
Platform: Linux Baremetal
  1. The latest version, and the main branch.

  2. https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants