多模态模型启动配置#

LightLLM支持多种多模态模型的推理，下面以InternVL为例，对多模态服务的启动命令进行说明。

基本启动命令#

INTERNVL_IMAGE_LENGTH=256 \
LOADWORKER=12 \
python -m lightllm.server.api_server \
--port 8080 \
--tp 2 \
--model_dir ${MODEL_PATH} \
--mem_fraction 0.8 \
--trust_remote_code \
--enable_multimodal

核心参数说明#

环境变量#

INTERNVL_IMAGE_LENGTH: 设置InternVL模型的图像token长度，默认为256
LOADWORKER: 设置模型加载的工作进程数

基础服务参数#

–port 8080: API服务器监听端口
–tp 2: 张量并行度(Tensor Parallelism)
–model_dir: InternVL模型文件路径
–mem_fraction 0.8: GPU显存使用比例
–trust_remote_code: 允许加载自定义模型代码
–enable_multimodal: 启用多模态功能

高级配置参数#

--visual_infer_batch_size 2 \
--cache_capacity 500 \
--visual_dp dp_size \
--visual_tp tp_size

–visual_infer_batch_size 2: 视觉推理批处理大小
–cache_capacity 500: 图像嵌入缓存容量
–visual_dp 2: 视觉模型数据并行度
–visual_tp 2: 视觉模型张量并行度

Note

为了使每一个GPU的显存负载相同，需要visual_dp * visual_tp = tp，例如tp=2，则visual_dp=1, visual_tp=2。

ViT部署方式#

ViT TP (张量并行)#

默认使用
–visual_tp tp_size 开启张量并行

ViT DP (数据并行)#

将不同图像批次分布到多个GPU
每个GPU运行完整ViT模型副本
–visual_dp dp_size 开启数据并行

图像缓存机制#

LightLLM 会对输入图片的embeddings进行缓存，多轮对话中，如果图片相同，则可以直接使用缓存的embeddings，避免重复推理。

–cache_capacity: 控制缓存的image embed数量
根据图片MD5哈希值进行匹配
采用LRU(最近最少使用)淘汰机制
命中的图片cache可直接跳过ViT推理

测试#

import json
import requests
import base64

def run(query, uris):
    images = []
    for uri in uris:
        if uri.startswith("http"):
            images.append({"type": "url", "data": uri})
        else:
            with open(uri, 'rb') as fin:
                b64 = base64.b64encode(fin.read()).decode("utf-8")
            images.append({'type': "base64", "data": b64})

    data = {
        "inputs": query,
        "parameters": {
            "max_new_tokens": 200,
            # The space before <|endoftext|> is important,
            # the server will remove the first bos_token_id,
            # but QWen tokenizer does not has bos_token_id
            "stop_sequences": [" <|endoftext|>", " <|im_start|>", " <|im_end|>"],
        },
        "multimodal_params": {
            "images": images,
        }
    }

    url = "http://127.0.0.1:8000/generate"
    headers = {'Content-Type': 'application/json'}
    response = requests.post(url, headers=headers, data=json.dumps(data))
    return response

query = """
<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
<img></img>
这是什么？<|im_end|>
<|im_start|>assistant
"""

response = run(
    uris = [
        "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"
    ],
    query = query
)

if response.status_code == 200:
    print(f"Result: {response.json()}")
else:
    print(f"Error: {response.status_code}, {response.text}")

多模态模型启动配置

Contents

多模态模型启动配置#

基本启动命令#

核心参数说明#

环境变量#

基础服务参数#

高级配置参数#

ViT部署方式#

ViT TP (张量并行)#

ViT DP (数据并行)#

图像缓存机制#

测试#