llm-api-benchmark-tool/production.md at 7350331b3a7fb5ac69b5931896bb86df886f72bf

mingor/llm-api-benchmark-tool

Fork 0

FeynmanLoo 7350331b3a Add vendor dependencies and update project structure

2025-04-23 08:05:46 +08:00

9.4 KiB

Raw Blame History

技术选型

语言：Go。
HTTP客户端：
- 抽象接口：HTTPClient。
- 实现：
  - fasthttp（github.com/valyala/fasthttp）：高性能，优先选择。
  - net/http（标准库）：兼容性强，备选。
Tokenizer 库：github.com/tiktoken-go/tokenizer。
并发管理：goroutine池、sync.Pool、context.Context。
数据处理：sync.Map存储响应数据，gonum/stat计算百分位数。
图表生成：go-echarts。

产品需求设计

功能需求

API兼容性：
- 支持OpenAI Compatible风格的LLM API（包括stream=true的SSE流式响应）。
- 兼容推理模型（如DeepSeek-R1）的响应结构，包含reasoning_content字段（与content同级）。
HTTP客户端抽象层：
- 定义HTTPClient接口，包含以下方法：
  - Do(req *Request) (*Response, error)：发送非流式请求。
  - Stream(req *Request, callback func(chunk SSEChunk) error) error：处理流式响应，逐块回调SSE数据。
- 实现：
  - fasthttp：基于fasthttp.RequestCtx和Response.BodyStream。
  - net/http：基于http.Client和Response.Body。
- 配置项client: "fasthttp"或client: "net-http"切换客户端。
提示词生成：
- 使用tiktoken-go/tokenizer动态生成短咨询提示词（50 tokens ±5%）和长文档提示词（1000 tokens ±5%）。
并发请求：
- 支持最大500并发，持续5分钟（300秒）或更长时间。
- 使用goroutine池（最大600 goroutine），sync.Pool重用请求对象。
- 通过context.Context支持任务终止。
场景建模：
- 混合负载：短咨询（50 tokens，70%比例），长文档生成（1000 tokens，30%比例）。
用户行为模拟：请求间隔遵循泊松分布（math/rand实现）。
梯度增压：支持阶梯式加压（50→200→500并发），每阶段持续300秒。
性能指标收集：
- 请求统计：总请求数、成功请求数、失败请求数、超时比率（目标：<1%，默认超时30-60秒）。
- 响应时间：平均、最小/最大、P90/P95/P99（目标：<2秒至<10秒，<3秒至<15秒，<5秒至<20秒）。
- TTFT：
  - 最小/最大TTFT，P90/P95/P99 TTFT。
  - 通过SSE流式响应测量，精度达毫秒，兼容content和reasoning_content。
- QPS：平均QPS（目标：50-200+），最大QPS。
- Token生成速率：平均速率（目标：>100 tokens/秒），最大速率，兼容content和reasoning_content。
- 最大有效并发用户数：记录仍有请求成功的最大并发数（目标：50-500+）。
报告生成：
- 生成质量报告，包含所有指标统计。
- 提供图表（如响应时间分布、TTFT分布、QPS曲线）。

非功能需求

高性能：
- 支持500并发，持续5分钟以上，fasthttp优化吞吐量，goroutine池降低GC压力。
统计准确性：
- TTFT和响应时间测量精度达毫秒，使用time.Now().UnixNano()。
- 使用sync.Mutex或sync/atomic确保并发统计线程安全。
可扩展性：支持新服务提供商、提示词模板、响应字段。
易用性：配置文件指定API、提示词、超时、客户端类型（fasthttp或net-http）。
稳定性：长时间运行无崩溃、死锁或内存泄漏。
可观测性：
- 使用log包记录关键事件和错误，通过--debug参数，开启调试日志，记录详细信息。
- 使用runtime包监控CPU、内存、goroutine数。
- 使用sync.Map存储响应数据，gonum/stat计算百分位数。

技术实现建议

HTTP客户端抽象层：

定义HTTPClient接口：

type HTTPClient interface {
    Do(req *Request) (*Response, error)
    Stream(req *Request, callback func(chunk SSEChunk) error) error
}
type Request struct {
    Method  string
    URL     string
    Body    []byte
    Headers map[string]string
}
type Response struct {
    StatusCode int
    Body       []byte
    Headers    map[string]string
}
type SSEChunk struct {
    Data       []byte // JSON数据
    Timestamp  int64  // 接收时间戳（UnixNano）
    IsDone     bool   // 是否为[data: [DONE]]
}

fasthttp实现：
- Do：使用fasthttp.Do，构造OpenAI Compatible JSON请求，解析响应。
- Stream：使用fasthttp.Response.BodyStream，逐行解析SSE（data: {...}），调用回调函数传递SSEChunk。
- 使用sync.Pool重用fasthttp.Request和fasthttp.Response。
net/http实现：
- Do：使用http.Client.Do，构造请求，解析响应。
- Stream：使用bufio.Scanner读取http.Response.Body，解析SSE，调用回调函数。

切换逻辑：

解析配置项client，实例化对应客户端：

func NewHTTPClient(clientType string) (HTTPClient, error) {
    switch clientType {
    case "fasthttp":
        return &FastHTTPClient{}, nil
    case "net-http":
        return &NetHTTPClient{}, nil
    default:
        return nil, fmt.Errorf("unsupported client: %s", clientType)
    }
}

SSE与TTFT处理：
- Stream方法逐块解析SSE，提取choices[0].delta.content或choices[0].delta.reasoning_content。
- 记录请求发送时间（t0 = time.Now().UnixNano()）。
- 检测首个非空Token块，记录时间（t1），TTFT = (t1 - t0) / 1e6（毫秒）。
- 兼容推理模型：优先检查reasoning_content，若为空则使用content。
Go性能优化：
- Goroutine池：使用ants库或自定义池，限制最大600 goroutine。
- 对象池：sync.Pool重用请求/响应对象。
- 上下文控制：context.WithTimeout设置压测时长（默认300秒）。
- 资源监控：使用runtime包记录CPU、内存、goroutine数。
统计准确性：
- 使用sync/atomic累加请求计数。
- 使用sync.Mutex保护响应时间和TTFT切片。
- 使用gonum/stat计算P90/P95/P99。
提示词生成：tiktoken-go/tokenizer动态调整模板至50或1000 tokens。
报告生成：go-echarts生成图表，模板引擎生成HTML/PDF。

压测流程

配置加载：读取API端点、客户端类型（fasthttp或net-http）、并发数、提示词模板等。
初始化HTTP客户端：根据配置项client实例化fasthttp或net/http客户端。
提示词生成：使用tiktoken-go/tokenizer生成短咨询（50 tokens）和长文档（1000 tokens）提示词。
并发执行：
- 初始化goroutine池（最大600）。
- 使用HTTPClient.Stream发送请求（stream=true），间隔符合泊松分布。
- 处理SSE流，记录TTFT。
响应处理：
- 解析content和reasoning_content，计算Token数。
- 记录响应时间、TTFT、成功/失败状态。
指标计算：
- 计算请求统计、响应时间、TTFT、QPS、Token生成速率、最大并发用户数。
报告生成：输出统计数据和图表。

配置示例与报告内容

以下是更新后的配置和报告示例，包含HTTP客户端抽象层和切换逻辑：

# 配置示例
api:
  endpoint: "https://api.example.com/v1/completions"
  api_key: "your_api_key"
  model: "deepseek-r1"  # 支持推理模型
  streaming: true  # 启用流式响应
  client: "fasthttp"  # HTTP客户端：fasthttp 或 net-http

prompt_templates:
  short:
    target_tokens: 50
    templates:
      - "What is the capital of {country}?"
      - "Please briefly explain the concept of {concept}."
      - "Summarize the main idea of {topic} in one sentence."
  long:
    target_tokens: 1000
    templates:
      - "Write a detailed history of {country} covering major events."
      - "Compose an in-depth analysis of {topic} with examples."
      - "Generate a comprehensive report on the impact of {event}."

requests:
  - type: "short"
    weight: 0.7
  - type: "long"
    weight: 0.3

concurrency:
  steps: [50, 200, 500]
  duration_per_step: 300  # 秒，5分钟
  max_goroutines: 600  # 最大goroutine数

timeout: 60  # 秒
poisson_lambda: 1.0  # 请求间隔泊松分布参数

tokenizer:
  model: "gpt-3.5-turbo"  # 用于tiktoken-go的分词器模型

# 报告内容示例
report:
  overview:
    api_endpoint: "https://api.example.com/v1/completions"
    model: "deepseek-r1"
    concurrency_steps: [50, 200, 500]
    duration_per_step: "300s"
    request_mix: "70% short (50 tokens), 30% long (1000 tokens)"
    tokenizer: "tiktoken-go (gpt-3.5-turbo)"
    streaming_enabled: true
    http_client: "fasthttp"
  metrics:
    total_requests: 15000
    successful_requests: 14850
    failed_requests: 150
    timeout_ratio: "1.0%"
    response_time:
      avg: "1.8s"
      min: "0.5s"
      max: "10.2s"
      p90: "2.5s"
      p95: "3.0s"
      p99: "4.5s"
    ttft:
      min: "0.1s"
      max: "2.0s"
      p90: "0.3s"
      p95: "0.4s"
      p99: "0.6s"
    qps:
      avg: 180
      max: 220
    token_rate:
      avg: "120 tokens/s"
      max: "150 tokens/s"
    max_concurrent_users: 500
    resource_usage:
      avg_cpu: "60%"
      max_memory: "1.1GB"
      max_goroutines: 600
  charts:
    - "response_time_distribution.png"
    - "ttft_distribution.png"
    - "qps_over_time.png"
    - "token_rate_over_time.png"
    - "concurrency_vs_response_time.png"

9.4 KiB Raw Blame History Unescape Escape

技术选型

产品需求设计

功能需求

非功能需求

技术实现建议

压测流程

配置示例与报告内容

9.4 KiB

Raw Blame History