llm-api-benchmark-tool/docs/production.md

### 技术选型
- **语言**：Go。
- **HTTP客户端**：
  - 抽象接口：`HTTPClient`。
  - 实现：
    - `fasthttp`（`github.com/valyala/fasthttp`）：高性能，优先选择。
    - `net/http`（标准库）：兼容性强，备选。
- **Tokenizer 库**：`github.com/tiktoken-go/tokenizer`。
- **并发管理**：goroutine池、`sync.Pool`、`context.Context`。
- **数据处理**：`sync.Map`存储响应数据，`gonum/stat`计算百分位数。
- **图表生成**：`go-echarts`。

### 产品需求设计

#### 功能需求
- **API兼容性**：
  - 支持OpenAI Compatible风格的LLM API（包括`stream=true`的SSE流式响应）。
  - 兼容推理模型（如DeepSeek-R1）的响应结构，包含`reasoning_content`字段（与`content`同级）。
- **HTTP客户端抽象层**：
  - 定义`HTTPClient`接口，包含以下方法：
    - `Do(req *Request) (*Response, error)`：发送非流式请求。
    - `Stream(req *Request, callback func(chunk SSEChunk) error) error`：处理流式响应，逐块回调SSE数据。
  - 实现：
    - `fasthttp`：基于`fasthttp.RequestCtx`和`Response.BodyStream`。
    - `net/http`：基于`http.Client`和`Response.Body`。
  - 配置项`client: "fasthttp"`或`client: "net-http"`切换客户端。
- **提示词生成**：
  - 使用`tiktoken-go/tokenizer`动态生成短咨询提示词（50 tokens ±5%）和长文档提示词（1000 tokens ±5%）。
- **并发请求**：
  - 支持最大500并发，持续5分钟（300秒）或更长时间。
  - 使用goroutine池（最大600 goroutine），`sync.Pool`重用请求对象。
  - 通过`context.Context`支持任务终止。
- **场景建模**：
  - 混合负载：短咨询（50 tokens，70%比例），长文档生成（1000 tokens，30%比例）。
- **用户行为模拟**：请求间隔遵循泊松分布（`math/rand`实现）。
- **梯度增压**：支持阶梯式加压（50→200→500并发），每阶段持续300秒。
- **性能指标收集**：
  - **请求统计**：总请求数、成功请求数、失败请求数、超时比率（目标：<1%，默认超时30-60秒）。
  - **响应时间**：平均、最小/最大、P90/P95/P99（目标：<2秒至<10秒，<3秒至<15秒，<5秒至<20秒）。
  - **TTFT**：
    - 最小/最大TTFT，P90/P95/P99 TTFT。
    - 通过SSE流式响应测量，精度达毫秒，兼容`content`和`reasoning_content`。
  - **QPS**：平均QPS（目标：50-200+），最大QPS。
  - **Token生成速率**：平均速率（目标：>100 tokens/秒），最大速率，兼容`content`和`reasoning_content`。
  - **最大有效并发用户数**：记录仍有请求成功的最大并发数（目标：50-500+）。
- **报告生成**：
  - 生成质量报告，包含所有指标统计。
  - 提供图表（如响应时间分布、TTFT分布、QPS曲线）。

#### 非功能需求
- **高性能**：
  - 支持500并发，持续5分钟以上，`fasthttp`优化吞吐量，goroutine池降低GC压力。
- **统计准确性**：
  - TTFT和响应时间测量精度达毫秒，使用`time.Now().UnixNano()`。
  - 使用`sync.Mutex`或`sync/atomic`确保并发统计线程安全。
- **可扩展性**：支持新服务提供商、提示词模板、响应字段。
- **易用性**：配置文件指定API、提示词、超时、客户端类型（`fasthttp`或`net-http`）。
- **稳定性**：长时间运行无崩溃、死锁或内存泄漏。
- **可观测性**：
  - 使用`log`包记录关键事件和错误，通过`--debug`参数，开启调试日志，记录详细信息。
  - 使用`runtime`包监控CPU、内存、goroutine数。
  - 使用`sync.Map`存储响应数据，`gonum/stat`计算百分位数。

#### 技术实现建议
- **HTTP客户端抽象层**：
  - 定义`HTTPClient`接口：
    ```go
    type HTTPClient interface {
        Do(req *Request) (*Response, error)
        Stream(req *Request, callback func(chunk SSEChunk) error) error
    }
    type Request struct {
        Method  string
        URL     string
        Body    []byte
        Headers map[string]string
    }
    type Response struct {
        StatusCode int
        Body       []byte
        Headers    map[string]string
    }
    type SSEChunk struct {
        Data       []byte // JSON数据
        Timestamp  int64  // 接收时间戳（UnixNano）
        IsDone     bool   // 是否为[data: [DONE]]
    }
    ```
  - **fasthttp实现**：
    - `Do`：使用`fasthttp.Do`，构造OpenAI Compatible JSON请求，解析响应。
    - `Stream`：使用`fasthttp.Response.BodyStream`，逐行解析SSE（`data: {...}`），调用回调函数传递`SSEChunk`。
    - 使用`sync.Pool`重用`fasthttp.Request`和`fasthttp.Response`。
  - **net/http实现**：
    - `Do`：使用`http.Client.Do`，构造请求，解析响应。
    - `Stream`：使用`bufio.Scanner`读取`http.Response.Body`，解析SSE，调用回调函数。
  - **切换逻辑**：
    - 解析配置项`client`，实例化对应客户端：
      ```go
      func NewHTTPClient(clientType string) (HTTPClient, error) {
          switch clientType {
          case "fasthttp":
              return &FastHTTPClient{}, nil
          case "net-http":
              return &NetHTTPClient{}, nil
          default:
              return nil, fmt.Errorf("unsupported client: %s", clientType)
          }
      }
      ```
- **SSE与TTFT处理**：
  - `Stream`方法逐块解析SSE，提取`choices[0].delta.content`或`choices[0].delta.reasoning_content`。
  - 记录请求发送时间（`t0 = time.Now().UnixNano()`）。
  - 检测首个非空Token块，记录时间（`t1`），TTFT = (`t1 - t0`) / 1e6（毫秒）。
  - 兼容推理模型：优先检查`reasoning_content`，若为空则使用`content`。
- **Go性能优化**：
  - **Goroutine池**：使用`ants`库或自定义池，限制最大600 goroutine。
  - **对象池**：`sync.Pool`重用请求/响应对象。
  - **上下文控制**：`context.WithTimeout`设置压测时长（默认300秒）。
  - **资源监控**：使用`runtime`包记录CPU、内存、goroutine数。
- **统计准确性**：
  - 使用`sync/atomic`累加请求计数。
  - 使用`sync.Mutex`保护响应时间和TTFT切片。
  - 使用`gonum/stat`计算P90/P95/P99。
- **提示词生成**：`tiktoken-go/tokenizer`动态调整模板至50或1000 tokens。
- **报告生成**：`go-echarts`生成图表，模板引擎生成HTML/PDF。

#### 压测流程
1. **配置加载**：读取API端点、客户端类型（`fasthttp`或`net-http`）、并发数、提示词模板等。
2. **初始化HTTP客户端**：根据配置项`client`实例化`fasthttp`或`net/http`客户端。
3. **提示词生成**：使用`tiktoken-go/tokenizer`生成短咨询（50 tokens）和长文档（1000 tokens）提示词。
4. **并发执行**：
   - 初始化goroutine池（最大600）。
   - 使用`HTTPClient.Stream`发送请求（`stream=true`），间隔符合泊松分布。
   - 处理SSE流，记录TTFT。
5. **响应处理**：
   - 解析`content`和`reasoning_content`，计算Token数。
   - 记录响应时间、TTFT、成功/失败状态。
6. **指标计算**：
   - 计算请求统计、响应时间、TTFT、QPS、Token生成速率、最大并发用户数。
7. **报告生成**：输出统计数据和图表。

#### 配置示例与报告内容
以下是更新后的配置和报告示例，包含HTTP客户端抽象层和切换逻辑：

```yaml
# 配置示例
api:
  endpoint: "https://api.example.com/v1/completions"
  api_key: "your_api_key"
  model: "deepseek-r1"  # 支持推理模型
  streaming: true  # 启用流式响应
  client: "fasthttp"  # HTTP客户端：fasthttp 或 net-http

prompt_templates:
  short:
    target_tokens: 50
    templates:
      - "What is the capital of {country}?"
      - "Please briefly explain the concept of {concept}."
      - "Summarize the main idea of {topic} in one sentence."
  long:
    target_tokens: 1000
    templates:
      - "Write a detailed history of {country} covering major events."
      - "Compose an in-depth analysis of {topic} with examples."
      - "Generate a comprehensive report on the impact of {event}."

requests:
  - type: "short"
    weight: 0.7
  - type: "long"
    weight: 0.3

concurrency:
  steps: [50, 200, 500]
  duration_per_step: 300  # 秒，5分钟
  max_goroutines: 600  # 最大goroutine数

timeout: 60  # 秒
poisson_lambda: 1.0  # 请求间隔泊松分布参数

tokenizer:
  model: "gpt-3.5-turbo"  # 用于tiktoken-go的分词器模型

# 报告内容示例
report:
  overview:
    api_endpoint: "https://api.example.com/v1/completions"
    model: "deepseek-r1"
    concurrency_steps: [50, 200, 500]
    duration_per_step: "300s"
    request_mix: "70% short (50 tokens), 30% long (1000 tokens)"
    tokenizer: "tiktoken-go (gpt-3.5-turbo)"
    streaming_enabled: true
    http_client: "fasthttp"
  metrics:
    total_requests: 15000
    successful_requests: 14850
    failed_requests: 150
    timeout_ratio: "1.0%"
    response_time:
      avg: "1.8s"
      min: "0.5s"
      max: "10.2s"
      p90: "2.5s"
      p95: "3.0s"
      p99: "4.5s"
    ttft:
      min: "0.1s"
      max: "2.0s"
      p90: "0.3s"
      p95: "0.4s"
      p99: "0.6s"
    qps:
      avg: 180
      max: 220
    token_rate:
      avg: "120 tokens/s"
      max: "150 tokens/s"
    max_concurrent_users: 500
    resource_usage:
      avg_cpu: "60%"
      max_memory: "1.1GB"
      max_goroutines: 600
  charts:
    - "response_time_distribution.png"
    - "ttft_distribution.png"
    - "qps_over_time.png"
    - "token_rate_over_time.png"
    - "concurrency_vs_response_time.png"
```