Skip to Content
API 参考流式响应

流式响应

流式(streaming)让 token 实时返回,显著降低用户感知的首 token 延迟。ttttt.ai 全链路 SSE 透传,不做 Buffer 聚合——首 token 时延接近原厂。

启用方式

请求体里加 "stream": true

{ "model": "gpt-5.5", "messages": [{"role": "user", "content": "..."}], "stream": true }

返回头:

Content-Type: text/event-stream Cache-Control: no-cache Connection: keep-alive X-Request-ID: req-...

OpenAI 事件格式

data: {"id":"chatcmpl-...","choices":[{"delta":{"role":"assistant"}}]} data: {"id":"chatcmpl-...","choices":[{"delta":{"content":"你"}}]} data: {"id":"chatcmpl-...","choices":[{"delta":{"content":"好"}}]} data: {"id":"chatcmpl-...","choices":[{"delta":{},"finish_reason":"stop"}],"usage":{"prompt_tokens":24,"completion_tokens":2,"total_tokens":26}} data: [DONE]

特点:

  • 每个 chunk 一行 data: <json>\n\n
  • 末尾固定 data: [DONE] 表示流结束
  • 最后一个 chunk 可能携带 usage(取决于上游版本);没有的话平台会按估算入账
  • 自定义解析时遇到非 data: 开头的行(注释 / keep-alive)请忽略

Anthropic 事件格式

Anthropic 使用 named events

event: message_start data: {"type":"message_start","message":{...}} event: content_block_start data: {"type":"content_block_start","index":0,"content_block":{"type":"text","text":""}} event: ping data: {"type":"ping"} event: content_block_delta data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":"你"}} event: content_block_delta data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":"好"}} event: content_block_stop data: {"type":"content_block_stop","index":0} event: message_delta data: {"type":"message_delta","delta":{"stop_reason":"end_turn"},"usage":{"output_tokens":2}} event: message_stop data: {"type":"message_stop"}

特点:

  • 每个 chunk 是 event: <name>\ndata: <json>\n\n
  • 事件类型分得很细(message_* / content_block_* / ping
  • message_delta 里的 usage.output_tokens 是当前已生成的累加值
  • ping 事件用于 keep-alive,可以忽略

OpenAI Responses 事件格式

event: response.created data: {...} event: response.output_text.delta data: {"delta": "你"} event: response.output_text.delta data: {"delta": "好"} event: response.completed data: {"response": {...}}

完整事件列表见 OpenAI 官方 Responses streaming 

解析示例(Node.js / fetch)

const res = await fetch("https://api.ttttt.ai/v1/chat/completions", { method: "POST", headers: { "Authorization": `Bearer ${process.env.TTTTT_API_KEY}`, "Content-Type": "application/json", }, body: JSON.stringify({ model: "gpt-5.5", messages: [{ role: "user", content: "讲一个冷笑话" }], stream: true, }), }); const reader = res.body!.getReader(); const decoder = new TextDecoder(); let buffer = ""; while (true) { const { done, value } = await reader.read(); if (done) break; buffer += decoder.decode(value, { stream: true }); const lines = buffer.split("\n"); buffer = lines.pop()!; // 留下不完整的行 for (const line of lines) { if (!line.startsWith("data:")) continue; const payload = line.slice(5).trim(); if (payload === "[DONE]") return; const json = JSON.parse(payload); process.stdout.write(json.choices[0].delta.content ?? ""); } }

解析示例(Python)

OpenAI 与 Anthropic SDK 都已封装好流式:

# OpenAI from openai import OpenAI client = OpenAI(api_key="owo-...", base_url="https://api.ttttt.ai/v1") stream = client.chat.completions.create( model="gpt-5.5", messages=[{"role": "user", "content": "..."}], stream=True, ) for chunk in stream: print(chunk.choices[0].delta.content or "", end="", flush=True)
# Anthropic import anthropic client = anthropic.Anthropic(api_key="owo-...", base_url="https://api.ttttt.ai") with client.messages.stream( model="claude-sonnet-4-6", max_tokens=1024, messages=[{"role": "user", "content": "..."}], ) as stream: for text in stream.text_stream: print(text, end="", flush=True)

断流与重试

场景平台行为客户端建议
上游 5xx 在首 chunk 之前自动切换备用渠道重试你看到的就是成功响应或最终 5xx
上游中途断开把已传输的内容透传给你,标记为 partial业务侧决定是重发还是显示”被截断”
客户端主动断开网关停止读上游、按估算扣费没有副作用,下次请求重发即可
网络抖动SSE 协议本身允许重连,平台不做服务端会话恢复客户端按超时重发

网关自动续传——SSE 中断后想”接着说”需要重新发一条 messages 把已收到的内容拼回去(或用 Responses 的 previous_response_id)。

首 token 延迟

实测的首 token 延迟由三段组成:

客户端 → ttttt.ai (网关接入) ─ 5-30 ms(国内 IP 通常 10ms 上下) 网关 → 上游供应商 + 模型推理 ─ 250 ms - 数秒(取决于模型和 prompt 长度) 上游 → ttttt.ai → 客户端 ─ 透传,无聚合,<10 ms 增量

平台不引入额外缓冲——上游的第一个 chunk 来了就立刻往下发。这意味着首 token 延迟基本等于”原厂延迟 + 客户端到网关的 RTT”。

计费时机

  • 流式请求响应成功结束后根据上游 usage 字段精确入账。
  • 上游不返回 usage 时按估算(已传输内容长度推算 output token),标记 estimated,运营侧会按月对账修正。
  • 中途断开按已估算部分入账。

详见 计费模型 → 流式

Last updated on