流式响应
流式(streaming)让 token 实时返回,显著降低用户感知的首 token 延迟。ttttt.ai 全链路 SSE 透传,不做 Buffer 聚合——首 token 时延接近原厂。
启用方式
请求体里加 "stream": true:
{
"model": "gpt-5.5",
"messages": [{"role": "user", "content": "..."}],
"stream": true
}返回头:
Content-Type: text/event-stream
Cache-Control: no-cache
Connection: keep-alive
X-Request-ID: req-...OpenAI 事件格式
data: {"id":"chatcmpl-...","choices":[{"delta":{"role":"assistant"}}]}
data: {"id":"chatcmpl-...","choices":[{"delta":{"content":"你"}}]}
data: {"id":"chatcmpl-...","choices":[{"delta":{"content":"好"}}]}
data: {"id":"chatcmpl-...","choices":[{"delta":{},"finish_reason":"stop"}],"usage":{"prompt_tokens":24,"completion_tokens":2,"total_tokens":26}}
data: [DONE]特点:
- 每个 chunk 一行
data: <json>\n\n - 末尾固定
data: [DONE]表示流结束 - 最后一个 chunk 可能携带
usage(取决于上游版本);没有的话平台会按估算入账 - 自定义解析时遇到非
data:开头的行(注释 / keep-alive)请忽略
Anthropic 事件格式
Anthropic 使用 named events:
event: message_start
data: {"type":"message_start","message":{...}}
event: content_block_start
data: {"type":"content_block_start","index":0,"content_block":{"type":"text","text":""}}
event: ping
data: {"type":"ping"}
event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":"你"}}
event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":"好"}}
event: content_block_stop
data: {"type":"content_block_stop","index":0}
event: message_delta
data: {"type":"message_delta","delta":{"stop_reason":"end_turn"},"usage":{"output_tokens":2}}
event: message_stop
data: {"type":"message_stop"}特点:
- 每个 chunk 是
event: <name>\ndata: <json>\n\n - 事件类型分得很细(
message_*/content_block_*/ping) message_delta里的usage.output_tokens是当前已生成的累加值ping事件用于 keep-alive,可以忽略
OpenAI Responses 事件格式
event: response.created
data: {...}
event: response.output_text.delta
data: {"delta": "你"}
event: response.output_text.delta
data: {"delta": "好"}
event: response.completed
data: {"response": {...}}完整事件列表见 OpenAI 官方 Responses streaming 。
解析示例(Node.js / fetch)
const res = await fetch("https://api.ttttt.ai/v1/chat/completions", {
method: "POST",
headers: {
"Authorization": `Bearer ${process.env.TTTTT_API_KEY}`,
"Content-Type": "application/json",
},
body: JSON.stringify({
model: "gpt-5.5",
messages: [{ role: "user", content: "讲一个冷笑话" }],
stream: true,
}),
});
const reader = res.body!.getReader();
const decoder = new TextDecoder();
let buffer = "";
while (true) {
const { done, value } = await reader.read();
if (done) break;
buffer += decoder.decode(value, { stream: true });
const lines = buffer.split("\n");
buffer = lines.pop()!; // 留下不完整的行
for (const line of lines) {
if (!line.startsWith("data:")) continue;
const payload = line.slice(5).trim();
if (payload === "[DONE]") return;
const json = JSON.parse(payload);
process.stdout.write(json.choices[0].delta.content ?? "");
}
}解析示例(Python)
OpenAI 与 Anthropic SDK 都已封装好流式:
# OpenAI
from openai import OpenAI
client = OpenAI(api_key="owo-...", base_url="https://api.ttttt.ai/v1")
stream = client.chat.completions.create(
model="gpt-5.5",
messages=[{"role": "user", "content": "..."}],
stream=True,
)
for chunk in stream:
print(chunk.choices[0].delta.content or "", end="", flush=True)# Anthropic
import anthropic
client = anthropic.Anthropic(api_key="owo-...", base_url="https://api.ttttt.ai")
with client.messages.stream(
model="claude-sonnet-4-6",
max_tokens=1024,
messages=[{"role": "user", "content": "..."}],
) as stream:
for text in stream.text_stream:
print(text, end="", flush=True)断流与重试
| 场景 | 平台行为 | 客户端建议 |
|---|---|---|
上游 5xx 在首 chunk 之前 | 自动切换备用渠道重试 | 你看到的就是成功响应或最终 5xx |
| 上游中途断开 | 把已传输的内容透传给你,标记为 partial | 业务侧决定是重发还是显示”被截断” |
| 客户端主动断开 | 网关停止读上游、按估算扣费 | 没有副作用,下次请求重发即可 |
| 网络抖动 | SSE 协议本身允许重连,平台不做服务端会话恢复 | 客户端按超时重发 |
网关不自动续传——SSE 中断后想”接着说”需要重新发一条
messages把已收到的内容拼回去(或用 Responses 的previous_response_id)。
首 token 延迟
实测的首 token 延迟由三段组成:
客户端 → ttttt.ai (网关接入) ─ 5-30 ms(国内 IP 通常 10ms 上下)
网关 → 上游供应商 + 模型推理 ─ 250 ms - 数秒(取决于模型和 prompt 长度)
上游 → ttttt.ai → 客户端 ─ 透传,无聚合,<10 ms 增量平台不引入额外缓冲——上游的第一个 chunk 来了就立刻往下发。这意味着首 token 延迟基本等于”原厂延迟 + 客户端到网关的 RTT”。
计费时机
- 流式请求响应成功结束后根据上游 usage 字段精确入账。
- 上游不返回 usage 时按估算(已传输内容长度推算 output token),标记
estimated,运营侧会按月对账修正。 - 中途断开按已估算部分入账。
详见 计费模型 → 流式。
Last updated on