Technical specifications of Seed 1.8 API
| Item | Specification / note |
|---|---|
| Model name / family | Doubao-Seed-1.8 (Seed1.8) — ByteDance Seed / Volcano Engine |
| Modalities supported | Text, images, video (multimodal VLM capabilities), audio tooling in ecosystem (separate models for audio/video generation). |
| Context window (text) | 256K tokens |
| Video / visual capacity | Designed for long-video reasoning, supports efficient visual encoding and large video-token budgets (model card reports video token experiments and long-video benchmarks). |
| Input formats | Free-text prompts; image uploads (screenshots, charts, photos); video as tokenized frames / video tools for segment inspection; file uploads (documents). |
| Output formats | Natural-language text, structured outputs (structured-output beta), function calls / tool calls, code, and multimodal outputs via orchestration. |
| Thinking / inference modes | no_think, think-low, think-medium, think-high — trade accuracy vs latency/cost. |
What is Doubao Seed 1.8?
Doubao Seed 1.8 is the Seed team’s 1.8 release: a unified LLM+VLM that explicitly targets generalized real-world agency — i.e., perception (images/video), reasoning, tool orchestration (search, function calls, code execution, GUI grounding) and multi-step decision making inside a single model. The design emphasizes configurable “thinking modes” (tradeoffs between latency and depth), efficient visual encoding and native support for long context and multimodal inputs so the model can operate as an autonomous assistant/agent in production workflows.
Main features of Seed 1.8 API
- Unified multimodal agentic model. Integrates perception (image/video), reasoning (LLM), and action (tool/G U I calls, code execution) in a single model rather than a split pipeline. This enables compact agent workflows and lower orchestration complexity.
- Ultra-long context & long-video handling. Long context (product support to 256k tokens) and specific long-video benchmarks (Seed1.8 shows strong long-video token efficiency). Model supports selective video tools (VideoCut) to focus reasoning on timestamps.
- Agentic GUI automation & tool use. Benchmarks and internal tests (OSWorld, AndroidWorld, LiveCodeBench, GUI grounding benchmarks) show improvements in GUI agent tasks and multi-step automation. The model can output GUI grounding commands and operate within simulated OS/web/mobile contexts.
- Configurable thinking modes for latency/cost control. Four inference modes let developers tune compute at test-time for interactive vs. high-quality batch tasks. This is useful for production systems with strict latency budgets.
- Improved token efficiency (multimodal). Seed 1.8 demonstrates stronger token efficiency on multimodal benchmarks versus its predecessors (Seed-1.5/1.6 series), achieving high accuracy with smaller token budgets in several long-video tasks.
- Configurable thinking modes: trade inference depth vs latency/cost with distinct modes (
no_think→think-high) to tune for interactive production use. - Technical capabilities
- Token efficiency: Seed1.8 shows marked token efficiency vs predecessors (Seed-1.5/1.6), delivering stronger accuracy at lower token budgets on long video tasks (e.g., achieving competitive accuracy even at 32K video tokens). This enables lower inference cost for long inputs.
- Multimodal reasoning & perception: The model reaches SOTA on several multi-image VQA and motion/perception tasks and obtains second-place or near-SOTA on many multimodal reasoning benchmarks; specifically it outperforms its predecessor on nearly every visual/video dimension measured.
- Agentic tool use & GUI grounding: Documented support for GUI grounding and screen-based operation benchmarks (ScreenSpot-Pro, GUI agenting) with strong grounding scores (e.g., improvements over Seed-1.5-VL on ScreenSpot-Pro).
- Parallel / stepped reasoning: Increasing test-time compute (parallel thinking) yields measurable gains on math, coding, and multi-modal reasoning benchmarks
Selected public benchmark highlights of Seed1.8
- VCRBench (visual commonsense reasoning): Seed1.8 scored 59.8 (Pass@1 reported in the model card table), an improvement over Seed-1.5-VL and competitive with top models
- VideoHolmes (video reasoning): Seed1.8 65.5, outperforming Seed-1.5-VL and approaching pro-grade competitor models.
- MMLB-NIAH (multimodal long-context, 128k): Seed1.8 achieved 72.2 Pass@1 at 128k context in MMLB-NIAH, surpassing some contemporary pro models.
- Motion & Perception suite: SOTA in 5 of 6 evaluated tasks; examples include TVBench, TempCompass and TOMATO where Seed1.8 shows substantial gains in temporal perception.
- Agentic workflows: On BrowseComp and other agentic search/code benchmarks, Seed1.8 often ranks near or above competing pro models
Seed 1.8 vs Gemini 3 Pro / GPT-5.x
- Seed1.8 vs Seed-1.5-VL / Seed-1.6: Clear improvements in multimodal perception, token efficiency for long videos, and agentic execution.
- Seed1.8 vs Gemini 3 Pro / GPT-5.x: On many multimodal benchmarks Seed1.8 matches or exceeds Gemini 3 Pro (SOTA on several VQA / motion tasks; better on MMLB-NIAH 128k run). However, the card also shows areas where Gemini family models retain advantages on certain disciplinary knowledge tasks — so the relative ordering is benchmark-dependent.
- Seed-Code variant (Doubao-Seed-Code): specialized for programming/agentic code tasks (large context for codebases; specialized SWE benchmarks). Seed1.8 is the generalist agentic multimodal model, while Seed-Code is the programming-focused variant.
Practical use-cases by the Seedream 4.5 API on CometAPI
- Multimodal research assistants & document analysis: extract, summarize, and reason across long documents, slide decks, and multi-page reports.
- Long-video comprehension & monitoring: security/sports broadcasting analytics, long meeting summarization, and streaming analysis where the model’s long-video token efficiency matters.
- Agentic workflows / automation: multi-step web search + code execution + data extraction scenarios (e.g., automated competitive analysis, travel planning, research pipelines demonstrated in internal benchmarks).
- Developer tooling (if using Seed-Code): large codebase analysis, IDE assistants, and agentic code execution for testing & repair (Seed-Code is the recommended specialized variant).
- GUI automation & RPA: screen grounding and GUI agent benchmarks indicate the model can perform structured GUI tasks better than prior Seed releases.
How to Use doubao Seed 1.8 API via CometAPI
Doubao seed1.8 is exposed commercially through CometAPI as a hosted inference API now. The API supports multimodal payloads (text + images + video fragments / timestamps) and configurable inference modes to trade latency and compute against answer quality.
Call patterns: The API supports standard chat/completion style requests, streaming responses, and agentic flows where the model issues tool calls (search, code execution, GUI actions) and ingests tool outputs as subsequent context.
Streaming & long-context handling: The API supports streaming and has built-in context management primitives for long sessions (to enable 100K+ contexts / multi-step agent traces).
Step 1: Sign Up for API Key
Log in to cometapi.com. If you are not our user yet, please register first. Sign into your CometAPI console. Get the access credential API key of the interface. Click “Add Token” at the API token in the personal center, get the token key: sk-xxxxx and submit.
Step 2: Send Requests to doubao Seed 1.8 API
Select the “doubao-seed-1-8-251228 ”endpoint to send the API request and set the request body. The request method and request body are obtained from our website API doc. Our website also provides Apifox test for your convenience. Replace <YOUR_API_KEY> with your actual CometAPI key from your account. Compatibility with the Chat APIs.
Insert your question or request into the content field—this is what the model will respond to . Process the API response to get the generated answer.
Step 3: Retrieve and Verify Results
Process the API response to get the generated answer. After processing, the API responds with the task status and output data.