Qubrid AI Models

Z.ai (Zhipu AI) · Chat / LLM · 355B Parameters (32B Active) · 128K Context
Streaming Reasoning Agentic Coding Long Context Code Tool Orchestration
Overview
GLM-4.7-FP8 is the flagship model from Z.ai (formerly Zhipu AI) — a Chinese AI research company focused on building large-scale open-source foundation models for reasoning, coding, and agentic workflows. With 355B total parameters and 32B active per forward pass, it introduces three novel thinking paradigms — Interleaved Thinking, Preserved Thinking, and Turn-level Thinking — enabling the model to reason before every action and maintain coherent reasoning state across long coding sessions. It achieves 95.7% on AIME 2025, 73.8% on SWE-bench, and 87.4% on τ²-Bench, delivering frontier-level mathematical and software engineering performance at open-source scale. Served instantly via the Qubrid AI Serverless API.
🏆 95.7% AIME 2025. 73.8% SWE-bench. 355B MoE. Interleaved Thinking.
Deploy on Qubrid AI — no infrastructure required.
Model Specifications
| Field | Details |
|---|---|
| Model ID | zai-org/GLM-4.7-FP8 |
| Provider | Z.ai (formerly Zhipu AI) |
| Kind | Chat / LLM |
| Architecture | Sparse MoE Transformer — 355B total / 32B active per token, FP8 native quantization |
| Parameters | 355B total (32B active per forward pass) |
| Context Length | 128,000 Tokens |
| MoE | No |
| Release Date | December 2025 |
| License | MIT (commercial use allowed) |
| Training Data | Large-scale multilingual dataset with code, math, reasoning, and agentic workflows |
| Function Calling | Not Supported |
| Image Support | N/A |
| Serverless API | Available |
| Fine-tuning | Coming Soon |
| On-demand | Coming Soon |
| State | 🟢 Ready |
Pricing
💳 Access via the Qubrid AI Serverless API with pay-per-token pricing. No infrastructure management required.
| Token Type | Price per 1M Tokens |
|---|---|
| Input Tokens | $0.60 |
| Input Tokens (Cached) | $0.30 |
| Output Tokens | $2.20 |
Quickstart
Prerequisites
- Create a free account at platform.qubrid.com
- Generate your API key from the API Keys section
- Replace
QUBRID_API_KEYin the code below with your actual key
💡 Thinking mode:
enable_thinking=trueby default — the model reasons before every response and tool call. Toggle per request for precise control over reasoning depth and latency.
Python
from openai import OpenAI
# Initialize the OpenAI client with Qubrid base URL
client = OpenAI(
base_url="https://platform.qubrid.com/v1",
api_key="QUBRID_API_KEY",
)
# Create a streaming chat completion
stream = client.chat.completions.create(
model="zai-org/GLM-4.7-FP8",
messages=[
{
"role": "user",
"content": "Explain quantum computing in simple terms"
}
],
max_tokens=4096,
temperature=0.6,
top_p=1,
stream=True
)
# If stream = False comment this out
for chunk in stream:
if chunk.choices and chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
print("\n")
# If stream = True comment this out
print(stream.choices[0].message.content)JavaScript
import OpenAI from "openai";
// Initialize the OpenAI client with Qubrid base URL
const client = new OpenAI({
baseURL: "https://platform.qubrid.com/v1",
apiKey: "QUBRID_API_KEY",
});
// Create a streaming chat completion
const stream = await client.chat.completions.create({
model: "zai-org/GLM-4.7-FP8",
messages: [
{
role: "user",
content: "Explain quantum computing in simple terms",
},
],
max_tokens: 4096,
temperature: 0.6,
top_p: 1,
stream: true,
});
// If stream = false comment this out
for await (const chunk of stream) {
if (chunk.choices[0]?.delta?.content) {
process.stdout.write(chunk.choices[0].delta.content);
}
}
console.log("\n");
// If stream = true comment this out
console.log(stream.choices[0].message.content);Go
package main
import (
"bufio"
"bytes"
"encoding/json"
"fmt"
"net/http"
)
func main() {
url := "https://platform.qubrid.com/v1/chat/completions"
data := map[string]interface{}{
"model": "zai-org/GLM-4.7-FP8",
"messages": []map[string]string{
{
"role": "user",
"content": "Explain quantum computing in simple terms",
},
},
"temperature": 0.6,
"max_tokens": 4096,
"stream": true,
"top_p": 1,
}
jsonData, _ := json.Marshal(data)
req, _ := http.NewRequest("POST", url, bytes.NewBuffer(jsonData))
req.Header.Set("Authorization", "Bearer QUBRID_API_KEY")
req.Header.Set("Content-Type", "application/json")
client := &http.Client{}
res, _ := client.Do(req)
defer res.Body.Close()
scanner := bufio.NewScanner(res.Body)
for scanner.Scan() {
line := scanner.Text()
if line != "" {
fmt.Println(line)
}
}
}cURL
curl -X POST "https://platform.qubrid.com/v1/chat/completions" \
-H "Authorization: Bearer QUBRID_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "zai-org/GLM-4.7-FP8",
"messages": [
{
"role": "user",
"content": "Explain quantum computing in simple terms"
}
],
"temperature": 0.6,
"max_tokens": 4096,
"stream": true,
"top_p": 1
}'Live Example
Prompt:
Write a Python function to find all prime numbers up to n using the Sieve of Eratosthenes
Response:
<think>
Classic algorithm request. I'll implement the Sieve of Eratosthenes cleanly
with proper type hints, a docstring, and an example usage block.
I'll also note the time complexity for completeness.
</think>
def sieve_of_eratosthenes(n: int) -> list[int]:
"""
Return all prime numbers up to and including n
using the Sieve of Eratosthenes.
Time complexity: O(n log log n)
Space complexity: O(n)
Args:
n: Upper bound (inclusive) for prime search.
Returns:
Sorted list of all primes <= n.
"""
if n < 2:
return []
# Initialize boolean array: True = potentially prime
is_prime = [True] * (n + 1)
is_prime[0] = False
is_prime[1] = False
# Sieve: mark multiples of each prime as composite
p = 2
while p * p <= n:
if is_prime[p]:
for multiple in range(p * p, n + 1, p):
is_prime[multiple] = False
p += 1
return [i for i, prime in enumerate(is_prime) if prime]
# Example usage
if __name__ == "__main__":
primes = sieve_of_eratosthenes(50)
print(primes)
# Output: [2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47]Try it yourself in the Qubrid AI Playground →
Playground Features
The Qubrid AI Playground lets you interact with GLM-4.7-FP8 directly in your browser — no setup, no code, no cost to explore.
🧠 System Prompt
Define the model's role, coding style, and reasoning constraints before the conversation begins. Particularly powerful for agentic coding sessions and long-horizon tool orchestration.
Example: "You are a senior software engineer. For every coding task:
1. Think through the approach before writing any code.
2. Write clean, production-ready code with type hints and docstrings.
3. Include edge case handling and a brief complexity analysis."Set your system prompt once in the Qubrid Playground and it applies across every turn — including preserved reasoning state across multi-turn coding sessions.
🎯 Few-Shot Examples
Establish your preferred code style and output format with concrete examples — no fine-tuning required. Especially powerful for consistent structured outputs in agentic pipelines.
| User Input | Assistant Response |
|---|---|
Write a function to check if a binary tree is balanced | def is_balanced(root) -> bool: def height(node): if not node: return 0; l, r = height(node.left), height(node.right); if l == -1 or r == -1 or abs(l-r) > 1: return -1; return max(l,r)+1; return height(root) != -1 |
Refactor this: for i in range(len(arr)): result.append(arr[i]*2) | result = [x * 2 for x in arr] # List comprehension: cleaner and faster |
💡 Stack multiple few-shot examples in the Qubrid Playground to establish coding style, language preference, and output structure — no fine-tuning required.
Inference Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
| Streaming | boolean | true | Enable streaming responses for real-time output |
| Temperature | number | 0.6 | Controls randomness. Lower values recommended for reasoning and coding |
| Max Tokens | number | 4096 | Maximum number of tokens to generate |
| Top P | number | 1 | Controls nucleus sampling |
| Enable Thinking | boolean | true | Enable Interleaved Thinking mode — the model reasons before every response and tool call for improved accuracy |
Use Cases
- Agentic multilingual coding
- Terminal-based task automation
- Vibe coding and UI generation
- Complex mathematical reasoning
- Tool orchestration (Claude Code, Cline, Roo Code)
- Long-horizon multi-turn tasks
Strengths & Limitations
| Strengths | Limitations |
|---|---|
| Interleaved Thinking — reasons before every response and tool call | Very large model requires significant infrastructure for self-hosting |
| Preserved Thinking — retains reasoning state across long coding sessions | FP8 inference requires natively supporting hardware |
| Turn-level control over thinking depth per request | Thinking mode increases latency |
| 355B MoE with 32B active — frontier reasoning at low cost | Function calling not supported |
| 95.7% AIME 2025 — state-of-the-art mathematical reasoning | |
| Open-source with MIT license — full commercial use permitted |
Why Qubrid AI?
- 🚀 No infrastructure setup — 355B MoE served serverlessly, pay only for what you use
- 🔁 OpenAI-compatible — drop-in replacement using the same SDK, just swap the base URL
- 💰 Cached input pricing — $0.30/1M for cached tokens, ideal for long agentic coding sessions
- 🧠 Interleaved Thinking on demand — toggle reasoning depth per request via the API without managing model configuration
- 🧪 Built-in Playground — prototype with system prompts and few-shot examples instantly at platform.qubrid.com
- 📊 Full observability — API logs and usage tracking built into the Qubrid dashboard
Resources
| Resource | Link |
|---|---|
| 📖 Qubrid Docs | docs.platform.qubrid.com |
| 🎮 Playground | Try GLM-4.7-FP8 live |
| 🔑 API Keys | Get your API Key |
| 🤗 Hugging Face | zai-org/GLM-4.7-FP8 |
| 💬 Discord | Join the Qubrid Community |
Built with ❤️ by Qubrid AI
Frontier models. Serverless infrastructure. Zero friction.