

NVIDIA · Chat / LLM · 31.6B Parameters (3.2B Active) · 262K Context (up to 1M)

Function Calling Tool Calling Streaming Reasoning Long Context CodeOverview
NVIDIA Nemotron-3 Nano 30B A3B BF16 is NVIDIA’s flagship open reasoning model, featuring a revolutionary hybrid Mamba-Transformer Mixture-of-Experts architecture. With 31.6B total parameters but only 3.2B active per forward pass (10% activation ratio), it delivers up to 3.3× higher throughput than Qwen3-30B-A3B while achieving state-of-the-art accuracy on reasoning, coding, and agentic benchmarks. The model supports up to 1M token context length and features configurable reasoning depth with thinking budget control — making it the most compute-efficient reasoning model in its class. Served instantly via the Qubrid AI Serverless API.⚡ 3.3× faster than Qwen3-30B-A3B. Only 3.2B active parameters. 1M token context. Deploy on Qubrid AI — no VRAM, no cluster, no ops.
Model Specifications
| Field | Details |
|---|---|
| Model ID | nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 |
| Provider | NVIDIA |
| Kind | Chat / LLM |
| Architecture | Hybrid Mamba-Transformer MoE — 23 Mamba-2 layers, 23 MoE layers (128 experts, 6 active), 6 GQA attention layers |
| Parameters | 31.6B total (3.2B active per forward pass) |
| Context Length | 262K Tokens (up to 1M) |
| MoE | Yes |
| Release Date | December 15, 2025 |
| License | NVIDIA Open Model License |
| Training Data | 25T tokens including 3T new unique tokens, 10.6T total with 33% synthetic data for math, code, and tool-calling |
| Function Calling | Supported |
| Image Support | N/A |
| Serverless API | Available |
| Fine-tuning | Coming Soon |
| On-demand | Coming Soon |
| State | 🟢 Ready |
Pricing
💳 Access via the Qubrid AI Serverless API with pay-per-token pricing. No infrastructure management required.
| Token Type | Price per 1M Tokens |
|---|---|
| Input Tokens | $0.04 |
| Output Tokens | $0.22 |
Quickstart
Prerequisites
- Create a free account at platform.qubrid.com
- Generate your API key from the API Keys section
- Replace
QUBRID_API_KEYin the code below with your actual key
💡 Reasoning mode: By default, chain-of-thought reasoning is enabled (enable_reasoning=true). Usethinking_budgetto control the maximum reasoning token budget and manage inference cost.
Python
JavaScript
Go
cURL
Live Example
Prompt: Write a short story about a robot learning to paint
Response:
Playground Features
The Qubrid AI Playground lets you interact with Nemotron-3 Nano 30B directly in your browser — no setup, no code, no cost to explore.🧠 System Prompt
Define the model’s reasoning mode, role, and output constraints before the conversation begins — essential for agentic pipelines, tool-use orchestration, and long-context analysis tasks.Set your system prompt once in the Qubrid Playground and it applies across every turn of the conversation.
🎯 Few-Shot Examples
Guide the model’s reasoning depth and output format with concrete examples — especially effective for structured outputs, tool calls, and STEM reasoning tasks.| User Input | Assistant Response |
|---|---|
What is the integral of x² from 0 to 3? | ∫₀³ x² dx = [x³/3]₀³ = (27/3) - (0/3) = 9 |
Debug: my Python list comprehension returns empty — [x for x in data if x > 10] | Check if 'data' is empty or all values are ≤ 10. Also verify data types — if elements are strings, the comparison x > 10 won't filter numerically. Try: print(type(data[0])) to confirm. |
💡 Stack multiple few-shot examples in the Qubrid Playground to shape reasoning style, output format, and domain focus — no fine-tuning required.
Inference Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
| Streaming | boolean | true | Enable streaming responses for real-time output |
| Temperature | number | 0.3 | Controls randomness. Higher values mean more creative but less predictable output |
| Max Tokens | number | 8192 | Maximum number of tokens to generate in the response |
| Top P | number | 1 | Nucleus sampling: considers tokens with top_p probability mass |
| Enable Reasoning | boolean | true | Enable chain-of-thought reasoning traces before final response |
| Thinking Budget | number | 16384 | Maximum tokens for reasoning traces. Controls inference cost and reasoning depth |
Use Cases
- Agentic AI systems and multi-agent orchestration
- Complex reasoning and problem-solving tasks
- Code generation, debugging, and optimization
- Function calling and tool integration
- Long-document analysis and RAG applications
- Mathematical reasoning and STEM tasks
- Instruction following and task automation
- Enterprise chatbots with reasoning capabilities
- Financial analysis and decision support
- Software development assistants
Strengths & Limitations
| Strengths | Limitations |
|---|---|
| Hybrid Mamba-2 + Transformer MoE for optimal efficiency | Requires 32GB+ VRAM for FP8, 60GB+ for BF16 self-hosting |
| 3.3× faster inference than Qwen3-30B-A3B with better accuracy | Hybrid architecture less tested in production than pure transformers |
| Only 3.2B active parameters from 31.6B total (10% activation) | May underperform on vanilla MMLU vs harder benchmark variants |
| 1M token context window for long-horizon tasks | FlashInfer backend requires CUDA toolkit for JIT compilation |
| Configurable reasoning ON/OFF modes | New architecture may have limited community tooling support |
| Thinking budget control for predictable inference costs | |
| Native tool calling and function execution | |
| FP8 quantization for reduced memory and faster inference | |
| State-of-the-art on SWE-Bench, GPQA Diamond, AIME 2025 | |
| Fully open: weights, datasets, and training recipes available |
Why Qubrid AI?
- 🚀 No infrastructure setup — 31.6B MoE served serverlessly at just $0.04/1M input tokens
- 🔁 OpenAI-compatible — drop-in replacement using the same SDK, just swap the base URL
- 🧠 Reasoning budget control — tune
thinking_budgetto balance depth vs. latency directly in the API - 🧪 Built-in Playground — prototype with system prompts and few-shot examples instantly at platform.qubrid.com
- 📊 Full observability — API logs and usage tracking built into the Qubrid dashboard
- 🌐 Multi-language support — Python, JavaScript, Go, cURL out of the box
Resources
| Resource | Link |
|---|---|
| 📖 Qubrid Docs | docs.platform.qubrid.com |
| 🎮 Playground | Try Nemotron-3 Nano 30B live |
| 🔑 API Keys | Get your API Key |
| 🤗 Hugging Face | nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 |
| 💬 Discord | Join the Qubrid Community |
Built with ❤️ by Qubrid AI
Frontier models. Serverless infrastructure. Zero friction.
Frontier models. Serverless infrastructure. Zero friction.