About the Provider
NVIDIA is a global leader in AI computing and accelerated hardware, known for its GPUs and enterprise AI platforms. Through its NeMo and research initiatives, NVIDIA develops advanced models to enable reasoning, tool orchestration, and scalable AI workflows for developers and enterprises.Model Quickstart
This section helps you quickly get started with thenvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 model on the Qubrid AI inferencing platform.
To use this model, you need:
- A valid Qubrid API key
- Access to the Qubrid inference API
- Basic knowledge of making API requests in your preferred language
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 model and receive responses based on your input prompts.
Below are example placeholders showing how the model can be accessed using different programming environments.You can choose the one that best fits your workflow.
Model Overview
Nemotron 3 Nano 30B-A3B is NVIDIA’s flagship open reasoning model, featuring a revolutionary hybrid Mamba-Transformer Mixture-of-Experts architecture.- With 31.6B total parameters but only 3.2B active per forward pass, it delivers up to 3.3× higher throughput than comparable models.
- It achieves state-of-the-art accuracy on reasoning, coding, and agentic benchmarks including SWE-Bench, GPQA Diamond, and AIME 2025.
- The model supports up to 262K token context length and features configurable reasoning depth with thinking budget control.
Model at a Glance
| Feature | Details |
|---|---|
| Model ID | nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 |
| Provider | NVIDIA |
| Architecture | Hybrid Mamba-Transformer MoE with 23 Mamba-2 layers, 23 MoE layers (128 experts, 6 active), and 6 GQA attention layers |
| Model Size | 31.6B Total / 3.2B Active |
| Context Length | 262K Tokens |
| Release Date | December 15, 2025 |
| License | NVIDIA Open Model License |
| Training Data | 25T tokens including 3T new unique tokens, 10.6T total with 33% synthetic data for math, code, and tool-calling |
When to use?
You should consider using Nemotron 3 Nano 30B-A3B if:- You need agentic AI systems and multi-agent orchestration
- Your application requires complex reasoning and problem-solving tasks
- You are working on code generation, debugging, and optimization
- You need function calling and tool integration
- Your use case involves long-document analysis and RAG applications
- You are solving mathematical reasoning and STEM tasks
- You need enterprise chatbots with deep reasoning capabilities
- Your application requires financial analysis and decision support
Inference Parameters
| Parameter Name | Type | Default | Description |
|---|---|---|---|
| Streaming | boolean | true | Enable streaming responses for real-time output. |
| Temperature | number | 0.3 | Controls randomness. Higher values mean more creative but less predictable output. |
| Max Tokens | number | 8192 | Maximum number of tokens to generate in the response. |
| Top P | number | 1 | Nucleus sampling: considers tokens with top_p probability mass. |
| Enable Reasoning | boolean | true | Enable chain-of-thought reasoning traces before final response. |
| Thinking Budget | number | 16384 | Maximum tokens for reasoning traces. Controls inference cost and reasoning depth. |
Key Features
- Hybrid Mamba-Transformer MoE Architecture: Combines Mamba-2 layers, MoE layers, and GQA attention for optimal efficiency and reasoning performance.
- 3.3× Faster Inference: Delivers higher throughput than Qwen3-30B-A3B with better accuracy, using only 10% of total parameters per forward pass.
- 1M Token Context Window: Supports long-horizon tasks including long-document analysis, RAG pipelines, and multi-turn conversations.
- Configurable Reasoning Depth: Supports reasoning ON/OFF modes with thinking budget control for predictable inference costs.
- Native Tool Calling: Built-in function calling with schema validation and tool-integrated reasoning.
- FP8 Quantization: Reduces memory requirements and enables faster inference on supported hardware.
- Fully Open: Weights, datasets, and training recipes are publicly available.
Benchmark Performance
| Benchmark | Performance |
|---|---|
| SWE-Bench | State-of-the-art |
| GPQA Diamond | State-of-the-art |
| AIME 2025 | State-of-the-art |
Limitations
- VRAM Requirements: Requires 32GB+ VRAM for FP8 and 60GB+ for BF16 deployments.
- Architecture Maturity: Hybrid architecture is less tested in production than pure transformer models.
- MMLU Performance: May underperform on vanilla MMLU compared to harder benchmark variants.
- FlashInfer Dependency: FlashInfer backend requires CUDA toolkit for JIT compilation.
- Community Tooling: New architecture may have limited community tooling support compared to established models.
Summary
Nemotron 3 Nano 30B-A3B is NVIDIA’s flagship open reasoning model built for high-throughput, accurate inference across reasoning, coding, and agentic tasks.- Its hybrid Mamba-Transformer MoE architecture delivers 3.3× faster inference than comparable models with only 3.2B active parameters.
- The model supports up to 262K token context, configurable reasoning depth, native tool calling, and FP8 quantization.
- It achieves state-of-the-art results on SWE-Bench, GPQA Diamond, and AIME 2025 benchmarks.
- Weights, datasets, and training recipes are fully open under the NVIDIA Open Model License.