Skip to main content

About the Provider

NVIDIA is a global leader in AI computing and accelerated hardware, known for its GPUs and enterprise AI platforms. Through its NeMo and research initiatives, NVIDIA develops advanced models to enable reasoning, tool orchestration, and scalable AI workflows for developers and enterprises.

Model Quickstart

This section helps you quickly get started with the nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 model on the Qubrid AI inferencing platform. To use this model, you need:
  • A valid Qubrid API key
  • Access to the Qubrid inference API
  • Basic knowledge of making API requests in your preferred language
Once authenticated with your API key, you can send inference requests to the nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 model and receive responses based on your input prompts. Below are example placeholders showing how the model can be accessed using different programming environments.
You can choose the one that best fits your workflow.
from openai import OpenAI

# Initialize the OpenAI client with Qubrid base URL
client = OpenAI(
    base_url="https://platform.qubrid.com/v1",
    api_key="QUBRID_API_KEY",
)

# Create a streaming chat completion
stream = client.chat.completions.create(
    model="nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16",
    messages=[
      {
        "role": "user",
        "content": "Explain quantum computing in simple terms"
      }
    ],
    max_tokens=8192,
    temperature=0.3,
    top_p=1,
    stream=True
)

# If stream = False comment this out
for chunk in stream:
    if chunk.choices and chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)
print("\n")

# If stream = True comment this out
print(stream.choices[0].message.content)
This will produce a response similar to the one below:
<think>
The user wants a simple explanation of quantum computing. Let me break this down 
into digestible concepts — classical bits vs qubits, superposition, entanglement, 
and practical applications...
</think>

Quantum computing is a fundamentally different way of processing information.

**Classical vs Quantum:**
- Classical computers use bits that are either 0 or 1
- Quantum computers use qubits that can be 0, 1, or both simultaneously (superposition)

**Key principles:**
- Superposition: A qubit exists in multiple states at once until measured
- Entanglement: Two qubits can be linked so the state of one instantly affects the other
- Interference: Used to amplify correct answers and cancel incorrect ones

**Why it matters:**
Quantum computers can solve certain problems exponentially faster than classical 
computers — particularly in cryptography, drug discovery, and optimization tasks 
that would take classical computers thousands of years.
...

Model Overview

Nemotron 3 Nano 30B-A3B is NVIDIA’s flagship open reasoning model, featuring a revolutionary hybrid Mamba-Transformer Mixture-of-Experts architecture.
  • With 31.6B total parameters but only 3.2B active per forward pass, it delivers up to 3.3× higher throughput than comparable models.
  • It achieves state-of-the-art accuracy on reasoning, coding, and agentic benchmarks including SWE-Bench, GPQA Diamond, and AIME 2025.
  • The model supports up to 262K token context length and features configurable reasoning depth with thinking budget control.

Model at a Glance

FeatureDetails
Model IDnvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
ProviderNVIDIA
ArchitectureHybrid Mamba-Transformer MoE with 23 Mamba-2 layers, 23 MoE layers (128 experts, 6 active), and 6 GQA attention layers
Model Size31.6B Total / 3.2B Active
Context Length262K Tokens
Release DateDecember 15, 2025
LicenseNVIDIA Open Model License
Training Data25T tokens including 3T new unique tokens, 10.6T total with 33% synthetic data for math, code, and tool-calling

When to use?

You should consider using Nemotron 3 Nano 30B-A3B if:
  • You need agentic AI systems and multi-agent orchestration
  • Your application requires complex reasoning and problem-solving tasks
  • You are working on code generation, debugging, and optimization
  • You need function calling and tool integration
  • Your use case involves long-document analysis and RAG applications
  • You are solving mathematical reasoning and STEM tasks
  • You need enterprise chatbots with deep reasoning capabilities
  • Your application requires financial analysis and decision support

Inference Parameters

Parameter NameTypeDefaultDescription
StreamingbooleantrueEnable streaming responses for real-time output.
Temperaturenumber0.3Controls randomness. Higher values mean more creative but less predictable output.
Max Tokensnumber8192Maximum number of tokens to generate in the response.
Top Pnumber1Nucleus sampling: considers tokens with top_p probability mass.
Enable ReasoningbooleantrueEnable chain-of-thought reasoning traces before final response.
Thinking Budgetnumber16384Maximum tokens for reasoning traces. Controls inference cost and reasoning depth.

Key Features

  • Hybrid Mamba-Transformer MoE Architecture: Combines Mamba-2 layers, MoE layers, and GQA attention for optimal efficiency and reasoning performance.
  • 3.3× Faster Inference: Delivers higher throughput than Qwen3-30B-A3B with better accuracy, using only 10% of total parameters per forward pass.
  • 1M Token Context Window: Supports long-horizon tasks including long-document analysis, RAG pipelines, and multi-turn conversations.
  • Configurable Reasoning Depth: Supports reasoning ON/OFF modes with thinking budget control for predictable inference costs.
  • Native Tool Calling: Built-in function calling with schema validation and tool-integrated reasoning.
  • FP8 Quantization: Reduces memory requirements and enables faster inference on supported hardware.
  • Fully Open: Weights, datasets, and training recipes are publicly available.

Benchmark Performance

BenchmarkPerformance
SWE-BenchState-of-the-art
GPQA DiamondState-of-the-art
AIME 2025State-of-the-art
Delivers 3.3× faster inference than Qwen3-30B-A3B with better accuracy across reasoning, coding, and agentic benchmarks.

Limitations

  • VRAM Requirements: Requires 32GB+ VRAM for FP8 and 60GB+ for BF16 deployments.
  • Architecture Maturity: Hybrid architecture is less tested in production than pure transformer models.
  • MMLU Performance: May underperform on vanilla MMLU compared to harder benchmark variants.
  • FlashInfer Dependency: FlashInfer backend requires CUDA toolkit for JIT compilation.
  • Community Tooling: New architecture may have limited community tooling support compared to established models.

Summary

Nemotron 3 Nano 30B-A3B is NVIDIA’s flagship open reasoning model built for high-throughput, accurate inference across reasoning, coding, and agentic tasks.
  • Its hybrid Mamba-Transformer MoE architecture delivers 3.3× faster inference than comparable models with only 3.2B active parameters.
  • The model supports up to 262K token context, configurable reasoning depth, native tool calling, and FP8 quantization.
  • It achieves state-of-the-art results on SWE-Bench, GPQA Diamond, and AIME 2025 benchmarks.
  • Weights, datasets, and training recipes are fully open under the NVIDIA Open Model License.