Supernova
A 650M parameter large language model built from scratch - architecture, tokenizer, and training - proving that efficiency can rival scale.

Role
Architecture & Training
Technology
PyTorch, Amazon SageMaker AI
Timeline
6 months
Status
Published on arXiv
The Challenge
The AI industry has a scaling obsession. Bigger models, more data, exponentially more compute. But we asked a different question: what if you could achieve comparable results with a fraction of the resources?
Supernova began as an academic research project at Ovidius University. The goal wasn't to compete with trillion-parameter giants - it was to prove that thoughtful engineering could close the gap. That architectural innovation could substitute for brute-force scaling.
We designed every component from scratch: a custom tokenizer optimized for compression efficiency, an architecture combining the latest advances (Rotary Position Embeddings, Grouped Query Attention, SwiGLU), and a training methodology focused on data quality over quantity.
The result? A model that achieves 90.29% of the performance of leading 1B parameter models - with 35% fewer parameters and trained on 100× less data.
Key Innovations
Custom Architecture
Decoder-only transformer with RoPE, Grouped Query Attention, RMSNorm, and SwiGLU - modern components working in synergy.
State-of-the-Art Tokenizer
128K vocabulary byte-level BPE tokenizer achieving 4.78 characters per token - outperforming GPT-4 and LLaMA tokenizers.
Exceptional Efficiency
Trained on 100B tokens - up to 360× less data than comparable models - proving that quality beats quantity.
Benchmark Comparison
Results across ten standard zero-shot benchmarks, comparing Supernova with similarly sized open models.
| Benchmark | Supernova | Qwen3-0.6B | Llama 3.2 1B | Gemma 3 1B | OpenELM 1.1B |
|---|---|---|---|---|---|
| HellaSwag | 48.18 | 47.31 | 63.56 | 62.06 | 64.81 |
| WinoGrande | 54.06 | 55.41 | 59.83 | 59.04 | 61.72 |
| ARC-E | 60.98 | 60.40 | 65.36 | 71.89 | 62.37 |
| ARC-C | 32.42 | 34.04 | 36.26 | 38.14 | 32.34 |
| PIQA | 71.38 | 67.63 | 74.59 | 74.65 | 75.57 |
| SuperGLUE | 56.13 | 52.14 | 55.50 | 57.60 | 57.30 |
| MMLU | 26.73 | 40.24 | 36.93 | 25.11 | 25.52 |
| MMLU-PRO | 10.31 | 26.49 | 10.90 | 8.99 | 9.48 |
| SIQA | 43.44 | 39.25 | 42.78 | 42.94 | 42.84 |
| BBH | 27.33 | 40.49 | 31.59 | 27.31 | 16.85 |
| Average | 43.09 | 46.34 | 47.73 | 46.77 | 44.88 |
Supernova reaches 90.29% of Llama 3.2 1B average performance with 35% fewer parameters. Best score per benchmark is bolded.
The Architecture
Supernova combines modern transformer innovations that work synergistically. Rotary Position Embeddings for efficient position encoding. Grouped Query Attention with 3:1 compression to reduce memory bandwidth. RMSNorm for faster normalization. SwiGLU activations for improved gradient flow.
Each component was chosen not just for individual merit, but for how they amplify each other's benefits. The result is a model that punches well above its weight class.
Model Specifications
The Training
Training was conducted on a cluster of 8 NVIDIA A100 GPUs, sponsored by AWS for academic research. We achieved 54% Model FLOPs Utilization - indicating efficient use of available compute.
The training corpus was carefully curated from Nemotron-CC: 100B tokens combining high-quality web data, synthetic question-answer pairs, and distilled content from larger models. Every partition passed through rigorous filtering for quality, safety, and linguistic coherence.
Total training time: 350 hours. Total cost: under $10,000 - compared to the millions typically required for models in this performance range. This efficiency wasn't accidental; it was the entire point.
The Results
Why It Matters
Supernova isn't about building the biggest model. It's about proving that the race to scale isn't the only path forward.
When training costs drop from millions to thousands, AI becomes accessible. When inference is 40% cheaper, deployment becomes sustainable. When you need 100× less data, you can iterate faster.
The future of AI isn't just about what's possible - it's about what's practical. Supernova is our contribution to that future.
Building something with AI?
We bring research-grade expertise to production AI systems. Let's talk about your project.
Start a Conversation