Published ResearchAI/ML

Supernova

A 650M parameter large language model built from scratch - architecture, tokenizer, and training - proving that efficiency can rival scale.

Read the Preprint Discuss AI Projects

Role

Architecture & Training

Technology

PyTorch, Amazon SageMaker AI

Timeline

6 months

Status

Published on arXiv

The Challenge

The AI industry has a scaling obsession. Bigger models, more data, exponentially more compute. But we asked a different question: what if you could achieve comparable results with a fraction of the resources?

Supernova began as an academic research project at Ovidius University. The goal wasn't to compete with trillion-parameter giants - it was to prove that thoughtful engineering could close the gap. That architectural innovation could substitute for brute-force scaling.

We designed every component from scratch: a custom tokenizer optimized for compression efficiency, an architecture combining the latest advances (Rotary Position Embeddings, Grouped Query Attention, SwiGLU), and a training methodology focused on data quality over quantity.

The result? A model that achieves 90.29% of the performance of leading 1B parameter models - with 35% fewer parameters and trained on 100× less data.

Key Innovations

Custom Architecture

Decoder-only transformer with RoPE, Grouped Query Attention, RMSNorm, and SwiGLU - modern components working in synergy.

State-of-the-Art Tokenizer

128K vocabulary byte-level BPE tokenizer achieving 4.78 characters per token - outperforming GPT-4 and LLaMA tokenizers.

Exceptional Efficiency

Trained on 100B tokens - up to 360× less data than comparable models - proving that quality beats quantity.

Zero-Shot Performance

Benchmark Comparison

Results across ten standard zero-shot benchmarks, comparing Supernova with similarly sized open models.

Benchmark	Supernova	Qwen3-0.6B	Llama 3.2 1B	Gemma 3 1B	OpenELM 1.1B
HellaSwag	48.18	47.31	63.56	62.06	64.81
WinoGrande	54.06	55.41	59.83	59.04	61.72
ARC-E	60.98	60.40	65.36	71.89	62.37
ARC-C	32.42	34.04	36.26	38.14	32.34
PIQA	71.38	67.63	74.59	74.65	75.57
SuperGLUE	56.13	52.14	55.50	57.60	57.30
MMLU	26.73	40.24	36.93	25.11	25.52
MMLU-PRO	10.31	26.49	10.90	8.99	9.48
SIQA	43.44	39.25	42.78	42.94	42.84
BBH	27.33	40.49	31.59	27.31	16.85
Average	43.09	46.34	47.73	46.77	44.88

Supernova reaches 90.29% of Llama 3.2 1B average performance with 35% fewer parameters. Best score per benchmark is bolded.

The Architecture

Supernova combines modern transformer innovations that work synergistically. Rotary Position Embeddings for efficient position encoding. Grouped Query Attention with 3:1 compression to reduce memory bandwidth. RMSNorm for faster normalization. SwiGLU activations for improved gradient flow.

Each component was chosen not just for individual merit, but for how they amplify each other's benefits. The result is a model that punches well above its weight class.

Model Specifications

Transformer Blocks16 layers

Attention Heads12 heads

Embedding Dimension1,536

Context Length2,048 tokens

Vocabulary Size128,000 tokens

GQA Compression3:1 ratio

The Training

Training was conducted on a cluster of 8 NVIDIA A100 GPUs, sponsored by AWS for academic research. We achieved 54% Model FLOPs Utilization - indicating efficient use of available compute.

The training corpus was carefully curated from Nemotron-CC: 100B tokens combining high-quality web data, synthetic question-answer pairs, and distilled content from larger models. Every partition passed through rigorous filtering for quality, safety, and linguistic coherence.

Total training time: 350 hours. Total cost: under $10,000 - compared to the millions typically required for models in this performance range. This efficiency wasn't accidental; it was the entire point.

8× A100 40GB GPUs

100B tokens (Nemotron-CC)

54% MFU efficiency

By the numbers

The Results

650M

Parameters

100B

Training Tokens

90%

of 1B Model Performance

350h

Training Time

Read the Full Preprint

Why It Matters

Supernova isn't about building the biggest model. It's about proving that the race to scale isn't the only path forward.

When training costs drop from millions to thousands, AI becomes accessible. When inference is 40% cheaper, deployment becomes sustainable. When you need 100× less data, you can iterate faster.

The future of AI isn't just about what's possible - it's about what's practical. Supernova is our contribution to that future.

Previous: FusionCalc Next: Navi

Building something with AI?

We bring research-grade expertise to production AI systems. Let's talk about your project.

Start a Conversation