The Story
DocSense started from a practical gap in Romanian NLP: model quality is constrained by dataset quality and compute availability. The presentation framing is simple and accurate: before model architecture matters, data discipline matters.
Valentina designed the project around a full lifecycle workflow. Data was collected for diversity, cleaned for consistency, and annotated for multi-task objectives. A critical engineering step was repairing encoding and normalization issues before training, otherwise linguistic signal quality drops immediately.
Instead of betting on one model for every task, DocSense uses a layered strategy: mBART-large-50 for generation tasks, Romanian BERT for sentiment interpretation, and XLM-RoBERTa for contextual question answering with Romanian specialization.
The product goal is direct: make complicated documents easier to understand without dumbing down the information itself.

