NLP ProductRomanian AI

DocSense

A document intelligence platform that turns complex Romanian text into clear, usable insight through summarization, sentiment analysis, and contextual Q&A.

DocSense Logo

Category

Romanian NLP Platform

Product

DocSense (macOS AI Assistant)

Pipeline

Custom Dataset + Multi-Model Stack

Status

Research Prototype + Live Demo

The Story

DocSense started from a practical gap in Romanian NLP: model quality is constrained by dataset quality and compute availability. The presentation framing is simple and accurate: before model architecture matters, data discipline matters.

Valentina designed the project around a full lifecycle workflow. Data was collected for diversity, cleaned for consistency, and annotated for multi-task objectives. A critical engineering step was repairing encoding and normalization issues before training, otherwise linguistic signal quality drops immediately.

Instead of betting on one model for every task, DocSense uses a layered strategy: mBART-large-50 for generation tasks, Romanian BERT for sentiment interpretation, and XLM-RoBERTa for contextual question answering with Romanian specialization.

The product goal is direct: make complicated documents easier to understand without dumbing down the information itself.

Model Architecture

Custom Romanian Dataset Engineering

The dataset pipeline was built from first principles: collection, cleaning, normalization, encoding repair, and annotation for downstream NLP tasks.

mBART Multi-Task Generation

One model handles three high-value actions on long documents: summarization, title generation, and keyword extraction.

BERT Sentiment Layer

A dedicated Romanian BERT path classifies sentiment and emotional signals, adding context beyond purely semantic extraction.

XLM-RoBERTa Question Answering

Two-stage training strategy: multilingual question-answering foundation first, then Romanian specialization for local linguistic nuance.

Data Pipeline

1

Collect representative Romanian text sources with broad topical coverage.

2

Repair encoding problems, normalize structure, and remove noisy artifacts.

3

Annotate target outputs for summarization, titles, keywords, sentiment, and QA.

4

Train and evaluate each model path with task-specific metrics and quality checks.

5

Integrate outputs into a product workflow designed for non-technical end users.

Visual Showcase

The main menu of the DocSense macOS app.

DocSense product screenshot
By the Numbers

Model Scale

610M
mBART-large-50 parameters
550M
XLM-RoBERTa-large parameters
110M
Romanian BERT parameters
150h
Approximate AWS G5 training cycle