NLP ProductRomanian AI

DocSense

A document intelligence platform that turns complex Romanian text into clear, usable insight through summarization, sentiment analysis, and contextual Q&A.

Read Case Study View Screenshot

The Story

DocSense started from a practical gap in Romanian NLP: model quality is constrained by dataset quality and compute availability. The presentation framing is simple and accurate: before model architecture matters, data discipline matters.

Valentina designed the project around a full lifecycle workflow. Data was collected for diversity, cleaned for consistency, and annotated for multi-task objectives. A critical engineering step was repairing encoding and normalization issues before training, otherwise linguistic signal quality drops immediately.

Instead of betting on one model for every task, DocSense uses a layered strategy: mBART-large-50 for generation tasks, Romanian BERT for sentiment interpretation, and XLM-RoBERTa for contextual question answering with Romanian specialization.

The product goal is direct: make complicated documents easier to understand without dumbing down the information itself.

Model Architecture

Custom Romanian Dataset Engineering

The dataset pipeline was built from first principles: collection, cleaning, normalization, encoding repair, and annotation for downstream NLP tasks.

mBART Multi-Task Generation

One model handles three high-value actions on long documents: summarization, title generation, and keyword extraction.

BERT Sentiment Layer

A dedicated Romanian BERT path classifies sentiment and emotional signals, adding context beyond purely semantic extraction.

XLM-RoBERTa Question Answering

Two-stage training strategy: multilingual question-answering foundation first, then Romanian specialization for local linguistic nuance.

Data Pipeline

Collect representative Romanian text sources with broad topical coverage.

Repair encoding problems, normalize structure, and remove noisy artifacts.

Annotate target outputs for summarization, titles, keywords, sentiment, and QA.

Train and evaluate each model path with task-specific metrics and quality checks.

Integrate outputs into a product workflow designed for non-technical end users.

Visual Showcase

The main menu of the DocSense macOS app.

By the Numbers

Model Scale

610M

mBART-large-50 parameters

550M

XLM-RoBERTa-large parameters

110M

Romanian BERT parameters

150h

Approximate AWS G5 training cycle

Previous: Legal App Suite All Projects