VDocRAG

Retrieval-Augmented Generation over Visually-Rich Documents

CVPR 2025

Ryota Tanaka^1,2, Taichi Iki¹, Taku Hasegawa¹, Kyosuke Nishida¹, Kuniko Saito¹, Jun Suzuki²,

1. NTT Human Informatics Laboratories, NTT Corporation 2. Tohoku University

arXiv Code 🤗 Dataset (QA) 🤗 Dataset (Corpus) 🤗 Model (Retriever) 🤗 Model (Generator)

🔥 New RAG Framework: VDocRAG can directly understand varied documents and modalities in a unified image format to avoid potential information loss that occurs in conventional text-based RAG.

🔥 New Pretraining Tasks: RCR and RCG compress the entire image representation into a dense token representation, by aligning the text in documents via retrieval and generation tasks.

🔥 New Dataset: OpenDocVQA is the first unified collection of open-domain DocumentVQA datasets encompassing a wide range of document types and formats. OpenDocVQA provides a comprehensive resource for training and evaluating retrieval and question answering models on visually-rich documents in an open-domain setting.

VDocRAG: Retrieval-Augmented Generation over Visually-Rich Documents

VDocRAG consists of two main components, both of which effectively leverage the visual features of documents.

VDocRetriever retrieves document images related to the question from a corpus of document images.
VDocGenerator uses these retrieved images to generate the answer.

Self-Supervised Pretraining

The goal of pre-training is to transfer the powerful understanding and generation abilities of LVLMs to facilitate their usage in visual document retrieval. To this end, we propose two new self-supervised pretraining tasks to compress the entire image representation into the EOS token at the end of the input image. Our pre-training process passes the document image, and its extracted OCR text is used as a pseudo target. Full pre-training objectives is defined as the sum of losses as follows.

Representation Compression via Retrieval (RCR) is a contrastive learning task that retrieves images relevant to their corresponding OCR text, by leveraging LVLM's image understanding capabilities.
Representation Compression via Generation (RCG) is a representation training strategy that leverages the generative capabilities of LVLMs through a customized attention mask matrix.

Aftere pre-training, we first fine-tune the VDocRetriever with the contrastive learning objective using query-document pairs with in-batch negatives. Then, we apply the trained VDocRetriever to search over the corpus to feed the top-k documents into the VDocGenerator. Finally, we train the VDocGenerator using the next-token prediction.

OpenDocVQA Dataset

OpenDocVQA is the first unified collection of open-domain DocumentVQA datasets encompassing a wide range of document types and formats, including 43K QA pairs over 200K document images. OpenDocVQA provides a comprehensive resource for training and evaluating retrieval and question answering models on visually-rich documents in an open-domain setting.

Performance

Retrieval Results

VDocRetriever exhibits superior zero-shot generalization on unseen datasets, ChartQA and SlideVQA, outperforming both off-the-shelf text retrievers and the state-of-the-art visual document retrieval models.

RAG Results

VDocRAG significantly outperformed both the closed-book LLM and the text-based RAG on the DocumentVQA task, even when all models were the same initialization.

Output Comparison

VDocRAG demonstrates significant performance advantages in understanding layouts and visual content, such as tables, charts, figures, and diagrams. These findings highlight the critical role of representing documents as images to improve the performance of the RAG framework.

BibTeX


  @inproceedings{tanaka2025vdocrag,
    author      = {Ryota Tanaka and Taichi Iki and Taku Hasegawa and Kyosuke Nishida and Kuniko Saito and Jun Suzuki},
    title       = {VDocRAG: Retrieval-Augmented Generation over Visually-Rich Documents},
    booktitle   = {CVPR},
    year        = {2025}
  }

Acknowledgement

This website is adapted from Nerfies, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.