SuDocu

SuDocu is an example-based personalized document summarization system that allows the users to provide example summaries, learns the summarization intent from the examples, and produces summaries for new documents that reflect the user's summarization intent.

Project Summary

Text document summarization refers to the task of producing a brief representation of a document for easy human consumption. Existing text summarization techniques mostly focus on generic summarization, but users often require personalized summarization that targets their specific preferences and needs. However, precisely expressing preferences is challenging, and current methods are often ambiguous, outside the user's control, or require costly training data. We propose a novel and effective way to express summarization intent (preferences) via examples: the user provides a few example summaries for a small number of documents in a collection, and the system summarizes the rest. We demonstrate SuDocu, an example-based personalized Document Summarization system. Through a simple interface, SuDocu allows the users to provide example summaries, learns the summarization intent from the examples, and produces summaries for new documents that reflect the user's summarization intent. SuDocu further explains the captured summarization intent in the form of a package query, an extension of a traditional SQL query that handles complex constraints and preferences over answer sets. SuDocu combines topic modeling, semantic similarity discovery, and in-database optimization in a novel way to achieve example-driven document summarization. We demonstrate how SuDocu can detect complex summarization intents from a few example summaries and produce accurate summaries for new documents effectively and efficiently.

VLDB 2020 Talk

NewSum@EMNLP 2021 Talk

System Architecture

People

Lead Students
BS and MS Students
- Oscar Youngquist [Winter 2021 - to date]
- Julian Killingback [Winter 2021]
- Genglin Liu [Summer 2020]
- Kanchi Masalia [Summer & Fall 2020]
Faculty
- Peter J. Haas
- Alexandra Meliou

Dataset

SubSumE Dataset

Publications

Nishant Yadav*, Matteo Brucato*, Anna Fariha*, Oscar Youngquist, Julian Killingback, Alexandra Meliou, Peter J. Haas: SubSumE: A Dataset for Subjective Summary Extraction from Wikipedia Documents. NewSum@EMNLP. 2021. Paper
Anna Fariha, Matteo Brucato, Peter J. Haas, and Alexandra Meliou. SuDocu: Summarizing Documents by Example. PVLDB, 13(12). 2020. Paper
Won the best demonstration runner-up award at VLDB 2020.

Acknowledgement

This work was supported by the NSF under grants IIS-1453543, IIS1943971, and CCF-1763423, and a Microsoft Research Dissertation Grant.