72,796 research outputs found
Exploratory Analysis of Highly Heterogeneous Document Collections
We present an effective multifaceted system for exploratory analysis of
highly heterogeneous document collections. Our system is based on intelligently
tagging individual documents in a purely automated fashion and exploiting these
tags in a powerful faceted browsing framework. Tagging strategies employed
include both unsupervised and supervised approaches based on machine learning
and natural language processing. As one of our key tagging strategies, we
introduce the KERA algorithm (Keyword Extraction for Reports and Articles).
KERA extracts topic-representative terms from individual documents in a purely
unsupervised fashion and is revealed to be significantly more effective than
state-of-the-art methods. Finally, we evaluate our system in its ability to
help users locate documents pertaining to military critical technologies buried
deep in a large heterogeneous sea of information.Comment: 9 pages; KDD 2013: 19th ACM SIGKDD Conference on Knowledge Discovery
and Data Minin
Layout-Aware Information Extraction for Document-Grounded Dialogue: Dataset, Method and Demonstration
Building document-grounded dialogue systems have received growing interest as
documents convey a wealth of human knowledge and commonly exist in enterprises.
Wherein, how to comprehend and retrieve information from documents is a
challenging research problem. Previous work ignores the visual property of
documents and treats them as plain text, resulting in incomplete modality. In
this paper, we propose a Layout-aware document-level Information Extraction
dataset, LIE, to facilitate the study of extracting both structural and
semantic knowledge from visually rich documents (VRDs), so as to generate
accurate responses in dialogue systems. LIE contains 62k annotations of three
extraction tasks from 4,061 pages in product and official documents, becoming
the largest VRD-based information extraction dataset to the best of our
knowledge. We also develop benchmark methods that extend the token-based
language model to consider layout features like humans. Empirical results show
that layout is critical for VRD-based extraction, and system demonstration also
verifies that the extracted knowledge can help locate the answers that users
care about.Comment: Accepted to ACM Multimedia (MM) Industry Track 202
Multi-document Summarization Based on Sentence Clustering Improved Using Topic Words
Informasi dalam bentuk teks berita telah menjadi salah satu komoditas yang paling penting dalam era informasi ini. Ada banyak berita yang dihasilkan sehari-hari, tetapi berita-berita ini sering memberikan konten kontekstual yang sama dengan narasi berbeda. Oleh karena itu, diperlukan metode untuk mengumpulkan informasi ini ke dalam ringkasan sederhana. Di antara sejumlah subtugas yang terlibat dalam peringkasan multi-dokumen termasuk ekstraksi kalimat, deteksi topik, ekstraksi kalimat representatif, dan kalimat rep-resentatif. Dalam tulisan ini, kami mengusulkan metode baru untuk merepresentasikan kalimat ber-dasarkan kata kunci dari topic teks menggunakan Latent Dirichlet Allocation (LDA). Metode ini terdiri dari tiga langkah dasar. Pertama, kami mengelompokkan kalimat di set dokumen menggunakan kesamaan histogram pengelompokan (SHC). Selanjutnya, peringkat cluster menggunakan klaster penting. Terakhir, kalimat perwakilan yang dipilih oleh topik diidentifikasi pada LDA. Metode yang diusulkan diuji pada dataset DUC2004. Hasil penelitian menunjukkan rata-rata 0,3419 dan 0,0766 untuk ROUGE-1 dan ROUGE-2, masing-masing. Selain itu, dari pembaca prespective, metode kami diusulkan menyajikan pengaturan yang koheren dan baik dalam memesan kalimat representatif, sehingga dapat mempermudah pemahaman bacaan dan mengurangi waktu yang dibutuhkan untuk membaca ringkasan
- …