98,480 research outputs found
Concept Extraction and Clustering for Topic Digital Library Construction
This paper is to introduce a new approach to build
topic digital library using concept extraction and
document clustering. Firstly, documents in a special
domain are automatically produced by document
classification approach. Then, the keywords of each
document are extracted using the machine learning
approach. The keywords are used to cluster the
documents subset. The clustered result is the taxonomy
of the subset. Lastly, the taxonomy is modified to the
hierarchical structure for user navigation by manual
adjustments. The topic digital library is constructed
after combining the full-text retrieval and hierarchical
navigation function
Automatic tagging and geotagging in video collections and communities
Automatically generated tags and geotags hold great promise
to improve access to video collections and online communi-
ties. We overview three tasks offered in the MediaEval 2010
benchmarking initiative, for each, describing its use scenario, definition and the data set released. For each task, a reference algorithm is presented that was used within MediaEval 2010 and comments are included on lessons learned. The Tagging Task, Professional involves automatically matching episodes in a collection of Dutch television with subject labels drawn from the keyword thesaurus used by the archive staff. The Tagging Task, Wild Wild Web involves automatically predicting the tags that are assigned by users to their online videos. Finally, the Placing Task requires automatically assigning geo-coordinates to videos. The specification of each task admits the use of the full range of available information including user-generated metadata, speech recognition transcripts, audio, and visual features
Simple vs. sophisticated approaches for patent prior-art search
Patent prior-art search is concerned with finding all filed patents relevant to a given patent application. We report a comparison between two search approaches representing the state-of-the-art in patent prior-art search. The first approach uses simple and straightforward information retrieval (IR) techniques, while the second uses much more sophisticated techniques which try to model the steps taken by a patent examiner in patent search. Experiments show that the retrieval effectiveness using both techniques is statistically indistinguishable when patent applications contain some initial citations. However, the advanced search technique is statistically better when no initial citations are provided. Our findings suggest that less time and effort can be exerted by applying simple IR approaches when initial citations are provided
Multi-document Summarization Based on Sentence Clustering Improved Using Topic Words
Informasi dalam bentuk teks berita telah menjadi salah satu komoditas yang paling penting dalam era informasi ini. Ada banyak berita yang dihasilkan sehari-hari, tetapi berita-berita ini sering memberikan konten kontekstual yang sama dengan narasi berbeda. Oleh karena itu, diperlukan metode untuk mengumpulkan informasi ini ke dalam ringkasan sederhana. Di antara sejumlah subtugas yang terlibat dalam peringkasan multi-dokumen termasuk ekstraksi kalimat, deteksi topik, ekstraksi kalimat representatif, dan kalimat rep-resentatif. Dalam tulisan ini, kami mengusulkan metode baru untuk merepresentasikan kalimat ber-dasarkan kata kunci dari topic teks menggunakan Latent Dirichlet Allocation (LDA). Metode ini terdiri dari tiga langkah dasar. Pertama, kami mengelompokkan kalimat di set dokumen menggunakan kesamaan histogram pengelompokan (SHC). Selanjutnya, peringkat cluster menggunakan klaster penting. Terakhir, kalimat perwakilan yang dipilih oleh topik diidentifikasi pada LDA. Metode yang diusulkan diuji pada dataset DUC2004. Hasil penelitian menunjukkan rata-rata 0,3419 dan 0,0766 untuk ROUGE-1 dan ROUGE-2, masing-masing. Selain itu, dari pembaca prespective, metode kami diusulkan menyajikan pengaturan yang koheren dan baik dalam memesan kalimat representatif, sehingga dapat mempermudah pemahaman bacaan dan mengurangi waktu yang dibutuhkan untuk membaca ringkasan
Finding Support Documents with a Logistic Regression Approach
Entity retrieval finds the relevant results for a user’s information needs at a finer unit called “entity”. To retrieve such entity, people usually first locate a small set of support documents which contain answer entities, and then further detect the answer entities in this set. In the literature, people view the support documents as relevant documents, and their findings as a conventional document retrieval problem. In this paper, we will state that finding support documents and that of relevant documents, although sounds similar, have important differences. Further, we propose a logistic regression approach to find support documents. Our experiment results show that the logistic regression method performs significantly better than a baseline system that treat the support document finding as a conventional document retrieval problem
Key Phrase Extraction of Lightly Filtered Broadcast News
This paper explores the impact of light filtering on automatic key phrase
extraction (AKE) applied to Broadcast News (BN). Key phrases are words and
expressions that best characterize the content of a document. Key phrases are
often used to index the document or as features in further processing. This
makes improvements in AKE accuracy particularly important. We hypothesized that
filtering out marginally relevant sentences from a document would improve AKE
accuracy. Our experiments confirmed this hypothesis. Elimination of as little
as 10% of the document sentences lead to a 2% improvement in AKE precision and
recall. AKE is built over MAUI toolkit that follows a supervised learning
approach. We trained and tested our AKE method on a gold standard made of 8 BN
programs containing 110 manually annotated news stories. The experiments were
conducted within a Multimedia Monitoring Solution (MMS) system for TV and radio
news/programs, running daily, and monitoring 12 TV and 4 radio channels.Comment: In 15th International Conference on Text, Speech and Dialogue (TSD
2012
- …