Search CORE

38,400 research outputs found

Document Style Recognition Using Shallow Statistical Analysis

Author: Braslavski P.
Браславский П. И.
Publication venue
Publication date: 01/01/2004
Field of study

Institutional repository of Ural Federal University named after the first President of Russia B.N.Yeltsin

Automatic Genre Classification in Web Pages Applied to Web Comments

Author: Mathar Rudolf
Neunerdt Melanie
Reyer Michael
Publication venue
Publication date: 23/10/2014
Field of study

Automatic Web comment detection could significantly facilitate information retrieval systems, e.g., a focused Web crawler. In this paper, we propose a text genre classifier for Web text segments as intermediate step for Web comment detection in Web pages. Different feature types and classifiers are analyzed for this purpose. We compare the two-level approach to state-of-the-art techniques operating on the whole Web page text and show that accuracy can be improved significantly. Finally, we illustrate the applicability for information retrieval systems by evaluating our approach on Web pages achieved by a Web crawler

University of Hildesheim

Experiment on Style-Dependent Document Ranking

Author: Braslavski P.
Tselischev A.
Браславский П. И.
Publication venue: б. и.
Publication date: 01/01/2005
Field of study

The paper reports on experiments aimed at incorporating style-dependent parameters into ranking schemata in information retrieval tasks. We use ROMIP Web collection and ROMIP-2003 ad-hoc track results in the analysis. Factor analysis techniques have been used to extract factors that would reflect stylistic properties of documents. Comparison of the obtained style-dependent parameters and their derived ranks is conducted. A simple schema for rank aggregation is proposed. Evaluation of the results shows only moderate improvement of relevance ranking.В работе описывается эксперимент по использованию стилистических параметров в ранжировании документов для задачи информационного поиска. В эксперименте использована Веб-коллекция РОМИП, а также результаты оценки дорожки Веб-поиска РОМИП-2003. Для выделения факторов, отражающих стиль документа, использовались методы факторного анализа. Проведено сравнение полученных стилистических параметров и рангов на их основе. Предложена простая схема агрегации рангов. Оценка результатов показала, что метод может давать только незначительное повышение качества ранжирования

Institutional repository of Ural Federal University named after the first President of Russia B.N.Yeltsin

Arabic Documents classification method a Step towards Efficient Documents Summarization

Author: Hesham Ahmed Hassan, Mohamed Yehia Dahab, Khaled Bahnassy, Amira M. Idrees, Fatma Gamal
Publication venue: 'Auricle Technologies, Pvt., Ltd.'
Publication date: 31/01/2015
Field of study

The massive growth of online information obliged the availability of a thorough research in the domain of automatic text summarization within the Natural Language Processing (NLP) community. To reach this goal, different approaches should be integrated and collaborated. One of these approaches is the classification od documents. Therefore, the aim of this paper is to propose a successful framework for agricultural documents classification as a step forward for a language independent automatic summarization approach. The main target of our serial research is to propose a complete novel framework which not only responses to the question, but also gives the user an opportunity to find additional information that is related to the question. We implemented the proposed method. As a case study, the implemented method is applied on Arabic text in the agriculture field. The implemented approach succeeded in classifying the documents submitted by the user. The approach results have been evaluated using Recall, Precision and F-score measures. DOI: 10.17762/ijritcc2321-8169.15017

International Journal on Recent and Innovation Trends in Computing and Communication

"Yes, user!": compiling a corpus according to what the user wants

Author: Aires Rachel
Aluísio Sandra
Santos Diana
Publication venue
Publication date: 31/10/2006
Field of study

Repositório Comum

Recommended from our members

Genre Classification of Websites Using Search Engine Snippets

Author: Becker Hila
Gupta Suhit
Kaiser Gail E.
Stolfo Salvatore
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2005
Field of study

Web pages often contain clutter (such as ads, unnecessary images and extraneous links) around the body of an article, which distracts a user from actual content. Automatic extraction of 'useful and relevant' content from web pages has many applications, including browsing on small cell phone and PDA screens, speech rendering for the visually impaired, and reducing noise for information retrieval systems. Prior work has led to the development of Crunch, a framework which employs various heuristics in the form of filters and filter settings for content extraction. Crunch allows users to tune these settings, essentially the thresholds for applying each filter. However, in order to reduce human involvement in selecting these heuristic settings, we have extended this work to utilize a website's classification, defined by its genre and physical layout. In particular, Crunch would then obtain the settings for a previously unknown website by automatically classifying it as sufficiently similar to a cluster of known websites with previously adjusted settings - which in practice produces better content extraction results than a single one-size-fits-all set of setting defaults. In this paper, we present our approach to clustering a large corpus of websites by their genre, utilizing the snippets generated by sending the website's domain name to search engines as well as the website's own text. We find that exploiting these snippets not only increased the frequency of function words that directly assist in detecting the genre of a website, but also allow for easier clustering of websites. We use existing techniques like Manhattan distance measure and Hierarchical clustering, with some modifications, to pre-classify websites into genres. Our clustering method does not require prior knowledge of the set of genres that websites fit into, but instead discovers these relationships among websites. Subsequently, we are able to classify newly encountered websites in linear-time, and then apply the corresponding filter settings, with no noticeable delay introduced for the content-extracting web proxy

Columbia University Academic Commons

Recommended from our members

A Genre-based Clustering Approach to Content Extraction

Author: Becker Hila
Gupta Suhit
Kaiser Gail E.
Stolfo Salvatore
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2005
Field of study

The content of a webpage is usually contained within a small body of text and images, or perhaps several articles on the same page; however, the content may be lost in the clutter (defined as cosmetic features such as animations, menus, sidebars, obtrusive banners). Automatic content extraction has many applications, including browsing on small cell phone and PDA screens, speech rendering for the visually impaired, and reducing noise for information retrieval systems. We have developed a framework, Crunch, which employs various heuristics for content extraction in the form of filters applied to the webpage's DOM tree; the filters aim to prune or transform the clutter, leaving only the content. Crunch allows users to tune what we call 'settings', consisting of thresholds for applying a particular filter and/or for toggling a filter on/off, because the HTML components that characterize clutter can vary significantly from website to website. However, we have found that the same settings tend to work well across different websites of the same genre, e.g., news or shopping, since the designers often employ similar page layouts. In particular, Crunch could obtain the settings for a previously unknown website by automatically classifying it as sufficiently similar to a cluster of known websites with previously adjusted settings. We present our approach to clustering a large corpus of websites into genres, using their pre-extraction textual material augmented by the snippets generated by searching for the website's domain name in web search engines. Including these snippets increases the frequency of function words needed for clustering. We use existing Manhattan distance measure and hierarchical clustering techniques, with some modifications, to pre-classify the corpus into genres offline. Our method does not require prior knowledge of the set of genres that websites fit into, but to be useful a priori settings must be available for some member of each cluster or a nearby cluster (otherwise defaults are used). Crunch classifies newly encountered websites online in linear-time, and then applies the corresponding filter settings, with no noticeable delay added by our content-extracting web proxy

Columbia University Academic Commons

Recommended from our members