Search CORE

6,318 research outputs found

Recommended from our members

A Genre-based Clustering Approach to Content Extraction

Author: Gupta Suhit
Becker Hila
Kaiser Gail E.
Stolfo Salvatore
Publication venue: Department of Computer Science, Columbia University
Publication date: 26/02/2004
Field of study

The content of a webpage is usually contained within a small body of text and images, or perhaps several articles on the same page; however, the content may be lost in the clutter (defined as cosmetic features such as animations, menus, sidebars, obtrusive banners). Automatic content extraction has many applications, including browsing on small cell phone and PDA screens, speech rendering for the visually impaired, and reducing noise for information retrieval systems. We have developed a framework, Crunch, which employs various heuristics for content extraction in the form of filters applied to the webpage's DOM tree; the filters aim to prune or transform the clutter, leaving only the content. Crunch allows users to tune what we call 'settings', consisting of thresholds for applying a particular filter and/or for toggling a filter on/off, because the HTML components that characterize clutter can vary significantly from website to website. However, we have found that the same settings tend to work well across different websites of the same genre, e.g., news or shopping, since the designers often employ similar page layouts. In particular, Crunch could obtain the settings for a previously unknown website by automatically classifying it as sufficiently similar to a cluster of known websites with previously adjusted settings. We present our approach to clustering a large corpus of websites into genres, using their pre-extraction textual material augmented by the snippets generated by searching for the website's domain name in web search engines. Including these snippets increases the frequency of function words needed for clustering. We use existing Manhattan distance measure and hierarchical clustering techniques, with some modifications, to pre-classify the corpus into genres offline. Our method does not require prior knowledge of the set of genres that websites fit into, but to be useful a priori settings must be available for some member of each cluster or a nearby cluster (otherwise defaults are used). Crunch classifies newly encountered websites online in linear-time, and then applies the corresponding filter settings, with no noticeable delay added by our content-extracting web proxy

Columbia University Academic Commons

TamPub Julkaisuarkisto - TamPub Institutional Repository

Trepo - Institutional Repository of Tampere University

Recommended from our members

A Genre-based Clustering Approach to Content Extraction

Author: Becker Hila
Gupta Suhit
Kaiser Gail E.
Stolfo Salvatore
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2005
Field of study

Columbia University Academic Commons

Visual and computational analysis of structure-activity relationships in high-throughput screening data

Author: Agrafiotis
Agrafiotis
Ahlberg
Ajay
Ajay
Bayada
Bemis
Bernard
Bonabeau
Brown
Calvert
Card
Chen
Chen
Cho
Christianini
Clark
Clark
Cox
Duda
Edwards
Engels
Frimurer
Gao
Garrido
Ghose
Gillet
Gillet
Hand
Hann
Haupts
Hayward
Hertzberg
Izrailev
Jiang
Jones-Hertzog
Kirew
Kobayashi
Kohonen
Ladd
Lee
Lepre
Martin
Mason
Mello
Meyer
Miller
Mitchell
Oprea
Peter Gedeck
Peter Willett
Poroikov
Rhodes
Roberts
Roberts
Ros
Rusinko
Sadowski
Sadowski
Scherf
Sheridan
Shi
Stanton
Su
Teague
Thompson
Tropsha
Tufte
Tufte
Wagener
Walters
Wang
Wedin
Xie
Xu
Zupan
Publication venue: 'Elsevier BV'
Publication date: 01/08/2001
Field of study

Novel analytic methods are required to assimilate the large volumes of structural and bioassay data generated by combinatorial chemistry and high-throughput screening programmes in the pharmaceutical and agrochemical industries. This paper reviews recent work in visualisation and data mining that can be used to develop structure-activity relationships from such chemical/biological datasets

Crossref

White Rose Research Online

QuickCSG: Fast Arbitrary Boolean Combinations of N Solids

Author: Douze Matthijs
Franco Jean-Sébastien
Raffin Bruno
Publication venue
Publication date: 05/06/2017
Field of study

QuickCSG computes the result for general N-polyhedron boolean expressions without an intermediate tree of solids. We propose a vertex-centric view of the problem, which simplifies the identification of final geometric contributions, and facilitates its spatial decomposition. The problem is then cast in a single KD-tree exploration, geared toward the result by early pruning of any region of space not contributing to the final surface. We assume strong regularity properties on the input meshes and that they are in general position. This simplifying assumption, in combination with our vertex-centric approach, improves the speed of the approach. Complemented with a task-stealing parallelization, the algorithm achieves breakthrough performance, one to two orders of magnitude speedups with respect to state-of-the-art CPU algorithms, on boolean operations over two to dozens of polyhedra. The algorithm also outperforms GPU implementations with approximate discretizations, while producing an output without redundant facets. Despite the restrictive assumptions on the input, we show the usefulness of QuickCSG for applications with large CSG problems and strong temporal constraints, e.g. modeling for 3D printers, reconstruction from visual hulls and collision detection

arXiv.org e-Print Archive

Hal - Université Grenoble Alpes

INRIA a CCSD electronic archive server

QuickCSG: Fast Arbitrary Boolean Combinations of N Solids

Author: Douze Matthijs
Franco Jean-Sébastien
Raffin Bruno
Publication venue
Publication date: 01/01/1760
Field of study

arXiv.org e-Print Archive

Biblioteca Digital de la Comunidad de Madrid

Galiciana

Text Classification: A Review, Empirical, and Experimental Evaluation

Author: Taha Aya
Taha Kamal
Yeun Chan
Yoo Paul D.
Publication venue
Publication date: 11/01/2024
Field of study

The explosive and widespread growth of data necessitates the use of text classification to extract crucial information from vast amounts of data. Consequently, there has been a surge of research in both classical and deep learning text classification methods. Despite the numerous methods proposed in the literature, there is still a pressing need for a comprehensive and up-to-date survey. Existing survey papers categorize algorithms for text classification into broad classes, which can lead to the misclassification of unrelated algorithms and incorrect assessments of their qualities and behaviors using the same metrics. To address these limitations, our paper introduces a novel methodological taxonomy that classifies algorithms hierarchically into fine-grained classes and specific techniques. The taxonomy includes methodology categories, methodology techniques, and methodology sub-techniques. Our study is the first survey to utilize this methodological taxonomy for classifying algorithms for text classification. Furthermore, our study also conducts empirical evaluation and experimental comparisons and rankings of different algorithms that employ the same specific sub-technique, different sub-techniques within the same technique, different techniques within the same category, and categorie

arXiv.org e-Print Archive

A plan classifier based on Chi-square distribution tests

Author: Iglesias Martínez José Antonio
Kaminka Gal
Ledezma Espino Agapito Ismael
Sanchis de Miguel María Araceli
Publication venue: 'IOS Press'
Publication date: 01/01/2011
Field of study

To make good decisions in a social context, humans often need to recognize the plan underlying the behavior of others, and make predictions based on this recognition. This process, when carried out by software agents or robots, is known as plan recognition, or agent modeling. Most existing techniques for plan recognition assume the availability of carefully hand-crafted plan libraries, which encode the a-priori known behavioral repertoire of the observed agents; during run-time, plan recognition algorithms match the observed behavior of the agents against the plan-libraries, and matches are reported as hypotheses. Unfortunately, techniques for automatically acquiring plan-libraries from observations, e.g., by learning or data-mining, are only beginning to emerge. We present an approach for automatically creating the model of an agent behavior based on the observation and analysis of its atomic behaviors. In this approach, observations of an agent behavior are transformed into a sequence of atomic behaviors (events). This stream is analyzed in order to get the corresponding behavior model, represented by a distribution of relevant events. Once the model has been created, the proposed approach presents a method using a statistical test for classifying an observed behavior. Therefore, in this research, the problem of behavior classification is examined as a problem of learning to characterize the behavior of an agent in terms of sequences of atomic behaviors. The experiment results of this paper show that a system based on our approach can efficiently recognize different behaviors in different domains, in particular UNIX command-line data, and RoboCup soccer simulationThis work has been partially supported by the Spanish Government under project TRA2007-67374-C02-0

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Universidad Carlos III de Madrid e-Archivo