Search CORE

22 research outputs found

Recommended from our members

Genre Classification of Websites Using Search Engine Snippets

Author: Becker Hila
Gupta Suhit
Kaiser Gail E.
Stolfo Salvatore
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2005
Field of study

Web pages often contain clutter (such as ads, unnecessary images and extraneous links) around the body of an article, which distracts a user from actual content. Automatic extraction of 'useful and relevant' content from web pages has many applications, including browsing on small cell phone and PDA screens, speech rendering for the visually impaired, and reducing noise for information retrieval systems. Prior work has led to the development of Crunch, a framework which employs various heuristics in the form of filters and filter settings for content extraction. Crunch allows users to tune these settings, essentially the thresholds for applying each filter. However, in order to reduce human involvement in selecting these heuristic settings, we have extended this work to utilize a website's classification, defined by its genre and physical layout. In particular, Crunch would then obtain the settings for a previously unknown website by automatically classifying it as sufficiently similar to a cluster of known websites with previously adjusted settings - which in practice produces better content extraction results than a single one-size-fits-all set of setting defaults. In this paper, we present our approach to clustering a large corpus of websites by their genre, utilizing the snippets generated by sending the website's domain name to search engines as well as the website's own text. We find that exploiting these snippets not only increased the frequency of function words that directly assist in detecting the genre of a website, but also allow for easier clustering of websites. We use existing techniques like Manhattan distance measure and Hierarchical clustering, with some modifications, to pre-classify websites into genres. Our clustering method does not require prior knowledge of the set of genres that websites fit into, but instead discovers these relationships among websites. Subsequently, we are able to classify newly encountered websites in linear-time, and then apply the corresponding filter settings, with no noticeable delay introduced for the content-extracting web proxy

Columbia University Academic Commons

Psoriasis prediction from genome-wide SNP profiles

Author: A Tsalenko
B Goertzel
B Zhang
BH Cho
D Brinza
F Zhou
I Guyon
M Gormley
MH Gail
Momiao Xiong
N Slonim
N Slonim
N Zhou
Q Lu
RB D'Agostino
RP Nair
Shenying Fang
TM Cover
TS Furey
Xiangzhong Fang
XL Wang
YH Wang
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Abstract Background With the availability of large-scale genome-wide association study (GWAS) data, choosing an optimal set of SNPs for disease susceptibility prediction is a challenging task. This study aimed to use single nucleotide polymorphisms (SNPs) to predict psoriasis from searching GWAS data. Methods Totally we had 2,798 samples and 451,724 SNPs. Process for searching a set of SNPs to predict susceptibility for psoriasis consisted of two steps. The first one was to search top 1,000 SNPs with high accuracy for prediction of psoriasis from GWAS dataset. The second one was to search for an optimal SNP subset for predicting psoriasis. The sequential information bottleneck (sIB) method was compared with classical linear discriminant analysis(LDA) for classification performance. Results The best test harmonic mean of sensitivity and specificity for predicting psoriasis by sIB was 0.674(95% CI: 0.650-0.698), while only 0.520(95% CI: 0.472-0.524) was reported for predicting disease by LDA. Our results indicate that the new classifier sIB performs better than LDA in the study. Conclusions The fact that a small set of SNPs can predict disease status with average accuracy of 68% makes it possible to use SNP data for psoriasis prediction.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Recommended from our members

A Genre-based Clustering Approach to Content Extraction

Author: Gupta Suhit
Becker Hila
Kaiser Gail E.
Stolfo Salvatore
Publication venue: Department of Computer Science, Columbia University
Publication date: 26/02/2004
Field of study

The content of a webpage is usually contained within a small body of text and images, or perhaps several articles on the same page; however, the content may be lost in the clutter (defined as cosmetic features such as animations, menus, sidebars, obtrusive banners). Automatic content extraction has many applications, including browsing on small cell phone and PDA screens, speech rendering for the visually impaired, and reducing noise for information retrieval systems. We have developed a framework, Crunch, which employs various heuristics for content extraction in the form of filters applied to the webpage's DOM tree; the filters aim to prune or transform the clutter, leaving only the content. Crunch allows users to tune what we call 'settings', consisting of thresholds for applying a particular filter and/or for toggling a filter on/off, because the HTML components that characterize clutter can vary significantly from website to website. However, we have found that the same settings tend to work well across different websites of the same genre, e.g., news or shopping, since the designers often employ similar page layouts. In particular, Crunch could obtain the settings for a previously unknown website by automatically classifying it as sufficiently similar to a cluster of known websites with previously adjusted settings. We present our approach to clustering a large corpus of websites into genres, using their pre-extraction textual material augmented by the snippets generated by searching for the website's domain name in web search engines. Including these snippets increases the frequency of function words needed for clustering. We use existing Manhattan distance measure and hierarchical clustering techniques, with some modifications, to pre-classify the corpus into genres offline. Our method does not require prior knowledge of the set of genres that websites fit into, but to be useful a priori settings must be available for some member of each cluster or a nearby cluster (otherwise defaults are used). Crunch classifies newly encountered websites online in linear-time, and then applies the corresponding filter settings, with no noticeable delay added by our content-extracting web proxy

Columbia University Academic Commons

TamPub Julkaisuarkisto - TamPub Institutional Repository

Trepo - Institutional Repository of Tampere University

Recommended from our members

A Genre-based Clustering Approach to Content Extraction

Author: Becker Hila
Gupta Suhit
Kaiser Gail E.
Stolfo Salvatore
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2005
Field of study

Columbia University Academic Commons

Klasifikasi Dokumen Menggunakan Kombinasi Algoritma Principal Component Analysis dan SVM

Author: MICHAEL FREDDY HAMONANGAN SIANTURI
Publication venue: Universitas Telkom
Publication date: 03/11/2017
Field of study

ABSTRAK Klasifikasi dokumen teks adalah masalah yang sederhana namun sangat penting karena manfaatnya cukup besar mengingat jumlah dokumen yang ada setiap hari semakin bertambah. Namun, kebanyakan teknik klasifikasi dokumen yang ada memerlukan labeled documents dalam jumlah besar untuk melakukan tahap training dan testing. Dalam melakukan klasifikasi dokumen, pada tugas akhir ini digunakan algoritma Principal Component Analysis yang dikombinasikan dengan Support Vector Machines untuk supervised document. Principal Component Analysis merupakan suatu teknik yang dapat digunakan untuk mengekstrasi struktur dari suatu data yang berdimensi tinggi tanpa menghilangkan informasi yang signifikan pada keseluruhan data. Kemudian dibutuhkan sebuah algoritma yang dapat menghasilkan prediksi dan akurasi dari dokumen tersebut yaitu Support Vector Machines (SVM). SVM adalah metode learning machine yang bekerja atas prinsip Structural Risk Minimization (SRM) dengan tujuan menemukan hyperplane terbaik yang memisahkan dua buah class pada input space. Hyperplane pemisah terbaik antara kedua kelas dapat ditemukan dengan mengukur margin hyperplane tersebut dan mencari titik maksimalnya. Hasil dari pengujian sistem menggunakan data yang direduksi oleh Principal Component Analysis (PCA) memiliki akurasi yang sedikit lebih rendah untuk dataset tertentu dibandingkan tanpa menggunakan PCA. Data yang digunakan adalah data R8 of Reuters-21578 Text Categorization Collection Data Set. Akurasi terbaik pada penelitian ini dihasilkan dari metode SVM dengan akurasi rata-rata 98.95%, sedangkan untuk metode SVM + PCA akurasi yang diperoleh rata-rata 96.7866%. Kata kunci : Klasifikasi Dokumen, Principal Component Analysis, Support Vector Machin

Open Library

Integration of TDOA Features in Information Bottleneck Framework for Fast Speaker Diarization

Author: Bourlard Hervé
Valente Fabio
Vijayasenan Deepu
Publication venue: IDIAP
Publication date: 11/02/2010
Field of study

In this paper we address the combination of multiple feature streams in a fast speaker diarization system for meeting recordings. Whenever Multiple Distant Microphones (MDM) are used, it is possible to estimate the Time Delay of Arrival (TDOA) for different channels. In \cite{xavi_comb}, it is shown that TDOA can be used as additional features together with conventional spectral features for improving speaker diarization. We investigate here the combination of TDOA and spectral features in a fast diarization system based on the Information Bottleneck principle. We evaluate the algorithm on the NIST RT06 diarization task. Adding TDOA features to spectral features reduces the speaker error by 3\% absolute. Results are comparable to those of conventional HMM/GMM based systems with consistent reduction in computational complexity

Infoscience - École polytechnique fédérale de Lausanne

COMBINATION OF AGGLOMERATIVE AND SEQUENTIAL CLUSTERING FOR SPEAKER DIARIZATION

Author: Bourlard Hervé
Valente Fabio
Vijayasenan Deepu
Publication venue: IDIAP
Publication date: 11/02/2010
Field of study

This paper aims at investigating the use of sequential clustering for speaker diarization. Conventional diarization systems are based on parametric models and agglomerative clustering. In our previous work we proposed a non-parametric method based on the agglomerative Information Bottleneck for very fast diarization. Here we consider the combination of sequential and agglomerative clustering for avoiding local maxima of the objective function and for purification. Experiments are run on the RT06 eval data. Sequential Clustering with oracle model selection can reduce the speaker error by

10\%

w.r.t. agglomerative clustering. When the model selection is based on Normalized Mutual Information criterion, a relative improvement of

5\%

is obtained using a combination of agglomerative and sequential clustering

Infoscience - École polytechnique fédérale de Lausanne

AGGLOMERATIVE INFORMATION BOTTLENECK FOR SPEAKER DIARIZATION OF MEETINGS DATA

Author: Bourlard Hervé
Valente Fabio
Vijayasenan Deepu
Publication venue
Publication date: 11/02/2010
Field of study

In this paper, we investigate the use of agglomerative Information Bottleneck (aIB) clustering for the speaker diarization task of meetings data. In contrary to the state-of-the-art diarization systems that models individual speakers with Gaussian Mixture Models, the proposed algorithm is completely non parametric . Both clustering and model selection issues of non-parametric models are addressed in this work. The proposed algorithm is evaluated on meeting data on the RT06 evaluation data set. The system is able to achieve Diarization Error Rates comparable to state-of-the-art systems at a much lower computational complexity

Infoscience - École polytechnique fédérale de Lausanne

Agglomerative information bottleneck for speaker diarization of meetings data

Author
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2007
Field of study

Crossref