Search CORE

764,616 research outputs found

Mining the Web for Lexical Knowledge to Improve Keyphrase Extraction: Learning from Labeled and Unlabeled Data.

Author: Turney Peter
Publication venue
Publication date: 01/01/2001
Field of study

A journal article is often accompanied by a list of keyphrases, composed of about five to fifteen important words and phrases that capture the articles main topics. Keyphrases are useful for a variety of purposes, including summarizing, indexing, labeling, categorizing, clustering, highlighting, browsing, and searching. The task of automatic keyphrase extraction is to select keyphrases from within the text of a given document. Automatic keyphrase extraction makes it feasible to generate keyphrases for the huge number of documents that do not have manually assigned keyphrases. Good performance on this task has been obtained by approaching it as a supervised learning problem. An input document is treated as a set of candidate phrases that must be classified as either keyphrases or non-keyphrases. To classify a candidate phrase as a keyphrase, the most important features (attributes) appear to be the frequency and location of the candidate phrase in the document. Recent work has demonstrated that it is also useful to know the frequency of the candidate phrase as a manually assigned keyphrase for other documents in the same domain as the given document (e.g., the domain of computer science). Unfortunately, this keyphrase-frequency feature is domain-specific (the learning process must be repeated for each new domain) and training-intensive (good performance requires a relatively large number of training documents in the given domain, with manually assigned keyphrases). The aim of the work described here is to remove these limitations. In this paper, I introduce new features that are conceptually related to keyphrase-frequency and I present experiments that show that the new features result in improved keyphrase extraction, although they are neither domain-specific nor training-intensive. The new features are generated by issuing queries to a Web search engine, based on the candidate phrases in the input document. The feature values are calculated from the number of hits for the queries (the number of matching Web pages). In essence, these new features are derived by mining lexical knowledge from a very large collection of unlabeled data, consisting of approximately 350 million Web pages without manually assigned keyphrases

arXiv.org e-Print Archive

CiteSeerX

NRC Publications Archive

CogPrints Cognitive Sciences Eprint Archive

Diet composition and food habits of demersal and pelagic marine fishes from Terengganu waters, east coast of Peninsular Malaysia

Author: Bachok Z.
Mansor M.I.
Noordin R.M.
Publication venue
Publication date: 01/01/2004
Field of study

Fish stomachs from 18 demersal and pelagic fishes from the coast of Terengganu in Malaysia were examined. The components of the fishes’ diets varied in number, weight, and their frequency of occurrence. The major food items in the stomachs of each species were determined using an Index of Relative Importance. A conceptual food web structure indicates that fish species in the study area can be classified into three predatory groups: (1) predators on largely planktivorous or pelagic species; (2) predators on largely benthophagous or demersal species; and (3) mixed feeders that consume both pelagic and demersal species

Aquatic Commons

A Quasi-Bayesian Perspective to Online Clustering

Author: Guedj Benjamin
Li Le
Loustau Sébastien
Publication venue: 'Institute of Mathematical Statistics'
Publication date: 01/01/2018
Field of study

When faced with high frequency streams of data, clustering raises theoretical and algorithmic pitfalls. We introduce a new and adaptive online clustering algorithm relying on a quasi-Bayesian approach, with a dynamic (i.e., time-dependent) estimation of the (unknown and changing) number of clusters. We prove that our approach is supported by minimax regret bounds. We also provide an RJMCMC-flavored implementation (called PACBO, see https://cran.r-project.org/web/packages/PACBO/index.html) for which we give a convergence guarantee. Finally, numerical experiments illustrate the potential of our procedure

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

UCL Discovery

Response Function of the Fractional Quantized Hall State on a Sphere I: Fermion Chern-Simons Theory

Author: A. L. Fetter
A. L. Fetter
A. Lopez
A. Lopez
A. R. Edmonds
B. I. Halperin
Bertrand I. Halperin
C. Gros
D. Pines
E. Weinberg
F. D. M. Haldane
G. Dev
H. A. Olsen
I. S. Gradshteyn
I. Tamm
J. K. Jain
J. K. Jain
J. K. Jain
J. K. Jain
M. Fierz
N. d'Ambrumenil
P. A. M. Dirac
S. H. Simon
S. He
Steven H. Simon
T. T. Wu
X. G. Wen
Y. H. Chen
Publication venue: 'American Physical Society (APS)'
Publication date: 21/02/1994
Field of study

Using a well known singular gauge transformation, certain fractional quantized Hall states can be modeled as integer quantized Hall states of transformed fermions interacting with a Chern-Simons field. In previous work we have calculated the electromagnetic response function of these states at arbitrary frequency and wavevector by using the Random Phase Approximation (RPA) in combination with a Landau Fermi Liquid approach. We now adopt these calculations to a spherical geometry in order to facilitate comparison with exact diagonalizations performed on finite size systems.Comment: 39 pages (REVTeX 3.0). Postscript file for this paper are available on the World Wide Web at http://cmtw.harvard.edu/~simon/ ; Preprint number HU-CMT-94S0

arXiv.org e-Print Archive

Crossref

Empirical study of error behavior in Web servers

Author: Singh Ajay Deep
Publication venue: The Research Repository @ WVU
Publication date: 01/12/2005
Field of study

The World Wide Web has been a huge success, bringing the Internet to widespread popularity. For Web based systems to deal effectively with increasing number of Web clients, it is very important to understand the basic fundamentals of Web workload and error characteristics. In this thesis we focus on detailed empirical analysis of Web server error characteristics and reliability based on the data extracted from eleven different web servers. First, we address the data collection process and describe the methods for extraction of workload and error data from Web logs. Then, we analyze the Web error characteristics which include unique errors, frequency of occurrence of unique errors and top files causing errors. Furthermore, we analyze the relationship between errors among Web workload and estimate request-based and session-based reliability. The discussion presented in this thesis shows the sessions-based reliability is better indicator of user perception of Web quality than request-based reliability. Finally, we analyze and develop heuristic search criteria to identify sessions which indicate unusual server behavior, such as extremely long sessions and sessions with large number of server errors. The results of our study provide valuable measures for tuning and maintaining of Web servers

The Research Repository @ WVU (West Virginia University)

Hybrid Information Retrieval Model For Web Images

Author: Bassil Youssef
Publication venue
Publication date: 20/02/2012
Field of study

The Bing Bang of the Internet in the early 90's increased dramatically the number of images being distributed and shared over the web. As a result, image information retrieval systems were developed to index and retrieve image files spread over the Internet. Most of these systems are keyword-based which search for images based on their textual metadata; and thus, they are imprecise as it is vague to describe an image with a human language. Besides, there exist the content-based image retrieval systems which search for images based on their visual information. However, content-based type systems are still immature and not that effective as they suffer from low retrieval recall/precision rate. This paper proposes a new hybrid image information retrieval model for indexing and retrieving web images published in HTML documents. The distinguishing mark of the proposed model is that it is based on both graphical content and textual metadata. The graphical content is denoted by color features and color histogram of the image; while textual metadata are denoted by the terms that surround the image in the HTML document, more particularly, the terms that appear in the tags p, h1, and h2, in addition to the terms that appear in the image's alt attribute, filename, and class-label. Moreover, this paper presents a new term weighting scheme called VTF-IDF short for Variable Term Frequency-Inverse Document Frequency which unlike traditional schemes, it exploits the HTML tag structure and assigns an extra bonus weight for terms that appear within certain particular HTML tags that are correlated to the semantics of the image. Experiments conducted to evaluate the proposed IR model showed a high retrieval precision rate that outpaced other current models.Comment: LACSC - Lebanese Association for Computational Sciences, http://www.lacsc.org/; International Journal of Computer Science & Emerging Technologies (IJCSET), Vol. 3, No. 1, February 201

arXiv.org e-Print Archive

CiteSeerX

ExcelingTech Publishing Company (E-Journals)