764,616 research outputs found
Mining the Web for Lexical Knowledge to Improve Keyphrase Extraction: Learning from Labeled and Unlabeled Data.
A journal article is often accompanied by a list of keyphrases, composed of about five to fifteen important words and phrases that capture the articleÂ’s main topics. Keyphrases are useful for a variety of purposes, including summarizing, indexing, labeling, categorizing, clustering, highlighting, browsing, and searching. The task of automatic keyphrase extraction is to select keyphrases from within the text of a given document. Automatic keyphrase extraction makes it feasible to generate keyphrases for the huge number of documents that do not have manually assigned keyphrases. Good performance on this task has been obtained by approaching it as a supervised learning problem. An input document is treated as a set of candidate phrases that must be classified as either keyphrases or non-keyphrases. To classify a candidate phrase as a keyphrase, the most important features (attributes) appear to be the frequency and location of the candidate phrase in the document. Recent work has demonstrated that it is also useful to know the frequency of the candidate phrase as a manually assigned keyphrase for other documents in the same domain as the given document (e.g., the domain of computer science). Unfortunately, this keyphrase-frequency feature is domain-specific (the learning process must be repeated for each new domain) and training-intensive (good performance requires a relatively large number of training documents in the given domain, with manually assigned keyphrases). The aim of the work described here is to remove these limitations. In this paper, I introduce new features that are conceptually related to keyphrase-frequency and I present experiments that show that the new features result in improved keyphrase extraction, although they are neither domain-specific nor training-intensive. The new features are generated by issuing queries to a Web search engine, based on the candidate phrases in the input document. The feature values are calculated from the number of hits for the queries (the number of matching Web pages). In essence, these new features are derived by mining lexical knowledge from a very large collection of unlabeled data, consisting of approximately 350 million Web pages without manually assigned keyphrases
Diet composition and food habits of demersal and pelagic marine fishes from Terengganu waters, east coast of Peninsular Malaysia
Fish stomachs from 18 demersal and pelagic fishes from the coast of Terengganu in Malaysia were examined. The components of the fishes’ diets varied in number, weight, and their frequency of occurrence. The major food items in the stomachs of each species were determined using an Index of Relative Importance. A conceptual food web structure indicates that fish species in the study area can be classified into three predatory groups: (1) predators on largely planktivorous or pelagic species; (2) predators on largely benthophagous or demersal species; and (3) mixed feeders that consume both pelagic and demersal species
A Quasi-Bayesian Perspective to Online Clustering
When faced with high frequency streams of data, clustering raises theoretical
and algorithmic pitfalls. We introduce a new and adaptive online clustering
algorithm relying on a quasi-Bayesian approach, with a dynamic (i.e.,
time-dependent) estimation of the (unknown and changing) number of clusters. We
prove that our approach is supported by minimax regret bounds. We also provide
an RJMCMC-flavored implementation (called PACBO, see
https://cran.r-project.org/web/packages/PACBO/index.html) for which we give a
convergence guarantee. Finally, numerical experiments illustrate the potential
of our procedure
Response Function of the Fractional Quantized Hall State on a Sphere I: Fermion Chern-Simons Theory
Using a well known singular gauge transformation, certain fractional
quantized Hall states can be modeled as integer quantized Hall states of
transformed fermions interacting with a Chern-Simons field. In previous work we
have calculated the electromagnetic response function of these states at
arbitrary frequency and wavevector by using the Random Phase Approximation
(RPA) in combination with a Landau Fermi Liquid approach. We now adopt these
calculations to a spherical geometry in order to facilitate comparison with
exact diagonalizations performed on finite size systems.Comment: 39 pages (REVTeX 3.0). Postscript file for this paper are available
on the World Wide Web at http://cmtw.harvard.edu/~simon/ ; Preprint number
HU-CMT-94S0
Empirical study of error behavior in Web servers
The World Wide Web has been a huge success, bringing the Internet to widespread popularity. For Web based systems to deal effectively with increasing number of Web clients, it is very important to understand the basic fundamentals of Web workload and error characteristics. In this thesis we focus on detailed empirical analysis of Web server error characteristics and reliability based on the data extracted from eleven different web servers. First, we address the data collection process and describe the methods for extraction of workload and error data from Web logs. Then, we analyze the Web error characteristics which include unique errors, frequency of occurrence of unique errors and top files causing errors. Furthermore, we analyze the relationship between errors among Web workload and estimate request-based and session-based reliability. The discussion presented in this thesis shows the sessions-based reliability is better indicator of user perception of Web quality than request-based reliability. Finally, we analyze and develop heuristic search criteria to identify sessions which indicate unusual server behavior, such as extremely long sessions and sessions with large number of server errors. The results of our study provide valuable measures for tuning and maintaining of Web servers
Hybrid Information Retrieval Model For Web Images
The Bing Bang of the Internet in the early 90's increased dramatically the
number of images being distributed and shared over the web. As a result, image
information retrieval systems were developed to index and retrieve image files
spread over the Internet. Most of these systems are keyword-based which search
for images based on their textual metadata; and thus, they are imprecise as it
is vague to describe an image with a human language. Besides, there exist the
content-based image retrieval systems which search for images based on their
visual information. However, content-based type systems are still immature and
not that effective as they suffer from low retrieval recall/precision rate.
This paper proposes a new hybrid image information retrieval model for indexing
and retrieving web images published in HTML documents. The distinguishing mark
of the proposed model is that it is based on both graphical content and textual
metadata. The graphical content is denoted by color features and color
histogram of the image; while textual metadata are denoted by the terms that
surround the image in the HTML document, more particularly, the terms that
appear in the tags p, h1, and h2, in addition to the terms that appear in the
image's alt attribute, filename, and class-label. Moreover, this paper presents
a new term weighting scheme called VTF-IDF short for Variable Term
Frequency-Inverse Document Frequency which unlike traditional schemes, it
exploits the HTML tag structure and assigns an extra bonus weight for terms
that appear within certain particular HTML tags that are correlated to the
semantics of the image. Experiments conducted to evaluate the proposed IR model
showed a high retrieval precision rate that outpaced other current models.Comment: LACSC - Lebanese Association for Computational Sciences,
http://www.lacsc.org/; International Journal of Computer Science & Emerging
Technologies (IJCSET), Vol. 3, No. 1, February 201
- …