10,514 research outputs found
Automatic Classification of Text Databases through Query Probing
Many text databases on the web are "hidden" behind search interfaces, and
their documents are only accessible through querying. Search engines typically
ignore the contents of such search-only databases. Recently, Yahoo-like
directories have started to manually organize these databases into categories
that users can browse to find these valuable resources. We propose a novel
strategy to automate the classification of search-only text databases. Our
technique starts by training a rule-based document classifier, and then uses
the classifier's rules to generate probing queries. The queries are sent to the
text databases, which are then classified based on the number of matches that
they produce for each query. We report some initial exploratory experiments
that show that our approach is promising to automatically characterize the
contents of text databases accessible on the web.Comment: 7 pages, 1 figur
Towards the Automatic Classification of Documents in User-generated Classifications
There is a huge amount of information scattered on the World Wide Web. As the information flow occurs at a high speed in the WWW, there is a need to organize it in the right manner so that a user can access it very easily. Previously the organization of information was generally done manually, by matching the document contents to some pre-defined categories. There are two approaches for this text-based categorization: manual and automatic. In the manual approach, a human expert performs the classification task, and in the second case supervised classifiers are used to automatically classify resources. In a supervised classification, manual interaction is required to create some training data before the automatic classification task takes place. In our new approach, we intend to propose automatic classification of documents through semantic keywords and building the formulas generation by these keywords. Thus we can reduce this human participation by combining the knowledge of a given classification and the knowledge extracted from the data. The main focus of this PhD thesis, supervised by Prof. Fausto Giunchiglia, is the automatic classification of documents into user-generated classifications. The key benefits foreseen from this automatic document classification is not only related to search engines, but also to many other fields like, document organization, text filtering, semantic index managing
Graph-Sparse LDA: A Topic Model with Structured Sparsity
Originally designed to model text, topic modeling has become a powerful tool
for uncovering latent structure in domains including medicine, finance, and
vision. The goals for the model vary depending on the application: in some
cases, the discovered topics may be used for prediction or some other
downstream task. In other cases, the content of the topic itself may be of
intrinsic scientific interest.
Unfortunately, even using modern sparse techniques, the discovered topics are
often difficult to interpret due to the high dimensionality of the underlying
space. To improve topic interpretability, we introduce Graph-Sparse LDA, a
hierarchical topic model that leverages knowledge of relationships between
words (e.g., as encoded by an ontology). In our model, topics are summarized by
a few latent concept-words from the underlying graph that explain the observed
words. Graph-Sparse LDA recovers sparse, interpretable summaries on two
real-world biomedical datasets while matching state-of-the-art prediction
performance
Enriching very large ontologies using the WWW
This paper explores the possibility to exploit text on the world wide web in
order to enrich the concepts in existing ontologies. First, a method to
retrieve documents from the WWW related to a concept is described. These
document collections are used 1) to construct topic signatures (lists of
topically related words) for each concept in WordNet, and 2) to build
hierarchical clusters of the concepts (the word senses) that lexicalize a given
word. The overall goal is to overcome two shortcomings of WordNet: the lack of
topical links among concepts, and the proliferation of senses. Topic signatures
are validated on a word sense disambiguation task with good results, which are
improved when the hierarchical clusters are used.Comment: 6 page
A Route Confidence Evaluation Method for Reliable Hierarchical Text Categorization
Hierarchical Text Categorization (HTC) is becoming increasingly important
with the rapidly growing amount of text data available in the World Wide Web.
Among the different strategies proposed to cope with HTC, the Local Classifier
per Node (LCN) approach attains good performance by mirroring the underlying
class hierarchy while enforcing a top-down strategy in the testing step.
However, the problem of embedding hierarchical information (parent-child
relationship) to improve the performance of HTC systems still remains open. A
confidence evaluation method for a selected route in the hierarchy is proposed
to evaluate the reliability of the final candidate labels in an HTC system. In
order to take into account the information embedded in the hierarchy, weight
factors are used to take into account the importance of each level. An
acceptance/rejection strategy in the top-down decision making process is
proposed, which improves the overall categorization accuracy by rejecting a few
percentage of samples, i.e., those with low reliability score. Experimental
results on the Reuters benchmark dataset (RCV1- v2) confirm the effectiveness
of the proposed method, compared to other state-of-the art HTC methods
- …