31 research outputs found
Reservoir of Diverse Adaptive Learners and Stacking Fast Hoeffding Drift Detection Methods for Evolving Data Streams
The last decade has seen a surge of interest in adaptive learning algorithms
for data stream classification, with applications ranging from predicting ozone
level peaks, learning stock market indicators, to detecting computer security
violations. In addition, a number of methods have been developed to detect
concept drifts in these streams. Consider a scenario where we have a number of
classifiers with diverse learning styles and different drift detectors.
Intuitively, the current 'best' (classifier, detector) pair is application
dependent and may change as a result of the stream evolution. Our research
builds on this observation. We introduce the \mbox{Tornado} framework that
implements a reservoir of diverse classifiers, together with a variety of drift
detection algorithms. In our framework, all (classifier, detector) pairs
proceed, in parallel, to construct models against the evolving data streams. At
any point in time, we select the pair which currently yields the best
performance. We further incorporate two novel stacking-based drift detection
methods, namely the \mbox{FHDDMS} and \mbox{FHDDMS}_{add} approaches. The
experimental evaluation confirms that the current 'best' (classifier, detector)
pair is not only heavily dependent on the characteristics of the stream, but
also that this selection evolves as the stream flows. Further, our
\mbox{FHDDMS} variants detect concept drifts accurately in a timely fashion
while outperforming the state-of-the-art.Comment: 42 pages, and 14 figure
Term frequency-information content for focused crawling to predict relevant web pages.
With the rapid growth of the Web, finding desirable information on the Internet is a tedious and time consuming task. Focused crawlers are the golden keys to solve this issue through mining of the Web content. In this regard, a variety of methods have been devised and implemented. Many of these methods coming from information retrieval viewpoint are not biased towards more informative terms
in multi-term topics (topics with more than one keyword). In this paper, by considering terms’ information contents, we propose Term Frequency-Information Content (TF-IC) method which assigns appropriate weight to each term in a multi-term topic. Through the conducted experiments, we
compare our method with other methods such as Term Frequency-Inverse Document Frequency (TF-IDF) and Latent Semantic Indexing (LSI). Experimental results show that our method outperforms those two methods by retrieving more relevant pages for multi-term topics
Augmenting concept definition in gloss vector semantic relatedness measure using Wikipedia articles
Semantic relatedness measures are widely used in text mining and information retrieval applications. Considering these automated measures, in this research paper we attempt to improve Gloss Vector relatedness measure for more accurate estimation of relatedness between two given concepts. Generally, this measure, by constructing concepts definitions (Glosses) from a thesaurus, tries to find the angle between the concepts’ gloss vectors for the calculation of relatedness. Nonetheless, this definition construction task is challenging as thesauruses do not provide full coverage of expressive definitions for the particularly specialized concepts. By employing Wikipedia articles and other external resources, we aim at augmenting these concepts’ definitions. Applying both definition types to the biomedical domain, using MEDLINE as corpus, UMLS as the default thesaurus, and a reference standard of 68 concept pairs manually rated for relatedness, we show exploiting available resources on the Web would have positive impact on final measurement of semantic relatedness
Self-Supervised Contrastive BERT Fine-tuning for Fusion-based Reviewed-Item Retrieval
As natural language interfaces enable users to express increasingly complex
natural language queries, there is a parallel explosion of user review content
that can allow users to better find items such as restaurants, books, or movies
that match these expressive queries. While Neural Information Retrieval (IR)
methods have provided state-of-the-art results for matching queries to
documents, they have not been extended to the task of Reviewed-Item Retrieval
(RIR), where query-review scores must be aggregated (or fused) into item-level
scores for ranking. In the absence of labeled RIR datasets, we extend Neural IR
methodology to RIR by leveraging self-supervised methods for contrastive
learning of BERT embeddings for both queries and reviews. Specifically,
contrastive learning requires a choice of positive and negative samples, where
the unique two-level structure of our item-review data combined with meta-data
affords us a rich structure for the selection of these samples. For contrastive
learning in a Late Fusion scenario, we investigate the use of positive review
samples from the same item and/or with the same rating, selection of hard
positive samples by choosing the least similar reviews from the same anchor
item, and selection of hard negative samples by choosing the most similar
reviews from different items. We also explore anchor sub-sampling and
augmenting with meta-data. For a more end-to-end Early Fusion approach, we
introduce contrastive item embedding learning to fuse reviews into single item
embeddings. Experimental results show that Late Fusion contrastive learning for
Neural RIR outperforms all other contrastive IR configurations, Neural IR, and
sparse retrieval baselines, thus demonstrating the power of exploiting the
two-level structure in Neural RIR approaches as well as the importance of
preserving the nuance of individual review content via Late Fusion methods
Improving Gloss Vector Semantic Relatedness Measure by Integrating pointwise mutual information optimizing second-order co-occurrence vectors computed from biomedical corpus and UMLS
Methods of semantic relatedness are essential for wide range of tasks such as information retrieval and text mining. This paper, concerned with these automated methods, attempts to improve Gloss Vector semantic relatedness measure for more reliable estimation of relatedness between two input concepts. Generally, this measure by considering frequency cut-off for big rams tries to remove low and high frequency words which usually do not end up being significant features. However, this naive cutting approach can lead to loss of valuable information. By employing point wise mutual information (PMI) as a measure of association between features, we will try to enforce the foregoing elimination step in a statistical fashion. Applying both approaches to the biomedical domain, using MEDLINE as corpus, MeSH as thesaurus, and available reference standard of 311 concept pairs manually rated for semantic relatedness, we will show that PMI for removing insignificant features is more effective approach than frequency cut-off