102,996 research outputs found
Searching Data: A Review of Observational Data Retrieval Practices in Selected Disciplines
A cross-disciplinary examination of the user behaviours involved in seeking
and evaluating data is surprisingly absent from the research data discussion.
This review explores the data retrieval literature to identify commonalities in
how users search for and evaluate observational research data. Two analytical
frameworks rooted in information retrieval and science technology studies are
used to identify key similarities in practices as a first step toward
developing a model describing data retrieval
Medical Image Classification via SVM using LBP Features from Saliency-Based Folded Data
Good results on image classification and retrieval using support vector
machines (SVM) with local binary patterns (LBPs) as features have been
extensively reported in the literature where an entire image is retrieved or
classified. In contrast, in medical imaging, not all parts of the image may be
equally significant or relevant to the image retrieval application at hand. For
instance, in lung x-ray image, the lung region may contain a tumour, hence
being highly significant whereas the surrounding area does not contain
significant information from medical diagnosis perspective. In this paper, we
propose to detect salient regions of images during training and fold the data
to reduce the effect of irrelevant regions. As a result, smaller image areas
will be used for LBP features calculation and consequently classification by
SVM. We use IRMA 2009 dataset with 14,410 x-ray images to verify the
performance of the proposed approach. The results demonstrate the benefits of
saliency-based folding approach that delivers comparable classification
accuracies with state-of-the-art but exhibits lower computational cost and
storage requirements, factors highly important for big data analytics.Comment: To appear in proceedings of The 14th International Conference on
Machine Learning and Applications (IEEE ICMLA 2015), Miami, Florida, USA,
201
Big data analytics:Computational intelligence techniques and application areas
Big Data has significant impact in developing functional smart cities and supporting modern societies. In this paper, we investigate the importance of Big Data in modern life and economy, and discuss challenges arising from Big Data utilization. Different computational intelligence techniques have been considered as tools for Big Data analytics. We also explore the powerful combination of Big Data and Computational Intelligence (CI) and identify a number of areas, where novel applications in real world smart city problems can be developed by utilizing these powerful tools and techniques. We present a case study for intelligent transportation in the context of a smart city, and a novel data modelling methodology based on a biologically inspired universal generative modelling approach called Hierarchical Spatial-Temporal State Machine (HSTSM). We further discuss various implications of policy, protection, valuation and commercialization related to Big Data, its applications and deployment
Improving average ranking precision in user searches for biomedical research datasets
Availability of research datasets is keystone for health and life science
study reproducibility and scientific progress. Due to the heterogeneity and
complexity of these data, a main challenge to be overcome by research data
management systems is to provide users with the best answers for their search
queries. In the context of the 2016 bioCADDIE Dataset Retrieval Challenge, we
investigate a novel ranking pipeline to improve the search of datasets used in
biomedical experiments. Our system comprises a query expansion model based on
word embeddings, a similarity measure algorithm that takes into consideration
the relevance of the query terms, and a dataset categorisation method that
boosts the rank of datasets matching query constraints. The system was
evaluated using a corpus with 800k datasets and 21 annotated user queries. Our
system provides competitive results when compared to the other challenge
participants. In the official run, it achieved the highest infAP among the
participants, being +22.3% higher than the median infAP of the participant's
best submissions. Overall, it is ranked at top 2 if an aggregated metric using
the best official measures per participant is considered. The query expansion
method showed positive impact on the system's performance increasing our
baseline up to +5.0% and +3.4% for the infAP and infNDCG metrics, respectively.
Our similarity measure algorithm seems to be robust, in particular compared to
Divergence From Randomness framework, having smaller performance variations
under different training conditions. Finally, the result categorization did not
have significant impact on the system's performance. We believe that our
solution could be used to enhance biomedical dataset management systems. In
particular, the use of data driven query expansion methods could be an
alternative to the complexity of biomedical terminologies
Unleashing the Power of Hashtags in Tweet Analytics with Distributed Framework on Apache Storm
Twitter is a popular social network platform where users can interact and
post texts of up to 280 characters called tweets. Hashtags, hyperlinked words
in tweets, have increasingly become crucial for tweet retrieval and search.
Using hashtags for tweet topic classification is a challenging problem because
of context dependent among words, slangs, abbreviation and emoticons in a short
tweet along with evolving use of hashtags. Since Twitter generates millions of
tweets daily, tweet analytics is a fundamental problem of Big data stream that
often requires a real-time Distributed processing. This paper proposes a
distributed online approach to tweet topic classification with hashtags. Being
implemented on Apache Storm, a distributed real time framework, our approach
incrementally identifies and updates a set of strong predictors in the Na\"ive
Bayes model for classifying each incoming tweet instance. Preliminary
experiments show promising results with up to 97% accuracy and 37% increase in
throughput on eight processors.Comment: IEEE International Conference on Big Data 201
- …