2,639 research outputs found
On Refining Twitter Lists as Ground Truth Data for Multi-Community User Classification
To help scholars and businesses understand and analyse Twitter users, it is useful to have classifiers that can identify the communities that a given user belongs to, e.g. business or politics. Obtaining high quality training data is an important step towards producing an effective multi-community classifier. An efficient approach for creating such ground truth data is to extract users from existing public Twitter lists, where those lists represent different communities, e.g. a list of journalists. However, ground truth datasets obtained using such lists can be noisy, since not all users that belong to a community are good training examples for that community. In this paper, we conduct a thorough failure analysis of a ground truth dataset generated using Twitter lists. We discuss how some categories of users collected from these Twitter public lists could negatively affect the classification performance and therefore should not be used for training. Through experiments with 3 classifiers and 5 communities, we show that removing ambiguous users based on their tweets and profile can indeed result in a 10% increase in F1 performance
Learning Sentence-internal Temporal Relations
In this paper we propose a data intensive approach for inferring
sentence-internal temporal relations. Temporal inference is relevant for
practical NLP applications which either extract or synthesize temporal
information (e.g., summarisation, question answering). Our method bypasses the
need for manual coding by exploiting the presence of markers like after", which
overtly signal a temporal relation. We first show that models trained on main
and subordinate clauses connected with a temporal marker achieve good
performance on a pseudo-disambiguation task simulating temporal inference
(during testing the temporal marker is treated as unseen and the models must
select the right marker from a set of possible candidates). Secondly, we assess
whether the proposed approach holds promise for the semi-automatic creation of
temporal annotations. Specifically, we use a model trained on noisy and
approximate data (i.e., main and subordinate clauses) to predict
intra-sentential relations present in TimeBank, a corpus annotated rich
temporal information. Our experiments compare and contrast several
probabilistic models differing in their feature space, linguistic assumptions
and data requirements. We evaluate performance against gold standard corpora
and also against human subjects
Finding kernel function for stock market prediction with support vector regression
Stock market prediction is one of the fascinating issues of stock market research. Accurate stock prediction becomes the biggest challenge in investment industry because the distribution of stock data is changing over the time. Time series forcasting, Neural Network (NN) and Support Vector Machine (SVM) are once commonly used for prediction on stock price. In this study, the data mining operation called time series forecasting is implemented. The large amount of stock data collected from Kuala Lumpur Stock Exchange is used for the experiment to test the validity of SVMs regression. SVM is a new machine learning technique with principle of structural minimization risk, which have greater generalization ability and proved success in time series prediction. Two kernel functions namely Radial Basis Function and polynomial are compared for finding the accurate prediction values. Besides that, backpropagation neural network are also used to compare the predictions performance. Several experiments are conducted and some analyses on the experimental results are done. The results show that SVM with polynomial kernels provide a promising alternative tool in KLSE stock market prediction
Towards machine learning approach for digital-health intervention program
Digital-Health intervention (DHI) are used by health care providers to promote engagement within community. Effective assignment of participants into DHI programs helps increasing benefits from the most suitable intervention. A major challenge with the roll-out and implementation of DHI, is in assigning participants into different interventions. The use of biopsychosocial model [18] for this purpose is not wide spread, due to limited personalized interventions formed on evidence-based data-driven models. Machine learning has changed the way data extraction and interpretation works by involving automatic sets of generic methods that have replaced the traditional statistical techniques. In this paper, we propose to investigate relevance of machine learning for this purpose and is carried out by studying different non-linear classifiers and compare their prediction accuracy to evaluate their suitability. Further, as a novel contribution, real-life biopsychosocial features are used as input in this study. The results help in developing an appropriate predictive classication model to assign participants into the most suitable DHI. We analyze biopsychosocial data generated from a DHI program and study their feature characteristics using scatter plots. While scatter plots are unable to reveal the linear relationships in the data-set, the use of classifiers can successfully identify which features are suitable predictors of mental ill health
Multimodal Content Analysis for Effective Advertisements on YouTube
The rapid advances in e-commerce and Web 2.0 technologies have greatly
increased the impact of commercial advertisements on the general public. As a
key enabling technology, a multitude of recommender systems exists which
analyzes user features and browsing patterns to recommend appealing
advertisements to users. In this work, we seek to study the characteristics or
attributes that characterize an effective advertisement and recommend a useful
set of features to aid the designing and production processes of commercial
advertisements. We analyze the temporal patterns from multimedia content of
advertisement videos including auditory, visual and textual components, and
study their individual roles and synergies in the success of an advertisement.
The objective of this work is then to measure the effectiveness of an
advertisement, and to recommend a useful set of features to advertisement
designers to make it more successful and approachable to users. Our proposed
framework employs the signal processing technique of cross modality feature
learning where data streams from different components are employed to train
separate neural network models and are then fused together to learn a shared
representation. Subsequently, a neural network model trained on this joint
feature embedding representation is utilized as a classifier to predict
advertisement effectiveness. We validate our approach using subjective ratings
from a dedicated user study, the sentiment strength of online viewer comments,
and a viewer opinion metric of the ratio of the Likes and Views received by
each advertisement from an online platform.Comment: 11 pages, 5 figures, ICDM 201
ActiveRemediation: The Search for Lead Pipes in Flint, Michigan
We detail our ongoing work in Flint, Michigan to detect pipes made of lead
and other hazardous metals. After elevated levels of lead were detected in
residents' drinking water, followed by an increase in blood lead levels in area
children, the state and federal governments directed over $125 million to
replace water service lines, the pipes connecting each home to the water
system. In the absence of accurate records, and with the high cost of
determining buried pipe materials, we put forth a number of predictive and
procedural tools to aid in the search and removal of lead infrastructure.
Alongside these statistical and machine learning approaches, we describe our
interactions with government officials in recommending homes for both
inspection and replacement, with a focus on the statistical model that adapts
to incoming information. Finally, in light of discussions about increased
spending on infrastructure development by the federal government, we explore
how our approach generalizes beyond Flint to other municipalities nationwide.Comment: 10 pages, 10 figures, To appear in KDD 2018, For associated
promotional video, see https://www.youtube.com/watch?v=YbIn_axYu9
Bagging and boosting classification trees to predict churn.
In this paper, bagging and boosting techniques are proposed as performing tools for churn prediction. These methods consist of sequentially applying a classification algorithm to resampled or reweigthed versions of the data set. We apply these algorithms on a customer database of an anonymous U.S. wireless telecom company. Bagging is easy to put in practice and, as well as boosting, leads to a significant increase of the classification performance when applied to the customer database. Furthermore, we compare bagged and boosted classifiers computed, respectively, from a balanced versus a proportional sample to predict a rare event (here, churn), and propose a simple correction method for classifiers constructed from balanced training samples.Algorithms; Bagging; Boosting; Churn; Classification; Classifiers; Companies; Data; Gini coefficient; Methods; Performance; Rare events; Sampling; Top decile; Training;
Enhancing Fund Selection Using Supervised Machine Learning : Evidence From the Nordic Mutual Fund Market
In this research we aim to extend the literature on the performance predictability in actively
managed mutual funds. We use the Nordic mutual fund market as our laboratory. We
develop a performance-enhancing system to assist retail investors in selecting mutual funds
by utilizing gradient boosting, random forest, and deep neural networks. Furthermore, we
seek to obtain positive abnormal returns from our predicted quintile portfolios. We thus
retrieve data free of survivorship bias for 2748 Nordic mutual funds from Morningstar
Direct. First, we run the algorithms to test the possibility of classifying alphas. Secondly,
we create a ranking system that categorizes funds based on predicted alpha, enabling us
to separate the best from the worst-performing mutual funds. At last, we benchmark
our findings against Morningstar’s acknowledged rating platform to examine whether
our top quintile portfolios manage to outperform Morningstar’s top quintile portfolio.
We find that our models can classify the sign of the alpha coefficient, whereas gradient
boosting and random forest does this exceptionally well. Further, we manage to create
a categorization system significantly outperforming both an equally weighted and asset
weighted benchmark on risk-adjusted returns. Finally, our best performing portfolios
generate risk-adjusted returns in excess of Morningstar, although only significantly for
gradient boosting. Results are further robust to changes in risk-adjustment models for
both equity funds and fixed income funds. The findings are consistent with the current
machine learning literature and enable us to state that machine learning algorithms can
be used to select successful mutual funds.nhhma
Assessing Convolutional Neural Network Animal Classification Models for Practical Applications in Wildlife Conservation
Convolution neural network models (CNNs) can successfully identify animal species in camera-trap images in simplified testing environments. CNN performance in more complex, realistic environments is understudied. Here the Wellington Camera Traps dataset was used to simulate a wildlife conservation project to detect invasive species at low population levels using camera-trap images and CNN models. Ten CNNs were developed and analyzed with seven testing datasets, simulating 13 possible project scenarios. Model performance was measured using standard computer science metrics, top-1, and top-5 accuracy, and two novel performance metrics developed for this research to directly reflect wildlife conservation goals, false alarm rate, and missed invasive rate. The highest performing models achieved 91.8% and 99.6% top-1 and top-5 accuracy; however, these models also had the highest missed invasive rates. This effect was related to the ratio of native to invasive species in the model’s training images. As this ratio increased so did the model’s top-1 and top-5 accuracy but also the missed invasive rate. Thus to achieve optimal performance when selecting or training a CNN for use in a wildlife camera-trap project the metric used to judge the performance of the model must be tailored to the specific goals of the project, and the distribution of species in the model’s training images must match the distribution that will be seen in the project’s camera-trap images
- …