2,639 research outputs found

    On Refining Twitter Lists as Ground Truth Data for Multi-Community User Classification

    Get PDF
    To help scholars and businesses understand and analyse Twitter users, it is useful to have classifiers that can identify the communities that a given user belongs to, e.g. business or politics. Obtaining high quality training data is an important step towards producing an effective multi-community classifier. An efficient approach for creating such ground truth data is to extract users from existing public Twitter lists, where those lists represent different communities, e.g. a list of journalists. However, ground truth datasets obtained using such lists can be noisy, since not all users that belong to a community are good training examples for that community. In this paper, we conduct a thorough failure analysis of a ground truth dataset generated using Twitter lists. We discuss how some categories of users collected from these Twitter public lists could negatively affect the classification performance and therefore should not be used for training. Through experiments with 3 classifiers and 5 communities, we show that removing ambiguous users based on their tweets and profile can indeed result in a 10% increase in F1 performance

    Learning Sentence-internal Temporal Relations

    Get PDF
    In this paper we propose a data intensive approach for inferring sentence-internal temporal relations. Temporal inference is relevant for practical NLP applications which either extract or synthesize temporal information (e.g., summarisation, question answering). Our method bypasses the need for manual coding by exploiting the presence of markers like after", which overtly signal a temporal relation. We first show that models trained on main and subordinate clauses connected with a temporal marker achieve good performance on a pseudo-disambiguation task simulating temporal inference (during testing the temporal marker is treated as unseen and the models must select the right marker from a set of possible candidates). Secondly, we assess whether the proposed approach holds promise for the semi-automatic creation of temporal annotations. Specifically, we use a model trained on noisy and approximate data (i.e., main and subordinate clauses) to predict intra-sentential relations present in TimeBank, a corpus annotated rich temporal information. Our experiments compare and contrast several probabilistic models differing in their feature space, linguistic assumptions and data requirements. We evaluate performance against gold standard corpora and also against human subjects

    Finding kernel function for stock market prediction with support vector regression

    Get PDF
    Stock market prediction is one of the fascinating issues of stock market research. Accurate stock prediction becomes the biggest challenge in investment industry because the distribution of stock data is changing over the time. Time series forcasting, Neural Network (NN) and Support Vector Machine (SVM) are once commonly used for prediction on stock price. In this study, the data mining operation called time series forecasting is implemented. The large amount of stock data collected from Kuala Lumpur Stock Exchange is used for the experiment to test the validity of SVMs regression. SVM is a new machine learning technique with principle of structural minimization risk, which have greater generalization ability and proved success in time series prediction. Two kernel functions namely Radial Basis Function and polynomial are compared for finding the accurate prediction values. Besides that, backpropagation neural network are also used to compare the predictions performance. Several experiments are conducted and some analyses on the experimental results are done. The results show that SVM with polynomial kernels provide a promising alternative tool in KLSE stock market prediction

    Towards machine learning approach for digital-health intervention program

    Get PDF
    Digital-Health intervention (DHI) are used by health care providers to promote engagement within community. Effective assignment of participants into DHI programs helps increasing benefits from the most suitable intervention. A major challenge with the roll-out and implementation of DHI, is in assigning participants into different interventions. The use of biopsychosocial model [18] for this purpose is not wide spread, due to limited personalized interventions formed on evidence-based data-driven models. Machine learning has changed the way data extraction and interpretation works by involving automatic sets of generic methods that have replaced the traditional statistical techniques. In this paper, we propose to investigate relevance of machine learning for this purpose and is carried out by studying different non-linear classifiers and compare their prediction accuracy to evaluate their suitability. Further, as a novel contribution, real-life biopsychosocial features are used as input in this study. The results help in developing an appropriate predictive classication model to assign participants into the most suitable DHI. We analyze biopsychosocial data generated from a DHI program and study their feature characteristics using scatter plots. While scatter plots are unable to reveal the linear relationships in the data-set, the use of classifiers can successfully identify which features are suitable predictors of mental ill health

    Multimodal Content Analysis for Effective Advertisements on YouTube

    Full text link
    The rapid advances in e-commerce and Web 2.0 technologies have greatly increased the impact of commercial advertisements on the general public. As a key enabling technology, a multitude of recommender systems exists which analyzes user features and browsing patterns to recommend appealing advertisements to users. In this work, we seek to study the characteristics or attributes that characterize an effective advertisement and recommend a useful set of features to aid the designing and production processes of commercial advertisements. We analyze the temporal patterns from multimedia content of advertisement videos including auditory, visual and textual components, and study their individual roles and synergies in the success of an advertisement. The objective of this work is then to measure the effectiveness of an advertisement, and to recommend a useful set of features to advertisement designers to make it more successful and approachable to users. Our proposed framework employs the signal processing technique of cross modality feature learning where data streams from different components are employed to train separate neural network models and are then fused together to learn a shared representation. Subsequently, a neural network model trained on this joint feature embedding representation is utilized as a classifier to predict advertisement effectiveness. We validate our approach using subjective ratings from a dedicated user study, the sentiment strength of online viewer comments, and a viewer opinion metric of the ratio of the Likes and Views received by each advertisement from an online platform.Comment: 11 pages, 5 figures, ICDM 201

    ActiveRemediation: The Search for Lead Pipes in Flint, Michigan

    Full text link
    We detail our ongoing work in Flint, Michigan to detect pipes made of lead and other hazardous metals. After elevated levels of lead were detected in residents' drinking water, followed by an increase in blood lead levels in area children, the state and federal governments directed over $125 million to replace water service lines, the pipes connecting each home to the water system. In the absence of accurate records, and with the high cost of determining buried pipe materials, we put forth a number of predictive and procedural tools to aid in the search and removal of lead infrastructure. Alongside these statistical and machine learning approaches, we describe our interactions with government officials in recommending homes for both inspection and replacement, with a focus on the statistical model that adapts to incoming information. Finally, in light of discussions about increased spending on infrastructure development by the federal government, we explore how our approach generalizes beyond Flint to other municipalities nationwide.Comment: 10 pages, 10 figures, To appear in KDD 2018, For associated promotional video, see https://www.youtube.com/watch?v=YbIn_axYu9

    Bagging and boosting classification trees to predict churn.

    Get PDF
    In this paper, bagging and boosting techniques are proposed as performing tools for churn prediction. These methods consist of sequentially applying a classification algorithm to resampled or reweigthed versions of the data set. We apply these algorithms on a customer database of an anonymous U.S. wireless telecom company. Bagging is easy to put in practice and, as well as boosting, leads to a significant increase of the classification performance when applied to the customer database. Furthermore, we compare bagged and boosted classifiers computed, respectively, from a balanced versus a proportional sample to predict a rare event (here, churn), and propose a simple correction method for classifiers constructed from balanced training samples.Algorithms; Bagging; Boosting; Churn; Classification; Classifiers; Companies; Data; Gini coefficient; Methods; Performance; Rare events; Sampling; Top decile; Training;

    Enhancing Fund Selection Using Supervised Machine Learning : Evidence From the Nordic Mutual Fund Market

    Get PDF
    In this research we aim to extend the literature on the performance predictability in actively managed mutual funds. We use the Nordic mutual fund market as our laboratory. We develop a performance-enhancing system to assist retail investors in selecting mutual funds by utilizing gradient boosting, random forest, and deep neural networks. Furthermore, we seek to obtain positive abnormal returns from our predicted quintile portfolios. We thus retrieve data free of survivorship bias for 2748 Nordic mutual funds from Morningstar Direct. First, we run the algorithms to test the possibility of classifying alphas. Secondly, we create a ranking system that categorizes funds based on predicted alpha, enabling us to separate the best from the worst-performing mutual funds. At last, we benchmark our findings against Morningstar’s acknowledged rating platform to examine whether our top quintile portfolios manage to outperform Morningstar’s top quintile portfolio. We find that our models can classify the sign of the alpha coefficient, whereas gradient boosting and random forest does this exceptionally well. Further, we manage to create a categorization system significantly outperforming both an equally weighted and asset weighted benchmark on risk-adjusted returns. Finally, our best performing portfolios generate risk-adjusted returns in excess of Morningstar, although only significantly for gradient boosting. Results are further robust to changes in risk-adjustment models for both equity funds and fixed income funds. The findings are consistent with the current machine learning literature and enable us to state that machine learning algorithms can be used to select successful mutual funds.nhhma

    Assessing Convolutional Neural Network Animal Classification Models for Practical Applications in Wildlife Conservation

    Get PDF
    Convolution neural network models (CNNs) can successfully identify animal species in camera-trap images in simplified testing environments. CNN performance in more complex, realistic environments is understudied. Here the Wellington Camera Traps dataset was used to simulate a wildlife conservation project to detect invasive species at low population levels using camera-trap images and CNN models. Ten CNNs were developed and analyzed with seven testing datasets, simulating 13 possible project scenarios. Model performance was measured using standard computer science metrics, top-1, and top-5 accuracy, and two novel performance metrics developed for this research to directly reflect wildlife conservation goals, false alarm rate, and missed invasive rate. The highest performing models achieved 91.8% and 99.6% top-1 and top-5 accuracy; however, these models also had the highest missed invasive rates. This effect was related to the ratio of native to invasive species in the model’s training images. As this ratio increased so did the model’s top-1 and top-5 accuracy but also the missed invasive rate. Thus to achieve optimal performance when selecting or training a CNN for use in a wildlife camera-trap project the metric used to judge the performance of the model must be tailored to the specific goals of the project, and the distribution of species in the model’s training images must match the distribution that will be seen in the project’s camera-trap images
    corecore