8 research outputs found

    Unionization method for changing opinion in sentiment classification using machine learning

    Get PDF
    Sentiment classification aims to determine whether an opinionated text expresses a positive, negative or neutral opinion. Most existing sentiment classification approaches have focused on supervised text classification techniques. One critical problem of sentiment classification is that a text collection may contain tens or hundreds of thousands of features, i.e. high dimensionality, which can be solved by dimension reduction approach. Nonetheless, although feature selection as a dimension reduction method can reduce feature space to provide a reduced feature subset, the size of the subset commonly requires further reduction. In this research, a novel dimension reduction approach called feature unionization is proposed to construct a more reduced feature subset. This approach works based on the combination of several features to create a more informative single feature. Another challenge of sentiment classification is the handling of concept drift problem in the learning step. Users’ opinions are changed due to evolution of target entities over time. However, the existing sentiment classification approaches do not consider the evolution of users’ opinions. They assume that instances are independent, identically distributed and generated from a stationary distribution, even though they are generated from a stream distribution. In this study, a stream sentiment classification method is proposed to deal with changing opinion and imbalanced data distribution using ensemble learning and instance selection methods. In relation to the concept drift problem, another important issue is the handling of feature drift in the sentiment classification. To handle feature drift, relevant features need to be detected to update classifiers. Since proposed feature unionization method is very effective to construct more relevant features, it is further used to handle feature drift. Thus, a method to deal with concept and feature drifts for stream sentiment classification was proposed. The effectiveness of the feature unionization method was compared with the feature selection method over fourteen publicly available datasets in sentiment classification domain using three typical classifiers. The experimental results showed the proposed approach is more effective than current feature selection approaches. In addition, the experimental results showed the effectiveness of the proposed stream sentiment classification method in comparison to static sentiment classification. The experiments conducted on four datasets, have successfully shown that the proposed algorithm achieved better results and proving the effectiveness of the proposed method

    Learning in the Real World: Constraints on Cost, Space, and Privacy

    Get PDF
    The sheer demand for machine learning in fields as varied as: healthcare, web-search ranking, factory automation, collision prediction, spam filtering, and many others, frequently outpaces the intended use-case of machine learning models. In fact, a growing number of companies hire machine learning researchers to rectify this very problem: to tailor and/or design new state-of-the-art models to the setting at hand. However, we can generalize a large set of the machine learning problems encountered in practical settings into three categories: cost, space, and privacy. The first category (cost) considers problems that need to balance the accuracy of a machine learning model with the cost required to evaluate it. These include problems in web-search, where results need to be delivered to a user in under a second and be as accurate as possible. The second category (space) collects problems that require running machine learning algorithms on low-memory computing devices. For instance, in search-and-rescue operations we may opt to use many small unmanned aerial vehicles (UAVs) equipped with machine learning algorithms for object detection to find a desired search target. These algorithms should be small to fit within the physical memory limits of the UAV (and be energy efficient) while reliably detecting objects. The third category (privacy) considers problems where one wishes to run machine learning algorithms on sensitive data. It has been shown that seemingly innocuous analyses on such data can be exploited to reveal data individuals would prefer to keep private. Thus, nearly any algorithm that runs on patient or economic data falls under this set of problems. We devise solutions for each of these problem categories including (i) a fast tree-based model for explicitly trading off accuracy and model evaluation time, (ii) a compression method for the k-nearest neighbor classifier, and (iii) a private causal inference algorithm that protects sensitive data

    Techniques for data pattern selection and abstraction

    Get PDF
    This thesis concerns the problem of prototype reduction in instance-based learning. In order to deal with problems such as storage requirements, sensitivity to noise and computational complexity, various algorithms have been presented that condense the number of stored prototypes, while maintaining competent classification accuracy. Instance selection, which recovers a smaller subset of the original training set, is the most widely used technique for instance reduction. But, prototype abstraction that generates new prototypes to replace the initial ones has also gained a lot of interest recently. The major contribution of this work is the proposal of four novel frameworks for performing prototype reduction, the Class Boundary Preserving algorithm (CBP), a hybrid method that uses both selection and generation of prototypes, Instance Seriation for Prototype Abstraction (ISPA), which is an abstraction algorithm, and two selective techniques, Spectral Instance Reduction (SIR) and Direct Weight Optimization (DWO). CBP is a multi-stage method based on a simple heuristic that is very effective in identifying samples close to class borders. Using a noise filter harmful instances are removed, while the powerful heuristic determines the geometrical distribution of patterns around every instance. Together with the concepts of nearest enemy pairs and mean shift clustering this algorithm decides on the final set of retained prototypes. DWO is a selection model whose output set of prototypes is decided by a set of binary weights. These weights are computed according to an objective function composed of the ratio between the nearest friend and nearest enemy of every sample. In order to obtain good quality results DWO is optimized using a genetic algorithm. ISPA is an abstraction technique that employs the concept of data seriation to organize instances in an arrangement that favours merging between them. As a result, a new set of prototypes is created. Results show that CBP, SIR and DWO, the three major algorithms presented in this thesis, are competent and efficient in terms of at least one of the two basic objectives, classification accuracy and condensation ratio. The comparison against other successful condensation algorithms illustrates the competitiveness of the proposed models. The SIR algorithm presents a set of border discriminating features (BDFs) that depicts the local distribution of friends and enemies of all samples. These are then used along with spectral graph theory to partition the training set in to border and internal instances

    State of the Art in Face Recognition

    Get PDF
    Notwithstanding the tremendous effort to solve the face recognition problem, it is not possible yet to design a face recognition system with a potential close to human performance. New computer vision and pattern recognition approaches need to be investigated. Even new knowledge and perspectives from different fields like, psychology and neuroscience must be incorporated into the current field of face recognition to design a robust face recognition system. Indeed, many more efforts are required to end up with a human like face recognition system. This book tries to make an effort to reduce the gap between the previous face recognition research state and the future state

    Artificial neural network models: data selection and online adaptation

    Get PDF
    Energy consumption has been increasing steadily due to globalization and industrialization. Studies have shown that buildings have the biggest proportion in energy consumption; for example in European Union countries, energy consumption in buildings represents around 40% of the total energy consumption. Hence this PhD was intended towards managing the energy consumed by Heating, Ventilating and Air Conditioning (HVAC) systems in buildings benefiting from Model Predictive Control (MPC) technique. To achieve this goal, artificial intelligence models such as neural networks and Support Vector Machines (SVM) have been proposed because of their high potential capabilities of performing accurate nonlinear mappings between inputs and outputs in real environments which are not noise-free. In this PhD, Radial Basis Function Neural Networks (RBFNN) as a promising class of Artificial Neural Networks (ANN) were considered to model a sequence of time series processes where the RBFNN models were built using Multi Objective Genetic Algorithm (MOGA) as a design platform. Regarding the design of such models, two main challenges were tackled; data selection and model adaptation. Since RBFNNs are data driven models, the performance of such models relies, to a good extent, on selecting proper data throughout the design phase, covering the whole input-output range in which they will be employed. The convex hull algorithms can be applied as methods for data selection; however the use of conventional implementations of these methods in high dimensions, due to their high complexity, is not feasible. As the first phase of this PhD, a new randomized approximation convex hull algorithm called ApproxHull was proposed for high dimensions so that it can be used in an acceptable execution time, and with low memory requirements. Simulation results showed that applying ApproxHull as a filter data selection method (i.e., unsupervised data selection method) could improve the performance of the classification and regression models, in comparison with random data selection method. In addition, ApproxHull was employed in real applications in terms of three case studies. The first two were in association with applying predictive models for energy saving. The last case study was related to segmentation of lesion areas in brain Computed Tomography (CT) images. The evaluation results showed that applying ApproxHull in MOGA could result in models with an acceptable level of accuracy. Specifically, the results obtained from the third case study demonstrated that ApproxHull is capable of being applied on large size data sets in high dimensions. Besides the random selection method, it was also compared with an entropy based unsupervised data selection method and a hybrid method involving ApproxHull and the entropy based method. Based on the simulation results, for most cases, ApproxHull and the hybrid method achieved a better performance than the others. In the second phase of this PhD, a new convex-hull-based sliding window online adaptation method was proposed. The goal was to update the offline predictive RBFNN models used in HVAC MPC technique, where these models are applied to processes in which the data input-output range changes over time. The idea behind the proposed method is capturing a new arriving point at each time instant which reflects a new range of data by comparing the point with current convex hull presented via ApproxHull. In this situation the underlying model’s parameters are updated based on the new point and a sliding window of some past points. The simulation results showed that not only the proposed method could efficiently update the model while a good level of accuracy is kept but also it was comparable with other methods.Devido aos processos de industrialização e globalização o consumo de energia tem aumentado de forma contínua. A investigação sobre o consumo mostra que os edifícios consomem a maior fatia de energia. Por exemplo nos países da União Europeia essa fatia corresponde a cerca de 40% de toda a energia consumida. Assim, esta tese de Doutoramento tem um objetivo prático de contribuir para melhorar a gestão da energia consumida por sistemas Heating, Ventilating and Air Conditioning (HVAC) em edifícios, no âmbito de uma estratégia de controlo preditivo baseado em modelos. Neste contexto foram já propostos modelos baseados em redes neuronais artificiais e máquinas de vetores de suporte, para mencionar apenas alguns. Estas técnicas têm uma grande capacidade de modelar relações não-lineares entre entradas e saídas de sistemas, e são aplicáveis em ambientes de operação, que, como sabemos, estão sujeitos a várias formas de ruído. Nesta tese foram consideradas redes neuronais de função de base radial, uma técnica consolidada no contexto da modelação de séries temporais. Para desenhar essas redes foi utilizada uma ferramenta baseada num algoritmo genético multi-objectivo. Relativamente ao processo de desenho destes modelos, esta tese versa sobre dois aspetos menos estudados: a seleção de dados e a adaptação em linha dos modelos. Uma vez que as redes neuronais artificiais são modelos baseados em dados, a sua performance depende em boa medida da existência de dados apropriados e representativos do sistema/processo, que cubram toda a gama de valores que a representação entrada/saída do processo/sistema gera. Os algoritmos que determinam a figura geométrica que envolve todos os dados, denominados algoritmos convex hull, podem ser aplicados à tarefa de seleção de dados. Contudo a utilização das implementações convencionais destes algoritmos em problemas de grane dimensionalidade não é viável do ponto de vista prático. Numa primeira fase deste trabalho foi proposto um novo método randomizado de aproximação ao convex hull, cunhado com o nome ApproxHull, apropriado para conjuntos de dados de grande dimensão, de forma a ser viável do ponto de vista das aplicações práticas. Os resultados experimentais mostraram que a aplicação do ApproxHull como método de seleção de dados do tipo filtro, ou seja, não supervisionado, pode melhorar o desempenho de modelos em problemas de classificação e regressão, quando comparado com a seleção aleatória de dados. O ApproxHull foi também aplicado em três casos de estudo relativos a aplicações reais. Nos dois primeiros casos no contexto do desenvolvimento de modelos preditivos para sistemas na área da eficiência energética. O terceiro caso de estudo consiste no desenvolvimento de modelos de classificação para uma aplicação na área da segmentação de lesões em imagens de tomografia computorizada. Os resultados revelaram que da aplicação do método proposto resultaram modelos com uma precisão aceitável. Do ponto de vista da aplicabilidade do método, os resultados mostraram que o ApproxHull pode ser utilizado em conjuntos de dados grandes e com dados de grande dimensionalidade. Para além da comparação com a seleção aleatória de dados, o método foi também comparado com um método de seleção de dados baseado no conceito de entropia e com um método híbrido que resulta da combinação do ApproxHull com o método entrópico. Com base nos resultados experimentais apurou-se que na maioria dos casos estudados o método híbrido conseguiu melhor desempenho que os restantes. Numa segunda fase do trabalho foi proposto um novo método de adaptação em linha com base no algoritmo ApproxHull e numa janela deslizante no tempo. Uma vez que os processos e sistemas na envolvente do sistema HVAC são variantes no tempo e dinâmicos, o objetivo foi aplicar o método proposto para adaptar em linha os modelos que foram primeiramente obtidos fora de linha. A ideia base do método proposto consiste em comparar cada novo par entrada/saída com o convex hull conhecido, e determinar se o novo par tem dados situados fora da gama conhecida. Nessa situação os parâmetros dos modelos são atualizados com base nesse novo ponto e num conjunto de pontos numa determinada janela temporal deslizante. Os resultados experimentais demonstraram não só que o novo método é eficiente na atualização dos modelos e em mantê-los num bom nível de precisão, mas também que era comparável a outros métodos existentes
    corecore