65 research outputs found

    A DATA DRIVEN APPROACH TO IDENTIFY JOURNALISTIC 5WS FROM TEXT DOCUMENTS

    Get PDF
    Textual understanding is the process of automatically extracting accurate high-quality information from text. The amount of textual data available from different sources such as news, blogs and social media is growing exponentially. These data encode significant latent information which if extracted accurately can be valuable in a variety of applications such as medical report analyses, news understanding and societal studies. Natural language processing techniques are often employed to develop customized algorithms to extract such latent information from text. Journalistic 5Ws refer to the basic information in news articles that describes an event and include where, when, who, what and why. Extracting them accurately may facilitate better understanding of many social processes including social unrest, human rights violations, propaganda spread, and population migration. Furthermore, the 5Ws information can be combined with socio-economic and demographic data to analyze state and trajectory of these processes. In this thesis, a data driven pipeline has been developed to extract the 5Ws from text using syntactic and semantic cues in the text. First, a classifier is developed to identify articles specifically related to social unrest. The classifier has been trained with a dataset of over 80K news articles. We then use NLP algorithms to generate a set of candidates for the 5Ws. Then, a series of algorithms to extract the 5Ws are developed. These algorithms based on heuristics leverage specific words and parts-of-speech customized for individual Ws to compute their scores. The heuristics are based on the syntactic structure of the document as well as syntactic and semantic representations of individual words and sentences. These scores are then combined and ranked to obtain the best answers to Journalistic 5Ws. The classification accuracy of the algorithms is validated using a manually annotated dataset of news articles

    Improving single classifiers prediction accuracy for underground water pump station in a gold mine using ensemble techniques

    Get PDF
    Abstract: In this paper six single classifiers (support vector machine, artificial neural network, naïve Bayesian classifier, decision trees, radial basis function and k nearest neighbors) were utilized to predict water dam levels in a deep gold mine underground pump station. Also, Bagging and Boosting ensemble techniques were used to increase the prediction accuracy of the single classifiers. In order to enhance the prediction accuracy even more a mutual information ensemble approach is introduced to improve the single classifiers and the Bagging and Boosting prediction results. This ensemble is used to classify, thus monitoring and predicting the underground water dam levels on a single-pump station deep gold mine in South Africa, Mutual information theory is used in order to determine the classifiers optimum number to build the most accurate ensemble. In terms of prediction accuracy, the results show that the mutual information ensemble over performed the other used ensembles and single classifiers and is more efficient for classification of underground water dam levels. However the ensemble construction is more complicated than the Bagging and Boosting techniques

    Gold mine dam levels and energy consumption classification using artificial intelligence methods

    Get PDF
    Abstract: In this paper a comparison between two single classifier methods (support vector machine, artificial neural network) and two ensemble methods (bagging, and boosting) is applied to a real-world mining problem. The four methods are used to classify, thus monitoring underground dam levels and underground pumps energy consumption on a doublepump station deep gold in South Africa. In terms of misclassification error, the results show support vector machines (SVM) to be more efficient for classification of underground pumps energy consumption compared to artificial neural network (ANN),..

    Applications of artificial intelligence in powerline communications in terms of noise detection and reduction : a review

    Get PDF
    Abstract: The technology which utilizes the power line as a medium for transferring information known as powerline communication (PLC) has been in existence for over a hundred years. It is beneficial because it avoids new installation since it uses the present installation for electrical power to transmit data. However, transmission of data signals through a power line channel usually experience some challenges which include impulsive noise, frequency selectivity, high channel attenuation, low line impedance etc. The impulsive noise exhibits a power spectral density within the range of 10-15 dB higher than the background noise, which could cause a severe problem in a communication system. For better outcome of the PLC system, these noises must be detected and suppressed. This paper reviews various techniques used in detecting and mitigating the impulsive noise in PLC and suggests the application of machine learning algorithms for the detection and removal of impulsive noise in power line communication systems

    Outside The Machine Learning Blackbox: Supporting Analysts Before And After The Learning Algorithm

    Full text link
    Applying machine learning to real problems is non-trivial because many important steps are needed to prepare for learning and to interpret the results after learning. This dissertation investigates four problems that arise before and after applying learning algorithms. First, how can we verify a dataset contains "good" information? I propose cross-data validation for quantifying the quality of a dataset relative to a benchmark dataset and define a data efficiency ratio that measures how efficiently the dataset in question collects information (relative to the benchmark). Using these methods I demonstrate the quality of bird observations collected by the eBird citizen science project which has few quality controls. Second, can off-the-shelf algorithms learn a model with good task-specific performance, or must the user have expertise both in the domain and in machine learning? In many applications, standard performance metrics are inappropriate, and most analysts lack the expertise or time to customize algorithms to optimize task-specific metrics. Ensemble selection offers a potential solution: build an ensemble to optimize the desired metric. I evaluate ensemble selection's ability to optimize for domain-specific metrics on natural language processing tasks and show that ensemble selection usually improves performance but sometimes overfits. Third, how can we understand complex models? Understanding a model often is as important its accuracy. I propose and evaluate statistics for measuring the importance of inputs used by a decision tree ensemble. The statistics agree with sensitivity analysis and, in an application to bird distribution models, are 500 times faster to compute. The statistics have been used to study hundreds of bird distribution models. Fourth, how should data be pre-processed when learning a high-performing ensemble? I examine the behavior of variable selection and bagging using a bias-variance analysis of error. The results show that the most accurate variable subset corresponds to the best bias-variance trade-off point. Often, this is not the point separating relevant from irrelevant inputs. Variable selection should be viewed as a variance reduction method and thus is often redundant for low variance methods like bagging. The best bagged model performance usually is obtained using all available inputs

    Biomolecular Event Extraction using Natural Language Processing

    Get PDF
    Biomedical research and discoveries are communicated through scholarly publications and this literature is voluminous, rich in scientific text and growing exponentially by the day. Biomedical journals publish nearly three thousand research articles daily, making literature search a challenging proposition for researchers. Biomolecular events involve genes, proteins, metabolites, and enzymes that provide invaluable insights into biological processes and explain the physiological functional mechanisms. Text mining (TM) or extraction of such events automatically from big data is the only quick and viable solution to gather any useful information. Such events extracted from biological literature have a broad range of applications like database curation, ontology construction, semantic web search and interactive systems. However, automatic extraction has its challenges on account of ambiguity and the diverse nature of natural language and associated linguistic occurrences like speculations, negations etc., which commonly exist in biomedical texts and lead to erroneous elucidation. In the last decade, many strategies have been proposed in this field, using different paradigms like Biomedical natural language processing (BioNLP), machine learning and deep learning. Also, new parallel computing architectures like graphical processing units (GPU) have emerged as possible candidates to accelerate the event extraction pipeline. This paper reviews and provides a summarization of the key approaches in complex biomolecular big data event extraction tasks and recommends a balanced architecture in terms of accuracy, speed, computational cost, and memory usage towards developing a robust GPU-accelerated BioNLP system

    Optimization issues in machine learning of coreference resolution

    Get PDF

    Enhancing Multimodal Information Retrieval Through Integrating Data Mining and Deep Learning Techniques

    Get PDF
    Multimodal information retrieval, the task of re trieving relevant information from heterogeneous data sources such as text, images, and videos, has gained significant attention in recent years due to the proliferation of multimedia content on the internet. This paper proposes an approach to enhance multimodal information retrieval by integrating data mining and deep learning techniques. Traditional information retrieval systems often struggle to effectively handle multimodal data due to the inherent complexity and diversity of such data sources. In this study, we leverage data mining techniques to preprocess and structure multimodal data efficiently. Data mining methods enable us to extract valuable patterns, relationships, and features from different modalities, providing a solid foundation for sub- sequent retrieval tasks. To further enhance the performance of multimodal information retrieval, deep learning techniques are employed. Deep neural networks have demonstrated their effectiveness in various multimedia tasks, including image recognition, natural language processing, and video analysis. By integrating deep learning models into our retrieval framework, we aim to capture complex intermodal dependencies and semantically rich representations, enabling more accurate and context-aware retrieval
    corecore