10 research outputs found

    Delineating Knowledge Domains in Scientific Domains in Scientific Literature using Machine Learning (ML)

    Get PDF
    The recent years have witnessed an upsurge in the number of published documents. Organizations are showing an increased interest in text classification for effective use of the information. Manual procedures for text classification can be fruitful for a handful of documents, but the same lack in credibility when the number of documents increases besides being laborious and time-consuming. Text mining techniques facilitate assigning text strings to categories rendering the process of classification fast, accurate, and hence reliable. This paper classifies chemistry documents using machine learning and statistical methods. The procedure of text classification has been described in chronological order like data preparation followed by processing, transformation, and application of classification techniques culminating in the validation of the results

    Text Categorization and Machine Learning Methods: Current State Of The Art

    Get PDF
    In this informative age, we find many documents are available in digital forms which need classification of the text. For solving this major problem present researchers focused on machine learning techniques: a general inductive process automatically builds a classifier by learning, from a set of pre classified documents, the characteristics of the categories. The main benefit of the present approach is consisting in the manual definition of a classifier by domain experts where effectiveness, less use of expert work and straightforward portability to different domains are possible. The paper examines the main approaches to text categorization comparing the machine learning paradigm and present state of the art. Various issues pertaining to three different text similarity problems, namely, semantic, conceptual and contextual are also discussed

    Advanced Text Analytics and Machine Learning Approach for Document Classification

    Get PDF
    Text classification is used in information extraction and retrieval from a given text, and text classification has been considered as an important step to manage a vast number of records given in digital form that is far-reaching and expanding. This thesis addresses patent document classification problem into fifteen different categories or classes, where some classes overlap with other classes for practical reasons. For the development of the classification model using machine learning techniques, useful features have been extracted from the given documents. The features are used to classify patent document as well as to generate useful tag-words. The overall objective of this work is to systematize NASA’s patent management, by developing a set of automated tools that can assist NASA to manage and market its portfolio of intellectual properties (IP), and to enable easier discovery of relevant IP by users. We have identified an array of methods that can be applied such as k-Nearest Neighbors (kNN), two variations of the Support Vector Machine (SVM) algorithms, and two tree based classification algorithms: Random Forest and J48. The major research steps in this work consist of filtering techniques for variable selection, information gain and feature correlation analysis, and training and testing potential models using effective classifiers. Further, the obstacles associated with the imbalanced data were mitigated by adding synthetic data wherever appropriate, which resulted in a superior SVM classifier based model

    Advanced Text Analytics and Machine Learning Approach for Document Classification

    Get PDF
    Text classification is used in information extraction and retrieval from a given text, and text classification has been considered as an important step to manage a vast number of records given in digital form that is far-reaching and expanding. This thesis addresses patent document classification problem into fifteen different categories or classes, where some classes overlap with other classes for practical reasons. For the development of the classification model using machine learning techniques, useful features have been extracted from the given documents. The features are used to classify patent document as well as to generate useful tag-words. The overall objective of this work is to systematize NASA’s patent management, by developing a set of automated tools that can assist NASA to manage and market its portfolio of intellectual properties (IP), and to enable easier discovery of relevant IP by users. We have identified an array of methods that can be applied such as k-Nearest Neighbors (kNN), two variations of the Support Vector Machine (SVM) algorithms, and two tree based classification algorithms: Random Forest and J48. The major research steps in this work consist of filtering techniques for variable selection, information gain and feature correlation analysis, and training and testing potential models using effective classifiers. Further, the obstacles associated with the imbalanced data were mitigated by adding synthetic data wherever appropriate, which resulted in a superior SVM classifier based model

    Tekoälyn hyödyntäminen ydinvoimalaitosvaatimusten analysoinnissa

    Get PDF
    Nuclear power plant projects are often characterized by two factors: they are time-consuming and capital-intensive. These current challenges include descriptive and non-harmonized requirements demanded in the nuclear power industry resulting in the adaptation to a new licensing domain being very data-intensive, laborious, and tardy. Furthermore, the sheer volume of these requirements also poses a challenge. Nevertheless, by utilizing artificial intelligence in the analysis of nuclear power plant requirements, licensing and engineering could be facilitated and errors reduced in the allocation of requirements. This Master’s thesis develops an algorithm capable of recognizing natural language to classify nuclear power plant requirements into predefined categories by utilizing supervised machine learning. The study was performed in close cooperation with an AI company, Selko Technologies Oy, being responsible for the development of the algorithm based on the classified set of requirements and the needs of Fortum. The algorithm consists of a nuclear power industry-specific language model involving a long short-term memory network, and a classifier based on a feedforward neural network. The language model and classifier were trained by using the YVL Guides issued by the Finnish Radiation and Nuclear Safety Authority (STUK). For training the classifier, a small selection of the requirements were classified according to the two-level predefined hierarchy. The algorithm was tested on the selected YVL Guides and a set of requirements issued by the Office for Nuclear Regulation in United Kingdom. The results include a predetermined requirements hierarchy, the content of the categories, natural language processing algorithm, requirements classified by both the experts and algorithm, and model accuracies in each test case. The accuracies of the classification tasks are promising indicating that the current methods are suitable for categorizing natural language as long as there is a qualified and sufficient amount of training data in place. The conclusions also suggest proceeding to research the capability of the models in other requirements analysis related tasks, such as atomizing long requirements and combining similar requirements into one.Ydinvoimalaitosprojektit ovat usein pitkäkestoisia ja pääomaintensiivisiä. Yhtenä projektien ominaisena haasteena voidaan pitää suurta määrää kuvailevia ja epäyhtenäisiä vaatimuksia. Lisäksi ydinvoimalaitosdesignin vieminen ja suunnittelun sopeuttaminen uuteen lisensiointiympäristöön vaatii paljon tiedonhallintaa. Lisäksi se on työlästä ja hidasta. Tekoälyn hyödyntäminen ydinvoimalaitosvaatimusten analysoimisessa voisi nopeuttaa lisensiointi- ja suunnitteluprosesseja, sekä vähentää virheitä vaatimusten allokoinnissa. Tässä diplomityössä on kehitetty luonnollisen kielen prosessointiin kykenevä algoritmi ydinvoimalaitosvaatimusten luokitteluun. Työssä vaatimukset on luokiteltu ennalta määrättyihin kategorioihin ohjattua koneoppimista hyödyntämällä. Tutkimus on tehty yhteistyössä tekoäly-yrityksen Selko Technologies Oy:n kanssa, joka on vastannut algoritmin kehittämisestä Fortumin toimittaman luokitellun vaatimusjoukon ja tarpeiden perusteella. Algoritmi koostuu ydinvoima-alan kielimallista ja luokittelijasta. Kielimalli pohjautuu pitkään lyhytaikaisen muistin verkkoon ja luokittelija myötäkytkettyyn neuroverkkoon. Kielimallin ja luokittelijan kouluttamiseen on käytetty Suomen säteily- ja ydinturvallisuusviranomaisen Säteilyturvakeskuksen (STUK) Ydinturvallisuusohjeita. Luokittelijan kouluttamista varten tietty osa vaatimuksista on kategorisoitu kaksitasoisen ennalta määritellyn hierarkian mukaisesti. Algoritmin testaukseen on käytetty sekä valittua Ydinturvallisuusohjeiden vaatimusjoukkoa että Yhdistyneiden kuningaskuntien ydinturvallisuusviranomaisen (ONR) yhtä vaatimusjoukkoa. Työn tuloksena syntyi ennalta määritetty vaatimushierarkia sekä luonnollista kieltä prosessoiva algoritmi. Lisäksi työssä määriteltiin, mitä asioita kuuluu eri vaatimusluokkiin. Määrittelyn jälkeen sekä asiantuntijat että algoritmi luokittelivat työssä käytetyn datan. Mallin tarkkuus ja käytettävyys pystyttiin testaamaan lopuksi testidatalla. Saadut tarkkuudet vaatimusten luokittelussa ovat lupaavia ja osoittavat, että nykyiset menetelmät soveltuvat hyvin luonnollisen kielen luokitteluun, mikäli vain koulutusdata on laadukasta ja sitä on riittävästi. Tutkimusta voitaisiin jatkaa kokeilemalla mallien soveltumista myös muissa vaatimusten analysointiin liittyvissä tehtävissä. Näitä ovat esimerkiksi pitkien vaatimusten pilkkominen lyhempiin ja selkeämmin määriteltyihin lauseisiin sekä samanlaisten vaatimusten yhdistäminen yhdeksi vaatimukseksi

    VENCE : un modèle performant d'extraction de résumés basé sur une approche d'apprentissage automatique renforcée par de la connaissance ontologique

    Get PDF
    De nombreuses méthodes et techniques d’intelligence artificielle pour l’extraction d'information, la reconnaissance des formes et l’exploration de données sont utilisées pour extraire des résumés automatiquement. En particulier, de nouveaux modèles d'apprentissage automatique semi supervisé avec ajout de connaissance ontologique permettent de choisir des phrases d’un corpus en fonction de leur contenu d'information. Le corpus est considéré comme un ensemble de phrases sur lequel des méthodes d'optimisation sont appliquées pour identifier les attributs les plus importants. Ceux-ci formeront l’ensemble d’entrainement, à partir duquel un algorithme d’apprentissage pourra abduire une fonction de classification capable de discriminer les phrases de nouveaux corpus en fonction de leur contenu d’information. Actuellement, même si les résultats sont intéressants, l’efficacité des modèles basés sur cette approche est encore faible notamment en ce qui concerne le pouvoir discriminant des fonctions de classification. Dans cette thèse, un nouveau modèle basé sur l’apprentissage automatique est proposé et dont l’efficacité est améliorée par un ajout de connaissance ontologique à l’ensemble d’entrainement. L’originalité de ce modèle est décrite à travers trois articles de revues. Le premier article a pour but de montrer comment des techniques linéaires peuvent être appliquées de manière originale pour optimiser un espace de travail dans le contexte du résumé extractif. Le deuxième article explique comment insérer de la connaissance ontologique pour améliorer considérablement la performance des fonctions de classification. Cette insertion se fait par l’ajout, à l'ensemble d’entraînement, de chaines lexicales extraites de bases de connaissances ontologiques. Le troisième article décrit VENCE , le nouveau modèle d’apprentissage automatique permettant d’extraire les phrases les plus porteuses d’information en vue de produire des résumés. Une évaluation des performances de VENCE a été réalisée en comparant les résultats obtenus avec ceux produits par des logiciels actuels commerciaux et publics, ainsi que ceux publiés dans des articles scientifiques très récents. L’utilisation des métriques habituelles de rappel, précision et F_measure ainsi que l’outil ROUGE a permis de constater la supériorité de VENCE. Ce modèle pourrait être profitable pour d’autres contextes d’extraction d’information comme pour définir des modèles d’analyse de sentiments.Several methods and techniques of artificial intelligence for information extraction, pattern recognition and data mining are used for extraction of summaries. More particularly, new machine learning models with the introduction of ontological knowledge allow the extraction of the sentences containing the greatest amount of information from a corpus. This corpus is considered as a set of sentences on which different optimization methods are applied to identify the most important attributes. They will provide a training set from which a machine learning algorithm will can abduce a classification function able to discriminate the sentences of new corpus according their information content. Currently, even though the results are interesting, the effectiveness of models based on this approach is still low, especially in the discriminating power of classification functions. In this thesis, a new model based on this approach is proposed and its effectiveness is improved by inserting ontological knowledge to the training set. The originality of this model is described through three papers. The first paper aims to show how linear techniques could be applied in an original way to optimize workspace in the context of extractive summary. The second article explains how to insert ontological knowledge to significantly improve the performance of classification functions. This introduction is performed by inserting lexical chains of ontological knowledge based in the training set. The third article describes VENCE , the new machine learning model to extract sentences with the most information content in order to produce summaries. An assessment of the VENCE performance is achieved comparing the results with those produced by current commercial and public software as well as those published in very recent scientific articles. The use of usual metrics recall, precision and F_measure and the ROUGE toolkit showed the superiority of VENCE. This model could benefit other contexts of information extraction as for instance to define models for sentiment analysis

    Accuracy Improvement of Automatic Text Classification Based on Feature Transformation and Multi-classifier Combination

    No full text
    In this paper, we describe a comparative study on techniques of feature transformation and classification to improve the accuracy of automatic text classification. The normalization to the relative word frequency, the principal component analysis (K-L transformation) and the power transformation were applied to the feature vectors, which were classified by the Euclidean distance, the linear discriminant function, the projection distance, the modified projection distance and the SVM. In order to improve the classification accuracy, the multi-classifier combination by majority vote was employed
    corecore