197 research outputs found

    A graph theoretical perspective for the unsupervised clustering of free text corpora

    Get PDF
    This thesis introduces a robust end to end topic discovery framework that extracts a set of coherent topics stemming intrinsically from document similarities. Some topic clustering methods can support embedded vectors instead of traditional Bag-of-Words (BoW) representation. Some can be free from the number of topics hyperparameter and some others can extract a multi-scale relation between topics. However, no topic clustering method supports all these properties together. This thesis focuses on this gap in the literature by designing a framework that supports any type of document-level features especially the embedded vectors. This framework does not require any uninformed decision making about the underlying data such as the number of topics, instead, the framework extracts topics in multiple resolutions. To achieve this goal, we combine existing methods from natural language processing (NLP) for feature generation and graph theory, first for graph construction based on semantic document similarities, then for graph partitioning to extract corresponding topics in multiple resolutions. Finally, we use specific methods from statistical machine learning to obtain highly generalisable supervised models to deploy topic classifiers for the deployment of topic extraction in real-time. Our applications on both a noisy and specialised corpus of medical records (i.e., descriptions for patient incidents within the NHS) and public news articles in daily language show that our framework extracts coherent topics that have better quantitative benchmark scores than other methods in most cases. The resulting multi-scale topics in both applications enable us to capture specific details more easily and choose the relevant resolutions for the specific objective. This study contributes to topic clustering literature by introducing a novel graph theoretical perspective that provides a combination of new properties. These properties are multiple resolutions, independence from uninformed decisions about the corpus, and usage of recent NLP features, such as vector embeddings.Open Acces

    Arabic web page clustering: a review

    Get PDF
    Clustering is the method employed to group Web pages containing related information into clusters, which facilitates the allocation of relevant information. Clustering performance is mostly dependent on the text features' characteristics. The Arabic language has a complex morphology and is highly inflected. Thus, selecting appropriate features affects clustering performance positively. Many studies have addressed the clustering problem in Web pages with Arabic content. There are three main challenges in applying text clustering to Arabic Web page content. The first challenge concerns difficulty with identifying significant term features to represent original content by considering the hidden knowledge. The second challenge is related to reducing data dimensionality without losing essential information. The third challenge regards how to design a suitable model for clustering Arabic text that is capable of improving clustering performance. This paper presents an overview of existing Arabic Web page clustering methods, with the goals of clarifying existing problems and examining feature selection and reduction techniques for solving clustering difficulties. In line with the objectives and scope of this study, the present research is a joint effort to improve feature selection and vectorization frameworks in order to enhance current text analysis techniques that can be applied to Arabic Web pages

    State of the art document clustering algorithms based on semantic similarity

    Get PDF
    The constant success of the Internet made the number of text documents in electronic forms increases hugely. The techniques to group these documents into meaningful clusters are becoming critical missions. The traditional clustering method was based on statistical features, and the clustering was done using a syntactic notion rather than semantically. However, these techniques resulted in un-similar data gathered in the same group due to polysemy and synonymy problems. The important solution to this issue is to document clustering based on semantic similarity, in which the documents are grouped according to the meaning and not keywords. In this research, eighty papers that use semantic similarity in different fields have been reviewed; forty of them that are using semantic similarity based on document clustering in seven recent years have been selected for a deep study, published between the years 2014 to 2020. A comprehensive literature review for all the selected papers is stated. Detailed research and comparison regarding their clustering algorithms, utilized tools, and methods of evaluation are given. This helps in the implementation and evaluation of the clustering of documents. The exposed research is used in the same direction when preparing the proposed research. Finally, an intensive discussion comparing the works is presented, and the result of our research is shown in figures

    Predicting credit rating change using machine learning and natural language processing

    Get PDF
    Corporate credit ratings provide standardized third-party information for market participants. They offer many benefits for issuers, intermediaries and investors and generally increase trust and efficiency in the market. Credit ratings are provided by credit rating agencies. In addition to quantitative information of companies (e.g. financial statements), the qualitative information in company-related textual documents is known to be a determinant in the credit rating process. However, the way in which the credit rating agencies interpret this data is not public information. The purpose of this thesis is to develop a supervised machine learning model that predicts credit rating changes as a binary classification problem, based on form 10-k annual reports of public U.S. companies. Before using in the classification task, the form 10-k reports are pre-processed using natural language processing methods. More generally, this thesis aims to answer, to what extent a change in a company’s credit rating can be predicted based on the form 10-k reports, and whether the use of topic modeling can improve the results. A total of five different machine learning algorithms are used for the binary classification of this thesis and their performances are compared. These algorithms are support vector machine, logistic regression, decision tree, random forest and naïve Bayes classifier. Topic modeling is implemented using latent semantic analysis. The studies of Hajek et al. (2016) and Chen et al. (2017) are the main sources of inspiration for this thesis. The methods used in this thesis are for the most part similar as in these studies. This thesis adds value to the findings of these studies by finding out how credit rating prediction methods in Hajek et al. (2016), binary classification methods in Chen et al. (2017) and utilization of form 10-k annual reports (used in both Hajek et al. (2016) and Chen et al. (2017) can be combined as a binary credit rating classifier. The results of the study show that credit rating change can be predicted using 10-k data, but the predictions are not very accurate. The best classification results were obtained using a support vector machine, with an accuracy of 69.4% and an AUC of 0.6744. No significant improvement on classification performance was obtained using topic modeling.Yritysten luottoluokitukset antavat standardoitua kolmannen osapuolen tietoa markkinaosapuolille. Ne tarjoavat monia etuja liikkeellelaskijoille, välittäjille ja sijoittajille ja lisäävät yleistä luottamusta ja tehokkuutta markkinoilla. Luottoluokituksia myöntävät luottoluokituslaitokset. Kvantitatiivisten yritystä koskevien tietojen (esim. Tilinpäätöstietojen) lisäksi yrityksen julkaiseman tekstimuotoisen datan sisältävien laadullisten tietojen tiedetään vaikuttavan luottoluokitusprosessiin. Tapa, jolla luottoluokituslaitokset tulkitsevat tätä tietoa, ei kuitenkaan ole julkisesti tiedossa. Tämän tutkielman tarkoituksena on kehittää ohjattu koneoppimismalli, joka ennustaa luottoluokitusmuutoksia binäärisenä luokitteluongelmana Yhdysvalloissa toimivien pörssiyhtiöiden 10-k -muotoisten vuosikertomuksien perusteella. 10-k vuosikertomukset esikäsitellään luonnollisen kielen käsittelyn menetelmillä, ennen kuin niitä käytetään luokittelutehtävässä. Yleisemmin tämän tutkielman tavoitteena on selvittää, missä määrin yrityksen luottoluokituksen muutosta voidaan ennustaa 10-k vuosikertomuksen perusteella ja voidaanko aihemallinnuksen avulla parantaa tuloksia. Tutkielmassa käytetään binääriseen luokitteluun yhteensä viittä erilaista koneoppimisalgoritmia ja verrataan niiden suorituskykyjä. Nämä algoritmit ovat tukivektorikone, logistinen regressio, päätöspuu, satunnainen metsä ja naïve Bayes-luokitin. Aihemallinnus toteutetaan latentin semanttisen analyysin avulla. Hajek ym. (2016) ja Chen ym. (2017) tutkimukset ovat toimineet pääasiallisena inspiraation lähteenä tälle tutkielmalle. Tässä tutkielmassa käytetyt metodit ovat pitkälti samoja kuin näissä tutkimuksissa. Tämä tutkielma tuo lisäarvoa näiden tutkimusten tuloksiin selvittämällä, kuinka Hajek ym. (2016) käyttämiä luottoluokituksen ennustusmetodeja, Chen ym. (2017) käyttämiä binäärisen luokittelun metodeja ja 10-k vuosikertomusten hyödyntämistä (käytetty sekä Hajek ym. (2016) että Chen ym. (2017)) voidaan yhdistää binääriseksi luottoluokitusennustimeksi. Tutkielman tulokset osoittavat, että luottoluokituksen muutosta voidaan ennustaa käyttämällä 10-k vuosikertomuksia, mutta ennusteet eivät ole kovin tarkkoja. Paras luokittelutulos saatiin tukivektorikoneella, tarkkuudella 69,4% ja AUC-arvolla 0,6744. Aihemallinnuksella ei saavutettu merkittävää parannusta luokittelutuloksiin

    Finding the online cry for help : automatic text classification for suicide prevention

    Get PDF
    Successful prevention of suicide, a serious public health concern worldwide, hinges on the adequate detection of suicide risk. While online platforms are increasingly used for expressing suicidal thoughts, manually monitoring for such signals of distress is practically infeasible, given the information overload suicide prevention workers are confronted with. In this thesis, the automatic detection of suicide-related messages is studied. It presents the first classification-based approach to online suicidality detection, and focuses on Dutch user-generated content. In order to evaluate the viability of such a machine learning approach, we developed a gold standard corpus, consisting of message board and blog posts. These were manually labeled according to a newly developed annotation scheme, grounded in suicide prevention practice. The scheme provides for the annotation of a post's relevance to suicide, and the subject and severity of a suicide threat, if any. This allowed us to derive two tasks: the detection of suicide-related posts, and of severe, high-risk content. In a series of experiments, we sought to determine how well these tasks can be carried out automatically, and which information sources and techniques contribute to classification performance. The experimental results show that both types of messages can be detected with high precision. Therefore, the amount of noise generated by the system is minimal, even on very large datasets, making it usable in a real-world prevention setting. Recall is high for the relevance task, but at around 60%, it is considerably lower for severity. This is mainly attributable to implicit references to suicide, which often go undetected. We found a variety of information sources to be informative for both tasks, including token and character ngram bags-of-words, features based on LSA topic models, polarity lexicons and named entity recognition, and suicide-related terms extracted from a background corpus. To improve classification performance, the models were optimized using feature selection, hyperparameter, or a combination of both. A distributed genetic algorithm approach proved successful in finding good solutions for this complex search problem, and resulted in more robust models. Experiments with cascaded classification of the severity task did not reveal performance benefits over direct classification (in terms of F1-score), but its structure allows the use of slower, memory-based learning algorithms that considerably improved recall. At the end of this thesis, we address a problem typical of user-generated content: noise in the form of misspellings, phonetic transcriptions and other deviations from the linguistic norm. We developed an automatic text normalization system, using a cascaded statistical machine translation approach, and applied it to normalize the data for the suicidality detection tasks. Subsequent experiments revealed that, compared to the original data, normalized data resulted in fewer and more informative features, and improved classification performance. This extrinsic evaluation demonstrates the utility of automatic normalization for suicidality detection, and more generally, text classification on user-generated content

    COST292 experimental framework for TRECVID 2008

    Get PDF
    In this paper, we give an overview of the four tasks submitted to TRECVID 2008 by COST292. The high-level feature extraction framework comprises four systems. The first system transforms a set of low-level descriptors into the semantic space using Latent Semantic Analysis and utilises neural networks for feature detection. The second system uses a multi-modal classifier based on SVMs and several descriptors. The third system uses three image classifiers based on ant colony optimisation, particle swarm optimisation and a multi-objective learning algorithm. The fourth system uses a Gaussian model for singing detection and a person detection algorithm. The search task is based on an interactive retrieval application combining retrieval functionalities in various modalities with a user interface supporting automatic and interactive search over all queries submitted. The rushes task submission is based on a spectral clustering approach for removing similar scenes based on eigenvalues of frame similarity matrix and and a redundancy removal strategy which depends on semantic features extraction such as camera motion and faces. Finally, the submission to the copy detection task is conducted by two different systems. The first system consists of a video module and an audio module. The second system is based on mid-level features that are related to the temporal structure of videos

    A Corpus Driven Computational Intelligence Framework for Deception Detection in Financial Text

    Get PDF
    Financial fraud rampages onwards seemingly uncontained. The annual cost of fraud in the UK is estimated to be as high as £193bn a year [1] . From a data science perspective and hitherto less explored this thesis demonstrates how the use of linguistic features to drive data mining algorithms can aid in unravelling fraud. To this end, the spotlight is turned on Financial Statement Fraud (FSF), known to be the costliest type of fraud [2]. A new corpus of 6.3 million words is composed of102 annual reports/10-K (narrative sections) from firms formally indicted for FSF juxtaposed with 306 non-fraud firms of similar size and industrial grouping. Differently from other similar studies, this thesis uniquely takes a wide angled view and extracts a range of features of different categories from the corpus. These linguistic correlates of deception are uncovered using a variety of techniques and tools. Corpus linguistics methodology is applied to extract keywords and to examine linguistic structure. N-grams are extracted to draw out collocations. Readability measurement in financial text is advanced through the extraction of new indices that probe the text at a deeper level. Cognitive and perceptual processes are also picked out. Tone, intention and liquidity are gauged using customised word lists. Linguistic ratios are derived from grammatical constructs and word categories. An attempt is also made to determine ‘what’ was said as opposed to ‘how’. Further a new module is developed to condense synonyms into concepts. Lastly frequency counts from keywords unearthed from a previous content analysis study on financial narrative are also used. These features are then used to drive machine learning based classification and clustering algorithms to determine if they aid in discriminating a fraud from a non-fraud firm. The results derived from the battery of models built typically exceed classification accuracy of 70%. The above process is amalgamated into a framework. The process outlined, driven by empirical data demonstrates in a practical way how linguistic analysis could aid in fraud detection and also constitutes a unique contribution made to deception detection studies
    corecore