1,875 research outputs found

    Supervised Identification of Writer\u27s Native Language Based on Their English Word Usage

    Get PDF
    In this paper, we investigate the possibility of constructing an automated tool for the writer\u27s first language detection based on a~document written in their second language. Since English is the contemporary lingua franca, commonly used by non-native speakers, we have chosen it to be the second language to study. In this paper, we examine English texts from computer science, a field related to mathematics. More generally, we wanted to study texts from a domain that operates with formal rules. We were able to achieve a high classification rate, about~90\%, using a relatively simple model (n-grams with logistic regression). We trained the model to distinguish twelve nationality groups/first languages based on our dataset. The classification mechanism was implemented using logistic regression with L1~regularisation, which performed well with sparse document-term data table. The experiment proved that we can use vocabulary alone to detect the first language with high accuracy

    Enhanced ontology-based text classification algorithm for structurally organized documents

    Get PDF
    Text classification (TC) is an important foundation of information retrieval and text mining. The main task of a TC is to predict the text‟s class according to the type of tag given in advance. Most TC algorithms used terms in representing the document which does not consider the relations among the terms. These algorithms represent documents in a space where every word is assumed to be a dimension. As a result such representations generate high dimensionality which gives a negative effect on the classification performance. The objectives of this thesis are to formulate algorithms for classifying text by creating suitable feature vector and reducing the dimension of data which will enhance the classification accuracy. This research combines the ontology and text representation for classification by developing five algorithms. The first and second algorithms namely Concept Feature Vector (CFV) and Structure Feature Vector (SFV), create feature vector to represent the document. The third algorithm is the Ontology Based Text Classification (OBTC) and is designed to reduce the dimensionality of training sets. The fourth and fifth algorithms, Concept Feature Vector_Text Classification (CFV_TC) and Structure Feature Vector_Text Classification (SFV_TC) classify the document to its related set of classes. These proposed algorithms were tested on five different scientific paper datasets downloaded from different digital libraries and repositories. Experimental obtained from the proposed algorithm, CFV_TC and SFV_TC shown better average results in terms of precision, recall, f-measure and accuracy compared against SVM and RSS approaches. The work in this study contributes to exploring the related document in information retrieval and text mining research by using ontology in TC

    Detecting suicidality on Twitter

    Get PDF
    Twitter is increasingly investigated as a means of detecting mental health status, including depression and suicidality, in the population. However, validated and reliable methods are not yet fully established. This study aimed to examine whether the level of concern for a suicide-related post on Twitter could be determined based solely on the content of the post, as judged by human coders and then replicated by machine learning. From 18th February 2014 to 23rd April 2014, Twitter was monitored for a series of suicide-related phrases and terms using the public Application Program Interface (API). Matching tweets were stored in a data annotation tool developed by the Commonwealth Scientific and Industrial Research Organisation (CSIRO). During this time, 14,701 suicide-related tweets were collected: 14% were randomly (n = 2000) selected and divided into two equal sets (Set A and B) for coding by human researchers. Overall, 14% of suicide-related tweets were classified as ‘strongly concerning’, with the majority coded as ‘possibly concerning’ (56%) and the remainder (29%) considered ‘safe to ignore’. The overall agreement rate among the human coders was 76% (average Îș = 0.55). Machine learning processes were subsequently applied to assess whether a ‘strongly concerning’ tweet could be identified automatically. The computer classifier correctly identified 80% of ‘strongly concerning’ tweets and showed increasing gains in accuracy; however, future improvements are necessary as a plateau was not reached as the amount of data increased. The current study demonstrated that it is possible to distinguish the level of concern among suicide-related tweets, using both human coders and an automatic machine classifier. Importantly, the machine classifier replicated the accuracy of the human coders. The findings confirmed that Twitter is used by individuals to express suicidality and that such posts evoked a level of concern that warranted further investigation. However, the predictive power for actual suicidal behaviour is not yet known and the findings do not directly identify targets for intervention.This project was supported in part by funding from the NSW Mental Health Commission and the NHMRC John Cade Fellowship 1056964. PJB and ALC are supported by the NHMRC Early Career Fellowships 1035262 and 1013199

    Topic-enhanced Models for Speech Recognition and Retrieval

    Get PDF
    This thesis aims to examine ways in which topical information can be used to improve recognition and retrieval of spoken documents. We consider the interrelated concepts of locality, repetition, and `subject of discourse' in the context of speech processing applications: speech recognition, speech retrieval, and topic identification of speech. This work demonstrates how supervised and unsupervised models of topics, applicable to any language, can improve accuracy in accessing spoken content. This work looks at the complementary aspects of topic information in lexical content in terms of local context - locality or repetition of word usage - and broad context - the typical `subject matter' definition of a topic. By augmenting speech processing language models with topic information we can demonstrate consistent improvements in performance in a number of metrics. We add locality to bags-of-words topic identification models, we quantify the relationship between topic information and keyword retrieval, and we consider word repetition both in terms of keyword based retrieval and language modeling. Lastly, we combine these concepts and develop joint models of local and broad context via latent topic models. We present a latent topic model framework that treats documents as arising from an underlying topic sequence combined with a cache-based repetition model. We analyze our proposed model both for its ability to capture word repetition via the cache and for its suitability as a language model for speech recognition and retrieval. We show this model, augmented with the cache, captures intuitive repetition behavior across languages and exhibits lower perplexity than regular LDA on held out data in multiple languages. Lastly, we show that our joint model improves speech retrieval performance beyond N-grams or latent topics alone, when applied to a term detection task in all languages considered

    Breakthrough Inventions and Migrating Clusters of Innovation

    Get PDF
    We investigate the speed at which clusters of invention for a technology migrate spatially following breakthrough inventions. We identify breakthrough inventions as the top one percent of US inventions for a technology during 1975-1984 in terms of subsequent citations. Patenting growth is significantly higher in cities and technologies where breakthrough inventions occur after 1984 relative to peer locations that do not experience breakthrough inventions. This growth differential in turn depends on the mobility of the technology's labor force, which we model through the extent that technologies depend upon immigrant scientists and engineers. Spatial adjustments are faster for technologies that depend heavily on immigrant inventors. The results qualitatively con.rm the mechanism of industry migration proposed in models like [Duranton, G., 2007. Urban evolutions: The fast, the slow, and the still. American Economic Review 97, 197.221].Agglomeration, Clusters, Entrepreneurship, Invention, Mobility, Reallocation, R&D, Patents, Scientists, Engineers, Immigration.

    Wikipedia-based hybrid document representation for textual news classification

    Get PDF
    The sheer amount of news items that are published every day makes worth the task of automating their classification. The common approach consists in representing news items by the frequency of the words they contain and using supervised learning algorithms to train a classifier. This bag-of-words (BoW) approach is oblivious to three aspects of natural language: synonymy, polysemy, and multiword terms. More sophisticated representations based on concepts—or units of meaning—have been proposed, following the intuition that document representations that better capture the semantics of text will lead to higher performance in automatic classification tasks. The reality is that, when classifying news items, the BoW representation has proven to be really strong, with several studies reporting it to perform above different ‘flavours’ of bag of concepts (BoC). In this paper, we propose a hybrid classifier that enriches the traditional BoW representation with concepts extracted from text—leveraging Wikipedia as background knowledge for the semantic analysis of text (WikiBoC). We benchmarked the proposed classifier, comparing it with BoW and several BoC approaches: Latent Dirichlet Allocation (LDA), Explicit Semantic Analysis, and word embeddings (doc2vec). We used two corpora: the well-known Reuters-21578, composed of newswire items, and a new corpus created ex professo for this study: the Reuters-27000. Results show that (1) the performance of concept-based classifiers is very sensitive to the corpus used, being higher in the more “concept-friendly” Reuters-27000; (2) the Hybrid-WikiBoC approach proposed offers performance increases over BoW up to 4.12 and 49.35% when classifying Reuters-21578 and Reuters-27000 corpora, respectively; and (3) for average performance, the proposed Hybrid-WikiBoC outperforms all the other classifiers, achieving a performance increase of 15.56% over the best state-of-the-art approach (LDA) for the largest training sequence. Results indicate that concepts extracted with the help of Wikipedia add useful information that improves classification performance for news items.Atlantic Research Center for Information and Communication TechnologiesXunta de Galicia | Ref. R2014/034 (RedPlir)Xunta de Galicia | Ref. R2014/029 (TELGalicia
    • 

    corecore