326 research outputs found

    Analyzing Prosody with Legendre Polynomial Coefficients

    Full text link
    This investigation demonstrates the effectiveness of Legendre polynomial coefficients representing prosodic contours within the context of two different tasks: nativeness classification and sarcasm detection. By making use of accurate representations of prosodic contours to answer fundamental linguistic questions, we contribute significantly to the body of research focused on analyzing prosody in linguistics as well as modeling prosody for machine learning tasks. Using Legendre polynomial coefficient representations of prosodic contours, we answer prosodic questions about differences in prosody between native English speakers and non-native English speakers whose first language is Mandarin. We also learn more about prosodic qualities of sarcastic speech. We additionally perform machine learning classification for both tasks, (achieving an accuracy of 72.3% for nativeness classification, and achieving 81.57% for sarcasm detection). We recommend that linguists looking to analyze prosodic contours make use of Legendre polynomial coefficients modeling; the accuracy and quality of the resulting prosodic contour representations makes them highly interpretable for linguistic analysis

    HYPERLINK NETWORK SYSTEM AND IMAGE OF GLOBAL CITIES: WEBPAGES AND THEIR CONTENTS

    Get PDF
    A distinctive trend of globalization research is a conceptual expansion that mirrors the penetration of globalization in various aspects of life. The World Wide Web has become the ultimate platform to create and disseminate information in this era of globalization. Although the importance of web-based information is widely acknowledged, the use of this information in global city research is not significant yet. Therefore, the purpose of this research is to extend the concept of globalization to the efficiency of information networks and the thematic dimensionality of the conveyed images from webpages. To this end, 264 global and globalizing cities are selected. The city hyperlink networks are constructed from the web crawling results of each city, and hyperlink network analysis measures the effectiveness of these hyperlink networks. The textual contents are also extracted from the crawled webpages, and the thematic dimensionality of the textual contents is measured by quantified content analysis and multidimensional scaling. The efficiency of the hyperlink network in information flow is confirmed to be a new consideration that shapes the globality of cities. The cities with high efficiency of connections have faster and easier access, which means better structure for city image formation. Specifically, social networking websites are the center of this information flow. This means that social interactions on the Web play a crucial role to form the images of cities. Apart from the positivity and the negativity of the city image, the dimensionality of cities on the thematic space denotes how they are expressed, discussed, and shared on the Web. The image status based on dimensions of globalization is an important starting point to city branding. It is concluded that a research framework handling information networks and images simultaneously deepens the understanding of how the structure and the contents on the Web affect the formation and maintenance of global city networks. Overall, this research demonstrates the usefulness of information networks and images of cities on the Web to overcome data inconsistency and scarcity in global city research

    Knowledge Expansion of a Statistical Machine Translation System using Morphological Resources

    Get PDF
    Translation capability of a Phrase-Based Statistical Machine Translation (PBSMT) system mostly depends on parallel data and phrases that are not present in the training data are not correctly translated. This paper describes a method that efficiently expands the existing knowledge of a PBSMT system without adding more parallel data but using external morphological resources. A set of new phrase associations is added to translation and reordering models; each of them corresponds to a morphological variation of the source/target/both phrases of an existing association. New associations are generated using a string similarity score based on morphosyntactic information. We tested our approach on En-Fr and Fr-En translations and results showed improvements of the performance in terms of automatic scores (BLEU and Meteor) and reduction of out-of-vocabulary (OOV) words. We believe that our knowledge expansion framework is generic and could be used to add different types of information to the model.JRC.G.2-Global security and crisis managemen

    Proceedings of the 9th Dutch-Belgian Information Retrieval Workshop

    Get PDF

    Deliverable D2.7 Final Linked Media Layer and Evaluation

    Get PDF
    This deliverable presents the evaluation of content annotation and content enrichment systems that are part of the final tool set developed within the LinkedTV consortium. The evaluations were performed on both the Linked News and Linked Culture trial content, as well as on other content annotated for this purpose. The evaluation spans three languages: German (Linked News), Dutch (Linked Culture) and English. Selected algorithms and tools were also subject to benchmarking in two international contests: MediaEval 2014 and TAC’14. Additionally, the Microposts 2015 NEEL Challenge is being organized with the support of LinkedTV

    Textual Analysis of Intangible Information

    Get PDF
    Traditionally, equity investors have relied upon the information reported in firms’ financial accounts to make their investment decisions. Due to the conservative nature of accounting standards, firms cannot value their intangible assets such as corporate culture, brand value and reputation. Investors’ efforts to collect such information have been hampered by the voluntary nature of Corporate Social Responsibility (CSR) reporting standards, which have resulted in the publication of inconsistent, stale and incomplete information across firms. In short, information on intangible assets is less salient to investors compared to accounting information because it is more costly to collect, process and analyse. In this thesis we design an automated approach to collect and quantify information on firms’ intangible assets by drawing upon techniques commonly adopted in the fields of Natural Language Processing (NLP) and Information Retrieval. The exploitation of unstructured data available on the Web holds promise for investors seeking to integrate a wider variety of information into their investment processes. The objectives of this research are: 1) to draw upon textual analysis methodologies to measure intangible information from a range of unstructured data sources, 2) to integrate intangible information and accounting information into an investment analysis framework, 3) evaluate the merits of unstructured data for the prediction of firms’ future earnings

    Cross-lingual genre classification

    Get PDF
    Automated classification of texts into genres can benefit NLP applications, in that the structure, location and even interpretation of information within a text are dictated by its genre. Cross-lingual methods promise such benefits to languages which lack genre-annotated training data. While there has been work on genre classification for over two decades, none has considered cross-lingual methods before the start of this project. My research aims to fill this gap. It follows previous approaches to monolingual genre classification that exploit simple, low-level text features, many of which can be extracted in different languages and have similar functions. This contrasts with work on cross-lingual topic or sentiment classification of texts that typically use word frequencies as features. These have been shown to have limited use when it comes to genres. Many such methods also assume cross-lingual resources, such as machine translation, which limits the range of their application. A selection of these approaches are used as baselines in my experiments. I report the results of two semi-supervised methods for exploiting genre-labelled source language texts and unlabelled target language texts. The first is a relatively simple algorithm that bridges the language gap by exploiting cross-lingual features and then iteratively re-trains a classification model on previously predicted target texts. My results show that this approach works well where only few cross-lingual resources are available and texts are to be classified into broad genre categories. It is also shown that further improvements can be achieved through multi-lingual training or cross-lingual feature selection if genre-annotated texts are available in several source languages. The second is a variant of the label propagation algorithm. This graph-based classifier learns genre-specific feature set weights from both source and target language texts and uses them to adjust the propagation channels for each text. This allows further feature sets to be added as additional resources, such as Part of Speech taggers, become available. While the method performs well even with basic text features, it is shown to benefit from additional feature sets. Results also indicate that it handles fine-grained genre classes better than the iterative re-labelling method

    B!SON: A Tool for Open Access Journal Recommendation

    Get PDF
    Finding a suitable open access journal to publish scientific work is a complex task: Researchers have to navigate a constantly growing number of journals, institutional agreements with publishers, funders’ conditions and the risk of Predatory Publishers. To help with these challenges, we introduce a web-based journal recommendation system called B!SON. It is developed based on a systematic requirements analysis, built on open data, gives publisher-independent recommendations and works across domains. It suggests open access journals based on title, abstract and references provided by the user. The recommendation quality has been evaluated using a large test set of 10,000 articles. Development by two German scientific libraries ensures the longevity of the project

    Improving Clustering Methods By Exploiting Richness Of Text Data

    No full text
    Clustering is an unsupervised machine learning technique, which involves discovering different clusters (groups) of similar objects in unlabeled data and is generally considered to be a NP hard problem. Clustering methods are widely used in a verity of disciplines for analyzing different types of data, and a small improvement in clustering method can cause a ripple effect in advancing research of multiple fields. Clustering any type of data is challenging and there are many open research questions. The clustering problem is exacerbated in the case of text data because of the additional challenges such as issues in capturing semantics of a document, handling rich features of text data and dealing with the well known problem of the curse of dimensionality. In this thesis, we investigate the limitations of existing text clustering methods and address these limitations by providing five new text clustering methods--Query Sense Clustering (QSC), Dirichlet Weighted K-means (DWKM), Multi-View Multi-Objective Evolutionary Algorithm (MMOEA), Multi-objective Document Clustering (MDC) and Multi-Objective Multi-View Ensemble Clustering (MOMVEC). These five new clustering methods showed that the use of rich features in text clustering methods could outperform the existing state-of-the-art text clustering methods. The first new text clustering method QSC exploits user queries (one of the rich features in text data) to generate better quality clusters and cluster labels. The second text clustering method DWKM uses probability based weighting scheme to formulate a semantically weighted distance measure to improve the clustering results. The third text clustering method MMOEA is based on a multi-objective evolutionary algorithm. MMOEA exploits rich features to generate a diverse set of candidate clustering solutions, and forms a better clustering solution using a cluster-oriented approach. The fourth and the fifth text clustering method MDC and MOMVEC address the limitations of MMOEA. MDC and MOMVEC differ in terms of the implementation of their multi-objective evolutionary approaches. All five methods are compared with existing state-of-the-art methods. The results of the comparisons show that the newly developed text clustering methods out-perform existing methods by achieving up to 16\% improvement for some comparisons. In general, almost all newly developed clustering algorithms showed statistically significant improvements over other existing methods. The key ideas of the thesis highlight that exploiting user queries improves Search Result Clustering(SRC); utilizing rich features in weighting schemes and distance measures improves soft subspace clustering; utilizing multiple views and a multi-objective cluster oriented method improves clustering ensemble methods; and better evolutionary operators and objective functions improve multi-objective evolutionary clustering ensemble methods. The new text clustering methods introduced in this thesis can be widely applied in various domains that involve analysis of text data. The contributions of this thesis which include five new text clustering methods, will not only help researchers in the data mining field but also to help a wide range of researchers in other fields
    • …
    corecore