139,386 research outputs found

    Generating adaptive hypertext content from the semantic web

    Get PDF
    Accessing and extracting knowledge from online documents is crucial for therealisation of the Semantic Web and the provision of advanced knowledge services. The Artequakt project is an ongoing investigation tackling these issues to facilitate the creation of tailored biographies from information harvested from the web. In this paper we will present the methods we currently use to model, consolidate and store knowledge extracted from the web so that it can be re-purposed as adaptive content. We look at how Semantic Web technology could be used within this process and also how such techniques might be used to provide content to be published via the Semantic Web

    Distantly Supervised Web Relation Extraction for Knowledge Base Population

    Get PDF
    Extracting information from Web pages for populating large, cross-domain knowledge bases requires methods which are suitable across domains, do not require manual effort to adapt to new domains, are able to deal with noise, and integrate information extracted from different Web pages. Recent approaches have used existing knowledge bases to learn to extract information with promising results, one of those approaches being distant supervision. Distant supervision is an unsupervised method which uses background information from the Linking Open Data cloud to automatically label sentences with relations to create training data for relation classifiers. In this paper we propose the use of distant supervision for relation extraction from the Web. Although the method is promising, existing approaches are still not suitable for Web extraction as they suffer from three main issues: data sparsity, noise and lexical ambiguity. Our approach reduces the impact of data sparsity by making entity recognition tools more robust across domains and extracting relations across sentence boundaries using unsupervised co- reference resolution methods. We reduce the noise caused by lexical ambiguity by employing statistical methods to strategically select training data. To combine information extracted from multiple sources for populating knowledge bases we present and evaluate several information integration strategies and show that those benefit immensely from additional relation mentions extracted using co-reference resolution, increasing precision by 8%. We further show that strategically selecting training data can increase precision by a further 3%

    Mining Web usage using FRS

    Get PDF
    Web Usage Mining (WUM) is the application of data mining methods in extracting potentially useful information from web usage data. Its application includes improving website design, personalised service, target marketing etc. Among the outstanding research issues in WUM include inefficiency in mining large weblogs, extracted patterns that are not representative of actual user behavior, and mining results which are too general, uninteresting and lack insights. This paper attempts to address the above problems using a method of mining that captures user traversing activities more effectively based on the notion of regularity. A mining algorithm is introduced using the approach of vertical database. The experiments suggest that the method is efficient, scalable, and able to address confusion caused by large number of extracted patterns

    Generating Paired Transliterated-cognates Using Multiple Pronunciation Characteristics from Web corpora

    Get PDF
    A novel approach to automatically extracting paired transliterated-cognates from Web corpora is proposed in this paper. One of the most important issues addressed is that of taking multiple pronunciation characteristics into account. Terms from various languages may pronounce very differently. Incorporating the knowledge of word origin may improve the pronunciation accuracy of terms. The accuracy of generated phonetic information has an important impact on term transliteration and hence transliterated-term extraction. Transliterated-term extraction is a fundamental task in natural language processing to extract paired transliterated-terms in studying term transliteration. An experiment on transliterated-term extraction from two kinds of Web resources, Web pages and anchored texts, has been conducted and evaluated. The experimental results show that many transliterated-term pairs, which cannot be extracted using the approach only exploiting English pronunciation characteristics, have been successfully extracted using the proposed approach in this paper. By taking multiple language-specific pronunciation transformations into account may further improve the output of the transliterated-term extraction

    A Semantic-Based Framework for Summarization and Page Segmentation in Web Mining

    Get PDF
    This chapter addresses two crucial issues that arise when one applies Web-mining techniques for extracting relevant information. The first one is the acquisition of useful knowledge from textual data; the second issue stems from the fact that a web page often proposes a considerable amount of \u2018noise\u2019 with respect to the sections that are truly informative for the user's purposes. The novelty contribution of this work lies in a framework that can tackle both these tasks at the same time, supporting text summarization and page segmentation. The approach achieves this goal by exploiting semantic networks to map natural language into an abstract representation, which eventually supports the identification of the topics addressed in a text source. A heuristic algorithm uses the abstract representation to highlight the relevant segments of text in the original document. The verification of the approach effectiveness involved a publicly available benchmark, the DUC 2002 dataset, and satisfactory results confirmed the method effectiveness

    Hacking an Ambiguity Detection Tool to Extract Variation Points: an Experience Report

    Get PDF
    Natural language (NL) requirements documents can be a precious source to identify variability information. This information can be later used to define feature models from which different systems can be instantiated. In this paper, we are interested in validating the approach we have recently proposed to extract variability issues from the ambiguity defects found in NL requirement documents. To this end, we single out ambiguities using an available NL analysis tool, QuARS, and we classify the ambiguities returned by the tool by distinguishing among false positives, real ambiguities, and variation points. We consider three medium sized requirement documents from different domains, namely, train control, social web, home automation. We report in this paper the results of the assessment. Although the validation set is not so large, the results obtained are quite uniform and permit to draw some interesting conclusions. Starting from the results obtained, we can foresee the tailoring of a NL analysis tool for extracting variability from NL requirement documents

    Sentiment analysis on UTHM issues with big data

    Get PDF
    Nowadays, social media platform such as Twitter, WhatsApp, Facebook and it Messenger, as well as Instagram plays a very importance role to the society. Twitter is a micro-blogging platform that is able to provide a remarkable amount of data that can be used in several number of sentiment analysis applications such as predictions, reviews, and elections. Sentiment Analysis is a process of extracting information of issues or specific topic from enormous amount of data and categorizes it into different classes. The main target of this project is to classify Twitter data into sentiments value either positive, neutral or negative on data collected regarding Universiti Tun Hussein Onn Malaysia (UTHM) issues. This sentiment was classified using sentiment classifier, while data is trained on a Naïve Bayes Classifier, on TextBlob Python library. Lastly, results were displayed to the user, through a web application using Jupyter Notebook. This study found out that the percentage for positive, neutral and negative tweets regarding UTHM issues were 74%, 26% and 0% in English tweets, meanwhile 17%, 82% and 1 % of Bahasa Melayu tweets, respectively. Positive and neutral sentiments analysis shows positive perception of the products and services, thus promoting and branding UTHM worldwide

    Sentiment Analysis on UTHM Issues with Big Data

    Get PDF
    Nowadays, social media platform such as Twitter, WhatsApp, Facebook and it Messenger, as well as Instagram plays a very importance role to the society. Twitter is a micro-blogging platform that is able to provide a remarkable amount of data that can be used in several number of sentiment analysis applications such as predictions, reviews, and elections. Sentiment Analysis is a process of extracting information of issues or specific topic from enormous amount of data and categorizes it into different classes. The main target of this project is to classify Twitter data into sentiments value either positive, neutral or negative on data collected regarding Universiti Tun Hussein Onn Malaysia (UTHM) issues. This sentiment was classified using sentiment classifier, while data is trained on a Naïve Bayes Classifier, on TextBlob Python library. Lastly, results were displayed to the user, through a web application using Jupyter Notebook. This study found out that the percentage for positive, neutral and negative tweets regarding UTHM issues were 74%, 26% and 0% in English tweets, meanwhile 17%, 82% and 1 % of Bahasa Melayu tweets, respectively. Positive and neutral sentiments analysis shows positive perception of the products and services, thus promoting and branding UTHM worldwide

    Understanding Information and Knowledge Sharing in Online Communities: Emerging Research Approaches

    Get PDF
    Social media have become an important component of contemporary information ecosystems. People use social media systems, such as Twitter, Facebook, YouTube, and Tumblr to communicate ideas and information needs, seek advice and solve problems, show appreciation and disagreement with a person or issue. These tools facilitate the emergence of communities, often resembling the communities of practice that arise in workplaces and educational institutions, where a common interest, identity and set of norms and structures for communicating develop through interaction. But while it seems easy to suck in data streams from social media to understand online communities, making sense of the vast data sets has been challenging. The issues include not just the tools and methods for extracting and synthesizing large data sets like the Twitter Firehose, they also extend to the ethical and responsible use and reporting of this data for academic and commercial purposes. This panel will focus on methodological approaches and research strategies for the study of social media communities, in particular web 2.0 tools that play an important role in the North American cultural landscape
    corecore