250 research outputs found

    BlogForever D2.6: Data Extraction Methodology

    Get PDF
    This report outlines an inquiry into the area of web data extraction, conducted within the context of blog preservation. The report reviews theoretical advances and practical developments for implementing data extraction. The inquiry is extended through an experiment that demonstrates the effectiveness and feasibility of implementing some of the suggested approaches. More specifically, the report discusses an approach based on unsupervised machine learning that employs the RSS feeds and HTML representations of blogs. It outlines the possibilities of extracting semantics available in blogs and demonstrates the benefits of exploiting available standards such as microformats and microdata. The report proceeds to propose a methodology for extracting and processing blog data to further inform the design and development of the BlogForever platform

    Blogs as Infrastructure for Scholarly Communication.

    Full text link
    This project systematically analyzes digital humanities blogs as an infrastructure for scholarly communication. This exploratory research maps the discourses of a scholarly community to understand the infrastructural dynamics of blogs and the Open Web. The text contents of 106,804 individual blog posts from a corpus of 396 blogs were analyzed using a mix of computational and qualitative methods. Analysis uses an experimental methodology (trace ethnography) combined with unsupervised machine learning (topic modeling), to perform an interpretive analysis at scale. Methodological findings show topic modeling can be integrated with qualitative and interpretive analysis. Special attention must be paid to data fitness, or the shape and re-shaping practices involved with preparing data for machine learning algorithms. Quantitative analysis of computationally generated topics indicates that while the community writes about diverse subject matter, individual scholars focus their attention on only a couple of topics. Four categories of informal scholarly communication emerged from the qualitative analysis: quasi-academic, para-academic, meta-academic, and extra-academic. The quasi and para-academic categories represent discourse with scholarly value within the digital humanities community, but do not necessarily have an obvious path into formal publication and preservation. A conceptual model, the (in)visible college, is introduced for situating scholarly communication on blogs and the Open Web. An (in)visible college is a kind of scholarly communication that is informal, yet visible at scale. This combination of factors opens up a new space for the study of scholarly communities and communication. While (in)invisible colleges are programmatically observable, care must be taken with any effort to count and measure knowledge work in these spaces. This is the first systematic, data driven analysis of the digital humanities and lays the groundwork for subsequent social studies of digital humanities.PhDInformationUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/111592/1/mcburton_1.pd

    Webometrics benefitting from web mining? An investigation of methods and applications of two research fields

    Full text link
    Webometrics and web mining are two fields where research is focused on quantitative analyses of the web. This literature review outlines definitions of the fields, and then focuses on their methods and applications. It also discusses the potential of closer contact and collaboration between them. A key difference between the fields is that webometrics has focused on exploratory studies, whereas web mining has been dominated by studies focusing on development of methods and algorithms. Differences in type of data can also be seen, with webometrics more focused on analyses of the structure of the web and web mining more focused on web content and usage, even though both fields have been embracing the possibilities of user generated content. It is concluded that research problems where big data is needed can benefit from collaboration between webometricians, with their tradition of exploratory studies, and web miners, with their tradition of developing methods and algorithms

    SWKM 2008: Social Web and Knowledge Management, Proceedings:CEUR Workshop Proceedings

    Get PDF

    A treatise on Web 2.0 with a case study from the financial markets

    Get PDF
    There has been much hype in vocational and academic circles surrounding the emergence of web 2.0 or social media; however, relatively little work was dedicated to substantiating the actual concept of web 2.0. Many have dismissed it as not deserving of this new title, since the term web 2.0 assumes a certain interpretation of web history, including enough progress in certain direction to trigger a succession [i.e. web 1.0 → web 2.0]. Others provided arguments in support of this development, and there has been a considerable amount of enthusiasm in the literature. Much research has been busy evaluating current use of web 2.0, and analysis of the user generated content, but an objective and thorough assessment of what web 2.0 really stands for has been to a large extent overlooked. More recently the idea of collective intelligence facilitated via web 2.0, and its potential applications have raised interest with researchers, yet a more unified approach and work in the area of collective intelligence is needed. This thesis identifies and critically evaluates a wider context for the web 2.0 environment, and what caused it to emerge; providing a rich literature review on the topic, a review of existing taxonomies, a quantitative and qualitative evaluation of the concept itself, an investigation of the collective intelligence potential that emerges from application usage. Finally, a framework for harnessing collective intelligence in a more systematic manner is proposed. In addition to the presented results, novel methodologies are also introduced throughout this work. In order to provide interesting insight but also to illustrate analysis, a case study of the recent financial crisis is considered. Some interesting results relating to the crisis are revealed within user generated content data, and relevant issues are discussed where appropriate

    Application of Common Sense Computing for the Development of a Novel Knowledge-Based Opinion Mining Engine

    Get PDF
    The ways people express their opinions and sentiments have radically changed in the past few years thanks to the advent of social networks, web communities, blogs, wikis and other online collaborative media. The distillation of knowledge from this huge amount of unstructured information can be a key factor for marketers who want to create an image or identity in the minds of their customers for their product, brand, or organisation. These online social data, however, remain hardly accessible to computers, as they are specifically meant for human consumption. The automatic analysis of online opinions, in fact, involves a deep understanding of natural language text by machines, from which we are still very far. Hitherto, online information retrieval has been mainly based on algorithms relying on the textual representation of web-pages. Such algorithms are very good at retrieving texts, splitting them into parts, checking the spelling and counting their words. But when it comes to interpreting sentences and extracting meaningful information, their capabilities are known to be very limited. Existing approaches to opinion mining and sentiment analysis, in particular, can be grouped into three main categories: keyword spotting, in which text is classified into categories based on the presence of fairly unambiguous affect words; lexical affinity, which assigns arbitrary words a probabilistic affinity for a particular emotion; statistical methods, which calculate the valence of affective keywords and word co-occurrence frequencies on the base of a large training corpus. Early works aimed to classify entire documents as containing overall positive or negative polarity, or rating scores of reviews. Such systems were mainly based on supervised approaches relying on manually labelled samples, such as movie or product reviews where the opinionist’s overall positive or negative attitude was explicitly indicated. However, opinions and sentiments do not occur only at document level, nor they are limited to a single valence or target. Contrary or complementary attitudes toward the same topic or multiple topics can be present across the span of a document. In more recent works, text analysis granularity has been taken down to segment and sentence level, e.g., by using presence of opinion-bearing lexical items (single words or n-grams) to detect subjective sentences, or by exploiting association rule mining for a feature-based analysis of product reviews. These approaches, however, are still far from being able to infer the cognitive and affective information associated with natural language as they mainly rely on knowledge bases that are still too limited to efficiently process text at sentence level. In this thesis, common sense computing techniques are further developed and applied to bridge the semantic gap between word-level natural language data and the concept-level opinions conveyed by these. In particular, the ensemble application of graph mining and multi-dimensionality reduction techniques on two common sense knowledge bases was exploited to develop a novel intelligent engine for open-domain opinion mining and sentiment analysis. The proposed approach, termed sentic computing, performs a clause-level semantic analysis of text, which allows the inference of both the conceptual and emotional information associated with natural language opinions and, hence, a more efficient passage from (unstructured) textual information to (structured) machine-processable data. The engine was tested on three different resources, namely a Twitter hashtag repository, a LiveJournal database and a PatientOpinion dataset, and its performance compared both with results obtained using standard sentiment analysis techniques and using different state-of-the-art knowledge bases such as Princeton’s WordNet, MIT’s ConceptNet and Microsoft’s Probase. Differently from most currently available opinion mining services, the developed engine does not base its analysis on a limited set of affect words and their co-occurrence frequencies, but rather on common sense concepts and the cognitive and affective valence conveyed by these. This allows the engine to be domain-independent and, hence, to be embedded in any opinion mining system for the development of intelligent applications in multiple fields such as Social Web, HCI and e-health. Looking ahead, the combined novel use of different knowledge bases and of common sense reasoning techniques for opinion mining proposed in this work, will, eventually, pave the way for development of more bio-inspired approaches to the design of natural language processing systems capable of handling knowledge, retrieving it when necessary, making analogies and learning from experience
    corecore