160 research outputs found

    BlogForever D2.6: Data Extraction Methodology

    Get PDF
    This report outlines an inquiry into the area of web data extraction, conducted within the context of blog preservation. The report reviews theoretical advances and practical developments for implementing data extraction. The inquiry is extended through an experiment that demonstrates the effectiveness and feasibility of implementing some of the suggested approaches. More specifically, the report discusses an approach based on unsupervised machine learning that employs the RSS feeds and HTML representations of blogs. It outlines the possibilities of extracting semantics available in blogs and demonstrates the benefits of exploiting available standards such as microformats and microdata. The report proceeds to propose a methodology for extracting and processing blog data to further inform the design and development of the BlogForever platform

    A series of case studies to enhance the social utility of RSS

    Get PDF
    RSS (really simple syndication, rich site summary or RDF site summary) is a dialect of XML that provides a method of syndicating on-line content, where postings consist of frequently updated news items, blog entries and multimedia. RSS feeds, produced by organisations or individuals, are often aggregated, and delivered to users for consumption via readers. The semi-structured format of RSS also allows the delivery/exchange of machine-readable content between different platforms and systems. Articles on web pages frequently include icons that represent social media services which facilitate social data. Amongst these, RSS feeds deliver data which is typically presented in the journalistic style of headline, story and snapshot(s). Consequently, applications and academic research have employed RSS on this basis. Therefore, within the context of social media, the question arises: can the social function, i.e. utility, of RSS be enhanced by producing from it data which is actionable and effective? This thesis is based upon the hypothesis that the fluctuations in the keyword frequencies present in RSS can be mined to produce actionable and effective data, to enhance the technology's social utility. To this end, we present a series of laboratory-based case studies which demonstrate two novel and logically consistent RSS-mining paradigms. Our first paradigm allows users to define mining rules to mine data from feeds. The second paradigm employs a semi-automated classification of feeds and correlates this with sentiment. We visualise the outputs produced by the case studies for these paradigms, where they can benefit users in real-world scenarios, varying from statistics and trend analysis to mining financial and sporting data. The contributions of this thesis to web engineering and text mining are the demonstration of the proof of concept of our paradigms, through the integration of an array of open-source, third-party products into a coherent and innovative, alpha-version prototype software implemented in a Java JSP/servlet-based web application architecture

    Data Mining Algorithms for Internet Data: from Transport to Application Layer

    Get PDF
    Nowadays we live in a data-driven world. Advances in data generation, collection and storage technology have enabled organizations to gather data sets of massive size. Data mining is a discipline that blends traditional data analysis methods with sophisticated algorithms to handle the challenges posed by these new types of data sets. The Internet is a complex and dynamic system with new protocols and applications that arise at a constant pace. All these characteristics designate the Internet a valuable and challenging data source and application domain for a research activity, both looking at Transport layer, analyzing network tra c flows, and going up to Application layer, focusing on the ever-growing next generation web services: blogs, micro-blogs, on-line social networks, photo sharing services and many other applications (e.g., Twitter, Facebook, Flickr, etc.). In this thesis work we focus on the study, design and development of novel algorithms and frameworks to support large scale data mining activities over huge and heterogeneous data volumes, with a particular focus on Internet data as data source and targeting network tra c classification, on-line social network analysis, recommendation systems and cloud services and Big data

    Recent Developments in Smart Healthcare

    Get PDF
    Medicine is undergoing a sector-wide transformation thanks to the advances in computing and networking technologies. Healthcare is changing from reactive and hospital-centered to preventive and personalized, from disease focused to well-being centered. In essence, the healthcare systems, as well as fundamental medicine research, are becoming smarter. We anticipate significant improvements in areas ranging from molecular genomics and proteomics to decision support for healthcare professionals through big data analytics, to support behavior changes through technology-enabled self-management, and social and motivational support. Furthermore, with smart technologies, healthcare delivery could also be made more efficient, higher quality, and lower cost. In this special issue, we received a total 45 submissions and accepted 19 outstanding papers that roughly span across several interesting topics on smart healthcare, including public health, health information technology (Health IT), and smart medicine

    A series of case studies to enhance the social utility of RSS

    Get PDF
    RSS (really simple syndication, rich site summary or RDF site summary) is a dialect of XML that provides a method of syndicating on-line content, where postings consist of frequently updated news items, blog entries and multimedia. RSS feeds, produced by organisations or individuals, are often aggregated, and delivered to users for consumption via readers. The semi-structured format of RSS also allows the delivery/exchange of machine-readable content between different platforms and systems. Articles on web pages frequently include icons that represent social media services which facilitate social data. Amongst these, RSS feeds deliver data which is typically presented in the journalistic style of headline, story and snapshot(s). Consequently, applications and academic research have employed RSS on this basis. Therefore, within the context of social media, the question arises: can the social function, i.e. utility, of RSS be enhanced by producing from it data which is actionable and effective? This thesis is based upon the hypothesis that the fluctuations in the keyword frequencies present in RSS can be mined to produce actionable and effective data, to enhance the technology's social utility. To this end, we present a series of laboratory-based case studies which demonstrate two novel and logically consistent RSS-mining paradigms. Our first paradigm allows users to define mining rules to mine data from feeds. The second paradigm employs a semi-automated classification of feeds and correlates this with sentiment. We visualise the outputs produced by the case studies for these paradigms, where they can benefit users in real-world scenarios, varying from statistics and trend analysis to mining financial and sporting data. The contributions of this thesis to web engineering and text mining are the demonstration of the proof of concept of our paradigms, through the integration of an array of open-source, third-party products into a coherent and innovative, alpha-version prototype software implemented in a Java JSP/servlet-based web application architecture

    Unsupervised Graph-Based Similarity Learning Using Heterogeneous Features.

    Full text link
    Relational data refers to data that contains explicit relations among objects. Nowadays, relational data are universal and have a broad appeal in many different application domains. The problem of estimating similarity between objects is a core requirement for many standard Machine Learning (ML), Natural Language Processing (NLP) and Information Retrieval (IR) problems such as clustering, classiffication, word sense disambiguation, etc. Traditional machine learning approaches represent the data using simple, concise representations such as feature vectors. While this works very well for homogeneous data, i.e, data with a single feature type such as text, it does not exploit the availability of dfferent feature types fully. For example, scientic publications have text, citations, authorship information, venue information. Each of the features can be used for estimating similarity. Representing such objects has been a key issue in efficient mining (Getoor and Taskar, 2007). In this thesis, we propose natural representations for relational data using multiple, connected layers of graphs; one for each feature type. Also, we propose novel algorithms for estimating similarity using multiple heterogeneous features. Also, we present novel algorithms for tasks like topic detection and music recommendation using the estimated similarity measure. We demonstrate superior performance of the proposed algorithms (root mean squared error of 24.81 on the Yahoo! KDD Music recommendation data set and classiffication accuracy of 88% on the ACL Anthology Network data set) over many of the state of the art algorithms, such as Latent Semantic Analysis (LSA), Multiple Kernel Learning (MKL) and spectral clustering and baselines on large, standard data sets.Ph.D.Computer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/89824/1/mpradeep_1.pd

    Acta Cybernetica : Volume 18. Number 2.

    Get PDF

    Combining SOA and BPM Technologies for Cross-System Process Automation

    Get PDF
    This paper summarizes the results of an industry case study that introduced a cross-system business process automation solution based on a combination of SOA and BPM standard technologies (i.e., BPMN, BPEL, WSDL). Besides discussing major weaknesses of the existing, custom-built, solution and comparing them against experiences with the developed prototype, the paper presents a course of action for transforming the current solution into the proposed solution. This includes a general approach, consisting of four distinct steps, as well as specific action items that are to be performed for every step. The discussion also covers language and tool support and challenges arising from the transformation
    corecore