2,752 research outputs found

    BlogForever D2.6: Data Extraction Methodology

    Get PDF
    This report outlines an inquiry into the area of web data extraction, conducted within the context of blog preservation. The report reviews theoretical advances and practical developments for implementing data extraction. The inquiry is extended through an experiment that demonstrates the effectiveness and feasibility of implementing some of the suggested approaches. More specifically, the report discusses an approach based on unsupervised machine learning that employs the RSS feeds and HTML representations of blogs. It outlines the possibilities of extracting semantics available in blogs and demonstrates the benefits of exploiting available standards such as microformats and microdata. The report proceeds to propose a methodology for extracting and processing blog data to further inform the design and development of the BlogForever platform

    Ontology Driven Web Extraction from Semi-structured and Unstructured Data for B2B Market Analysis

    No full text
    The Market Blended Insight project1 has the objective of improving the UK business to business marketing performance using the semantic web technologies. In this project, we are implementing an ontology driven web extraction and translation framework to supplement our backend triple store of UK companies, people and geographical information. It deals with both the semi-structured data and the unstructured text on the web, to annotate and then translate the extracted data according to the backend schema

    Topic driven testing

    Get PDF
    Modern interactive applications offer so many interaction opportunities that automated exploration and testing becomes practically impossible without some domain specific guidance towards relevant functionality. In this dissertation, we present a novel fundamental graphical user interface testing method called topic-driven testing. We mine the semantic meaning of interactive elements, guide testing, and identify core functionality of applications. The semantic interpretation is close to human understanding and allows us to learn specifications and transfer knowledge across multiple applications independent of the underlying device, platform, programming language, or technology stack—to the best of our knowledge a unique feature of our technique. Our tool ATTABOY is able to take an existing Web application test suite say from Amazon, execute it on ebay, and thus guide testing to relevant core functionality. Tested on different application domains such as eCommerce, news pages, mail clients, it can trans- fer on average sixty percent of the tested application behavior to new apps—without any human intervention. On top of that, topic-driven testing can go with even more vague instructions of how-to descriptions or use-case descriptions. Given an instruction, say “add item to shopping cart”, it tests the specified behavior in an application–both in a browser as well as in mobile apps. It thus improves state-of-the-art UI testing frame- works, creates change resilient UI tests, and lays the foundation for learning, transfer- ring, and enforcing common application behavior. The prototype is up to five times faster than existing random testing frameworks and tests functions that are hard to cover by non-trained approaches.Moderne interaktive Anwendungen bieten so viele Interaktionsmöglichkeiten, dass eine vollständige automatische Exploration und das Testen aller Szenarien praktisch unmöglich ist. Stattdessen muss die Testprozedur auf relevante Kernfunktionalität ausgerichtet werden. Diese Arbeit stellt ein neues fundamentales Testprinzip genannt thematisches Testen vor, das beliebige Anwendungen u ̈ber die graphische Oberfläche testet. Wir untersuchen die semantische Bedeutung von interagierbaren Elementen um die Kernfunktionenen von Anwendungen zu identifizieren und entsprechende Tests zu erzeugen. Statt typischen starren Testinstruktionen orientiert sich diese Art von Tests an menschlichen Anwendungsfällen in natürlicher Sprache. Dies erlaubt es, Software Spezifikationen zu erlernen und Wissen von einer Anwendung auf andere zu übertragen unabhängig von der Anwendungsart, der Programmiersprache, dem Testgerät oder der -Plattform. Nach unserem Kenntnisstand ist unser Ansatz der Erste dieser Art. Wir präsentieren ATTABOY, ein Programm, das eine existierende Testsammlung für eine Webanwendung (z.B. für Amazon) nimmt und in einer beliebigen anderen Anwendung (sagen wir ebay) ausführt. Dadurch werden Tests für Kernfunktionen generiert. Bei der ersten Ausführung auf Anwendungen aus den Domänen Online Shopping, Nachrichtenseiten und eMail, erzeugt der Prototyp sechzig Prozent der Tests automatisch. Ohne zusätzlichen manuellen Aufwand. Darüber hinaus interpretiert themen- getriebenes Testen auch vage Anweisungen beispielsweise von How-to Anleitungen oder Anwendungsbeschreibungen. Eine Anweisung wie "Fügen Sie das Produkt in den Warenkorb hinzu" testet das entsprechende Verhalten in der Anwendung. Sowohl im Browser, als auch in einer mobilen Anwendung. Die erzeugten Tests sind robuster und effektiver als vergleichbar erzeugte Tests. Der Prototyp testet die Zielfunktionalität fünf mal schneller und testet dabei Funktionen die durch nicht spezialisierte Ansätze kaum zu erreichen sind

    Which one is better: presentation-based or content-based math search?

    Full text link
    Mathematical content is a valuable information source and retrieving this content has become an important issue. This paper compares two searching strategies for math expressions: presentation-based and content-based approaches. Presentation-based search uses state-of-the-art math search system while content-based search uses semantic enrichment of math expressions to convert math expressions into their content forms and searching is done using these content-based expressions. By considering the meaning of math expressions, the quality of search system is improved over presentation-based systems

    Business Ontology for Evaluating Corporate Social Responsibility

    Get PDF
    This paper presents a software solution that is developed to automatically classify companies by taking into account their level of social responsibility. The application is based on ontologies and on intelligent agents. In order to obtain the data needed to evaluate companies, we developed a web crawling module that analyzes the company’s website and the documents that are available online such as social responsibility report, mission statement, employment structure, etc. Based on a predefined CSR ontology, the web crawling module extracts the terms that are linked to corporate social responsibility. By taking into account the extracted qualitative data, an intelligent agent, previously trained on a set of companies, computes the qualitative values, which are then included in the classification model based on neural networks. The proposed ontology takes into consideration the guidelines proposed by the “ISO 26000 Standard for Social Responsibility”. Having this model, and being aware of the positive relationship between Corporate Social Responsibility and financial performance, an overall perspective on each company’s activity can be configured, this being useful not only to the company’s creditors, auditors, stockholders, but also to its consumers.corporate social responsibility, ISO 26000 Standard for Social Responsibility, ontology, web crawling, intelligent agent, corporate performance, POS tagging, opinion mining, sentiment analysis

    Online Popularity and Topical Interests through the Lens of Instagram

    Full text link
    Online socio-technical systems can be studied as proxy of the real world to investigate human behavior and social interactions at scale. Here we focus on Instagram, a media-sharing online platform whose popularity has been rising up to gathering hundred millions users. Instagram exhibits a mixture of features including social structure, social tagging and media sharing. The network of social interactions among users models various dynamics including follower/followee relations and users' communication by means of posts/comments. Users can upload and tag media such as photos and pictures, and they can "like" and comment each piece of information on the platform. In this work we investigate three major aspects on our Instagram dataset: (i) the structural characteristics of its network of heterogeneous interactions, to unveil the emergence of self organization and topically-induced community structure; (ii) the dynamics of content production and consumption, to understand how global trends and popular users emerge; (iii) the behavior of users labeling media with tags, to determine how they devote their attention and to explore the variety of their topical interests. Our analysis provides clues to understand human behavior dynamics on socio-technical systems, specifically users and content popularity, the mechanisms of users' interactions in online environments and how collective trends emerge from individuals' topical interests.Comment: 11 pages, 11 figures, Proceedings of ACM Hypertext 201

    From Frequency to Meaning: Vector Space Models of Semantics

    Full text link
    Computers understand very little of the meaning of human language. This profoundly limits our ability to give instructions to computers, the ability of computers to explain their actions to us, and the ability of computers to analyse and process text. Vector space models (VSMs) of semantics are beginning to address these limits. This paper surveys the use of VSMs for semantic processing of text. We organize the literature on VSMs according to the structure of the matrix in a VSM. There are currently three broad classes of VSMs, based on term-document, word-context, and pair-pattern matrices, yielding three classes of applications. We survey a broad range of applications in these three categories and we take a detailed look at a specific open source project in each category. Our goal in this survey is to show the breadth of applications of VSMs for semantics, to provide a new perspective on VSMs for those who are already familiar with the area, and to provide pointers into the literature for those who are less familiar with the field

    Advanced Data Mining Techniques for Compound Objects

    Get PDF
    Knowledge Discovery in Databases (KDD) is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in large data collections. The most important step within the process of KDD is data mining which is concerned with the extraction of the valid patterns. KDD is necessary to analyze the steady growing amount of data caused by the enhanced performance of modern computer systems. However, with the growing amount of data the complexity of data objects increases as well. Modern methods of KDD should therefore examine more complex objects than simple feature vectors to solve real-world KDD applications adequately. Multi-instance and multi-represented objects are two important types of object representations for complex objects. Multi-instance objects consist of a set of object representations that all belong to the same feature space. Multi-represented objects are constructed as a tuple of feature representations where each feature representation belongs to a different feature space. The contribution of this thesis is the development of new KDD methods for the classification and clustering of complex objects. Therefore, the thesis introduces solutions for real-world applications that are based on multi-instance and multi-represented object representations. On the basis of these solutions, it is shown that a more general object representation often provides better results for many relevant KDD applications. The first part of the thesis is concerned with two KDD problems for which employing multi-instance objects provides efficient and effective solutions. The first is the data mining in CAD parts, e.g. the use of hierarchic clustering for the automatic construction of product hierarchies. The introduced solution decomposes a single part into a set of feature vectors and compares them by using a metric on multi-instance objects. Furthermore, multi-step query processing using a novel filter step is employed, enabling the user to efficiently process similarity queries. On the basis of this similarity search system, it is possible to perform several distance based data mining algorithms like the hierarchical clustering algorithm OPTICS to derive product hierarchies. The second important application is the classification and search for complete websites in the world wide web (WWW). A website is a set of HTML-documents that is published by the same person, group or organization and usually serves a common purpose. To perform data mining for websites, the thesis presents several methods to classify websites. After introducing naive methods modelling websites as webpages, two more sophisticated approaches to website classification are introduced. The first approach uses a preprocessing that maps single HTML-documents within each website to so-called page classes. The second approach directly compares websites as sets of word vectors and uses nearest neighbor classification. To search the WWW for new, relevant websites, a focused crawler is introduced that efficiently retrieves relevant websites. This crawler minimizes the number of HTML-documents and increases the accuracy of website retrieval. The second part of the thesis is concerned with the data mining in multi-represented objects. An important example application for this kind of complex objects are proteins that can be represented as a tuple of a protein sequence and a text annotation. To analyze multi-represented objects, a clustering method for multi-represented objects is introduced that is based on the density based clustering algorithm DBSCAN. This method uses all representations that are provided to find a global clustering of the given data objects. However, in many applications there already exists a sophisticated class ontology for the given data objects, e.g. proteins. To map new objects into an ontology a new method for the hierarchical classification of multi-represented objects is described. The system employs the hierarchical structure of the ontology to efficiently classify new proteins, using support vector machines

    D7.1. Criteria for evaluation of resources, technology and integration.

    Get PDF
    This deliverable defines how evaluation is carried out at each integration cycle in the PANACEA project. As PANACEA aims at producing large scale resources, evaluation becomes a critical and challenging issue. Critical because it is important to assess the quality of the results that should be delivered to users. Challenging because we prospect rather new areas, and through a technical platform: some new methodologies will have to be explored or old ones to be adapted
    • …
    corecore