2,752 research outputs found
BlogForever D2.6: Data Extraction Methodology
This report outlines an inquiry into the area of web data extraction, conducted within the context of blog preservation. The report reviews theoretical advances and practical developments for implementing data extraction. The inquiry is extended through an experiment that demonstrates the effectiveness and feasibility of implementing some of the suggested approaches. More specifically, the report discusses an approach based on unsupervised machine learning that employs the RSS feeds and HTML representations of blogs. It outlines the possibilities of extracting semantics available in blogs and demonstrates the benefits of exploiting available standards such as microformats and microdata. The report proceeds to propose a methodology for extracting and processing blog data to further inform the design and development of the BlogForever platform
Ontology Driven Web Extraction from Semi-structured and Unstructured Data for B2B Market Analysis
The Market Blended Insight project1 has the objective of improving the UK business to business marketing performance using the semantic web technologies. In this project, we are implementing an ontology driven web extraction and translation framework to supplement our backend triple store of UK companies, people and geographical information. It deals with both the semi-structured data and the unstructured text on the web, to annotate and then translate the extracted data according to the backend schema
Topic driven testing
Modern interactive applications offer so many interaction opportunities that automated exploration and testing becomes practically impossible without some domain specific guidance towards relevant functionality. In this dissertation, we present a novel fundamental graphical user interface testing method called topic-driven testing. We mine the semantic meaning of interactive elements, guide testing, and identify core functionality of applications. The semantic interpretation is close to human understanding and allows us to learn specifications and transfer knowledge across multiple applications independent of the underlying device, platform, programming language, or technology stackâto the best of our knowledge a unique feature of our technique. Our tool ATTABOY is able to take an existing Web application test suite say from Amazon, execute it on ebay, and thus guide testing to relevant core functionality. Tested on different application domains such as eCommerce, news pages, mail clients, it can trans- fer on average sixty percent of the tested application behavior to new appsâwithout any human intervention. On top of that, topic-driven testing can go with even more vague instructions of how-to descriptions or use-case descriptions. Given an instruction, say âadd item to shopping cartâ, it tests the specified behavior in an applicationâboth in a browser as well as in mobile apps. It thus improves state-of-the-art UI testing frame- works, creates change resilient UI tests, and lays the foundation for learning, transfer- ring, and enforcing common application behavior. The prototype is up to five times faster than existing random testing frameworks and tests functions that are hard to cover by non-trained approaches.Moderne interaktive Anwendungen bieten so viele InteraktionsmĂśglichkeiten, dass eine vollständige automatische Exploration und das Testen aller Szenarien praktisch unmĂśglich ist. Stattdessen muss die Testprozedur auf relevante Kernfunktionalität ausgerichtet werden. Diese Arbeit stellt ein neues fundamentales Testprinzip genannt thematisches Testen vor, das beliebige Anwendungen u Ěber die graphische Oberfläche testet. Wir untersuchen die semantische Bedeutung von interagierbaren Elementen um die Kernfunktionenen von Anwendungen zu identifizieren und entsprechende Tests zu erzeugen. Statt typischen starren Testinstruktionen orientiert sich diese Art von Tests an menschlichen Anwendungsfällen in natĂźrlicher Sprache. Dies erlaubt es, Software Spezifikationen zu erlernen und Wissen von einer Anwendung auf andere zu Ăźbertragen unabhängig von der Anwendungsart, der Programmiersprache, dem Testgerät oder der -Plattform. Nach unserem Kenntnisstand ist unser Ansatz der Erste dieser Art. Wir präsentieren ATTABOY, ein Programm, das eine existierende Testsammlung fĂźr eine Webanwendung (z.B. fĂźr Amazon) nimmt und in einer beliebigen anderen Anwendung (sagen wir ebay) ausfĂźhrt. Dadurch werden Tests fĂźr Kernfunktionen generiert. Bei der ersten AusfĂźhrung auf Anwendungen aus den Domänen Online Shopping, Nachrichtenseiten und eMail, erzeugt der Prototyp sechzig Prozent der Tests automatisch. Ohne zusätzlichen manuellen Aufwand. DarĂźber hinaus interpretiert themen- getriebenes Testen auch vage Anweisungen beispielsweise von How-to Anleitungen oder Anwendungsbeschreibungen. Eine Anweisung wie "FĂźgen Sie das Produkt in den Warenkorb hinzu" testet das entsprechende Verhalten in der Anwendung. Sowohl im Browser, als auch in einer mobilen Anwendung. Die erzeugten Tests sind robuster und effektiver als vergleichbar erzeugte Tests. Der Prototyp testet die Zielfunktionalität fĂźnf mal schneller und testet dabei Funktionen die durch nicht spezialisierte Ansätze kaum zu erreichen sind
Which one is better: presentation-based or content-based math search?
Mathematical content is a valuable information source and retrieving this
content has become an important issue. This paper compares two searching
strategies for math expressions: presentation-based and content-based
approaches. Presentation-based search uses state-of-the-art math search system
while content-based search uses semantic enrichment of math expressions to
convert math expressions into their content forms and searching is done using
these content-based expressions. By considering the meaning of math
expressions, the quality of search system is improved over presentation-based
systems
Business Ontology for Evaluating Corporate Social Responsibility
This paper presents a software solution that is developed to automatically classify companies by taking into account their level of social responsibility. The application is based on ontologies and on intelligent agents. In order to obtain the data needed to evaluate companies, we developed a web crawling module that analyzes the companyâs website and the documents that are available online such as social responsibility report, mission statement, employment structure, etc. Based on a predefined CSR ontology, the web crawling module extracts the terms that are linked to corporate social responsibility. By taking into account the extracted qualitative data, an intelligent agent, previously trained on a set of companies, computes the qualitative values, which are then included in the classification model based on neural networks. The proposed ontology takes into consideration the guidelines proposed by the âISO 26000 Standard for Social Responsibilityâ. Having this model, and being aware of the positive relationship between Corporate Social Responsibility and financial performance, an overall perspective on each companyâs activity can be configured, this being useful not only to the companyâs creditors, auditors, stockholders, but also to its consumers.corporate social responsibility, ISO 26000 Standard for Social Responsibility, ontology, web crawling, intelligent agent, corporate performance, POS tagging, opinion mining, sentiment analysis
Online Popularity and Topical Interests through the Lens of Instagram
Online socio-technical systems can be studied as proxy of the real world to
investigate human behavior and social interactions at scale. Here we focus on
Instagram, a media-sharing online platform whose popularity has been rising up
to gathering hundred millions users. Instagram exhibits a mixture of features
including social structure, social tagging and media sharing. The network of
social interactions among users models various dynamics including
follower/followee relations and users' communication by means of
posts/comments. Users can upload and tag media such as photos and pictures, and
they can "like" and comment each piece of information on the platform. In this
work we investigate three major aspects on our Instagram dataset: (i) the
structural characteristics of its network of heterogeneous interactions, to
unveil the emergence of self organization and topically-induced community
structure; (ii) the dynamics of content production and consumption, to
understand how global trends and popular users emerge; (iii) the behavior of
users labeling media with tags, to determine how they devote their attention
and to explore the variety of their topical interests. Our analysis provides
clues to understand human behavior dynamics on socio-technical systems,
specifically users and content popularity, the mechanisms of users'
interactions in online environments and how collective trends emerge from
individuals' topical interests.Comment: 11 pages, 11 figures, Proceedings of ACM Hypertext 201
From Frequency to Meaning: Vector Space Models of Semantics
Computers understand very little of the meaning of human language. This
profoundly limits our ability to give instructions to computers, the ability of
computers to explain their actions to us, and the ability of computers to
analyse and process text. Vector space models (VSMs) of semantics are beginning
to address these limits. This paper surveys the use of VSMs for semantic
processing of text. We organize the literature on VSMs according to the
structure of the matrix in a VSM. There are currently three broad classes of
VSMs, based on term-document, word-context, and pair-pattern matrices, yielding
three classes of applications. We survey a broad range of applications in these
three categories and we take a detailed look at a specific open source project
in each category. Our goal in this survey is to show the breadth of
applications of VSMs for semantics, to provide a new perspective on VSMs for
those who are already familiar with the area, and to provide pointers into the
literature for those who are less familiar with the field
Advanced Data Mining Techniques for Compound Objects
Knowledge Discovery in Databases (KDD) is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in large data collections. The most important step within the process of KDD is data mining which is concerned with the extraction of the valid patterns. KDD is necessary to analyze the steady growing amount of data caused by the enhanced performance of modern computer systems. However, with the growing amount of data the complexity of data objects increases as well. Modern methods of KDD should therefore examine more complex objects than simple feature vectors to solve real-world KDD applications adequately. Multi-instance and multi-represented objects are two important types of object representations for complex objects. Multi-instance objects consist of a set of object representations that all belong to the same feature space. Multi-represented objects are constructed as a tuple of feature representations where each feature representation belongs to a different feature space.
The contribution of this thesis is the development of new KDD methods for the classification and clustering of complex objects. Therefore, the thesis introduces solutions for real-world applications that are based on multi-instance and
multi-represented object representations. On the basis of these solutions, it is shown that a more general object representation often provides better results for many relevant KDD applications.
The first part of the thesis is concerned with two KDD problems for which employing multi-instance objects provides efficient and effective solutions. The first is the data mining in CAD parts, e.g. the use of hierarchic clustering for the automatic construction of product hierarchies. The introduced solution decomposes a single part into a set of feature vectors and compares them by using a metric on multi-instance objects. Furthermore, multi-step query processing using a novel filter step is employed, enabling the user to efficiently process similarity queries. On the basis of this similarity search system, it is possible to perform several distance based data mining algorithms like the hierarchical clustering algorithm OPTICS to derive product hierarchies.
The second important application is the classification and search for complete websites in the world wide web (WWW). A website is a set of HTML-documents that is published by the same person, group or organization and usually serves a common purpose. To perform data mining for websites, the thesis presents several methods to classify websites. After introducing naive methods modelling websites as webpages, two more sophisticated approaches to website classification are introduced. The first approach uses a preprocessing that maps single HTML-documents within each website to so-called page classes. The second approach directly compares websites as sets of word vectors and uses nearest neighbor classification. To search the WWW for new, relevant websites, a focused crawler is introduced that efficiently retrieves relevant websites. This crawler minimizes the number of HTML-documents and increases the accuracy of website retrieval.
The second part of the thesis is concerned with the data mining in multi-represented objects. An important example application for this kind of complex objects are proteins that can be represented as a tuple of a protein sequence and a text annotation. To analyze multi-represented objects, a clustering method for multi-represented objects is introduced that is based on the density based clustering algorithm DBSCAN. This method uses all representations that are provided to find a global clustering of the given data objects. However, in many applications there already exists a sophisticated class ontology for the given data objects, e.g. proteins. To map new objects into an ontology a new
method for the hierarchical classification of multi-represented objects is described. The system employs the hierarchical structure of the ontology to efficiently classify new proteins, using support vector machines
D7.1. Criteria for evaluation of resources, technology and integration.
This deliverable defines how evaluation is carried out at each integration cycle in the PANACEA project. As PANACEA aims at producing large scale resources, evaluation becomes a critical and challenging issue. Critical because it is important to assess the quality of the results that should be delivered to users. Challenging because we prospect rather new areas, and through a technical platform: some new methodologies will have to be explored or old ones to be adapted
- âŚ