160 research outputs found

    Scalable and Declarative Information Extraction in a Parallel Data Analytics System

    Get PDF
    Informationsextraktions (IE) auf sehr großen Datenmengen erfordert hochkomplexe, skalierbare und anpassungsfähige Systeme. Obwohl zahlreiche IE-Algorithmen existieren, ist die nahtlose und erweiterbare Kombination dieser Werkzeuge in einem skalierbaren System immer noch eine große Herausforderung. In dieser Arbeit wird ein anfragebasiertes IE-System für eine parallelen Datenanalyseplattform vorgestellt, das für konkrete Anwendungsdomänen konfigurierbar ist und für Textsammlungen im Terabyte-Bereich skaliert. Zunächst werden konfigurierbare Operatoren für grundlegende IE- und Web-Analytics-Aufgaben definiert, mit denen komplexe IE-Aufgaben in Form von deklarativen Anfragen ausgedrückt werden können. Alle Operatoren werden hinsichtlich ihrer Eigenschaften charakterisiert um das Potenzial und die Bedeutung der Optimierung nicht-relationaler, benutzerdefinierter Operatoren (UDFs) für Data Flows hervorzuheben. Anschließend wird der Stand der Technik in der Optimierung nicht-relationaler Data Flows untersucht und herausgearbeitet, dass eine umfassende Optimierung von UDFs immer noch eine Herausforderung ist. Darauf aufbauend wird ein erweiterbarer, logischer Optimierer (SOFA) vorgestellt, der die Semantik von UDFs mit in die Optimierung mit einbezieht. SOFA analysiert eine kompakte Menge von Operator-Eigenschaften und kombiniert eine automatisierte Analyse mit manuellen UDF-Annotationen, um die umfassende Optimierung von Data Flows zu ermöglichen. SOFA ist in der Lage, beliebige Data Flows aus unterschiedlichen Anwendungsbereichen logisch zu optimieren, was zu erheblichen Laufzeitverbesserungen im Vergleich mit anderen Techniken führt. Als Viertes wird die Anwendbarkeit des vorgestellten Systems auf Korpora im Terabyte-Bereich untersucht und systematisch die Skalierbarkeit und Robustheit der eingesetzten Methoden und Werkzeuge beurteilt um schließlich die kritischsten Herausforderungen beim Aufbau eines IE-Systems für sehr große Datenmenge zu charakterisieren.Information extraction (IE) on very large data sets requires highly complex, scalable, and adaptive systems. Although numerous IE algorithms exist, their seamless and extensible combination in a scalable system still is a major challenge. This work presents a query-based IE system for a parallel data analysis platform, which is configurable for specific application domains and scales for terabyte-sized text collections. First, configurable operators are defined for basic IE and Web Analytics tasks, which can be used to express complex IE tasks in the form of declarative queries. All operators are characterized in terms of their properties to highlight the potential and importance of optimizing non-relational, user-defined operators (UDFs) for dataflows. Subsequently, we survey the state of the art in optimizing non-relational dataflows and highlight that a comprehensive optimization of UDFs is still a challenge. Based on this observation, an extensible, logical optimizer (SOFA) is introduced, which incorporates the semantics of UDFs into the optimization process. SOFA analyzes a compact set of operator properties and combines automated analysis with manual UDF annotations to enable a comprehensive optimization of data flows. SOFA is able to logically optimize arbitrary data flows from different application areas, resulting in significant runtime improvements compared to other techniques. Finally, the applicability of the presented system to terabyte-sized corpora is investigated. Hereby, we systematically evaluate scalability and robustness of the employed methods and tools in order to pinpoint the most critical challenges in building an IE system for very large data sets

    Semantic web system for differential diagnosis recommendations

    Get PDF
    There is a growing realization that healthcare is a knowledge-intensive field. The ability to capture and leverage semantics via inference or query processing is crucial for supporting the various required processes in both primary (e.g. disease diagnosis) and long term care (e.g. predictive and preventive diagnosis). Given the wide canvas and the relatively frequent knowledge changes that occur in this area, we need to take advantage of the new trends in Semantic Web technologies. In particular, the power of ontologies allows us to share medical research and provide suitable support to physician's practices. There is also a need to integrate these technologies within the currently used healthcare practices. In particular the use of semantic web technologies is highly demanded within the clinicians' differential diagnosis process and the clinical pathways disease management procedures as well as to aid the predictive/preventative measures used by healthcare professionals

    Improving Android app security and privacy with developers

    Get PDF
    Existing research has uncovered many security vulnerabilities in Android applications (apps) caused by inexperienced, and unmotivated developers. Especially, the lack of tool support makes it hard for developers to avoid common security and privacy problems in Android apps. As a result, this leads to apps with security vulnerability that exposes end users to a multitude of attacks. This thesis presents a line of work that studies and supports Android developers in writing more secure code. We first studied to which extent tool support can help developers in creating more secure applications. To this end, we developed and evaluated an Android Studio extension that identifies common security problems of Android apps, and provides developers suggestions to more secure alternatives. Subsequently, we focused on the issue of outdated third-party libraries in apps which also is the root cause for a variety of security vulnerabilities. Therefore, we analyzed all popular 3rd party libraries in the Android ecosystem, and provided developers feedback and guidance in the form of tool support in their development environment to fix such security problems. In the second part of this thesis, we empirically studied and measured the impact of user reviews on app security and privacy evolution. Thus, we built a review classifier to identify security and privacy related reviews and performed regression analysis to measure their impact on the evolution of security and privacy in Android apps. Based on our results we proposed several suggestions to improve the security and privacy of Android apps by leveraging user feedbacks to create incentives for developers to improve their apps toward better versions.Die bisherige Forschung zeigt eine Vielzahl von Sicherheitslücken in Android-Applikationen auf, welche sich auf unerfahrene und unmotivierte Entwickler zurückführen lassen. Insbesondere ein Mangel an Unterstützung durch Tools erschwert es den Entwicklern, häufig auftretende Sicherheits- und Datenschutzprobleme in Android Apps zu vermeiden. Als Folge führt dies zu Apps mit Sicherheitsschwachstellen, die Benutzer einer Vielzahl von Angriffen aussetzen. Diese Dissertation präsentiert eine Reihe von Forschungsarbeiten, die Android-Entwickler bei der Entwicklung von sichereren Apps untersucht und unterstützt. In einem ersten Schritt untersuchten wir, inwieweit die Tool-Unterstützung Entwicklern beim Schreiben von sicherem Code helfen kann. Zu diesem Zweck entwickelten und evaluierten wir eine Android Studio-Erweiterung, die gängige Sicherheitsprobleme von Android-Apps identifiziert und Entwicklern Vorschläge für sicherere Alternativen bietet. Daran anknüpfend, konzentrierten wir uns auf das Problem veralteter Bibliotheken von Drittanbietern in Apps, die ebenfalls häufig die Ursache von Sicherheitslücken sein können. Hierzu analysierten wir alle gängigen 3rd-Party-Bibliotheken im Android-Ökosystem und gaben den Entwicklern Feedback und Anleitung in Form von Tool-Unterstützung in ihrer Entwicklungsumgebung, um solche Sicherheitsprobleme zu beheben. Im zweiten Teil dieser Dissertation untersuchten wir empirisch die Auswirkungen von Benutzer-Reviews im Android Appstore auf die Entwicklung der Sicherheit und des Datenschutzes von Apps. Zu diesem Zweck entwickelten wir einen Review-Klassifikator, welcher in der Lage ist sicherheits- und datenschutzbezogene Reviews zu identifizieren. Nachfolgend untersuchten wir den Einfluss solcher Reviews auf die Entwicklung der Sicherheit und des Datenschutzes in Android-Apps mithilfe einer Regressionsanalyse. Basierend auf unseren Ergebnissen präsentieren wir verschiedene Vorschläge zur Verbesserung der Sicherheit und des Datenschutzes von Android-Apps, welche die Reviews der Benutzer zur Schaffung von Anreizen für Entwickler nutzen

    Large-scale sensor-rich video management and delivery

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    Crawling, Collecting, and Condensing News Comments

    Get PDF
    Traditionally, public opinion and policy is decided by issuing surveys and performing censuses designed to measure what the public thinks about a certain topic. Within the past five years social networks such as Facebook and Twitter have gained traction for collection of public opinion about current events. Academic research on Facebook data proves difficult since the platform is generally closed. Twitter on the other hand restricts the conversation of its users making it difficult to extract large scale concepts from the microblogging infrastructure. News comments provide a rich source of discourse from individuals who are passionate about an issue. Furthermore, due to the overhead of commenting, the population of commenters is necessarily biased towards individual who have either strong opinions of a topic or in depth knowledge of the given issue. Furthermore, their comments are often a collection of insight derived from reading multiple articles on any given topic. Unfortunately the commenting systems employed by news companies are not implemented by a single entity, and are often stored and generated using AJAX, which causes traditional crawlers to ignore them. To make matters worse they are often noisy; containing spam, poor grammar, and excessive typos. Furthermore, due to the anonymity of comment systems, conversations can often be derailed by malicious users or inherent biases in the commenters. In this thesis we discuss the design and creation of a crawler designed to extract comments from domains across the internet. For practical purposes we create a semiautomatic parser generator and describe how our system attempts to employ user feedback to predict which remote procedure calls are used to load comments. By reducing comment systems into remote procedure calls, we simplify the internet into a much simpler space, where we can focus on the data, almost independently from its presentation. Thus we are able to quickly create high fidelity parsers to extract comments from a web page. Once we have our system, we show the usefulness by attempting to extract meaningful opinions from the large collections we collect. Unfortunately doing so in real time is shown to foil traditional summarization systems, which are designed to handle dozens of well formed documents. In attempting to solve this problem we create a new algorithm, KLSum+, that outperforms all its competitors in efficiency while generally scoring well against the ROUGE SU4 metric. This algorithm factors in background models to boost accuracy, but performs over 50 times faster than alternatives. Furthermore, using the summaries we see that the data collected can provide useful insight into public opinion and even provide the key points of discourse

    Web manifestations of knowledge-based innovation systems in the UK

    Get PDF
    Innovation is widely recognised as essential to the modern economy. The term knowledgebased innovation system has been used to refer to innovation systems which recognise the importance of an economy’s knowledge base and the efficient interactions between important actors from the different sectors of society. Such interactions are thought to enable greater innovation by the system as a whole. Whilst it may not be possible to fully understand all the complex relationships involved within knowledge-based innovation systems, within the field of informetrics bibliometric methodologies have emerged that allows us to analyse some of the relationships that contribute to the innovation process. However, due to the limitations in traditional bibliometric sources it is important to investigate new potential sources of information. The web is one such source. This thesis documents an investigation into the potential of the web to provide information about knowledge-based innovation systems in the United Kingdom. Within this thesis the link analysis methodologies that have previously been successfully applied to investigations of the academic community (Thelwall, 2004a) are applied to organisations from different sections of society to determine whether link analysis of the web can provide a new source of information about knowledge-based innovation systems in the UK. This study makes the case that data may be collected ethically to provide information about the interconnections between web sites of various different sizes and from within different sectors of society, that there are significant differences in the linking practices of web sites within different sectors, and that reciprocal links provide a better indication of collaboration than uni-directional web links. Most importantly the study shows that the web provides new information about the relationships between organisations, rather than just a repetition of the same information from an alternative source. Whilst the study has shown that there is a lot of potential for the web as a source of information on knowledge-based innovation systems, the same richness that makes it such a potentially useful source makes applications of large scale studies very labour intensive.EThOS - Electronic Theses Online ServiceGBUnited Kingdo

    Statistical Extraction of Multilingual Natural Language Patterns for RDF Predicates: Algorithms and Applications

    Get PDF
    The Data Web has undergone a tremendous growth period. It currently consists of more then 3300 publicly available knowledge bases describing millions of resources from various domains, such as life sciences, government or geography, with over 89 billion facts. In the same way, the Document Web grew to the state where approximately 4.55 billion websites exist, 300 million photos are uploaded on Facebook as well as 3.5 billion Google searches are performed on average every day. However, there is a gap between the Document Web and the Data Web, since for example knowledge bases available on the Data Web are most commonly extracted from structured or semi-structured sources, but the majority of information available on the Web is contained in unstructured sources such as news articles, blog post, photos, forum discussions, etc. As a result, data on the Data Web not only misses a significant fragment of information but also suffers from a lack of actuality since typical extraction methods are time-consuming and can only be carried out periodically. Furthermore, provenance information is rarely taken into consideration and therefore gets lost in the transformation process. In addition, users are accustomed to entering keyword queries to satisfy their information needs. With the availability of machine-readable knowledge bases, lay users could be empowered to issue more specific questions and get more precise answers. In this thesis, we address the problem of Relation Extraction, one of the key challenges pertaining to closing the gap between the Document Web and the Data Web by four means. First, we present a distant supervision approach that allows finding multilingual natural language representations of formal relations already contained in the Data Web. We use these natural language representations to find sentences on the Document Web that contain unseen instances of this relation between two entities. Second, we address the problem of data actuality by presenting a real-time data stream RDF extraction framework and utilize this framework to extract RDF from RSS news feeds. Third, we present a novel fact validation algorithm, based on natural language representations, able to not only verify or falsify a given triple, but also to find trustworthy sources for it on the Web and estimating a time scope in which the triple holds true. The features used by this algorithm to determine if a website is indeed trustworthy are used as provenance information and therewith help to create metadata for facts in the Data Web. Finally, we present a question answering system that uses the natural language representations to map natural language question to formal SPARQL queries, allowing lay users to make use of the large amounts of data available on the Data Web to satisfy their information need

    Android Application Security Scanning Process

    Get PDF
    This chapter presents the security scanning process for Android applications. The aim is to guide researchers and developers to the core phases/steps required to analyze Android applications, check their trustworthiness, and protect Android users and their devices from being victims to different malware attacks. The scanning process is comprehensive, explaining the main phases and how they are conducted including (a) the download of the apps themselves; (b) Android application package (APK) reverse engineering; (c) app feature extraction, considering both static and dynamic analysis; (d) dataset creation and/or utilization; and (e) data analysis and data mining that result in producing detection systems, classification systems, and ranking systems. Furthermore, this chapter highlights the app features, evaluation metrics, mechanisms and tools, and datasets that are frequently used during the app’s security scanning process
    corecore