    About the exploration of data mining techniques using structured features for information extraction

    The World Wide Web is a huge source of information. The amount of information being available in the World Wide Web becomes bigger and bigger every day. It is impossible to handle this amount of information by hand. Special techniques have to be used to deliver smaller excerpts of information which become manageable. Unfortunately, these techniques like search engines, for instance, just deliver a certain view of the informations original appearance. The delivered information is present in various types of les like websites, text documents, video clips, audio files and the like. The extraction of relevant and interesting pieces of information out of these files is very complex and time-consuming. Special techniques which allow for an automatic extraction of interesting informational units are analyzed in this work. Such techniques are based on Machine Learning methods. In contrast to traditional Machine Learning tasks the processing of text documents in this context needs certain techniques. The structure of natural language contained in text document poses constraints which should be respected by the Machine Learning method. These constraints and the specially tuned methods respecting them are another important aspect in this work. After defining all needed formalisms of Machine Learning which are used in this work, I present multiple approaches of Machine Learning applicable to the fields of Information Extraction. I describe the historical development from first approaches of Information Extraction over Named Entity Recognition to the point of Relation Extraction. The possibilities of using linguistic resources for the creation of feature sets for Information Extraction purposes are presented. I show how Relation Extraction is formally defined, and I additionally show what kind of methods are used for Relation Extraction in Machine Learning. I focus on Relation Extraction techniques which benefit on the one hand from minimum optimization and on the other hand from efficient data structure. Most of the experiments and implementations described in this work were done using the open source framework for Data Mining RapidMiner. To apply this framework on Information Extraction tasks I developed an extension called Information Extraction Plugin which is exhaustively described. Finally, I present applications which explicitly benefit from the collaboration of Data Mining and Information Extraction

    Well-Formed and Scalable Invasive Software Composition

    Software components provide essential means to structure and organize software effectively. However, frequently, required component abstractions are not available in a programming language or system, or are not adequately combinable with each other. Invasive software composition (ISC) is a general approach to software composition that unifies component-like abstractions such as templates, aspects and macros. ISC is based on fragment composition, and composes programs and other software artifacts at the level of syntax trees. Therefore, a unifying fragment component model is related to the context-free grammar of a language to identify extension and variation points in syntax trees as well as valid component types. By doing so, fragment components can be composed by transformations at respective extension and variation points so that always valid composition results regarding the underlying context-free grammar are yielded. However, given a language’s context-free grammar, the composition result may still be incorrect. Context-sensitive constraints such as type constraints may be violated so that the program cannot be compiled and/or interpreted correctly. While a compiler can detect such errors after composition, it is difficult to relate them back to the original transformation step in the composition system, especially in the case of complex compositions with several hundreds of such steps. To tackle this problem, this thesis proposes well-formed ISC—an extension to ISC that uses reference attribute grammars (RAGs) to specify fragment component models and fragment contracts to guard compositions with context-sensitive constraints. Additionally, well-formed ISC provides composition strategies as a means to configure composition algorithms and handle interferences between composition steps. Developing ISC systems for complex languages such as programming languages is a complex undertaking. Composition-system developers need to supply or develop adequate language and parser specifications that can be processed by an ISC composition engine. Moreover, the specifications may need to be extended with rules for the intended composition abstractions. Current approaches to ISC require complete grammars to be able to compose fragments in the respective languages. Hence, the specifications need to be developed exhaustively before any component model can be supplied. To tackle this problem, this thesis introduces scalable ISC—a variant of ISC that uses island component models as a means to define component models for partially specified languages while still the whole language is supported. Additionally, a scalable workflow for agile composition-system development is proposed which supports a development of ISC systems in small increments using modular extensions. All theoretical concepts introduced in this thesis are implemented in the Skeletons and Application Templates framework SkAT. It supports “classic”, well-formed and scalable ISC by leveraging RAGs as its main specification and implementation language. Moreover, several composition systems based on SkAT are discussed, e.g., a well-formed composition system for Java and a C preprocessor-like macro language. In turn, those composition systems are used as composers in several example applications such as a library of parallel algorithmic skeletons

    Proceedings of the Third Symposium on Programming Languages and Software Tools : Kääriku, Estonia, August 23-24 1993

    Report of the EAGLES Workshop on Implemented Formalisms at DFKI, Saarbrücken

    Case Retrieval Nets as a Model for Building Flexible Information Systems

    Im Rahmen dieser Arbeit wird das Modell der Case Retrieval Netze vorgestellt, das ein Speichermodell für die Phase des Retrievals beim fallbasierten Schliessen darstellt. Dieses Modell lehnt sich an Assoziativspeicher an, insbesondere wird das Retrieval als Rekonstruktion des Falles betrachtet anstatt als eine Suche im traditionellen Sinne. Zwei der wesentlichen Vorteile des Modells sind Effizienz und Flexibilität: Effizienz beschreibt dabei die Fähigkeit, mit grossen Fallbasen umzugehen und dennoch schnell ein Resultat des Retrievals liefern zu können. Im Rahmen dieser Arbeit wird dieser Aspekt formal untersucht, das Hauptaugenmerk ist aber eher pragmatisch motiviert insofern als der Retrieval-Prozess so schnell sein sollte, dass der Benutzer möglichst keine Wartezeiten in Kauf nehmen muss. Flexibilität betrifft andererseits die allgemeine Anwendbarkeit des Modells in Bezug auf veränderte Aufgabenstellungen, auf alternative Formen der Fallrepräsentation usw. Hierfür wird das Konzept der Informationsvervollständigung diskutiert, welches insbesondere für die Beschreibung von interaktiven Entscheidungsunterstützungssystemen geeignet ist. Traditionelle Problemlöseverfahren, wie etwa Klassifikation oder Diagnose, können als Spezialfälle von Informationsvervollständigung aufgefasst werden. Das formale Modell der Case Retrieval Netze wird im Detail erläutert und dessen Eigenschaften untersucht. Anschliessend werden einige möglich Erweiterungen beschrieben. Neben diesen theoretischen Aspekten bilden Anwendungen, die mit Hilfe des Case Retrieval Netz Modells erstellt wurden, einen weiteren Schwerpunkt. Diese lassen sich in zwei grosse Richtungen einordnen: intelligente Verkaufsunterstützung für Zwecke des E-Commerce sowie Wissensmanagement auf Basis textueller Dokumente, wobei für letzteres der Aspekt der Wiederbenutzung von Problemlösewissen essentiell ist. Für jedes dieser Gebiete wird eine Anwendung im Detail beschrieben, weitere dienen der Illustration und werden nur kurz erläutert. Zuvor wird allgemein beschrieben, welche Aspekte bei Entwurf und Implementierung eines Informationssystems zu beachten sind, welches das Modell der Case Retrieval Netze nutzt.In this thesis, a specific memory structure is presented that has been developed for the retrieval task in Case-Based Reasoning systems, namely Case Retrieval Nets (CRNs). This model borrows from associative memories in that it suggests to interpret case retrieval as a process of re-constructing a stored case rather than searching for it in the traditional sense. Tow major advantages of this model are efficiency and flexibility: Efficiency, on the one hand, is concerned with the ability to handle large case bases and still deliver retrieval results reasonably fast. In this thesis, a formal investigation of efficiency is included but the main focus is set on a more pragmatic view in the sense that retrieval should, in the ideal case, be fast enough such that for the users of a related system no delay will be noticeable. Flexibility, on the other hand, is related to the general applicability of a case memory depending on the type of task to perform, the representation of cases etc. For this, the concept of information completion is discussed which allows to capture the interactive nature of problem solving methods in particular when they are applied within a decision support system environment. As discussed, information completion, thus, covers more specific problem solving types, such as classification and diagnosis. The formal model of CRNs is presented in detail and its properties are investigated. After that, some possible extensions are described. Besides these more theoretical aspects, a further focus is set on applications that have been developed on the basis of the CRN model. Roughly speaking, two areas of applications can be recognized: electronic commerce applications for which Case-Based Reasoning may provide intelligent sales support, and knowledge management based on textual documents where the reuse of problem solving knowledge plays a crucial role. For each of these areas, a single application is described in full detail and further case studies are listed for illustration purposes. Prior to the details of the applications, a more general framework is presented describing the general design and implementation of an information system that makes uses of the model of CRNs

    Intensional Cyberforensics

    This work focuses on the application of intensional logic to cyberforensic analysis and its benefits and difficulties are compared with the finite-state-automata approach. This work extends the use of the intensional programming paradigm to the modeling and implementation of a cyberforensics investigation process with backtracing of event reconstruction, in which evidence is modeled by multidimensional hierarchical contexts, and proofs or disproofs of claims are undertaken in an eductive manner of evaluation. This approach is a practical, context-aware improvement over the finite state automata (FSA) approach we have seen in previous work. As a base implementation language model, we use in this approach a new dialect of the Lucid programming language, called Forensic Lucid, and we focus on defining hierarchical contexts based on intensional logic for the distributed evaluation of cyberforensic expressions. We also augment the work with credibility factors surrounding digital evidence and witness accounts, which have not been previously modeled. The Forensic Lucid programming language, used for this intensional cyberforensic analysis, formally presented through its syntax and operational semantics. In large part, the language is based on its predecessor and codecessor Lucid dialects, such as GIPL, Indexical Lucid, Lucx, Objective Lucid, and JOOIP bound by the underlying intensional programming paradigm.Comment: 412 pages, 94 figures, 18 tables, 19 algorithms and listings; PhD thesis; v2 corrects some typos and refs; also available on Spectrum at http://spectrum.library.concordia.ca/977460

    Advanced Threat Intelligence: Interpretation of Anomalous Behavior in Ubiquitous Kernel Processes

    Targeted attacks on digital infrastructures are a rising threat against the confidentiality, integrity, and availability of both IT systems and sensitive data. With the emergence of advanced persistent threats (APTs), identifying and understanding such attacks has become an increasingly difficult task. Current signature-based systems are heavily reliant on fixed patterns that struggle with unknown or evasive applications, while behavior-based solutions usually leave most of the interpretative work to a human analyst. This thesis presents a multi-stage system able to detect and classify anomalous behavior within a user session by observing and analyzing ubiquitous kernel processes. Application candidates suitable for monitoring are initially selected through an adapted sentiment mining process using a score based on the log likelihood ratio (LLR). For transparent anomaly detection within a corpus of associated events, the author utilizes star structures, a bipartite representation designed to approximate the edit distance between graphs. Templates describing nominal behavior are generated automatically and are used for the computation of both an anomaly score and a report containing all deviating events. The extracted anomalies are classified using the Random Forest (RF) and Support Vector Machine (SVM) algorithms. Ultimately, the newly labeled patterns are mapped to a dedicated APT attacker–defender model that considers objectives, actions, actors, as well as assets, thereby bridging the gap between attack indicators and detailed threat semantics. This enables both risk assessment and decision support for mitigating targeted attacks. Results show that the prototype system is capable of identifying 99.8% of all star structure anomalies as benign or malicious. In multi-class scenarios that seek to associate each anomaly with a distinct attack pattern belonging to a particular APT stage we achieve a solid accuracy of 95.7%. Furthermore, we demonstrate that 88.3% of observed attacks could be identified by analyzing and classifying a single ubiquitous Windows process for a mere 10 seconds, thereby eliminating the necessity to monitor each and every (unknown) application running on a system. With its semantic take on threat detection and classification, the proposed system offers a formal as well as technical solution to an information security challenge of great significance.The financial support by the Christian Doppler Research Association, the Austrian Federal Ministry for Digital and Economic Affairs, and the National Foundation for Research, Technology and Development is gratefully acknowledged

    Learning Ontology Relations by Combining Corpus-Based Techniques and Reasoning on Data from Semantic Web Sources

    The manual construction of formal domain conceptualizations (ontologies) is labor-intensive. Ontology learning, by contrast, provides (semi-)automatic ontology generation from input data such as domain text. This thesis proposes a novel approach for learning labels of non-taxonomic ontology relations. It combines corpus-based techniques with reasoning on Semantic Web data. Corpus-based methods apply vector space similarity of verbs co-occurring with labeled and unlabeled relations to calculate relation label suggestions from a set of candidates. A meta ontology in combination with Semantic Web sources such as DBpedia and OpenCyc allows reasoning to improve the suggested labels. An extensive formal evaluation demonstrates the superior accuracy of the presented hybrid approach