Search CORE

11 research outputs found

XML Schema Clustering with Semantic and Hierarchical Similarity Measures

Author: Iryadi Wina
Nayak Richi
Publication venue: 'Elsevier BV'
Publication date: 01/01/2007
Field of study

With the growing popularity of XML as the data representation language, collections of the XML data are exploded in numbers. The methods are required to manage and discover the useful information from them for the improved document handling. We present a schema clustering process by organising the heterogeneous XML schemas into various groups. The methodology considers not only the linguistic and the context of the elements but also the hierarchical structural similarity. We support our findings with experiments and analysis

Crossref

Queensland University of Technology ePrints Archive

A Performance Comparison of Data Mining Algorithms Based Intrusion Detection System for Smart Grid

Author: Ghazi Hassan El
Kaabouch Naima
Mrabet Zakaria El
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 31/12/2019
Field of study

Smart grid is an emerging and promising technology. It uses the power of information technologies to deliver intelligently the electrical power to customers, and it allows the integration of the green technology to meet the environmental requirements. Unfortunately, information technologies have its inherent vulnerabilities and weaknesses that expose the smart grid to a wide variety of security risks. The Intrusion detection system (IDS) plays an important role in securing smart grid networks and detecting malicious activity, yet it suffers from several limitations. Many research papers have been published to address these issues using several algorithms and techniques. Therefore, a detailed comparison between these algorithms is needed. This paper presents an overview of four data mining algorithms used by IDS in Smart Grid. An evaluation of performance of these algorithms is conducted based on several metrics including the probability of detection, probability of false alarm, probability of miss detection, efficiency, and processing time. Results show that Random Forest outperforms the other three algorithms in detecting attacks with higher probability of detection, lower probability of false alarm, lower probability of miss detection, and higher accuracy.Comment: 6 pages, 6 Figure

arXiv.org e-Print Archive

Crossref

Extending ontologies by finding siblings using set expansion techniques

Author: Ashburner
Balog
Bodenreider
Brunzel
Côté
Day-Richter
Doms
Etzioni
Frantzi
Götz Fabian
Hearst
Howe
Kozareva
Lin
Liu
Michael Schroeder
Ogren
Pantel
Paşca M.
Schober
Shi
Shi
Shinzato
Thomas Wächter
Wang
Whetzel
Whetzel
Wächter
Yao
Zhang
Publication venue: Oxford University Press
Publication date
Field of study

Motivation: Ontologies are an everyday tool in biomedicine to capture and represent knowledge. However, many ontologies lack a high degree of coverage in their domain and need to improve their overall quality and maturity. Automatically extending sets of existing terms will enable ontology engineers to systematically improve text-based ontologies level by level

Crossref

PubMed Central

Mining XML documents with association rule algorithms

Author: Gürel Görkem
Publication venue: Izmir Institute of Technology
Publication date: 01/01/2008
Field of study

Thesis (Master)--Izmir Institute of Technology, Computer Engineering, Izmir, 2008Includes bibliographical references (leaves: 59-63)Text in English; Abstract: Turkish and Englishx, 63 leavesFollowing the increasing use of XML technology for data storage and data exchange between applications, the subject of mining XML documents has become more researchable and important topic. In this study, we considered the problem of Mining Association Rules between items in XML document. The principal purpose of this study is applying association rule algorithms directly to the XML documents with using XQuery which is a functional expression language that can be used to query or process XML data. We used three different algorithms; Apriori, AprioriTid and High Efficient AprioriTid. We give comparisons of mining times of these three apriori-like algorithms on XML documents using different support levels, different datasets and different dataset sizes

Acquisition de liens sémantiques à partir d'éléments de mise en forme des textes: exploitation des structures énumératives

Author: Fauconnier Jean-Philippe
Publication venue: HAL CCSD
Publication date: 27/01/2016
Field of study

The past decade witnessed significant advances in the field of relation extraction from text, facilitating the building of lexical or semantic resources. However, the methods proposed so far (supervised learning, kernel methods, distant supervision, etc.) don't fully exploit the texts: they are usually applied at the sentential level and they don't take into account the layout and the formatting of texts.In such a context, this thesis aims at expanding those methods and makes them layout-aware for extracting relations expressed beyond sentence boundaries. For this purpose, we rely on the semantics conveyed by typographical (bullets, emphasis, etc.) and dispositional (visual indentations, carriage returns, etc.) features. Those features often substitute purely discursive formulations. In particular, the study reported here is dealing with the relations carried by the vertical enumerative structures. Although they display discontinuities between their various components, the enumerative structures can be dealt as a whole at the semantic level. They form textual structures prone to hierarchical relations.This study was divided into two parts. (i) The first part describes a model representing the hierarchical structure of documents. This model is falling within the theoretical framework representing the textual architecture: an abstraction of the layout and the formatting, as well as a strong connection with the rhetorical structure are achieved. However, our model focuses primarily on the efficiency of the analysis process rather than on the expressiveness of the representation. A bottom-up method intended for building automatically this model is presented and evaluated on a corpus of PDF documents.(ii) The second part aims at integrating this model into the process of relation extraction. In particular, we focused on vertical enumerative structures. A multidimensional typology intended for characterizing those structures was established and used into an annotation task. Thanks to corpus-based observations, we proposed a two-step method, by supervised learning, for qualifying the nature of the relation and identifying its arguments. The evaluation of our method showed that exploiting the formatting and the layout of documents, in combination with standard lexico-syntactic features, improves those two tasks.Ces dernières années de nombreux progrès ont été faits dans le domaine de l'extraction de relations à partir de textes, facilitant ainsi la construction de ressources lexicales ou sémantiques. Cependant, les méthodes proposées (apprentissage supervisé, méthodes à noyaux, apprentissage distant, etc.) n’exploitent pas tout le potentiel des textes : elles ont généralement été appliquées à un niveau phrastique, sans tenir compte des éléments de mise en forme.Dans ce contexte, l'objectif de cette thèse est d'adapter ces méthodes à l'extraction de relations exprimées au-delà des frontières de la phrase. Pour cela, nous nous appuyons sur la sémantique véhiculée par les indices typographiques (puces, emphases, etc.) et dispositionnels (indentations visuelles, retours à la ligne, etc.), qui complètent des formulations strictement discursives. En particulier, nous étudions les structures énumératives verticales qui, bien qu'affichant des discontinuités entre leurs différents composants, présentent un tout sur le plan sémantique. Ces structures textuelles sont souvent révélatrices de relations hiérarchiques. Notre travail est divisé en deux parties. (i) La première partie décrit un modèle pour représenter la structure hiérarchique des documents. Ce modèle se positionne dans la suite des modèles théoriques proposés pour rendre compte de l'architecture textuelle : une abstraction de la mise en forme et une connexion forte avec la structure rhétorique sont faites. Toutefois, notre modèle se démarque par une perspective d'analyse automatique des textes. Nous en proposons une implémentation efficace sous la forme d'une méthode ascendante et nous l'évaluons sur un corpus de documents PDF. (ii) La seconde partie porte sur l'intégration de ce modèle dans le processus d'extraction de relations. Plus particulièrement, nous nous sommes focalisés sur les structures énumératives verticales. Un corpus a été annoté selon une typologie multi-dimensionnelle permettant de caractériser et de cibler les structures énumératives verticales porteuses de relations utiles à la création de ressources. Les observations faites en corpus ont conduit à procéder en deux étapes par apprentissage supervisé pour analyser ces structures : qualifier la relation puis en extraire les arguments. L'évaluation de cette méthode montre que l'exploitation de la mise en forme, combinée à un faisceau d'indices lexico-syntaxiques, améliore les résultats

Thèses en Ligne

Scientific Publications of the University of Toulouse II Le Mirail

Thèses en ligne de l'Université Toulouse III - Paul Sabatier

Mining a Small Medical Data Set by Integrating the Decision Tree and t-test

Author: Chang Ming-Yang
Publication venue: 'Academy Publisher'
Publication date
Field of study

[[abstract]]Although several researchers have used statistical methods to prove that aspiration followed by the injection of 95% ethanol left in situ (retention) is an effective treatment for ovarian endometriomas, very few discuss the different conditions that could generate different recovery rates for the patients. Therefore, this study adopts the statistical method and decision tree techniques together to analyze the postoperative status of ovarian endometriosis patients under different conditions. Since our collected data set is small, containing only 212 records, we use all of these data as the training data. Therefore, instead of using a resultant tree to generate rules directly, we use the value of each node as a cut point to generate all possible rules from the tree first. Then, using t-test, we verify the rules to discover some useful description rules after all possible rules from the tree have been generated. Experimental results show that our approach can find some new interesting knowledge about recurrent ovarian endometriomas under different conditions.[[journaltype]]國外[[incitationindex]]EI[[booktype]]紙本[[countrycodes]]FI

Tamkang University Institutional Repository

Indexing Heterogeneous XML for Full-Text Search

Author: Lehtonen Miro
Publication venue: Helsingfors universitet
Publication date: 01/01/2006
Field of study

XML documents are becoming more and more common in various environments. In particular, enterprise-scale document management is commonly centred around XML, and desktop applications as well as online document collections are soon to follow. The growing number of XML documents increases the importance of appropriate indexing methods and search tools in keeping the information accessible. Therefore, we focus on content that is stored in XML format as we develop such indexing methods. Because XML is used for different kinds of content ranging all the way from records of data fields to narrative full-texts, the methods for Information Retrieval are facing a new challenge in identifying which content is subject to data queries and which should be indexed for full-text search. In response to this challenge, we analyse the relation of character content and XML tags in XML documents in order to separate the full-text from data. As a result, we are able to both reduce the size of the index by 5-6\% and improve the retrieval precision as we select the XML fragments to be indexed. Besides being challenging, XML comes with many unexplored opportunities which are not paid much attention in the literature. For example, authors often tag the content they want to emphasise by using a typeface that stands out. The tagged content constitutes phrases that are descriptive of the content and useful for full-text search. They are simple to detect in XML documents, but also possible to confuse with other inline-level text. Nonetheless, the search results seem to improve when the detected phrases are given additional weight in the index. Similar improvements are reported when related content is associated with the indexed full-text including titles, captions, and references. Experimental results show that for certain types of document collections, at least, the proposed methods help us find the relevant answers. Even when we know nothing about the document structure but the XML syntax, we are able to take advantage of the XML structure when the content is indexed for full-text search.XML on yleistynyt tekstidokumenttien formaattina monessa ympäristössä. Erityisesti konsernitason dokumenttienhallinta perustuu juuri XML:ään, mutta myös kotikoneilla ja WWW-ympäristössä XML on yleinen tallennusmuoto sekä tekstille että datalle. Dokumenttien määrän voimakas kasva korostaa indeksointi- ja hakumenetelmien tärkeyttä, koska dokumenttien sisältämä tietomäärä ei ole hallittavissa ilman tiedonhakujärjestelmää. Keskitymme siis XML-muodossa tallennetun sisällön indeksointiin tekstihakua varten. Dokumenttiformaattina XML ei mitenkään rajoita itse tallennetun sisällön laatua, vaan XML-dokumenteista löytää kaikkea mahdollista tietokoneiden raakadatasta kaunokirjalliseen proosaan. Siksi on tärkeää tunnistaa sisällön laatu ennen sen indeksointia. Yksi menetelmä datan erottamiseen tekstistä on XML-dokumenttien sisäisen rakenteen analysointi: data vaatii tiukasti säännöllisen ja määrämuotoisen rakenteen, kun taas tekstidokumenttien XML-rakenteessa on paljon vaihtelua. Kun datan jättää indeksoimatta, saavutetaan n. 5-6% pienempi indeksi sekä tarkemmat hakutulokset. XML-dokumenteilla on myös muita ominaisuuksia, joita ei aikaisemmin ole hyödynnetty tekstin indeksointimenetelmissä. Sisältö, jota kirjoittaja haluaa korostaa esim. toisella kirjasintyypillä, on erikseen merkitty XML-koodiin. Korostettu sisältö on siten helppo paikallistaa. Antamalla sille enemmän painoarvoa indeksissä kuin korostamattomalle sisällölle, saadaan hakutuloksia ohjattua parempaan suuntaan. Sama vaikutus on otsikkojen, kuvatekstien ja viitteiden analysoinnilla ja painotuksella. Alustavien testitulosten mukaan esitetyt indeksointimenetelmät auttavat relevantin tiedon löytämisessä XML-dokumenteista

CiteSeerX

Helsingin yliopiston digitaalinen arkisto

Knowledge Discovery from XML documents: PAKDD 2006 Workshop Proceedings First International Workshop, KDXD 2006, Singapore, April 9, 2006.

Author: Nayak Richi
Zaki Mohammad
Publication venue: Springer
Publication date: 01/01/2006
Field of study

The KDXD'06 (Knowledge Discovery from XML Documents) workshop is\ud the first international workshop running this year in conjunction\ud with the PAKDD'06 conference. The workshop provides an important\ud forum for the dissemination and exchange of new ideas and,\ud research related to XML data discovery and retrieval.\ud \ud The eXtensible Markup Language (XML) has become a standard\ud language for data representation and exchange. With the continuous\ud growth in XML data sources, the ability to manage collections of\ud XML documents and discover knowledge from them for decision\ud support becomes increasingly important. Due to the inherent\ud flexibility of XML, in both structure and semantics, inferring\ud important knowledge from XML data is faced with new challenges as\ud well as benefits. The objective of the workshop is to bring\ud together researchers and practitioners to discuss all aspects of\ud the emerging XML data management challenges. Thus, the topics of\ud interest included, but were not limited to: XML data mining\ud methods; XML data mining applications; XML data management\ud emerging issues and challenges; XML in improving knowledge\ud discovery process; and Benchmarks and mining performance using XML\ud databases.\ud \ud The workshop received 26 submissions. We would like to thank all\ud those who submitted their work to the workshop under relatively\ud pressuring time deadlines. We have selected 10 high quality full\ud papers for the discussion and presentation in the workshop and for\ud inclusion in the proceedings after peer-reviews by at least three\ud members of the Program Committee. Accepted papers have been\ud grouped in three sessions and allocated equal presentation time\ud slots. The first session is on XML data mining methods of\ud classification, clustering and association. The second session\ud focuses on the XML data reasoning and querying methods. Query\ud Optimization. And, the last session is on XML data applications of\ud transportation and security .\ud \ud Special thanks go to the program committee members who shared\ud their expertise and time to make KDXD'06 a success. The final\ud quality of selected papers depends on their efforts.\ud \ud Last but least, we would like to thank the organizers of PAKDD\ud 2006 for hosting KDXD'06

Queensland University of Technology ePrints Archive

Knowledge Data Discovery

Author: Melichar Ladislav
Publication venue: Vysoké učení technické v Brně. Fakulta informačních technologií
Publication date
Field of study

The data mining is still little investigated area. This project is aimed firstly generally to the knowledge discovery from the structured data, especially from the datas in XML format. Furthermore the tree algorithm HybridTreeMiner is presented here with aim of its application for the knowledge discovery from XML documents. The practical part of this project is dedicated to the design of the conception for the algorithm integration to the mining system developed in FIT. This system is implemented in the programming language Java, it has modular structure and its parts communicate each other by means of the language DMSL. Reached results are presented and discussed in the end

National Repository of Grey Literature