11 research outputs found

    XML Schema Clustering with Semantic and Hierarchical Similarity Measures

    Get PDF
    With the growing popularity of XML as the data representation language, collections of the XML data are exploded in numbers. The methods are required to manage and discover the useful information from them for the improved document handling. We present a schema clustering process by organising the heterogeneous XML schemas into various groups. The methodology considers not only the linguistic and the context of the elements but also the hierarchical structural similarity. We support our findings with experiments and analysis

    A Performance Comparison of Data Mining Algorithms Based Intrusion Detection System for Smart Grid

    Full text link
    Smart grid is an emerging and promising technology. It uses the power of information technologies to deliver intelligently the electrical power to customers, and it allows the integration of the green technology to meet the environmental requirements. Unfortunately, information technologies have its inherent vulnerabilities and weaknesses that expose the smart grid to a wide variety of security risks. The Intrusion detection system (IDS) plays an important role in securing smart grid networks and detecting malicious activity, yet it suffers from several limitations. Many research papers have been published to address these issues using several algorithms and techniques. Therefore, a detailed comparison between these algorithms is needed. This paper presents an overview of four data mining algorithms used by IDS in Smart Grid. An evaluation of performance of these algorithms is conducted based on several metrics including the probability of detection, probability of false alarm, probability of miss detection, efficiency, and processing time. Results show that Random Forest outperforms the other three algorithms in detecting attacks with higher probability of detection, lower probability of false alarm, lower probability of miss detection, and higher accuracy.Comment: 6 pages, 6 Figure

    Extending ontologies by finding siblings using set expansion techniques

    Get PDF
    Motivation: Ontologies are an everyday tool in biomedicine to capture and represent knowledge. However, many ontologies lack a high degree of coverage in their domain and need to improve their overall quality and maturity. Automatically extending sets of existing terms will enable ontology engineers to systematically improve text-based ontologies level by level

    Mining XML documents with association rule algorithms

    Get PDF
    Thesis (Master)--Izmir Institute of Technology, Computer Engineering, Izmir, 2008Includes bibliographical references (leaves: 59-63)Text in English; Abstract: Turkish and Englishx, 63 leavesFollowing the increasing use of XML technology for data storage and data exchange between applications, the subject of mining XML documents has become more researchable and important topic. In this study, we considered the problem of Mining Association Rules between items in XML document. The principal purpose of this study is applying association rule algorithms directly to the XML documents with using XQuery which is a functional expression language that can be used to query or process XML data. We used three different algorithms; Apriori, AprioriTid and High Efficient AprioriTid. We give comparisons of mining times of these three apriori-like algorithms on XML documents using different support levels, different datasets and different dataset sizes

    Acquisition de liens sémantiques à partir d'éléments de mise en forme des textes: exploitation des structures énumératives

    Get PDF
    The past decade witnessed significant advances in the field of relation extraction from text, facilitating the building of lexical or semantic resources. However, the methods proposed so far (supervised learning, kernel methods, distant supervision, etc.) don't fully exploit the texts: they are usually applied at the sentential level and they don't take into account the layout and the formatting of texts.In such a context, this thesis aims at expanding those methods and makes them layout-aware for extracting relations expressed beyond sentence boundaries. For this purpose, we rely on the semantics conveyed by typographical (bullets, emphasis, etc.) and dispositional (visual indentations, carriage returns, etc.) features. Those features often substitute purely discursive formulations. In particular, the study reported here is dealing with the relations carried by the vertical enumerative structures. Although they display discontinuities between their various components, the enumerative structures can be dealt as a whole at the semantic level. They form textual structures prone to hierarchical relations.This study was divided into two parts. (i) The first part describes a model representing the hierarchical structure of documents. This model is falling within the theoretical framework representing the textual architecture: an abstraction of the layout and the formatting, as well as a strong connection with the rhetorical structure are achieved. However, our model focuses primarily on the efficiency of the analysis process rather than on the expressiveness of the representation. A bottom-up method intended for building automatically this model is presented and evaluated on a corpus of PDF documents.(ii) The second part aims at integrating this model into the process of relation extraction. In particular, we focused on vertical enumerative structures. A multidimensional typology intended for characterizing those structures was established and used into an annotation task. Thanks to corpus-based observations, we proposed a two-step method, by supervised learning, for qualifying the nature of the relation and identifying its arguments. The evaluation of our method showed that exploiting the formatting and the layout of documents, in combination with standard lexico-syntactic features, improves those two tasks.Ces derniĂšres annĂ©es de nombreux progrĂšs ont Ă©tĂ© faits dans le domaine de l'extraction de relations Ă  partir de textes, facilitant ainsi la construction de ressources lexicales ou sĂ©mantiques. Cependant, les mĂ©thodes proposĂ©es (apprentissage supervisĂ©, mĂ©thodes Ă  noyaux, apprentissage distant, etc.) n’exploitent pas tout le potentiel des textes : elles ont gĂ©nĂ©ralement Ă©tĂ© appliquĂ©es Ă  un niveau phrastique, sans tenir compte des Ă©lĂ©ments de mise en forme.Dans ce contexte, l'objectif de cette thĂšse est d'adapter ces mĂ©thodes Ă  l'extraction de relations exprimĂ©es au-delĂ  des frontiĂšres de la phrase. Pour cela, nous nous appuyons sur la sĂ©mantique vĂ©hiculĂ©e par les indices typographiques (puces, emphases, etc.) et dispositionnels (indentations visuelles, retours Ă  la ligne, etc.), qui complĂštent des formulations strictement discursives. En particulier, nous Ă©tudions les structures Ă©numĂ©ratives verticales qui, bien qu'affichant des discontinuitĂ©s entre leurs diffĂ©rents composants, prĂ©sentent un tout sur le plan sĂ©mantique. Ces structures textuelles sont souvent rĂ©vĂ©latrices de relations hiĂ©rarchiques. Notre travail est divisĂ© en deux parties. (i) La premiĂšre partie dĂ©crit un modĂšle pour reprĂ©senter la structure hiĂ©rarchique des documents. Ce modĂšle se positionne dans la suite des modĂšles thĂ©oriques proposĂ©s pour rendre compte de l'architecture textuelle : une abstraction de la mise en forme et une connexion forte avec la structure rhĂ©torique sont faites. Toutefois, notre modĂšle se dĂ©marque par une perspective d'analyse automatique des textes. Nous en proposons une implĂ©mentation efficace sous la forme d'une mĂ©thode ascendante et nous l'Ă©valuons sur un corpus de documents PDF. (ii) La seconde partie porte sur l'intĂ©gration de ce modĂšle dans le processus d'extraction de relations. Plus particuliĂšrement, nous nous sommes focalisĂ©s sur les structures Ă©numĂ©ratives verticales. Un corpus a Ă©tĂ© annotĂ© selon une typologie multi-dimensionnelle permettant de caractĂ©riser et de cibler les structures Ă©numĂ©ratives verticales porteuses de relations utiles Ă  la crĂ©ation de ressources. Les observations faites en corpus ont conduit Ă  procĂ©der en deux Ă©tapes par apprentissage supervisĂ© pour analyser ces structures : qualifier la relation puis en extraire les arguments. L'Ă©valuation de cette mĂ©thode montre que l'exploitation de la mise en forme, combinĂ©e Ă  un faisceau d'indices lexico-syntaxiques, amĂ©liore les rĂ©sultats

    Mining a Small Medical Data Set by Integrating the Decision Tree and t-test

    Get PDF
    [[abstract]]Although several researchers have used statistical methods to prove that aspiration followed by the injection of 95% ethanol left in situ (retention) is an effective treatment for ovarian endometriomas, very few discuss the different conditions that could generate different recovery rates for the patients. Therefore, this study adopts the statistical method and decision tree techniques together to analyze the postoperative status of ovarian endometriosis patients under different conditions. Since our collected data set is small, containing only 212 records, we use all of these data as the training data. Therefore, instead of using a resultant tree to generate rules directly, we use the value of each node as a cut point to generate all possible rules from the tree first. Then, using t-test, we verify the rules to discover some useful description rules after all possible rules from the tree have been generated. Experimental results show that our approach can find some new interesting knowledge about recurrent ovarian endometriomas under different conditions.[[journaltype]]ćœ‹ć€–[[incitationindex]]EI[[booktype]]çŽ™æœŹ[[countrycodes]]FI

    Indexing Heterogeneous XML for Full-Text Search

    Get PDF
    XML documents are becoming more and more common in various environments. In particular, enterprise-scale document management is commonly centred around XML, and desktop applications as well as online document collections are soon to follow. The growing number of XML documents increases the importance of appropriate indexing methods and search tools in keeping the information accessible. Therefore, we focus on content that is stored in XML format as we develop such indexing methods. Because XML is used for different kinds of content ranging all the way from records of data fields to narrative full-texts, the methods for Information Retrieval are facing a new challenge in identifying which content is subject to data queries and which should be indexed for full-text search. In response to this challenge, we analyse the relation of character content and XML tags in XML documents in order to separate the full-text from data. As a result, we are able to both reduce the size of the index by 5-6\% and improve the retrieval precision as we select the XML fragments to be indexed. Besides being challenging, XML comes with many unexplored opportunities which are not paid much attention in the literature. For example, authors often tag the content they want to emphasise by using a typeface that stands out. The tagged content constitutes phrases that are descriptive of the content and useful for full-text search. They are simple to detect in XML documents, but also possible to confuse with other inline-level text. Nonetheless, the search results seem to improve when the detected phrases are given additional weight in the index. Similar improvements are reported when related content is associated with the indexed full-text including titles, captions, and references. Experimental results show that for certain types of document collections, at least, the proposed methods help us find the relevant answers. Even when we know nothing about the document structure but the XML syntax, we are able to take advantage of the XML structure when the content is indexed for full-text search.XML on yleistynyt tekstidokumenttien formaattina monessa ympÀristössÀ. Erityisesti konsernitason dokumenttienhallinta perustuu juuri XML:ÀÀn, mutta myös kotikoneilla ja WWW-ympÀristössÀ XML on yleinen tallennusmuoto sekÀ tekstille ettÀ datalle. Dokumenttien mÀÀrÀn voimakas kasva korostaa indeksointi- ja hakumenetelmien tÀrkeyttÀ, koska dokumenttien sisÀltÀmÀ tietomÀÀrÀ ei ole hallittavissa ilman tiedonhakujÀrjestelmÀÀ. Keskitymme siis XML-muodossa tallennetun sisÀllön indeksointiin tekstihakua varten. Dokumenttiformaattina XML ei mitenkÀÀn rajoita itse tallennetun sisÀllön laatua, vaan XML-dokumenteista löytÀÀ kaikkea mahdollista tietokoneiden raakadatasta kaunokirjalliseen proosaan. Siksi on tÀrkeÀÀ tunnistaa sisÀllön laatu ennen sen indeksointia. Yksi menetelmÀ datan erottamiseen tekstistÀ on XML-dokumenttien sisÀisen rakenteen analysointi: data vaatii tiukasti sÀÀnnöllisen ja mÀÀrÀmuotoisen rakenteen, kun taas tekstidokumenttien XML-rakenteessa on paljon vaihtelua. Kun datan jÀttÀÀ indeksoimatta, saavutetaan n. 5-6% pienempi indeksi sekÀ tarkemmat hakutulokset. XML-dokumenteilla on myös muita ominaisuuksia, joita ei aikaisemmin ole hyödynnetty tekstin indeksointimenetelmissÀ. SisÀltö, jota kirjoittaja haluaa korostaa esim. toisella kirjasintyypillÀ, on erikseen merkitty XML-koodiin. Korostettu sisÀltö on siten helppo paikallistaa. Antamalla sille enemmÀn painoarvoa indeksissÀ kuin korostamattomalle sisÀllölle, saadaan hakutuloksia ohjattua parempaan suuntaan. Sama vaikutus on otsikkojen, kuvatekstien ja viitteiden analysoinnilla ja painotuksella. Alustavien testitulosten mukaan esitetyt indeksointimenetelmÀt auttavat relevantin tiedon löytÀmisessÀ XML-dokumenteista

    Knowledge Discovery from XML documents: PAKDD 2006 Workshop Proceedings First International Workshop, KDXD 2006, Singapore, April 9, 2006.

    No full text
    The KDXD'06 (Knowledge Discovery from XML Documents) workshop is\ud the first international workshop running this year in conjunction\ud with the PAKDD'06 conference. The workshop provides an important\ud forum for the dissemination and exchange of new ideas and,\ud research related to XML data discovery and retrieval.\ud \ud The eXtensible Markup Language (XML) has become a standard\ud language for data representation and exchange. With the continuous\ud growth in XML data sources, the ability to manage collections of\ud XML documents and discover knowledge from them for decision\ud support becomes increasingly important. Due to the inherent\ud flexibility of XML, in both structure and semantics, inferring\ud important knowledge from XML data is faced with new challenges as\ud well as benefits. The objective of the workshop is to bring\ud together researchers and practitioners to discuss all aspects of\ud the emerging XML data management challenges. Thus, the topics of\ud interest included, but were not limited to: XML data mining\ud methods; XML data mining applications; XML data management\ud emerging issues and challenges; XML in improving knowledge\ud discovery process; and Benchmarks and mining performance using XML\ud databases.\ud \ud The workshop received 26 submissions. We would like to thank all\ud those who submitted their work to the workshop under relatively\ud pressuring time deadlines. We have selected 10 high quality full\ud papers for the discussion and presentation in the workshop and for\ud inclusion in the proceedings after peer-reviews by at least three\ud members of the Program Committee. Accepted papers have been\ud grouped in three sessions and allocated equal presentation time\ud slots. The first session is on XML data mining methods of\ud classification, clustering and association. The second session\ud focuses on the XML data reasoning and querying methods. Query\ud Optimization. And, the last session is on XML data applications of\ud transportation and security .\ud \ud Special thanks go to the program committee members who shared\ud their expertise and time to make KDXD'06 a success. The final\ud quality of selected papers depends on their efforts.\ud \ud Last but least, we would like to thank the organizers of PAKDD\ud 2006 for hosting KDXD'06

    Knowledge Data Discovery

    No full text
    The data mining is still little investigated area. This project is aimed firstly generally to the knowledge discovery from the structured data, especially from the datas in XML format. Furthermore the tree algorithm HybridTreeMiner is presented here with aim of its application for the knowledge discovery from XML documents. The practical part of this project is dedicated to the design of the conception for the algorithm integration to the mining system developed in FIT. This system is implemented in the programming language Java, it has modular   structure and its parts communicate each other by means of the language DMSL. Reached results are presented and discussed in the end
    corecore