3,280 research outputs found
XML Matchers: approaches and challenges
Schema Matching, i.e. the process of discovering semantic correspondences
between concepts adopted in different data source schemas, has been a key topic
in Database and Artificial Intelligence research areas for many years. In the
past, it was largely investigated especially for classical database models
(e.g., E/R schemas, relational databases, etc.). However, in the latest years,
the widespread adoption of XML in the most disparate application fields pushed
a growing number of researchers to design XML-specific Schema Matching
approaches, called XML Matchers, aiming at finding semantic matchings between
concepts defined in DTDs and XSDs. XML Matchers do not just take well-known
techniques originally designed for other data models and apply them on
DTDs/XSDs, but they exploit specific XML features (e.g., the hierarchical
structure of a DTD/XSD) to improve the performance of the Schema Matching
process. The design of XML Matchers is currently a well-established research
area. The main goal of this paper is to provide a detailed description and
classification of XML Matchers. We first describe to what extent the
specificities of DTDs/XSDs impact on the Schema Matching task. Then we
introduce a template, called XML Matcher Template, that describes the main
components of an XML Matcher, their role and behavior. We illustrate how each
of these components has been implemented in some popular XML Matchers. We
consider our XML Matcher Template as the baseline for objectively comparing
approaches that, at first glance, might appear as unrelated. The introduction
of this template can be useful in the design of future XML Matchers. Finally,
we analyze commercial tools implementing XML Matchers and introduce two
challenging issues strictly related to this topic, namely XML source clustering
and uncertainty management in XML Matchers.Comment: 34 pages, 8 tables, 7 figure
Qualitative Effects of Knowledge Rules in Probabilistic Data Integration
One of the problems in data integration is data overlap: the fact that different data sources have data on the same real world entities. Much development time in data integration projects is devoted to entity resolution. Often advanced similarity measurement techniques are used to remove semantic duplicates from the integration result or solve other semantic conflicts, but it proofs impossible to get rid of all semantic problems in data integration. An often-used rule of thumb states that about 90% of the development effort is devoted to solving the remaining 10% hard cases. In an attempt to significantly decrease human effort at data integration time, we have proposed an approach that stores any remaining semantic uncertainty and conflicts in a probabilistic database enabling it to already be meaningfully used. The main development effort in our approach is devoted to defining and tuning knowledge rules and thresholds. Rules and thresholds directly impact the size and quality of the integration result. We measure integration quality indirectly by measuring the quality of answers to queries on the integrated data set in an information retrieval-like way. The main contribution of this report is an experimental investigation of the effects and sensitivity of rule definition and threshold tuning on the integration quality. This proves that our approach indeed reduces development effort â and not merely shifts the effort to rule definition and threshold tuning â by showing that setting rough safe thresholds and defining only a few rules suffices to produce a âgood enoughâ integration that can be meaningfully used
Highlighting matched and mismatched segments in translation memory output through sub-Âtree alignment
In recent years, it is becoming more and more clear that the
localisation industry does not have the necessary manpower to satisfy the increasing demand for high-quality translation. This has fuelled the search new and existing technologies that would increase translator throughput. As Translation Memory (TM) systems are the most commonly employed tool by translators, a number of enhancements are
available to assist them in their job. One such enhancement would be to show the translator which parts of the sentence
that needs to be translated match which parts of the fuzzy
match suggested by the TM. For this information to be used,
however, the translators have to carry it over to the TM
translation themselves. In this paper, we present a novel methodology that can automatically detect and highlight
the segments that need to be modified in a TM-Âsuggested
translation. We base it on state-Âof-the-art sub-Âtree align-
ment technology (Zhechev,2010) that can produce aligned
phrase-Âbased-Âtree pairs from unannotated data. Our system
operates in a three-Âstep process. First, the fuzzy match
selected by the TM and its translation are aligned. This
lets us know which segments of the source-Âlanguage sentence
correspond to which segments in its translation. In the
second step, the fuzzy match is aligned to the input sentence that is currently being translated. This tells us
which parts of the input sentence are available in the fuzzy
match and which still need to be translated. In the third
step, the fuzzy match is used as an intermediary, through
which the alignments between the input sentence and the TM
translation are established. In this way, we can detect with
precision the segments in the suggested translation that the
translator needs to edit and highlight them appropriately to
set them apart from the segments that are already good translations for parts of the input sentence. Additionally,
we can show the alignmentsâas detected by our systemâbetween
the input and the translation, which will make it even easier for the translator to post-edit the TM suggestion. This alignment information can additionally be used to pre-
translate the mismatched segments, further reducing the post-Âediting load
Sample-based XPath Ranking for Web Information Extraction
Web information extraction typically relies on a wrapper, i.e., program code or a configuration that specifies how to extract some information from web pages at a specific website. Manually creating and maintaining wrappers is a cumbersome and error-prone task. It may even be prohibitive as some applications require information extraction from previously unseen websites. This paper approaches the problem of automatic on-the-fly wrapper creation for websites that provide attribute data for objects in a âsearch â search result page â detail pageâ setup. The approach is a wrapper induction approach which uses a small and easily obtainable set of sample data for ranking XPaths on their suitability for extracting the wanted attribute data. Experiments show that the automatically generated top-ranked XPaths indeed extract the wanted data. Moreover, it appears that 20 to 25 input samples suffice for finding a suitable XPath for an attribute
Structure and content semantic similarity detection of eXtensible markup language documents using keys
XML (eXtensible Mark-up Language) has become the fundamental standard for efficient data management and exchange. Due to the widespread use of XML for describing and exchanging data on the web, XML-based comparison is central issues in database management and information retrieval. In fact, although many heterogeneous XML sources have similar content, they may be described using different tag names and structures. This work proposes a series of algorithms for detection of structural and content changes among XML data. The first is an algorithm called XDoI (XML Data Integration Based on Content and Structure Similarity Using Keys) that clusters XML documents into subtrees using leaf-node parents as clustering points. This algorithm matches subtrees using the key concept and compares unmatched subtrees for similarities in both content and structure. The experimental results show that this approach finds much more accurate matches with or without the presence of keys in the subtrees. A second algorithm proposed here is called XDI-CSSK (a system for detecting xml similarity in content and structure using relational database); it eliminates unnecessary clustering points using instance statistics and a taxonomic analyzer. As the number of subtrees to be compared is reduced, the overall execution time is reduced dramatically. Semantic similarity plays a crucial role in precise computational similarity measures. A third algorithm, called XML-SIM (structure and content semantic similarity detection using keys) is based on previous work to detect XML semantic similarity based on structure and content. This algorithm is an improvement over XDI-CSSK and XDoI in that it determines content similarity based on semantic structural similarity. In an experimental evaluation, it outperformed previous approaches in terms of both execution time and false positive rates. Information changes periodically; therefore, it is important to be able to detect changes among different versions of an XML document and use that information to identify semantic similarities. Finally, this work introduces an approach to detect XML similarity and thus to join XML document versions using a change detection mechanism. In this approach, subtree keys still play an important role in order to avoid unnecessary subtree comparisons within multiple versions of the same document. Real data sets from bibliographic domains demonstrate the effectiveness of all these algorithms --Abstract, page iv-v
Design of Automatically Adaptable Web Wrappers
Nowadays, the huge amount of information distributed through the Web motivates studying techniques to\ud
be adopted in order to extract relevant data in an efïŹcient and reliable way. Both academia and enterprises\ud
developed several approaches of Web data extraction, for example using techniques of artiïŹcial intelligence or\ud
machine learning. Some commonly adopted procedures, namely wrappers, ensure a high degree of precision\ud
of information extracted from Web pages, and, at the same time, have to prove robustness in order not to\ud
compromise quality and reliability of data themselves.\ud
In this paper we focus on some experimental aspects related to the robustness of the data extraction process\ud
and the possibility of automatically adapting wrappers. We discuss the implementation of algorithms for\ud
ïŹnding similarities between two different version of a Web page, in order to handle modiïŹcations, avoiding\ud
the failure of data extraction tasks and ensuring reliability of information extracted. Our purpose is to evaluate\ud
performances, advantages and draw-backs of our novel system of automatic wrapper adaptation
BlogForever D2.6: Data Extraction Methodology
This report outlines an inquiry into the area of web data extraction, conducted within the context of blog preservation. The report reviews theoretical advances and practical developments for implementing data extraction. The inquiry is extended through an experiment that demonstrates the effectiveness and feasibility of implementing some of the suggested approaches. More specifically, the report discusses an approach based on unsupervised machine learning that employs the RSS feeds and HTML representations of blogs. It outlines the possibilities of extracting semantics available in blogs and demonstrates the benefits of exploiting available standards such as microformats and microdata. The report proceeds to propose a methodology for extracting and processing blog data to further inform the design and development of the BlogForever platform
Measuring the similarity of PML documents with RFID-based sensors
The Electronic Product Code (EPC) Network is an important part of the
Internet of Things. The Physical Mark-Up Language (PML) is to represent and
de-scribe data related to objects in EPC Network. The PML documents of each
component to exchange data in EPC Network system are XML documents based on PML
Core schema. For managing theses huge amount of PML documents of tags captured
by Radio frequency identification (RFID) readers, it is inevitable to develop
the high-performance technol-ogy, such as filtering and integrating these tag
data. So in this paper, we propose an approach for meas-uring the similarity of
PML documents based on Bayesian Network of several sensors. With respect to the
features of PML, while measuring the similarity, we firstly reduce the
redundancy data except information of EPC. On the basis of this, the Bayesian
Network model derived from the structure of the PML documents being compared is
constructed.Comment: International Journal of Ad Hoc and Ubiquitous Computin
Improving Assessment of Students through Semantic Space Construction
Assessment is one of the hardest tasks an Intel- ligent Tutoring System has to perform. It involves different and sometimes uncorrelated sub-tasks: building a student model to define her needs, defining tools and procedures to perform tests, understanding studentsâ replies to system prompts, defining suitable procedures to evaluate the correctness of studentsâ replies, and strategies to improve studentsâ abilities after the assessment session.
In this work we present an improvement of our system, TutorJ, with particular attention to the assessment phase. Many tutoring systems offer only a limited set of assessment options like multiple-choice questions, fill-in-the-blanks tests or other types of predefined replies obtained through graphical widgets (radio-buttons, text-areas). This limited set of solutions makes interaction poor and unable to satisfy the usersâ needs. Our interest is to enrich interaction with dialog in natural language. In this respect, the assessment problem is strictly connected to natural language understanding. The preliminary step is indeed to understand questions and replies of the student.
We have reviewed the system design in the framework of a cognitive architecture with the aim to reach a double result: the reduction of the effort for the construction of the knowledge base and the improvement of the system capabilities in the assessment process. To this aim a new common semantic space has been defined and implemented. The entire architecture is oriented to intuitive and natural interaction
- âŠ