20,082 research outputs found
Tree model guided candidate generation for mining frequent subtrees from XML
Due to the inherent flexibilities in both structure and semantics, XML association rules mining faces few challenges, such as: a more complicated hierarchical data structure and ordered data context. Mining frequent patterns from XML documents can be recast as mining frequent tree structures from a database of XML documents. In this study, we model a database of XML documents as a database of rooted labeled ordered subtrees. In particular, we are mainly coneerned with mining frequent induced and embedded ordered subtrees. Our main contributions arc as follows. We describe our unique embedding list representation of the tree structure, which enables efficient implementation ofour Tree Model Guided (TMG) candidate generation. TMG is an optimal, non-redundant enumeration strategy which enumerates all the valid candidates that conform to the structural aspects of the data. We show through a mathematical model and experiments that TMG has better complexity compared to the commonly used join approach. In this paper, we propose two algorithms, MB3Miner and iMB3-Miner. MB3-Miner mines embedded subtrees. iMB3-Miner mines induced and/or embedded subtrees by using the maximum level of embedding constraint. Our experiments with both synthetic and real datasets against two well known algorithms for mining induced and embedded subtrees, demonstrate the effeetiveness and the efficiency of the proposed techniques
An Effective XML Keyword Search with User Search Intention over XML Documents
The extreme success of web search engines makes keyword search the most popular search model for ordinary users. Keyword search on XML is a user friendly way to query XML databases since it allows users to pose queries without the knowledge of complex query languages and the database schema. The three main challenges faces in XML keyword search: 1) Identify the user search intention, i.e., identify the XML node types that users want to search for and search via. 2) Resolve keyword ambiguity problems: a keyword can appear as both a tag name and a text value of some node; a keyword can appear as the text values of different XML node types and carry different meanings; a keyword can appear as the tag name of different XML node types with different meanings. 3) As the search results are sub trees of the XML documents, new scoring function is needed to estimate its relevance to a given query. However, existing methods cannot resolve these challenges, thus return low result quality in term of query relevance. In this paper, we propose an IR-style approach which basically utilizes the statistics of underlying XML data to address these challenges. We first propose specific guidelines that a search engine should meet in both search intention identification and relevance oriented ranking for search results over XML documents. Then, based on these guidelines, we design novel formulae to identify the search for nodes and search via nodes of a query, and present a novel XML TF*IDF ranking strategy to rank the individual matches of all possible search intentions over XML documents
The Topology ToolKit
This system paper presents the Topology ToolKit (TTK), a software platform
designed for topological data analysis in scientific visualization. TTK
provides a unified, generic, efficient, and robust implementation of key
algorithms for the topological analysis of scalar data, including: critical
points, integral lines, persistence diagrams, persistence curves, merge trees,
contour trees, Morse-Smale complexes, fiber surfaces, continuous scatterplots,
Jacobi sets, Reeb spaces, and more. TTK is easily accessible to end users due
to a tight integration with ParaView. It is also easily accessible to
developers through a variety of bindings (Python, VTK/C++) for fast prototyping
or through direct, dependence-free, C++, to ease integration into pre-existing
complex systems. While developing TTK, we faced several algorithmic and
software engineering challenges, which we document in this paper. In
particular, we present an algorithm for the construction of a discrete gradient
that complies to the critical points extracted in the piecewise-linear setting.
This algorithm guarantees a combinatorial consistency across the topological
abstractions supported by TTK, and importantly, a unified implementation of
topological data simplification for multi-scale exploration and analysis. We
also present a cached triangulation data structure, that supports time
efficient and generic traversals, which self-adjusts its memory usage on demand
for input simplicial meshes and which implicitly emulates a triangulation for
regular grids with no memory overhead. Finally, we describe an original
software architecture, which guarantees memory efficient and direct accesses to
TTK features, while still allowing for researchers powerful and easy bindings
and extensions. TTK is open source (BSD license) and its code, online
documentation and video tutorials are available on TTK's website
Towards structured sharing of raw and derived neuroimaging data across existing resources
Data sharing efforts increasingly contribute to the acceleration of
scientific discovery. Neuroimaging data is accumulating in distributed
domain-specific databases and there is currently no integrated access mechanism
nor an accepted format for the critically important meta-data that is necessary
for making use of the combined, available neuroimaging data. In this
manuscript, we present work from the Derived Data Working Group, an open-access
group sponsored by the Biomedical Informatics Research Network (BIRN) and the
International Neuroimaging Coordinating Facility (INCF) focused on practical
tools for distributed access to neuroimaging data. The working group develops
models and tools facilitating the structured interchange of neuroimaging
meta-data and is making progress towards a unified set of tools for such data
and meta-data exchange. We report on the key components required for integrated
access to raw and derived neuroimaging data as well as associated meta-data and
provenance across neuroimaging resources. The components include (1) a
structured terminology that provides semantic context to data, (2) a formal
data model for neuroimaging with robust tracking of data provenance, (3) a web
service-based application programming interface (API) that provides a
consistent mechanism to access and query the data model, and (4) a provenance
library that can be used for the extraction of provenance data by image
analysts and imaging software developers. We believe that the framework and set
of tools outlined in this manuscript have great potential for solving many of
the issues the neuroimaging community faces when sharing raw and derived
neuroimaging data across the various existing database systems for the purpose
of accelerating scientific discovery
Supporting text mining for e-Science: the challenges for Grid-enabled natural language processing
Over the last few years, language technology has moved rapidly from 'applied research' to 'engineering', and from small-scale to large-scale engineering. Applications such as advanced text mining systems are feasible, but very resource-intensive, while research seeking to address the underlying language processing questions faces very real practical and methodological limitations. The e-Science vision, and the creation of the e-Science Grid, promises the level of integrated large-scale technological support required to sustain this important and successful new technology area. In this paper, we discuss the foundations for the deployment of text mining and other language technology on the Grid - the protocols and tools required to build distributed large-scale language technology systems, meeting the needs of users, application builders and researchers
Information extraction from multimedia web documents: an open-source platform and testbed
The LivingKnowledge project aimed to enhance the current state of the art in search, retrieval and knowledge management on the web by advancing the use of sentiment and opinion analysis within multimedia applications. To achieve this aim, a diverse set of novel and complementary analysis techniques have been integrated into a single, but extensible software platform on which such applications can be built. The platform combines state-of-the-art techniques for extracting facts, opinions and sentiment from multimedia documents, and unlike earlier platforms, it exploits both visual and textual techniques to support multimedia information retrieval. Foreseeing the usefulness of this software in the wider community, the platform has been made generally available as an open-source project. This paper describes the platform design, gives an overview of the analysis algorithms integrated into the system and describes two applications that utilise the system for multimedia information retrieval
Linked education: interlinking educational resources and the web of data
Research on interoperability of technology-enhanced learning (TEL) repositories throughout the last decade has led to a fragmented landscape of competing approaches, such as metadata schemas and interface mechanisms. However, so far Web-scale integration of resources is not facilitated, mainly due to the lack of take-up of shared principles, datasets and schemas. On the other hand, the Linked Data approach has emerged as the de-facto standard for sharing data on the Web and offers a large potential to solve interoperability issues in the field of TEL. In this paper, we describe a general approach to exploit the wealth of already existing TEL data on the Web by allowing its exposure as Linked Data and by taking into account automated enrichment and interlinking techniques to provide rich and well-interlinked data for the educational domain. This approach has been implemented in the context of the mEducator project where data from a number of open TEL data repositories has been integrated, exposed and enriched by following Linked Data principles
Investigation into Indexing XML Data Techniques
The rapid development of XML technology improves the WWW, since the XML data has many advantages and has become a common technology for transferring data cross the internet. Therefore, the objective of this research is to investigate and study the XML indexing techniques in terms of their structures. The main goal of this investigation is to identify the main limitations of these techniques and any other open issues.
Furthermore, this research considers most common XML indexing techniques and performs a comparison between them. Subsequently, this work makes an argument to find out these limitations. To conclude, the main problem of all the XML indexing techniques is the trade-off between the
size and the efficiency of the indexes. So, all the indexes become large in order to perform well, and none of them is suitable for all users’ requirements. However, each one of these techniques has some advantages in somehow
- …