139 research outputs found
Improving the Representation and Conversion of Mathematical Formulae by Considering their Textual Context
Mathematical formulae represent complex semantic information in a concise
form. Especially in Science, Technology, Engineering, and Mathematics,
mathematical formulae are crucial to communicate information, e.g., in
scientific papers, and to perform computations using computer algebra systems.
Enabling computers to access the information encoded in mathematical formulae
requires machine-readable formats that can represent both the presentation and
content, i.e., the semantics, of formulae. Exchanging such information between
systems additionally requires conversion methods for mathematical
representation formats. We analyze how the semantic enrichment of formulae
improves the format conversion process and show that considering the textual
context of formulae reduces the error rate of such conversions. Our main
contributions are: (1) providing an openly available benchmark dataset for the
mathematical format conversion task consisting of a newly created test
collection, an extensive, manually curated gold standard and task-specific
evaluation metrics; (2) performing a quantitative evaluation of
state-of-the-art tools for mathematical format conversions; (3) presenting a
new approach that considers the textual context of formulae to reduce the error
rate for mathematical format conversions. Our benchmark dataset facilitates
future research on mathematical format conversions as well as research on many
problems in mathematical information retrieval. Because we annotated and linked
all components of formulae, e.g., identifiers, operators and other entities, to
Wikidata entries, the gold standard can, for instance, be used to train methods
for formula concept discovery and recognition. Such methods can then be applied
to improve mathematical information retrieval systems, e.g., for semantic
formula search, recommendation of mathematical content, or detection of
mathematical plagiarism.Comment: 10 pages, 4 figure
XML Schema Clustering with Semantic and Hierarchical Similarity Measures
With the growing popularity of XML as the data representation language, collections of the XML data are exploded in numbers. The methods are required to manage and discover the useful information from them for the improved document handling. We present a schema clustering process by organising the heterogeneous XML schemas into various groups. The methodology considers not only the linguistic and the context of the elements but also the hierarchical structural similarity. We support our findings with experiments and analysis
Web Data Extraction, Applications and Techniques: A Survey
Web Data Extraction is an important problem that has been studied by means of
different scientific tools and in a broad range of applications. Many
approaches to extracting data from the Web have been designed to solve specific
problems and operate in ad-hoc domains. Other approaches, instead, heavily
reuse techniques and algorithms developed in the field of Information
Extraction.
This survey aims at providing a structured and comprehensive overview of the
literature in the field of Web Data Extraction. We provided a simple
classification framework in which existing Web Data Extraction applications are
grouped into two main classes, namely applications at the Enterprise level and
at the Social Web level. At the Enterprise level, Web Data Extraction
techniques emerge as a key tool to perform data analysis in Business and
Competitive Intelligence systems as well as for business process
re-engineering. At the Social Web level, Web Data Extraction techniques allow
to gather a large amount of structured data continuously generated and
disseminated by Web 2.0, Social Media and Online Social Network users and this
offers unprecedented opportunities to analyze human behavior at a very large
scale. We discuss also the potential of cross-fertilization, i.e., on the
possibility of re-using Web Data Extraction techniques originally designed to
work in a given domain, in other domains.Comment: Knowledge-based System
Well-Formed and Scalable Invasive Software Composition
Software components provide essential means to structure and organize software effectively. However, frequently, required component abstractions are not available in a programming language or system, or are not adequately combinable with each other. Invasive software composition (ISC) is a general approach to software composition that unifies component-like abstractions such as templates, aspects and macros. ISC is based on fragment composition, and composes programs and other software artifacts at the level of syntax trees. Therefore, a unifying fragment component model is related to the context-free grammar of a language to identify extension and variation points in syntax trees as well as valid component types. By doing so, fragment components can be composed by transformations at respective extension and variation points so that always valid composition results regarding the underlying context-free grammar are yielded. However, given a language’s context-free grammar, the composition result may still be incorrect.
Context-sensitive constraints such as type constraints may be violated so that the program cannot be compiled and/or interpreted correctly. While a compiler can detect such errors after composition, it is difficult to relate them back to the original transformation step in the composition system, especially in the case of complex compositions with several hundreds of such steps. To tackle this problem, this thesis proposes well-formed ISC—an extension to ISC that uses reference attribute grammars (RAGs) to specify fragment component models and fragment contracts to guard compositions with context-sensitive constraints. Additionally, well-formed ISC provides composition strategies as a means to configure composition algorithms and handle interferences between composition steps.
Developing ISC systems for complex languages such as programming languages is a complex undertaking. Composition-system developers need to supply or develop adequate language and parser specifications that can be processed by an ISC composition engine. Moreover, the specifications may need to be extended with rules for the intended composition abstractions.
Current approaches to ISC require complete grammars to be able to compose fragments in the respective languages. Hence, the specifications need to be developed exhaustively before any component model can be supplied. To tackle this problem, this thesis introduces scalable ISC—a variant of ISC that uses island component models as a means to define component models for partially specified languages while still the whole language is supported. Additionally, a scalable workflow for agile composition-system development is proposed which supports a development of ISC systems in small increments using modular extensions.
All theoretical concepts introduced in this thesis are implemented in the Skeletons and Application Templates framework SkAT. It supports “classic”, well-formed and scalable ISC by leveraging RAGs as its main specification and implementation language. Moreover, several composition systems based on SkAT are discussed, e.g., a well-formed composition system for Java and a C preprocessor-like macro language. In turn, those composition systems are used as composers in several example applications such as a library of parallel algorithmic skeletons
Automatic construction and adaptation of wrappers for semi-structured web documents.
Wong Tak Lam.Thesis (M.Phil.)--Chinese University of Hong Kong, 2003.Includes bibliographical references (leaves 88-94).Abstracts in English and Chinese.Chapter 1 --- Introduction --- p.1Chapter 1.1 --- Wrapper Induction for Semi-structured Web Documents --- p.1Chapter 1.2 --- Adapting Wrappers to Unseen Web Sites --- p.6Chapter 1.3 --- Thesis Contributions --- p.7Chapter 1.4 --- Thesis Organization --- p.8Chapter 2 --- Related Work --- p.10Chapter 2.1 --- Related Work on Wrapper Induction --- p.10Chapter 2.2 --- Related Work on Wrapper Adaptation --- p.16Chapter 3 --- Automatic Construction of Hierarchical Wrappers --- p.20Chapter 3.1 --- Hierarchical Record Structure Inference --- p.22Chapter 3.2 --- Extraction Rule Induction --- p.30Chapter 3.3 --- Applying Hierarchical Wrappers --- p.38Chapter 4 --- Experimental Results for Wrapper Induction --- p.40Chapter 5 --- Adaptation of Wrappers for Unseen Web Sites --- p.52Chapter 5.1 --- Problem Definition --- p.52Chapter 5.2 --- Overview of Wrapper Adaptation Framework --- p.55Chapter 5.3 --- Potential Training Example Candidate Identification --- p.58Chapter 5.3.1 --- Useful Text Fragments --- p.58Chapter 5.3.2 --- Training Example Generation from the Unseen Web Site --- p.60Chapter 5.3.3 --- Modified Nearest Neighbour Classification --- p.63Chapter 5.4 --- Machine Annotated Training Example Discovery and New Wrap- per Learning --- p.64Chapter 5.4.1 --- Text Fragment Classification --- p.64Chapter 5.4.2 --- New Wrapper Learning --- p.69Chapter 6 --- Case Study and Experimental Results for Wrapper Adapta- tion --- p.71Chapter 6.1 --- Case Study on Wrapper Adaptation --- p.71Chapter 6.2 --- Experimental Results --- p.73Chapter 6.2.1 --- Book Domain --- p.74Chapter 6.2.2 --- Consumer Electronic Appliance Domain --- p.79Chapter 7 --- Conclusions and Future Work --- p.83Bibliography --- p.88Chapter A --- Detailed Performance of Wrapper Induction for Book Do- main --- p.95Chapter B --- Detailed Performance of Wrapper Induction for Consumer Electronic Appliance Domain --- p.9
Understanding Documents with Text Mining Methods
KlĂÄŤová slova jsou vĂ˝razy kterĂ© popisujĂ tĂ©ma dokumentu. PouĹľĂvajĂ se takĂ© pro shrnutĂ dokumentu nebo pĹ™i optimalizaci jejich vyhledávanĂ. CĂlem tĂ©to práce je prozkoumat moĹľnosti vylepšenĂ extrakce klĂÄŤovĂ˝ch slov a jejich pouĹľitĂ pĹ™i shlukovánĂ a klasifikaci dokumentu. PodaĹ™ilo se prozkoumat moĹľnosti vytvářenĂ ontologie na základÄ› klĂÄŤovĂ˝ch slov a modelovánĂ klĂÄŤovĂ˝ch slov v ÄŤase. NavĂc bylo v práci ukázáno, Ĺľe metody text-miningu lze v omezenĂ© mĂĹ™e pouĹľĂt i pro predikci dĂ©lky Ĺľivota ÄŤlánku. Dále byl navrĹľen a implementován text miningovĂ˝ systĂ©m, kterĂ˝ je prostĹ™ednictvĂm webovĂ˝ch sluĹľeb schopnĂ˝ extrahovat klĂÄŤová slova z novinovĂ˝ch ÄŤlánkĹŻ a shlukovat je dle postupĹŻ, kterĂ© byly vyhodnoceny jako nejĂşspěšnÄ›jšĂ.Keyword is a term that captures topic of a document. They can be used for quick article summarization or search optimization. In this work we test whether keywords can improve quality of document clustering and document classification compared to unigrams. Futher we describe text mining system that can extract keywords from a document and cluster documents as the part of back-end of news web portal. We also explore possible methods of organizing keywords into hierarchy and visualize them in time
Automatic office document classification and information extraction
TEXPR.OS (TEXt PROcessing System) is a document processing system (DPS) to support and assist office workers in their daily work in dealing with information and document management. In this thesis, document classification and information extraction, which are two of the major functional capabilities in TEXPROS, are investigated.
Based on the nature of its content, a document is divided into structured and unstructured (i.e., of free text) parts. The conceptual and content structures are introduced to capture the semantics of the structured and unstructured part of the document respectively. The document is classified and information is extracted based on the analyses of conceptual and content structures. In our approach, the layout structure of a document is used to assist the analyses of the conceptual and content structures of the document. By nested segmentation of a document, the layout structure of the document is represented by an ordered labeled tree structure, called Layout Structure Tree (L-S-Tree). Sample-based classification mechanism is adopted in our approach for classifying the documents. A set of pre-classified documents are stored in a document sample base in the form of sample trees. In the layout analysis, an approximate tree matching is used to match the L-S-Tree of a document to be classified against the sample trees. The layout similarities between the document and the sample documents are evaluated based on the edit distance between the L-S-Tree of the document and the sample trees. The document samples which have the similar layout structure to the document are chosen to be used for the conceptual analysis of the document.
In the conceptual analysis of the document, based on the mapping between the document and document samples, which was found during the layout analysis, the conceptual similarities between the document and the sample documents are evaluated based on the degree of conceptual closeness degree . The document sample which has the similar conceptual structure to the document is chosen to be used for extracting information. Extracting the information of the structured part of the document is based on the layout locations of key terms appearing in the document and string pattern matching. Based on the information extracted from the structured part of the document the type of the document is identified. In the content analysis of the document, the bottom-up and top-down analyses on the free text are combined to extract information from the unstructured part of the document. In the bottom-up analysis, the sentences of the free text are classified into those which are relevant or irrelevant to the extraction. The sentence classification is based on the semantical relationship between the phrases in the sentences and the attribute names in the corresponding content structure by consulting the thesaurus. Then the thematic roles of the phrases in each relevant sentence are identified based on the syntactic analysis and heuristic thematic analysis. In the top-down analysis, the appropriate content structure is identified based on the document type identified in the conceptual analysis. Then the information is extracted from the unstructured part of the document by evaluating the restrictions specified in the corresponding content structure based on the result of bottom-up analysis.
The information extracted from the structured and unstructured parts of the document are stored in the form of a frame like structure (frame instance) in the data base for information retrieval in TEXPROS
- …