Search CORE

70,018 research outputs found

XML Schema Clustering with Semantic and Hierarchical Similarity Measures

Author: Iryadi Wina
Nayak Richi
Publication venue: 'Elsevier BV'
Publication date: 01/01/2007
Field of study

With the growing popularity of XML as the data representation language, collections of the XML data are exploded in numbers. The methods are required to manage and discover the useful information from them for the improved document handling. We present a schema clustering process by organising the heterogeneous XML schemas into various groups. The methodology considers not only the linguistic and the context of the elements but also the hierarchical structural similarity. We support our findings with experiments and analysis

Crossref

Queensland University of Technology ePrints Archive

Methodological considerations concerning manual annotation of musical audio in function of algorithm development

Author: De Baets Bernard
Leman Marc
Lesaffre Micheline
Martens Jean-Pierre
Publication venue
Publication date: 01/01/2004
Field of study

In research on musical audio-mining, annotated music databases are needed which allow the development of computational tools that extract from the musical audiostream the kind of high-level content that users can deal with in Music Information Retrieval (MIR) contexts. The notion of musical content, and therefore the notion of annotation, is ill-defined, however, both in the syntactic and semantic sense. As a consequence, annotation has been approached from a variety of perspectives (but mainly linguistic-symbolic oriented), and a general methodology is lacking. This paper is a step towards the definition of a general framework for manual annotation of musical audio in function of a computational approach to musical audio-mining that is based on algorithms that learn from annotated data. 1

CiteSeerX

Ghent University Academic Bibliography

From Frequency to Meaning: Vector Space Models of Semantics

Author: Pantel Patrick
Turney Peter D.
Publication venue: 'AI Access Foundation'
Publication date: 01/01/2010
Field of study

Computers understand very little of the meaning of human language. This profoundly limits our ability to give instructions to computers, the ability of computers to explain their actions to us, and the ability of computers to analyse and process text. Vector space models (VSMs) of semantics are beginning to address these limits. This paper surveys the use of VSMs for semantic processing of text. We organize the literature on VSMs according to the structure of the matrix in a VSM. There are currently three broad classes of VSMs, based on term-document, word-context, and pair-pattern matrices, yielding three classes of applications. We survey a broad range of applications in these three categories and we take a detailed look at a specific open source project in each category. Our goal in this survey is to show the breadth of applications of VSMs for semantics, to provide a new perspective on VSMs for those who are already familiar with the area, and to provide pointers into the literature for those who are less familiar with the field

arXiv.org e-Print Archive

CiteSeerX

NRC Publications Archive

Crossref

FilteredWeb: A Framework for the Automated Search-Based Discovery of Blocked URLs

Author: anderson
aryan
blei
crandall
deibert
dornseif
duan
filastò
knockel
levis
lowe
nabi
rajaraman
salton
sfakianakis
verkamp
winter
wright
wright
Publication venue
Publication date: 01/01/2017
Field of study

Various methods have been proposed for creating and maintaining lists of potentially filtered URLs to allow for measurement of ongoing internet censorship around the world. Whilst testing a known resource for evidence of filtering can be relatively simple, given appropriate vantage points, discovering previously unknown filtered web resources remains an open challenge. We present a new framework for automating the process of discovering filtered resources through the use of adaptive queries to well-known search engines. Our system applies information retrieval algorithms to isolate characteristic linguistic patterns in known filtered web pages; these are then used as the basis for web search queries. The results of these queries are then checked for evidence of filtering, and newly discovered filtered resources are fed back into the system to detect further filtered content. Our implementation of this framework, applied to China as a case study, shows that this approach is demonstrably effective at detecting significant numbers of previously unknown filtered web pages, making a significant contribution to the ongoing detection of internet filtering as it develops. Our tool is currently deployed and has been used to discover 1355 domains that are poisoned within China as of Feb 2017 - 30 times more than are contained in the most widely-used public filter list. Of these, 759 are outside of the Alexa Top 1000 domains list, demonstrating the capability of this framework to find more obscure filtered content. Further, our initial analysis of filtered URLs, and the search terms that were used to discover them, gives further insight into the nature of the content currently being blocked in China.Comment: To appear in "Network Traffic Measurement and Analysis Conference 2017" (TMA2017

arXiv.org e-Print Archive

Crossref

Oxford University Research Archive

A new web interface to facilitate access to corpora: development of the ASLLRP data access interface

Author: Neidle Carol
Vogler Christian
Publication venue
Publication date: 01/01/2012
Field of study

A significant obstacle to broad utilization of corpora is the difficulty in gaining access to the specific subsets of data and annotations that may be relevant for particular types of research. With that in mind, we have developed a web-based Data Access Interface (DAI), to provide access to the expanding datasets of the American Sign Language Linguistic Research Project (ASLLRP). The DAI facilitates browsing the corpora, viewing videos and annotations, searching for phenomena of interest, and downloading selected materials from the website. The web interface, compared to providing videos and annotation files off-line, also greatly increases access by people that have no prior experience in working with linguistic annotation tools, and it opens the door to integrating the data with third-party applications on the desktop and in the mobile space. In this paper we give an overview of the available videos, annotations, and search functionality of the DAI, as well as plans for future enhancements. We also summarize best practices and key lessons learned that are crucial to the success of similar projects

CiteSeerX

Boston University Institutional Repository (OpenBU)

Measures of lexical distance between languages

Author: Bakker
D’Urville
Filippo Petroni
Gray
Gray
Holman
Maurizio Serva
Petroni
Serva
Starostin
Swadesh
Publication venue: 'Elsevier BV'
Publication date: 09/12/2009
Field of study

The idea of measuring distance between languages seems to have its roots in the work of the French explorer Dumont D'Urville \cite{Urv}. He collected comparative words lists of various languages during his voyages aboard the Astrolabe from 1826 to 1829 and, in his work about the geographical division of the Pacific, he proposed a method to measure the degree of relation among languages. The method used by modern glottochronology, developed by Morris Swadesh in the 1950s, measures distances from the percentage of shared cognates, which are words with a common historical origin. Recently, we proposed a new automated method which uses normalized Levenshtein distance among words with the same meaning and averages on the words contained in a list. Recently another group of scholars \cite{Bak, Hol} proposed a refined of our definition including a second normalization. In this paper we compare the information content of our definition with the refined version in order to decide which of the two can be applied with greater success to resolve relationships among languages

arXiv.org e-Print Archive

Crossref

Archivio istituzionale della ricerca - Università di Cagliari

IRIS UniversitÃ Politecnica delle Marche

Do Linguistic Style and Readability of Scientific Abstracts affect their Virality?

Author: Guerini Marco
Lepri Bruno
Pepe Alberto
Publication venue
Publication date: 01/01/2012
Field of study

Reactions to textual content posted in an online social network show different dynamics depending on the linguistic style and readability of the submitted content. Do similar dynamics exist for responses to scientific articles? Our intuition, supported by previous research, suggests that the success of a scientific article depends on its content, rather than on its linguistic style. In this article, we examine a corpus of scientific abstracts and three forms of associated reactions: article downloads, citations, and bookmarks. Through a class-based psycholinguistic analysis and readability indices tests, we show that certain stylistic and readability features of abstracts clearly concur in determining the success and viral capability of a scientific article.Comment: Proceedings of the Sixth International AAAI Conference on Weblogs and Social Media (ICWSM 2012), 4-8 June 2012, Dublin, Irelan

arXiv.org e-Print Archive

CiteSeerX

Archivio della ricerca - Fondazione Bruno Kessler

Association for the Advancement of Artificial Intelligence: AAAI Publications

A Flexible Shallow Approach to Text Generation

Author: Busemann Stephan
Horacek Helmut
Publication venue
Publication date: 01/01/1998
Field of study

In order to support the efficient development of NL generation systems, two orthogonal methods are currently pursued with emphasis: (1) reusable, general, and linguistically motivated surface realization components, and (2) simple, task-oriented template-based techniques. In this paper we argue that, from an application-oriented perspective, the benefits of both are still limited. In order to improve this situation, we suggest and evaluate shallow generation methods associated with increased flexibility. We advise a close connection between domain-motivated and linguistic ontologies that supports the quick adaptation to new tasks and domains, rather than the reuse of general resources. Our method is especially designed for generating reports with limited linguistic variations.Comment: LaTeX, 10 page

arXiv.org e-Print Archive

CiteSeerX