184,743 research outputs found
Normalized Web Distance and Word Similarity
There is a great deal of work in cognitive psychology, linguistics, and
computer science, about using word (or phrase) frequencies in context in text
corpora to develop measures for word similarity or word association, going back
to at least the 1960s. The goal of this chapter is to introduce the
normalizedis a general way to tap the amorphous low-grade knowledge available
for free on the Internet, typed in by local users aiming at personal
gratification of diverse objectives, and yet globally achieving what is
effectively the largest semantic electronic database in the world. Moreover,
this database is available for all by using any search engine that can return
aggregate page-count estimates for a large range of search-queries. In the
paper introducing the NWD it was called `normalized Google distance (NGD),' but
since Google doesn't allow computer searches anymore, we opt for the more
neutral and descriptive NWD. web distance (NWD) method to determine similarity
between words and phrases. ItComment: Latex, 20 pages, 7 figures, to appear in: Handbook of Natural
Language Processing, Second Edition, Nitin Indurkhya and Fred J. Damerau
Eds., CRC Press, Taylor and Francis Group, Boca Raton, FL, 2010, ISBN
978-142008592
MANULEX: a grade-level lexical database from French elementary school readers.
International audienceThis article presents MANULEX, a Web-accessible database that provides grade-level word frequency lists of nonlemmatized and lemmatized words (48,886 and 23,812 entries, respectively) computed from the 1.9 million words taken from 54 French elementary school readers. Word frequencies are provided for four levels: first grade (G1), second grade (G2), third to fifth grades (G3-5), and all grades (G1-5). The frequencies were computed following the methods described by Carroll, Davies, and Richman (1971) and Zeno, Ivenz, Millard, and Duvvuri (1995), with four statistics at each level (F, overall word frequency; D, index of dispersion across the selected readers; U, estimated frequency per million words; and SFI, standard frequency index). The database also provides the number of letters in the word and syntactic category information. MANULEX is intended to be a useful tool for studying language development through the selection of stimuli based on precise frequency norms. Researchers in artificial intelligence can also use it as a source of information on natural language processing to simulate written language acquisition in children. Finally, it may serve an educational purpose by providing basic vocabulary lists
ATLAS: A flexible and extensible architecture for linguistic annotation
We describe a formal model for annotating linguistic artifacts, from which we
derive an application programming interface (API) to a suite of tools for
manipulating these annotations. The abstract logical model provides for a range
of storage formats and promotes the reuse of tools that interact through this
API. We focus first on ``Annotation Graphs,'' a graph model for annotations on
linear signals (such as text and speech) indexed by intervals, for which
efficient database storage and querying techniques are applicable. We note how
a wide range of existing annotated corpora can be mapped to this annotation
graph model. This model is then generalized to encompass a wider variety of
linguistic ``signals,'' including both naturally occuring phenomena (as
recorded in images, video, multi-modal interactions, etc.), as well as the
derived resources that are increasingly important to the engineering of natural
language processing systems (such as word lists, dictionaries, aligned
bilingual corpora, etc.). We conclude with a review of the current efforts
towards implementing key pieces of this architecture.Comment: 8 pages, 9 figure
Investigating Vector Space Embeddings for Database Schema Management
Text generation in the area of natural language processing as part of the artificial intelligence field has been greatly improving over the last several years. Here we examine the application of vector space word embeddings to provide additional information and context during the text generation process as a way to improve the resultant output through the lens of database normalization. It is known that words encoded into vector space that are closer together in distance generally share meaning or have some semantic or symbolic relationship. This knowledge, paired with the known ability of recurrent neural networks in learning sequences, will be used to examine how vectorizing words can benefit text generation. While the majority of database normalization has been automated, the naming of the generated normalized tables has not. This work seeks to use word embeddings, generated from the data columns of a database table, to give context to a recurrent neural network model while it learns to generate database table names. Using real world data, a recurrent neural network based artificial intelligence model will be paired with a context vector made of word embeddings to observe how effective word embeddings are at providing additional context information during the learning and generation processes. Several methods for generating the context vector will be examined, such as how the word embeddings are generated and how they are combined. The exploration of these methods yielded very promising results in line with the overall goals of the performed work. The benefit of incorporating word embeddings to supply additional information during the text generation process allows for better learning with the goal of generating more human-useful names for newly normalized database tables from their data column titles
SMaTTS: standard malay text to speech system
This paper presents a rule-based text- to- speech
(TTS) Synthesis System for Standard Malay, namely SMaTTS. The
proposed system using sinusoidal method and some pre- recorded
wave files in generating speech for the system. The use of phone
database significantly decreases the amount of computer memory
space used, thus making the system very light and embeddable. The
overall system was comprised of two phases the Natural Language
Processing (NLP) that consisted of the high-level processing of text
analysis, phonetic analysis, text normalization and morphophonemic
module. The module was designed specially for SM to overcome
few problems in defining the rules for SM orthography system before
it can be passed to the DSP module. The second phase is the Digital
Signal Processing (DSP) which operated on the low-level process of
the speech waveform generation. A developed an intelligible and
adequately natural sounding formant-based speech synthesis system
with a light and user-friendly Graphical User Interface (GUI) is
introduced. A Standard Malay Language (SM) phoneme set and an
inclusive set of phone database have been constructed carefully for
this phone-based speech synthesizer. By applying the generative
phonology, a comprehensive letter-to-sound (LTS) rules and a
pronunciation lexicon have been invented for SMaTTS. As for the
evaluation tests, a set of Diagnostic Rhyme Test (DRT) word list was
compiled and several experiments have been performed to evaluate
the quality of the synthesized speech by analyzing the Mean Opinion
Score (MOS) obtained. The overall performance of the system as
well as the room for improvements was thoroughly discussed
- …