    Process Improvement of LSA for Semantic Relatedness Computing

    Tang poetry semantic correlation computing is critical in many applications, such as searching, clustering, automatic generation of poetry and so on. Aiming to increase computing efficiency and accuracy of semantic relatedness, we improved the process of latent semantic analysis (LSA). In this paper, we adopted “representation of words semantic” instead of “words-by-poems” to represent the words semantic, which based on the finding that words having similar distribution in poetry categories are almost always semantically related. Meanwhile, we designed experiment which obtained segmentation words from more than 40000 poems, and computed relatedness by cosine value which calculated from decomposed co-occurrence matrix with Singular Value Decomposition (SVD) method. The experimental result shows that this method is good to analyze semantic and emotional relatedness of words in Tang poetry. We can find associated words and the relevance of poetry categories by matrix manipulation of the decomposing matrices as well

    Segmentation of lecture videos based on text: A method combining multiple linguistic features

    In multimedia-based e-Learning systems, there are strong needs for segmenting lecture videos into topic units in order to organize the videos for browsing and to provide search capability. Automatic segmentation is highly desired because of the high cost of manual segmentation. While a lot of research has been conducted on topic segmentation of transcribed spoken text, most attempts rely on domain-specific cues and formal presentation format, and require extensive training; none of these features exist in lecture videos with unscripted and spontaneous speech. In addition, lecture videos usually have few scene changes, which implies that the visual information that most video segmentation methods rely on is not available. Furthermore, even when there are scene changes, they do not match with the topic transitions. In this paper, we make use of the transcribed speech text extracted from the audio track of video to segment lecture videos into topics. We review related research and propose a new segmentation approach. Our approach utilizes features such as noun phrases and combines multiple content-based and discourse-based features. Our preliminary results show that the noun phrases are salient features and the combination of multiple features is promising to improve segmentation accuracy.published_or_final_versio

    Verb similarity: comparing corpus and psycholinguistic data

    Similarity, which plays a key role in fields like cognitive science, psycholinguistics and natural language processing, is a broad and multifaceted concept. In this work we analyse how two approaches that belong to different perspectives, the corpus view and the psycholinguistic view, articulate similarity between verb senses in Spanish. Specifically, we compare the similarity between verb senses based on their argument structure, which is captured through semantic roles, with their similarity defined by word associations. We address the question of whether verb argument structure, which reflects the expression of the events, and word associations, which are related to the speakers' organization of the mental lexicon, shape similarity between verbs in a congruent manner, a topic which has not been explored previously. While we find significant correlations between verb sense similarities obtained from these two approaches, our findings also highlight some discrepancies between them and the importance of the degree of abstraction of the corpus annotation and psycholinguistic representations.La similitud, que desempeña un papel clave en campos como la ciencia cognitiva, la psicolingüística y el procesamiento del lenguaje natural, es un concepto amplio y multifacético. En este trabajo analizamos cómo dos enfoques que pertenecen a diferentes perspectivas, la visión del corpus y la visión psicolingüística, articulan la semejanza entre los sentidos verbales en español. Específicamente, comparamos la similitud entre los sentidos verbales basados en su estructura argumental, que se capta a través de roles semánticos, con su similitud definida por las asociaciones de palabras. Abordamos la cuestión de si la estructura del argumento verbal, que refleja la expresión de los acontecimientos, y las asociaciones de palabras, que están relacionadas con la organización de los hablantes del léxico mental, forman similitud entre los verbos de una manera congruente, un tema que no ha sido explorado previamente. Mientras que encontramos correlaciones significativas entre las similitudes de los sentidos verbales obtenidas de estos dos enfoques, nuestros hallazgos también resaltan algunas discrepancias entre ellos y la importancia del grado de abstracción de la anotación del corpus y las representaciones psicolingüísticas.La similitud, que exerceix un paper clau en camps com la ciència cognitiva, la psicolingüística i el processament del llenguatge natural, és un concepte ampli i multifacètic. En aquest treball analitzem com dos enfocaments que pertanyen a diferents perspectives, la visió del corpus i la visió psicolingüística, articulen la semblança entre els sentits verbals en espanyol. Específicament, comparem la similitud entre els sentits verbals basats en la seva estructura argumental, que es capta a través de rols semàntics, amb la seva similitud definida per les associacions de paraules. Abordem la qüestió de si l'estructura de l'argument verbal, que reflecteix l'expressió dels esdeveniments, i les associacions de paraules, que estan relacionades amb l'organització dels parlants del lèxic mental, formen similitud entre els verbs d'una manera congruent, un tema que no ha estat explorat prèviament. Mentre que trobem correlacions significatives entre les similituds dels sentits verbals obtingudes d'aquests dos enfocaments, les nostres troballes també ressalten algunes discrepàncies entre ells i la importància del grau d'abstracció de l'anotació del corpus i les representacions psicolingüístiques

    An Innovative Aim for Collecting and Retrieving Documents from Web Domain Using SSARC (Spontaneous Sorting and Retrieving Clock) Algorithm

    ABSTRACT: This paper presents an algorithm for generating and grouping documents from the web. In current years, due to the immense accessible of large document collections and the need to effective operate on them (for instance: navigate, analyze, query and summarize), there has been an increased emphasis on developing efficient and effective clustering algorithms for large document collections. In our novel algorithm collects all the documents from the web then it sorts the documents in an alphabetical order and stores the documents in clockwise structure algorithm which can easily retrieve the documents related to the user's query. This novel algorithm called as SSARC Algorithm, it is the expansion of "Spontaneous Sorting and Retrieving Clock" algorithm. We propose the overall architecture and depict two innovative algorithms which produce notable improvement over traditional clustering algorithms and form the basis for the query scrutinization and exploration of this algorithm

    Kísérlet magyar szavak jelentéshasonlóságának meghatározására a Magyar szókincstár segítségével

    A szavak jelentéshasonlóságának meghatározására irányuló kutatások és kísérletek a mintegy fél évszázados asszociációs pszicholingvisztikai kísérletek után az utóbbi évtizedben ugrásszerűen megnőttek. A növekedés okai a természetes nyelvek gépi feldolgozása technológiájának látványos fejlődése és a ma már széles körben elérhető elektronikus nagy nyelvi adatbázisok (egynyelvű szótárak, tezauruszok, korpuszok, WordNet) létrehozása. Előadásunkban bemutatjuk kísérletünket, melyben a Magyar szókincstárat [Kiss 1998], pontosabban az abban lévő 2S787 címszó alatt található 42976 szinonimasort miként használtuk fel kiindulási nyelvi tudásbázisként szópárok (egyes aljelentések szerint megkülönböztetett) jelentéshasonlóságának meghatározására. Ismertetjük a szópárok jelentéshasonlósági mérőszámaiből felépített - szófajokra szétbontott - jelentéshasonlósági mátrixok létrehozásának menetét. Kísérletet végeztünk, hogy a jelentéshasonlósági mátrixokból kiindulva szinguláris érték dekompozíció (SVD) alkalmazásával miként lehet automatikusan fogalomköröket generálni

    A domain-independent semantic tagger for the study of meaning associations in English text

    A comparison of semantic tagging with syntactic Part-of-Speech tagging leads us to propose that a domain-independent semantic tagger for English corpora should not aim to annotate each word with an atomic 'sem-tag', but instead that a semantic tagging should attach to each word a set of semantic primitive attributes or features. These features should include: - lemma or root, grouping together inflected and derived forms of the same lexical item; - broad subject categories where applicable; - selectional restrictions; - a meaning definition, stated in terms of a restricted Defining Vocabulary, and processed to remove stoplist-words and repetitions. A semantic tagger meeting this description can be derived from the Longman Dictionary of Contemporary English, if combined with a robust lemmatiser; allowing automated semantic tagging of large English corpora such as LOB and BNC