1,250 research outputs found

    NLP-based Metadata Extraction for Legal Text Consolidation

    Get PDF
    The paper describes a system for the automatic consolidation of Italian legislative texts to be used as a support of an editorial consolidating activity and dealing with the following typology of textual amendments: repeal, substitution and integration. The focus of the paper is on the semantic analysis of the textual amendment provisions and the formalized representation of the amendments in terms of metadata. The proposed approach to consolidation is metadata- oriented and based on Natural Language Processing (NLP) techniques: we use XML-based standards for metadata annotation of legislative acts and a flexible NLP architecture for extracting metadata from parsed texts. An evaluation of achieved results is also provided

    A Robust Framework for Mining YouTube Data

    Get PDF
    YouTube is currently the most popular and successful video sharing website. As YouTube has broad and profound social impact, YouTube analytics has become a hot research area. The videos on YouTube have become a treasure of data. However, getting access to the immense and massive YouTube data is a challenge. Previous research, studies, and analysis so far, are only conducted on very small volumes of YouTube video data. To date, there exists no mechanism to systematically and continuously collect, process and store the rich set of YouTube data. This thesis presents a methodology to systematically and continuously mine and store the YouTube data. The methodology has two modules: a video discovery and a video metadata collection. YouTube provides an API to conduct search requests analogous to the search performed by a user on the YouTube website. However, the YouTube API’s ‘search’ operation was not designed to return large volumes of data and only provides limited search results (metadata) that can easily be handled by a human. The proposed discovery process makes the search process scalable, robust and swift by (a) initially taking a few carefully selected video IDs (seeds) as input from each of the video categories in YouTube (so to get a wider coverage), and (b) later, using each of them to find related videos over multiple generations. The thesis employed in-memory data management for the discovery process to suppress redundancy explosion and rapidly find new videos. Further, the batch-caching mechanism is introduced to ensure that the high velocity data generated by the discovery process do not result in memory explosion; thereby increasing the reliability of the methodology. The performance of the proposed methodology is gauged over the period of two months. Within two months, 16,000,000 videos were discovered and complete metadata of more than 42,000 videos was mined. The thesis also explores serveral new possible dimensions that can be possible extensions to the proposed framework. The two most promiment dimensions are (a) channel discovery: Every YouTube user that has ever made a comment contributes to a channel. A channel can hold hundreds of YouTube videos and related metadata. Discovering channels can speed up the video discovery up to 100-fold; and (b) channel metadata collection: Since the volume of videos in a channel is massive, therefore, a mechanism needs to be developed to use multiple machines running software agents that can collaborate and communicate with each other to collect metadata of billions of videos in a distributed fashion

    Corpora compilation for prosody-informed speech processing

    Get PDF
    Research on speech technologies necessitates spoken data, which is usually obtained through read recorded speech, and specifically adapted to the research needs. When the aim is to deal with the prosody involved in speech, the available data must reflect natural and conversational speech, which is usually costly and difficult to get. This paper presents a machine learning-oriented toolkit for collecting, handling, and visualization of speech data, using prosodic heuristic. We present two corpora resulting from these methodologies: PANTED corpus, containing 250 h of English speech from TED Talks, and Heroes corpus containing 8 h of parallel English and Spanish movie speech. We demonstrate their use in two deep learning-based applications: punctuation restoration and machine translation. The presented corpora are freely available to the research community

    TEXT-TO-SPEECH CONVERSION (FOR BAHASA MELAYU)

    Get PDF
    Text-to-Speech (TTS) is an application that help user in having the text given to be read out loud. This project highlighted in creating a TTS system that allows text reading in Standard Malay Language (Bahasa Melayu). There is a lack of computer aided learning (CAL) tools that emphasize in Malay linguistic and misconception that people have regarding the usage of English-based TTS to read Bahasa Melayu text derived the development ofthis project. The end result is the TTS conversion prototype for Bahasa Melayu that reads by syllable using syllabification techniques through the employment ofMaximum Onset Principle (MOP) and produce syllable sounding speech by using syllable to sound mapping method

    Extending automatic transcripts in a unified data representation towards a prosodic-based metadata annotation and evaluation

    Get PDF
    This paper describes a framework that extends automatic speech transcripts in order to accommodate relevant information coming from manual transcripts, the speech signal itself, and other resources, like lexica. The proposed framework automatically collects, relates, computes, and stores all relevant information together in a self-contained data source, making it possible to easily provide a wide range of interconnected information suitable for speech analysis, training, and evaluating a number of automatic speech processing tasks. The main goal of this framework is to integrate different linguistic and paralinguistic layers of knowledge for a more complete view of their representation and interactions in several domains and languages. The processing chain is composed of two main stages, where the first consists of integrating the relevant manual annotations in the speech recognition data, and the second consists of further enriching the previous output in order to accommodate prosodic information. The described framework has been used for the identification and analysis of structural metadata in automatic speech transcripts. Initially put to use for automatic detection of punctuation marks and for capitalization recovery from speech data, it has also been recently used for studying the characterization of disfluencies in speech. It was already applied to several domains of Portuguese corpora, and also to English and Spanish Broadcast News corpora

    Questions Generated by Japanese Students of English

    Get PDF

    Incorporating Punctuation Into the Sentence Grammar: A Lexicalized Tree Adjoining Grammar Perspective

    Get PDF
    Punctuation helps us to structure, and thus to understand, texts. Many uses of punctuation straddle the line between syntax and discourse, because they serve to combine multiple propositions within a single orthographic sentence. They allow us to insert discourse-level relations at the level of a single sentence. Just as people make use of information from punctuation in processing what they read, computers can use information from punctuation in processing texts automatically. Most current natural language processing systems fail to take punctuation into account at all, losing a valuable source of information about the text. Those which do mostly do so in a superficial way, again failing to fully exploit the information conveyed by punctuation. To be able to make use of such information in a computational system, we must first characterize its uses and find a suitable representation for encoding them. The work here focuses on extending a syntactic grammar to handle phenomena occurring within a single sentence which have punctuation as an integral component. Punctuation marks are treated as full-fledged lexical items in a Lexicalized Tree Adjoining Grammar, which is an extremely well-suited formalism for encoding punctuation in the sentence grammar. Each mark anchors its own elementary trees and imposes constraints on the surrounding lexical items. I have analyzed data representing a wide variety of constructions, and added treatments of them to the large English grammar which is part of the XTAG system. The advantages of using LTAG are that its elementary units are structured trees of a suitable size for stating the constraints we are interested in, and the derivation histories it produces contain information the discourse grammar will need about which elementary units have used and how they have been combined. I also consider in detail a few particularly interesting constructions where the sentence and discourse grammars meet-appositives, reported speech and uses of parentheses. My results confirm that punctuation can be used in analyzing sentences to increase the coverage of the grammar, reduce the ambiguity of certain word sequences and facilitate discourse-level processing of the texts
    corecore