18 research outputs found

    Dataset: Gronings

    No full text
    Dataset for evaluation of feature extraction methods for query-by-example spoken term detection with low resource languages. See full README on https://github.com/fauxneticien/qbe-std_feats_eval

    Features for QbE-STD: wav2vec 2.0 LibriSpeech 960 Transformer Layer 11 (Archive III: top performing features)

    No full text
    Best performing features extracted from audio using Transformer layer 11 of the wav2vec 2.0 LibriSpeech 960h on all 10 datasets (eng-mav, gbb-lg, gbb-pd, gos-kdl, gup-wat, mwf-jm, pjt-sw01, wbp-jk, wrl-mb, wrm-pd)

    Features for QbE-STD: MFCC, BNF, wav2vec 2.0 LibriSpeech 960 (Archive I)

    No full text
    Features extracted from audio using BNF, MFCC, and wav2vec 2.0 feature extractors for eng-mav, gbb-lg, wbp-jk, and wrl-mb datasets

    Features for QbE-STD, wav2vec 2.0 XLSR-53 (Archive II)

    No full text
    Features extracted from audio using wav2vec 2.0 feature extractor using XLSR-53 checkpoint for gbb-pd, gos-kdl, gup-wat, mwf-jm, pjt-sw01, and wrm-pd datasets

    Features for QbE-STD, wav2vec 2.0 XLSR-53 (Archive I)

    No full text
    Features extracted from audio using wav2vec 2.0 feature extractor with the XLSR-53 checkpoint for eng-mav, gbb-lg, wbp-jk, and wrl-mb datasets.

    Features for QbE-STD: MFCC, BNF, wav2vec 2.0 LibriSpeech 960 (Archive II)

    No full text
    Features extracted from audio using BNF, MFCC, and wav2vec 2.0 feature extractors for gbb-pd, gos-kdl, gup-wat, mwf-jm, pjt-sw01, and wrm-pd datasets

    Experiment artefacts: DTW search and evaluation

    No full text
    Results from query-by-example spoken term detection (QbE-STD) tasks using a Dynamic Time Warping (DTW) search (see description of procedure and associated experiment code) from main experiments with 54 feature extraction methods (BNF; MFCC; wav2vec 2.0 English/XLSR Encoder, Quantiser, and Transformer Layers 1-24

    Textual features and metadata for DBNL novels 1800-2000

    No full text
    This dataset contains a corpus of 1346 novels from DBNL. Included are metadata, word counts, and syntactic features for the novels. The metadata includes variables related to canonicity: library information, secondary references, Wikipedia mentions, etc. The titles have been selected using the following criteria: - Novels and novellas - Originally written in Dutch - First published 1800-2000 - TEI from titles available on https://www.DBNL.org Acknowledgements: Information from libraries was contributed by Trudie Stoutjesdijk and Eddie de Kok from Data Warehouse

    EXCEPTIUS Corpus

    No full text
    EXCEPTIUS Corpus v1.0, containing the following data: - raw documents for 21 countries at national level - pre-processed data with spacy-udpipe v1.0 - automatically annotated documents for the identification of exceptional measures at sentence level Country list (ISO 3166-1 alpha-2): AT, BE, HR, CY, CZ, DK, FR, DE, HU, IE, IT, LV, LT, NL, NO, PL, SI, SE, CH, UK Folder structure: each country has a dedicated folder. Inside each folder you will find the following subfolders: - raw_text: the raw text data (.txt format) - processed: the output of the spacy-udpipe v1.0 - each line is a sentence, containing the following info: tokens, lemma, POS, UD dependency relations - model: the predictions of the trained model (XML pre@36 as reported in Table 4 of the paper). Each line is a sentence, separate by 9 tab - each for a exceptional measure class. 1: signals presence of a class. The Italy and Norway folder misses the predictions of the models
    corecore