    Czech Text Document Corpus v 2.0

    This paper introduces "Czech Text Document Corpus v 2.0", a collection of text documents for automatic document classification in Czech language. It is composed of the text documents provided by the Czech News Agency and is freely available for research purposes at http://ctdc.kiv.zcu.cz/. This corpus was created in order to facilitate a straightforward comparison of the document classification approaches on Czech data. It is particularly dedicated to evaluation of multi-label document classification approaches, because one document is usually labelled with more than one label. Besides the information about the document classes, the corpus is also annotated at the morphological layer. This paper further shows the results of selected state-of-the-art methods on this corpus to offer the possibility of an easy comparison with these approaches.Comment: Accepted for LREC 201

    Dialogue Act Recognition using Visual Information

    Cross-border Cooperation Program Czech Republic - Free State of Bavaria ETS Objective 2014-2020 (project no. 211) and by Grant No.SGS-2019-01

    Well-calibrated Confidence Measures for Multi-label Text Classification with a Large Number of Labels

    We extend our previous work on Inductive Conformal Prediction (ICP) for multi-label text classification and present a novel approach for addressing the computational inefficiency of the Label Powerset (LP) ICP, arrising when dealing with a high number of unique labels. We present experimental results using the original and the proposed efficient LP-ICP on two English and one Czech language data-sets. Specifically, we apply the LP-ICP on three deep Artificial Neural Network (ANN) classifiers of two types: one based on contextualised (bert) and two on non-contextualised (word2vec) word-embeddings. In the LP-ICP setting we assign nonconformity scores to label-sets from which the corresponding p-values and prediction-sets are determined. Our approach deals with the increased computational burden of LP by eliminating from consideration a significant number of label-sets that will surely have p-values below the specified significance level. This reduces dramatically the computational complexity of the approach while fully respecting the standard CP guarantees. Our experimental results show that the contextualised-based classifier surpasses the non-contextualised-based ones and obtains state-of-the-art performance for all data-sets examined. The good performance of the underlying classifiers is carried on to their ICP counterparts without any significant accuracy loss, but with the added benefits of ICP, i.e. the confidence information encapsulated in the prediction sets. We experimentally demonstrate that the resulting prediction sets can be tight enough to be practically useful even though the set of all possible label-sets contains more than 1e+161e+16 combinations. Additionally, the empirical error rates of the obtained prediction-sets confirm that our outputs are well-calibrated

    Historical Map Toponym Extraction for Efficient Information Retrieval

    ERDF ”Research and Development of Intelligent Components of Advanced Technologies for the Pilsen Metropolitan Area (InteCom)” (no.:CZ.02.1.01/0.0/0.0/17 048/0007267), Grant No. SGS-2022-016 ”Advanced methods of data processing and analysis

    Rozpoznávání obličejů v reálných podmínkách

    Práce se zabývá rozpoznáváním obličejů v reálných podmínkách. Hlavním cílem je návrh systému pro automatické anotování fotografií z fotobanky ČTK. Prvním krokem je vytvoření korpusu z anotovaných fotografií. Cílem je výběr fotografií vhodných pro vytvoření modelu obličeje. Přínosem práce je návrh algoritmu pro automatické vytvoření korpusu. Pomocí tohoto algoritmu byl vytvořen nový obličejový korpus volně dostupný pro výzkumné účely. Druhým krokem je rozpoznávání. Přínosem práce v tomto směru je návrh několika metod založených na Gaborových waveletech a algoritmu Scale Invariant Feature Transform (SIFT). Metody byly testovány na databázích ORL, FERET a nově vytvořené ČTK databázi. Na základě testů byla jako nejvhodnější kandidát pro náš systém vybrána adaptovaná Kepenekciho metoda založena na algoritmu SIFT. Posledním krokem systému je použití míry důvěry. Ta umožňuje stanovit pravděpodobnost, že výsledek je správný. Přínosem práce je návrh nové dvou krokové míry důvěry. Hlavním výsledkem práce je ucelený systém pro rozpoznávání obličejů. Probíhají jednání o nasazení systému v prostředí ČTK.Katedra informatiky a výpočetní technikyObhájenoThis thesis deals with Automatic Face Recognition under real-world conditions. The main goal of this work is proposing a complete face recognition system intended to be used by the Czech News Agency (ČTK) for automatic annotation of photographs. The first task is to prepare a gallery of known faces. The first contribution of this work is the proposition of an automatic corpus creation algorithm. The goal is to choose the best representing images for each person. An important outcome is the creation of a novel face dataset created using this algorithm. The next step is the face recognition. Our contribution is propsition of several Gabor wavelet and Scale Invariant Feature Transform (SIFT) based methods. We chose the SIFT based adapted Kepenekci method as the best candidate for our system. The final step of the sytem is a confidence measure. It defines the probability that the result is correct. We proposed a novel two-step confidence measure approach for the face recognition. The final outcome of this work is thus a complete face recognition system capable to handle real-world photographs. Currently, discussions about the deployment of the system are under way

    Vylepšení metod pro rozpoznávání obličejů založených na deskriptorech POEM

    Obvyklý způsob použití POEM deskriptorů je vytvoření příznaků v pravidelných obdélníkových regionech, které pokrývají celý snímek. Příznaky jsou spojeny do jednoho vektoru, který reprezentuje snímek obličeje. V článku je navržena vylepšená metoda, která využívá automaticky detekované body pro vytvoření příznaků. Zároveň je použita komplexnější metoda pro porovnávání příznakových vektorů. Navržená metoda nalezne uplatnění zejména v případech, kdy je k dispozici omezené množství dat a použití např. neuronových sítí by proto bylo obtížné. Metoda je testována na třech standardních obličejových korpusech. Dosažené výsledky ukazují, že použití POEM deskriptorů a příznaků, vytvořených v automaticky detekovaných bodech, dosahuje výrazně lepších výsledků, než základní metody.The usual way how POEM descriptors are utilized consists in constructing features in rectangular non-overlapping regions covering the whole image. The features created in the regions are then concatenated into one long vector representing the face. We propose an enhancement of this method using automatic key-point identification strategies. In our approach, the image features are created in the detected key-points. We also employ a more complex matching procedure that compares the features individually. This method is efficient particularly when the number of training samples is small and therefore neural network based methods fail, because they do not have enough training data. The proposed approach is evaluated on three standard face corpora. The obtained results show that the combination of POEM features with the automatic point identification and a more sophisticated matching algorithm brings significant improvement over the baseline method

    SAPKOS: Experimental Czech Multi-label Document Classification and Analysis System

    Part 6: Classification, Clustering, and ReasoningInternational audienceThis paper presents an experimental multi-label document classification and analysis system called SAPKOS. The system which integrates the state-of-the-art machine learning and natural language processing approaches is intended to be used by the Czech news Agency (ČTK). Its main purpose is to save human resources in the task of annotation of newspaper articles with topics. Another important functionality is automatic comparison of the ČTK production with popular Czech media. The results of this analysis will be used to adapt the ČTK production to better correspond to the today’s market requirements. An interesting contribution is that, to the best of our knowledge, no other automatic Czech document classification system exists. It is also worth mentioning that the system accuracy is very high. This score is obtained due to the unique system architecture which integrates a maximum entropy based classification engine with the novel confidence measure method

    Novel Matching Methods for Automatic Face Recognition Using SIFT

    Part 6: Classification Pattern RecognitionInternational audienceThe object of interest of this paper is Automatic Face Recognition (AFR). The usual methods need a labeled corpus and the number of training examples plays a crucial role for the recognition accuracy. Unfortunately, the corpus creation is very expensive and time consuming task. Therefore, the motivation of this work is to propose and implement new AFR approaches that could solve this issue and perform well also with few training examples. Our approaches extend the successful method based on the Scale Invariant Feature Transform (SIFT) proposed by Aly. We propose and evaluate two methods: the Lenc-Kral matching and the SIFT based Kepenekci approach [7]. Our approaches are evaluated on two face data-sets: the ORL database and the Czech News Agency (ČTK) corpus. We experimentally show that the proposed approaches significantly outperform the baseline Aly method on both corpora