10 research outputs found

    Soft Biometric Analysis: MultiPerson and RealTime Pedestrian Attribute Recognition in Crowded Urban Environments

    Get PDF
    Traditionally, recognition systems were only based on human hard biometrics. However, the ubiquitous CCTV cameras have raised the desire to analyze human biometrics from far distances, without people attendance in the acquisition process. Highresolution face closeshots are rarely available at far distances such that facebased systems cannot provide reliable results in surveillance applications. Human soft biometrics such as body and clothing attributes are believed to be more effective in analyzing human data collected by security cameras. This thesis contributes to the human soft biometric analysis in uncontrolled environments and mainly focuses on two tasks: Pedestrian Attribute Recognition (PAR) and person reidentification (reid). We first review the literature of both tasks and highlight the history of advancements, recent developments, and the existing benchmarks. PAR and person reid difficulties are due to significant distances between intraclass samples, which originate from variations in several factors such as body pose, illumination, background, occlusion, and data resolution. Recent stateoftheart approaches present endtoend models that can extract discriminative and comprehensive feature representations from people. The correlation between different regions of the body and dealing with limited learning data is also the objective of many recent works. Moreover, class imbalance and correlation between human attributes are specific challenges associated with the PAR problem. We collect a large surveillance dataset to train a novel gender recognition model suitable for uncontrolled environments. We propose a deep residual network that extracts several posewise patches from samples and obtains a comprehensive feature representation. In the next step, we develop a model for multiple attribute recognition at once. Considering the correlation between human semantic attributes and class imbalance, we respectively use a multitask model and a weighted loss function. We also propose a multiplication layer on top of the backbone features extraction layers to exclude the background features from the final representation of samples and draw the attention of the model to the foreground area. We address the problem of person reid by implicitly defining the receptive fields of deep learning classification frameworks. The receptive fields of deep learning models determine the most significant regions of the input data for providing correct decisions. Therefore, we synthesize a set of learning data in which the destructive regions (e.g., background) in each pair of instances are interchanged. A segmentation module determines destructive and useful regions in each sample, and the label of synthesized instances are inherited from the sample that shared the useful regions in the synthesized image. The synthesized learning data are then used in the learning phase and help the model rapidly learn that the identity and background regions are not correlated. Meanwhile, the proposed solution could be seen as a data augmentation approach that fully preserves the label information and is compatible with other data augmentation techniques. When reid methods are learned in scenarios where the target person appears with identical garments in the gallery, the visual appearance of clothes is given the most importance in the final feature representation. Clothbased representations are not reliable in the longterm reid settings as people may change their clothes. Therefore, developing solutions that ignore clothing cues and focus on identityrelevant features are in demand. We transform the original data such that the identityrelevant information of people (e.g., face and body shape) are removed, while the identityunrelated cues (i.e., color and texture of clothes) remain unchanged. A learned model on the synthesized dataset predicts the identityunrelated cues (shortterm features). Therefore, we train a second model coupled with the first model and learns the embeddings of the original data such that the similarity between the embeddings of the original and synthesized data is minimized. This way, the second model predicts based on the identityrelated (longterm) representation of people. To evaluate the performance of the proposed models, we use PAR and person reid datasets, namely BIODI, PETA, RAP, Market1501, MSMTV2, PRCC, LTCC, and MIT and compared our experimental results with stateoftheart methods in the field. In conclusion, the data collected from surveillance cameras have low resolution, such that the extraction of hard biometric features is not possible, and facebased approaches produce poor results. In contrast, soft biometrics are robust to variations in data quality. So, we propose approaches both for PAR and person reid to learn discriminative features from each instance and evaluate our proposed solutions on several publicly available benchmarks.This thesis was prepared at the University of Beria Interior, IT Instituto de Telecomunicações, Soft Computing and Image Analysis Laboratory (SOCIA Lab), Covilhã Delegation, and was submitted to the University of Beira Interior for defense in a public examination session

    Recognizing Visual Categories by Commonality and Diversity

    Get PDF
    Visual categories refer to categories of objects or scenes in the computer vision literature. Building a well-performing classifier for visual categories is challenging as it requires a high level of generalization as the categories have large within class variability. We present several methods to build generalizable classifiers for visual categories by exploiting commonality and diversity of labeled samples and the cat- egory definitions to improve category classification accuracy. First, we describe a method to discover and add unlabeled samples from auxil- iary sources to categories of interest for building better classifiers. In the literature, given a pool of unlabeled samples, the samples to be added are usually discovered based on low level visual signatures such as edge statistics or shape or color by an unsupervised or semi-supervised learning framework. This method is inexpensive as it does not require human intervention, but generally does not provide useful information for accuracy improvement as the selected samples are visually similar to the existing set of samples. The samples added by active learning, on the other hand, provide different visual aspects to categories and contribute to learning a better classifier, but are expensive as they need human labeling. To obtain high quality samples with less annotation cost, we present a method to discover and add samples from unlabeled image pools that are visually diverse but coherent to cat- egory definition by using higher level visual aspects, captured by a set of learned attributes. The method significantly improves the classification accuracy over the baselines without human intervention. Second, we describe now to learn an ensemble of classifiers that captures both commonly shared information and diversity among the training samples. To learn such ensemble classifiers, we first discover discriminative sub-categories of the la- beled samples for diversity. We then learn an ensemble of discriminative classifiers with a constraint that minimizes the rank of the stacked matrix of classifiers. The resulting set of classifiers both share the category-wide commonality and preserve diversity of subcategories. The proposed ensemble classifier improves recognition accuracy significantly over the baselines and state-of-the-art subcategory based en- semble classifiers, especially for the challenging categories. Third, we explore the commonality and diversity of semantic relationships of category definitions to improve classification accuracy in an efficient manner. Specif- ically, our classification model identifies the most helpful relational semantic queries to discriminatively refine the model by a small amount of semantic feedback in inter- active iterations. We improve the classification accuracy on challenging categories that have very small numbers of training samples via transferred knowledge from other related categories that have a lager number of training samples by solving a semantically constrained transfer learning optimization problem. Finally, we summarize ideas presented and discuss possible future work

    Semantic knowledge integration for learning from semantically imprecise data

    Get PDF
    Low availability of labeled training data often poses a fundamental limit to the accuracy of computer vision applications using machine learning methods. While these methods are improved continuously, e.g., through better neural network architectures, there cannot be a single methodical change that increases the accuracy on all possible tasks. This statement, known as the no free lunch theorem, suggests that we should consider aspects of machine learning other than learning algorithms for opportunities to escape the limits set by the available training data. In this thesis, we focus on two main aspects, namely the nature of the training data, where we introduce structure into the label set using concept hierarchies, and the learning paradigm, which we change in accordance with requirements of real-world applications as opposed to more academic setups.Concept hierarchies represent semantic relations, which are sets of statements such as "a bird is an animal." We propose a hierarchical classifier to integrate this domain knowledge in a pre-existing task, thereby increasing the information the classifier has access to. While the hierarchy's leaf nodes correspond to the original set of classes, the inner nodes are "new" concepts that do not exist in the original training data. However, we pose that such "imprecise" labels are valuable and should occur naturally, e.g., as an annotator's way of expressing their uncertainty. Furthermore, the increased number of concepts leads to more possible search terms when assembling a web-crawled dataset or using an image search. We propose CHILLAX, a method that learns from semantically imprecise training data, while still offering precise predictions to integrate seamlessly into a pre-existing application

    Deep Learning-based Solutions to Improve Diagnosis in Wireless Capsule Endoscopy

    Full text link
    [eng] Deep Learning (DL) models have gained extensive attention due to their remarkable performance in a wide range of real-world applications, particularly in computer vision. This achievement, combined with the increase in available medical records, has made it possible to open up new opportunities for analyzing and interpreting healthcare data. This symbiotic relationship can enhance the diagnostic process by identifying abnormalities, patterns, and trends, resulting in more precise, personalized, and effective healthcare for patients. Wireless Capsule Endoscopy (WCE) is a non-invasive medical imaging technique used to visualize the entire Gastrointestinal (GI) tract. Up to this moment, physicians meticulously review the captured frames to identify pathologies and diagnose patients. This manual process is time- consuming and prone to errors due to the challenges of interpreting the complex nature of WCE procedures. Thus, it demands a high level of attention, expertise, and experience. To overcome these drawbacks, shorten the screening process, and improve the diagnosis, efficient and accurate DL methods are required. This thesis proposes DL solutions to the following problems encountered in the analysis of WCE studies: pathology detection, anatomical landmark identification, and Out-of-Distribution (OOD) sample handling. These solutions aim to achieve robust systems that minimize the duration of the video analysis and reduce the number of undetected lesions. Throughout their development, several DL drawbacks have appeared, including small and imbalanced datasets. These limitations have also been addressed, ensuring that they do not hinder the generalization of neural networks, leading to suboptimal performance and overfitting. To address the previous WCE problems and overcome the DL challenges, the proposed systems adopt various strategies that utilize the power advantage of Triplet Loss (TL) and Self-Supervised Learning (SSL) techniques. Mainly, TL has been used to improve the generalization of the models, while SSL methods have been employed to leverage the unlabeled data to obtain useful representations. The presented methods achieve State-of-the-art results in the aforementioned medical problems and contribute to the ongoing research to improve the diagnostic of WCE studies.[cat] Els models d’aprenentatge profund (AP) han acaparat molta atenció a causa del seu rendiment en una àmplia gamma d'aplicacions del món real, especialment en visió per ordinador. Aquest fet, combinat amb l'increment de registres mèdics disponibles, ha permès obrir noves oportunitats per analitzar i interpretar les dades sanitàries. Aquesta relació simbiòtica pot millorar el procés de diagnòstic identificant anomalies, patrons i tendències, amb la conseqüent obtenció de diagnòstics sanitaris més precisos, personalitzats i eficients per als pacients. La Capsula endoscòpica (WCE) és una tècnica d'imatge mèdica no invasiva utilitzada per visualitzar tot el tracte gastrointestinal (GI). Fins ara, els metges revisen minuciosament els fotogrames capturats per identificar patologies i diagnosticar pacients. Aquest procés manual requereix temps i és propens a errors. Per tant, exigeix un alt nivell d'atenció, experiència i especialització. Per superar aquests inconvenients, reduir la durada del procés de detecció i millorar el diagnòstic, es requereixen mètodes eficients i precisos d’AP. Aquesta tesi proposa solucions que utilitzen AP per als següents problemes trobats en l'anàlisi dels estudis de WCE: detecció de patologies, identificació de punts de referència anatòmics i gestió de mostres que pertanyen fora del domini. Aquestes solucions tenen com a objectiu aconseguir sistemes robustos que minimitzin la durada de l'anàlisi del vídeo i redueixin el nombre de lesions no detectades. Durant el seu desenvolupament, han sorgit diversos inconvenients relacionats amb l’AP, com ara conjunts de dades petits i desequilibrats. Aquestes limitacions també s'han abordat per assegurar que no obstaculitzin la generalització de les xarxes neuronals, evitant un rendiment subòptim. Per abordar els problemes anteriors de WCE i superar els reptes d’AP, els sistemes proposats adopten diverses estratègies que aprofiten l'avantatge de la Triplet Loss (TL) i les tècniques d’auto-aprenentatge. Principalment, s'ha utilitzat TL per millorar la generalització dels models, mentre que els mètodes d’autoaprenentatge s'han emprat per aprofitar les dades sense etiquetar i obtenir representacions útils. Els mètodes presentats aconsegueixen bons resultats en els problemes mèdics esmentats i contribueixen a la investigació en curs per millorar el diagnòstic dels estudis de WCE

    Representation learning for few-shot image classification

    Get PDF
    En tant qu'algorithmes d'apprentissage automatique à la pointe de la technologie, les réseaux de neurones profonds nécessitent de nombreux exemples pour bien fonctionner sur une tâche d'apprentissage. La collecte et l'annotation de multiples échantillons nécessitent un travail humain important et c'est même impossible dans la plupart des problèmes du monde réel tel que l'analyse de données biomédicales. Dans le contexte de la vision par ordinateur, la classification d'images à quelques plans vise à saisir la capacité humaine à apprendre de nouveaux concepts avec peu de supervision. À cet égard, l'idée générale est de transférer les connaissances des catégories de base avec plus d'encadrement vers des classes nouvelles avec peu d'exemples. En particulier, les approches actuelles d'apprentissage à quelques coups pré entraînent un modèle sur les classes de base disponible pour généraliser aux nouvelles classes, peut-être avec un réglage fin. Cependant, la généralisation du modèle actuel est limitée en raison de certaines hypothèses lors de la préformation et de restrictions lors de l'étape de mise au point. Cette thèse vise à assouplir trois hypothèses des modèles d'apprentissage à quelques plans actuels et nous proposons un apprentissage de représentation pour la classification d'images à quelques plans. Tout d'abord, le gel d'un modèle préformé semble inévitable dans la phase de réglage fin en raison de la forte possibilité de surentraînement sur quelques exemples. Malheureusement, l'apprentissage par transfert avec une hypothèse de modèle gelé limite la capacité du modèle puisque le modèle n'est pas mis à jour avec aucune connaissance des nouvelles classes. Contrairement au gel d'un modèle, nous proposons un alignement associatif qui permet d'affiner et de mettre à jour le réseau sur de nouvelles catégories. Plus précisément, nous présentons deux stratégies qui détectent et alignent les nouvelles classes sur les catégories de base hautement liées. Alors que la première stratégie pousse la distribution des nouvelles classes au centre de leurs catégories de base associées, la seconde stratégie effectue une correspondance de distribution à l'aide d'un algorithme d'entraînement contradictoire. Dans l'ensemble, notre alignement associatif vise à éviter le surentraînement et à augmenter la capacité du modèle en affinant le modèle à l'aide de nouveaux exemples et d'échantillons de base associés. Deuxièmement, les approches actuelles d'apprentissage à quelques coups effectuent le transfert de connaissances vers de nouvelles classes distinctes sous l'hypothèse uni modale, où tous les exemples d'une seule classe sont représentés par un seul cluster. Au lieu de cela, nous proposons une approche d'apprentissage de l'espace des caractéristiques basée sur le mélange (MixtFSL) pour déduire une représentation multimodale. Alors qu'un précédent travail basé sur un modèle de mélange d'Allen et al. citeallen2019infinite est basé sur une méthode de clusters classique de manière non différentielle, notre MixtFSL est un nouveau modèle multimodale de bout en bout et entièrement différentielle. MixtFSL capture la multimodale des classes de base sans aucun algorithme de clusters classique à l'aide d'un cadre en deux étapes. La première phase s'appeler formation initiale et vise à apprendre la représentation préliminaire du mélange avec une paire de fonctions de perte. Ensuite, l'étape suivante progressive, la deuxième étape, stabilise la formation avec un cadre de formation de type enseignant-élève utilisant une fonction de perte unique. Troisièmement, contrairement aux techniques actuelles à quelques prises de vue consistant à représenter chaque exemple d'entrée avec une seule entité à la fin du réseau, nous proposons un extracteur d'entités d'ensemble et des ensembles d'entités correspondantes qui assouplissent l'hypothèse typique basée sur une seule entité en raisonnant sur des ensembles d'entités. Ici, nous émettons l'hypothèse que l'hypothèse d'une seule caractéristique est problématique dans la classification d'images à quelques prises de vue puisque les nouvelles classes sont différentes des classes de base préformées. À cette fin, nous proposons nouvel extracteur de caractéristiques d'ensemble d'apprentissage profond basé sur les réseaux de neurones hybrides convolution-attention. De plus, nous suggérons trois métriques ensemble à ensemble non paramétriques pour séduire la classe de l'entrée donnée. Cette thèse utilise plusieurs indicateurs standards publiés dans la littérature sur l'apprentissage en peu d'exemples et l'ossature de réseau pour évaluer les méthodes que nous proposons.As the current state-of-the-art machine learning algorithms, deep neural networks require many examples to perform well on a learning task. Gathering and annotating many samples requires significant human labor, and it is even impossible in most real-world problems such as biomedical data analysis. Under the computer vision context, few-shot image classification aims at grasping the human ability to learn new concepts with little supervision. In this respect, the general idea is to transfer knowledge from base categories with more supervision to novel classes with few examples. In particular, the current few-shot learning approaches pre-train a model on available base classes to generalize to the novel classes, perhaps with fine-tuning. However, the current model's generalization is limited because of some assumptions in the pre-training and restrictions in the fine-tuning stage. This thesis aims to relax three assumptions of the current few-shot learning models, and we propose representation learning for few-shot image classification. First, freezing a pre-trained model looks inevitable in the fine-tuning stage due to the high possibility of overfitting on a few examples. Unfortunately, transfer learning with a frozen model assumption limits the model capacity since the model is not updated with any knowledge of the novel classes. In contrast to freezing a model, we propose associative alignment that enables fine-tuning and updating the network on novel categories. Specifically, we present two strategies that detect and align the novel classes to the highly related base categories. While the first strategy pushes the distribution of the novel classes to the center of their related base categories, the second strategy performs distribution matching using an adversarial training algorithm. Overall, our associative alignment aims to prevent overfitting and increase the model capacity by refining the model using novel examples and related base samples. Second, the current few-shot learning approaches perform transferring knowledge to distinctive novel classes under the uni-modal assumption, where all the examples of a single class are represented with a single cluster. Instead, we propose a mixture-based feature space learning (MixtFSL) approach to infer a multi-modal representation. While a previous mixture-model-based work of Allen et al. [1] is based on a classical clustering method in a non-differentiable manner, our MixtFSL is a new end-to-end multi-modal and fully differentiable model. MixtFSL captures the multi-modality of base classes without any classical clustering algorithm using a two-stage framework. The first phase is called initial training and aims to learn preliminary mixture representation with a pair of loss functions. Then, the progressive following stage, the second stage, stabilizes the training with a teacher-student kind of training framework using a single loss function. Third, unlike the current few-shot techniques of representing each input example with a single feature at the end of the network, we propose a set feature extractor and matching feature sets that relax the typical single feature-based assumption by reasoning on feature sets. Here, we hypothesize that the single feature assumption is problematic in few-shot image classification since the novel classes are different from pre-trained base classes. To this end, we propose a new deep learning set feature extractor based on the hybrid convolution-attention neural networks. Additionally, we offer three non-parametric set-to-set metrics to infer the class of the given input. This thesis employs several standard benchmarks of few-shot learning literature and network backbones to evaluate our proposed methods

    Distributed Spectral Graph Methods for Analyzing Large-Scale Unstructured Biomedical Data

    Get PDF
    There is an ever-expanding body of biological data, growing in size and complexity, out- stripping the capabilities of standard database tools or traditional analysis techniques. Such examples include molecular dynamics simulations, drug-target interactions, gene regulatory networks, and high-throughput imaging. Large-scale acquisition and curation biological data has already yielded results in the form of lower costs for genome sequencing and greater cov- erage in databases such as GenBank, and is viewed as the future of biocuration. The “big data” philosophy and its associated paradigms and frameworks have the potential to uncover solutions to problems otherwise intractable with more traditional investigative techniques. Here, we focus on two biological systems whose data form large, undirected graphs. First, we develop a quantitative model of ciliary motion phenotypes, using spectral graph methods for unsupervised latent pattern discovery. Second, we apply similar techniques to identify a mapping between physiochemical structure and odor percept in human olfaction. In both cases, we experienced computational bottlenecks in our statistical machinery, necessitating the creation of a new analysis framework. At the core of this framework is a distributed hierarchical eigensolver, which we compare directly to other popular solvers. We demon- strate its essential role in enabling the discovery of novel ciliary motion phenotypes and in identifying physiochemical-perceptual associations

    On Improving Generalization of CNN-Based Image Classification with Delineation Maps Using the CORF Push-Pull Inhibition Operator

    Get PDF
    Deployed image classification pipelines are typically dependent on the images captured in real-world environments. This means that images might be affected by different sources of perturbations (e.g. sensor noise in low-light environments). The main challenge arises by the fact that image quality directly impacts the reliability and consistency of classification tasks. This challenge has, hence, attracted wide interest within the computer vision communities. We propose a transformation step that attempts to enhance the generalization ability of CNN models in the presence of unseen noise in the test set. Concretely, the delineation maps of given images are determined using the CORF push-pull inhibition operator. Such an operation transforms an input image into a space that is more robust to noise before being processed by a CNN. We evaluated our approach on the Fashion MNIST data set with an AlexNet model. It turned out that the proposed CORF-augmented pipeline achieved comparable results on noise-free images to those of a conventional AlexNet classification model without CORF delineation maps, but it consistently achieved significantly superior performance on test images perturbed with different levels of Gaussian and uniform noise

    Proceedings of the Seventh Congress of the European Society for Research in Mathematics Education

    Get PDF
    International audienceThis volume contains the Proceedings of the Seventh Congress of the European Society for Research in Mathematics Education (ERME), which took place 9-13 February 2011, at Rzeszñw in Poland
    corecore