225 research outputs found

    Convolutional Neural Network on Three Orthogonal Planes for Dynamic Texture Classification

    Get PDF
    Dynamic Textures (DTs) are sequences of images of moving scenes that exhibit certain stationarity properties in time such as smoke, vegetation and fire. The analysis of DT is important for recognition, segmentation, synthesis or retrieval for a range of applications including surveillance, medical imaging and remote sensing. Deep learning methods have shown impressive results and are now the new state of the art for a wide range of computer vision tasks including image and video recognition and segmentation. In particular, Convolutional Neural Networks (CNNs) have recently proven to be well suited for texture analysis with a design similar to a filter bank approach. In this paper, we develop a new approach to DT analysis based on a CNN method applied on three orthogonal planes x y , xt and y t . We train CNNs on spatial frames and temporal slices extracted from the DT sequences and combine their outputs to obtain a competitive DT classifier. Our results on a wide range of commonly used DT classification benchmark datasets prove the robustness of our approach. Significant improvement of the state of the art is shown on the larger datasets.Comment: 19 pages, 10 figure

    Time-slice analysis of dyadic human activity

    Get PDF
    La reconnaissance d’activitĂ©s humaines Ă  partir de donnĂ©es vidĂ©o est utilisĂ©e pour la surveillance ainsi que pour des applications d’interaction homme-machine. Le principal objectif est de classer les vidĂ©os dans l’une des k classes d’actions Ă  partir de vidĂ©os entiĂšrement observĂ©es. Cependant, de tout temps, les systĂšmes intelligents sont amĂ©liorĂ©s afin de prendre des dĂ©cisions basĂ©es sur des incertitudes et ou des informations incomplĂštes. Ce besoin nous motive Ă  introduire le problĂšme de l’analyse de l’incertitude associĂ©e aux activitĂ©s humaines et de pouvoir passer Ă  un nouveau niveau de gĂ©nĂ©ralitĂ© liĂ© aux problĂšmes d’analyse d’actions. Nous allons Ă©galement prĂ©senter le problĂšme de reconnaissance d’activitĂ©s par intervalle de temps, qui vise Ă  explorer l’activitĂ© humaine dans un intervalle de temps court. Il a Ă©tĂ© dĂ©montrĂ© que l’analyse par intervalle de temps est utile pour la caractĂ©risation des mouvements et en gĂ©nĂ©ral pour l’analyse de contenus vidĂ©o. Ces Ă©tudes nous encouragent Ă  utiliser ces intervalles de temps afin d’analyser l’incertitude associĂ©e aux activitĂ©s humaines. Nous allons dĂ©tailler Ă  quel degrĂ© de certitude chaque activitĂ© se produit au cours de la vidĂ©o. Dans cette thĂšse, l’analyse par intervalle de temps d’activitĂ©s humaines avec incertitudes sera structurĂ©e en 3 parties. i) Nous prĂ©sentons une nouvelle famille de descripteurs spatiotemporels optimisĂ©s pour la prĂ©diction prĂ©coce avec annotations d’intervalle de temps. Notre reprĂ©sentation prĂ©dictive du point d’intĂ©rĂȘt spatiotemporel (Predict-STIP) est basĂ©e sur l’idĂ©e de la contingence entre intervalles de temps. ii) Nous exploitons des techniques de pointe pour extraire des points d’intĂ©rĂȘts afin de reprĂ©senter ces intervalles de temps. iii) Nous utilisons des relations (uniformes et par paires) basĂ©es sur les rĂ©seaux neuronaux convolutionnels entre les diffĂ©rentes parties du corps de l’individu dans chaque intervalle de temps. Les relations uniformes enregistrent l’apparence locale de la partie du corps tandis que les relations par paires captent les relations contextuelles locales entre les parties du corps. Nous extrayons les spĂ©cificitĂ©s de chaque image dans l’intervalle de temps et examinons diffĂ©rentes façons de les agrĂ©ger temporellement afin de gĂ©nĂ©rer un descripteur pour tout l’intervalle de temps. En outre, nous crĂ©ons une nouvelle base de donnĂ©es qui est annotĂ©e Ă  de multiples intervalles de temps courts, permettant la modĂ©lisation de l’incertitude inhĂ©rente Ă  la reconnaissance d’activitĂ©s par intervalle de temps. Les rĂ©sultats expĂ©rimentaux montrent l’efficience de notre stratĂ©gie dans l’analyse des mouvements humains avec incertitude.Recognizing human activities from video data is routinely leveraged for surveillance and human-computer interaction applications. The main focus has been classifying videos into one of k action classes from fully observed videos. However, intelligent systems must to make decisions under uncertainty, and based on incomplete information. This need motivates us to introduce the problem of analysing the uncertainty associated with human activities and move to a new level of generality in the action analysis problem. We also present the problem of time-slice activity recognition which aims to explore human activity at a small temporal granularity. Time-slice recognition is able to infer human behaviours from a short temporal window. It has been shown that temporal slice analysis is helpful for motion characterization and for video content representation in general. These studies motivate us to consider timeslices for analysing the uncertainty associated with human activities. We report to what degree of certainty each activity is occurring throughout the video from definitely not occurring to definitely occurring. In this research, we propose three frameworks for time-slice analysis of dyadic human activity under uncertainty. i) We present a new family of spatio-temporal descriptors which are optimized for early prediction with time-slice action annotations. Our predictive spatiotemporal interest point (Predict-STIP) representation is based on the intuition of temporal contingency between time-slices. ii) we exploit state-of-the art techniques to extract interest points in order to represent time-slices. We also present an accumulative uncertainty to depict the uncertainty associated with partially observed videos for the task of early activity recognition. iii) we use Convolutional Neural Networks-based unary and pairwise relations between human body joints in each time-slice. The unary term captures the local appearance of the joints while the pairwise term captures the local contextual relations between the parts. We extract these features from each frame in a time-slice and examine different temporal aggregations to generate a descriptor for the whole time-slice. Furthermore, we create a novel dataset which is annotated at multiple short temporal windows, allowing the modelling of the inherent uncertainty in time-slice activity recognition. All the three methods have been evaluated on TAP dataset. Experimental results demonstrate the effectiveness of our framework in the analysis of dyadic activities under uncertaint

    Abnormal behavior detection using sparse representations through sequential generalization of k-means

    Get PDF
    The potential capability to automatically detect and classify human behavior as either normal or abnormal events is an important aspect in intelligent monitoring/surveillance systems. This study presents a new high-performance framework for detecting behavioral abnormalities in video streams by utilizing only the patterns for normal behaviors. In this paper, we used a hybrid descriptor, called a foreground optical flow energy (FGOFE), which makes use of two effective motion techniques in order to extract the most descriptive spatiotemporal features in video sequences. The FGOFE descriptor can effectively capture both weak and sudden incidents in a scene. The sequential generalization of k-means (SGK) algorithm was applied in this study to generate the dictionary set that can sparsely represent each signal; in addition, the orthogonal matching pursuit algorithm was utilized to recover high-dimensional sparse features when referring to a few numbers of noisy linear measurements. Using the SGK allows gaining a less complex and quicker implementation compared to other dictionary learning methods. We conducted comprehensive experiments to analyze and evaluate the ability of our framework in detecting abnormalities using several public benchmarks, which contain different abnormal samples and various contextual compositions. The experimental results show that the proposed framework achieved high detection accuracy (up to 95.33%) and low frame processing time (31 ms on average) compared to the relevant related work

    ConvMixerSeg: Weakly Supervised Semantic Segmentation for CT Liver Images

    Get PDF
    The predictive power of modern deep learning approaches is posed to revolutionize the medical imaging field, however, their usefulness and applicability are severely limited by the lack of well annotated data. Liver segmentation in CT images is an application that could benefit particularly well from less data hungry methods and potentially lead to better liver volume estimation and tumor detection. To this end, we propose a new semantic segmentation model called ConvMixerSeg and experimentally show that it outperforms an FCN with a ResNet-50 backbone when trained to segment livers on a subset of the Liver Tumor Segmentation Benchmark data set (LiTS). We have further developed a novel Class Activation Map (CAM) based method to train semantic segmentation models with image level labels without adding parameters. The proposed CAM method includes a Neighborhood Correlation Enforcement module using Gaussian smoothing that reduces part domination and prediction noise. Additionally, our experiments show that the proposed CAM method outperforms the original CAM method for both classification and segmentation with high statistical significance given the same ConvMixerSeg backbone

    REVITALIZING WATERFRONT: The Sinking City

    Get PDF
    The existing waterfront condition presents a separation between the water and urban. I propose this separation between the water and urban is an interacted space of urban and water. Waterfront constructed in this way protects the city from floating, yet a solution of creating a public space, ports, and water filtration facility will blend the separated condition. Thus, architecture exemplifies a vehicle to constitute physical and visual connection for dichotomy edge created by the waterfront in a rapid stratified urbanization and industrialization

    Toolflows for Mapping Convolutional Neural Networks on FPGAs: A Survey and Future Directions

    Get PDF
    In the past decade, Convolutional Neural Networks (CNNs) have demonstrated state-of-the-art performance in various Artificial Intelligence tasks. To accelerate the experimentation and development of CNNs, several software frameworks have been released, primarily targeting power-hungry CPUs and GPUs. In this context, reconfigurable hardware in the form of FPGAs constitutes a potential alternative platform that can be integrated in the existing deep learning ecosystem to provide a tunable balance between performance, power consumption and programmability. In this paper, a survey of the existing CNN-to-FPGA toolflows is presented, comprising a comparative study of their key characteristics which include the supported applications, architectural choices, design space exploration methods and achieved performance. Moreover, major challenges and objectives introduced by the latest trends in CNN algorithmic research are identified and presented. Finally, a uniform evaluation methodology is proposed, aiming at the comprehensive, complete and in-depth evaluation of CNN-to-FPGA toolflows.Comment: Accepted for publication at the ACM Computing Surveys (CSUR) journal, 201

    Efficient and Accurate Segmentation of Defects in Industrial CT Scans

    Get PDF
    Industrial computed tomography (CT) is an elementary tool for the non-destructive inspection of cast light-metal or plastic parts. A comprehensive testing not only helps to ensure the stability and durability of a part, it also allows reducing the rejection rate by supporting the optimization of the casting process and to save material (and weight) by producing equivalent but more filigree structures. With a CT scan it is theoretically possible to locate any defect in the part under examination and to exactly determine its shape, which in turn helps to draw conclusions about its harmfulness. However, most of the time the data quality is not good enough to allow segmenting the defects with simple filter-based methods which directly operate on the gray-values—especially when the inspection is expanded to the entire production. In such in-line inspection scenarios the tight cycle times further limit the available time for the acquisition of the CT scan, which renders them noisy and prone to various artifacts. In recent years, dramatic advances in deep learning (and convolutional neural networks in particular) made even the reliable detection of small objects in cluttered scenes possible. These methods are a promising approach to quickly yield a reliable and accurate defect segmentation even in unfavorable CT scans. The huge drawback: a lot of precisely labeled training data is required, which is utterly challenging to obtain—particularly in the case of the detection of tiny defects in huge, highly artifact-afflicted, three-dimensional voxel data sets. Hence, a significant part of this work deals with the acquisition of precisely labeled training data. Firstly, we consider facilitating the manual labeling process: our experts annotate on high-quality CT scans with a high spatial resolution and a high contrast resolution and we then transfer these labels to an aligned ``normal'' CT scan of the same part, which holds all the challenging aspects we expect in production use. Nonetheless, due to the indecisiveness of the labeling experts about what to annotate as defective, the labels remain fuzzy. Thus, we additionally explore different approaches to generate artificial training data, for which a precise ground truth can be computed. We find an accurate labeling to be crucial for a proper training. We evaluate (i) domain randomization which simulates a super-set of reality with simple transformations, (ii) generative models which are trained to produce samples of the real-world data distribution, and (iii) realistic simulations which capture the essential aspects of real CT scans. Here, we develop a fully automated simulation pipeline which provides us with an arbitrary amount of precisely labeled training data. First, we procedurally generate virtual cast parts in which we place reasonable artificial casting defects. Then, we realistically simulate CT scans which include typical CT artifacts like scatter, noise, cupping, and ring artifacts. Finally, we compute a precise ground truth by determining for each voxel the overlap with the defect mesh. To determine whether our realistically simulated CT data is eligible to serve as training data for machine learning methods, we compare the prediction performance of learning-based and non-learning-based defect recognition algorithms on the simulated data and on real CT scans. In an extensive evaluation, we compare our novel deep learning method to a baseline of image processing and traditional machine learning algorithms. This evaluation shows how much defect detection benefits from learning-based approaches. In particular, we compare (i) a filter-based anomaly detection method which finds defect indications by subtracting the original CT data from a generated ``defect-free'' version, (ii) a pixel-classification method which, based on densely extracted hand-designed features, lets a random forest decide about whether an image element is part of a defect or not, and (iii) a novel deep learning method which combines a U-Net-like encoder-decoder-pair of three-dimensional convolutions with an additional refinement step. The encoder-decoder-pair yields a high recall, which allows us to detect even very small defect instances. The refinement step yields a high precision by sorting out the false positive responses. We extensively evaluate these models on our realistically simulated CT scans as well as on real CT scans in terms of their probability of detection, which tells us at which probability a defect of a given size can be found in a CT scan of a given quality, and their intersection over union, which gives us information about how precise our segmentation mask is in general. While the learning-based methods clearly outperform the image processing method, the deep learning method in particular convinces by its inference speed and its prediction performance on challenging CT scans—as they, for example, occur in in-line scenarios. Finally, we further explore the possibilities and the limitations of the combination of our fully automated simulation pipeline and our deep learning model. With the deep learning method yielding reliable results for CT scans of low data quality, we examine by how much we can reduce the scan time while still maintaining proper segmentation results. Then, we take a look on the transferability of the promising results to CT scans of parts of different materials and different manufacturing techniques, including plastic injection molding, iron casting, additive manufacturing, and composed multi-material parts. Each of these tasks comes with its own challenges like an increased artifact-level or different types of defects which occasionally are hard to detect even for the human eye. We tackle these challenges by employing our simulation pipeline to produce virtual counterparts that capture the tricky aspects and fine-tuning the deep learning method on this additional training data. With that we can tailor our approach towards specific tasks, achieving reliable and robust segmentation results even for challenging data. Lastly, we examine if the deep learning method, based on our realistically simulated training data, can be trained to distinguish between different types of defects—the reason why we require a precise segmentation in the first place—and we examine if the deep learning method can detect out-of-distribution data where its predictions become less trustworthy, i.e. an uncertainty estimation

    Expression Recognition with Deep Features Extracted from Holistic and Part-based Models

    Get PDF
    International audienceFacial expression recognition aims to accurately interpret facial muscle movements in affective states (emotions). Previous studies have proposed holistic analysis of the face, as well as the extraction of features pertained only to specific facial regions towards expression recognition. While classically the latter have shown better performances, we here explore this in the context of deep learning. In particular, this work provides a performance comparison of holistic and part-based deep learning models for expression recognition. In addition, we showcase the effectiveness of skip connections, which allow a network to infer from both low and high-level feature maps. Our results suggest that holistic models outperform part-based models, in the absence of skip connections. Finally, based on our findings, we propose a data augmentation scheme, which we incorporate in a part-based model. The proposed multi-face multi-part (MFMP) model leverages the wide information from part-based data augmentation, where we train the network using the facial parts extracted from different face samples of the same expression class. Extensive experiments on publicly available datasets show a significant improvement of facial expression classification with the proposed MFMP framework
    • 

    corecore