13 research outputs found
Multi-view video plus depth representation with saliency depth video
Saliency represents a region where viewers tend to put more focus on compared to other regions in an image or video. Although there are many saliency models available, very few exploit the saliency model based on depth video sequences. This paper proposed a saliency depth based video by utilizing selected saliency maps and fusing it into depth video sequences. The proposed saliency depth based model is used with multi-view video plus depth (MVD) and compressed using the latest High Efficiency Video Coding (HEVC) compression method. The proposed method showed a notable quality improvement on the virtual view video compared to other saliency model such as the frequency-tuned saliency model
Stock Trend Prediction With Neural Network Techniques
This thesis presents a study and implementation of stock trend prediction using neural network techniques. The multilayer-perceptron (MLP) and radial basis function network (RBF) are compared with the new neural network technique, Support Vector Machine(SVM). In this study the stock trend is defined as the maximum excess return from the stock index closing level observed within the next 10 days ahead
Fusion of Appearance and Motion Features for Daily Activity Recognition from Egocentric Perspective
Vidos from a first-person or egocentric perspective offer a promising tool for recognizing various activities related to daily living. In the egocentric perspective, the video is obtained from a wearable camera, and this enables the capture of the person’s activities in a consistent viewpoint. Recognition of activity using a wearable sensor is challenging due to various reasons, such as motion blur and large variations. The existing methods are based on extracting handcrafted features from video frames to represent the contents. These features are domain-dependent, where features that are suitable for a specific dataset may not be suitable for others. In this paper, we propose a novel solution to recognize daily living activities from a pre-segmented video clip. The pre-trained convolutional neural network (CNN) model VGG16 is used to extract visual features from sampled video frames and then aggregated by the proposed pooling scheme. The proposed solution combines appearance and motion features extracted from video frames and optical flow images, respectively. The methods of mean and max spatial pooling (MMSP) and max mean temporal pyramid (TPMM) pooling are proposed to compose the final video descriptor. The feature is applied to a linear support vector machine (SVM) to recognize the type of activities observed in the video clip. The evaluation of the proposed solution was performed on three public benchmark datasets. We performed studies to show the advantage of aggregating appearance and motion features for daily activity recognition. The results show that the proposed solution is promising for recognizing activities of daily living. Compared to several methods on three public datasets, the proposed MMSP–TPMM method produces higher classification performance in terms of accuracy (90.38% with LENA dataset, 75.37% with ADL dataset, 96.08% with FPPA dataset) and average per-class precision (AP) (58.42% with ADL dataset and 96.11% with FPPA dataset)
Content-based image retrieval system for marine life images using gradient vector flow
Content Based Image Retrieval (CBIR) has been an active and fast growing research area in both image processing and data mining. Malaysia has been recognized with a rich marine ecosystem. Challenges of these images are low resolution, translation, and transformation invariant. In this paper, we have designed an automated CBIR system to characterize the species for future research. Gradient vector flow (GVF) has been implemented in a lot of image processing applications. Inspired by its fast image restoration algorithms we applied GVF for marine images. We evaluated different automated segmentation techniques and found GVF showing better retrieval results compared to other automated segmentation techniques
Semantic facial scores and compact deep transferred descriptors for scalable face image retrieval
Face retrieval systems intend to locate index of identical faces to a given query face. The performance of these systems heavily relies on the careful analysis of different facial attributes (gender, race, etc.) since these attributes can tolerate some degrees of geometrical distortion, expressions, and occlusions. However, solely employing facial scores fail to add scalability. Besides, owing to the discriminative power of CNN (convolutional neural network) features, recent works have employed a complete set of deep transferred CNN features with a large dimensionality to obtain enhancement; yet these systems require high computational power and are very resource-demanding. This study aims to exploit the distinctive capability of semantic facial attributes while their retrieval results are refined by a proposed subset feature selection to reduce the dimensionality of the deep transferred descriptors. The constructed compact deeply transferred descriptors (CDTD) not only have a largely reduced dimension but also more discriminative power. Lastly, we have proposed a new performance metric which is tailored for the case of face retrieval called ImAP (Individual Mean Average Precision) and is used to evaluate retrieval results. Multiple experiments on two scalable face datasets demonstrated the superior performance of the proposed CDTD model outperforming state-of-the-art face retrieval results
Spectrogram-Based Classification of Spoken Foul Language Using Deep CNN
Excessive content of profanity in audio and video
files has proven to shape one’s character and behavior.
Currently, conventional methods of manual detection and censorship are being used. Manual censorship method is time consuming and prone to misdetection of foul language. This paper proposed an intelligent model for foul language censorship through automated and robust detection by deep convolutional neural networks (CNNs). A dataset of foul language was collected and processed for the computation of
audio spectrogram images that serve as an input to evaluate the classification of foul language. The proposed model was first tested for 2-class (Foul vs Normal) classification problem, the foul class is then further decomposed into a 10-class classification problem for exact detection of profanity. Experimental results show the viability of proposed system by
demonstrating high performance of curse words classification with 1.24-2.71 Error Rate (ER) for 2-class and 5.49-8.30 F1-score. Proposed Resnet50 architecture outperforms other models in terms of accuracy, sensitivity, specificity, F1-score
Deep Learning-Based Detection of Inappropriate Speech Content for Film Censorship
Audible content has become an effective tool for shaping one’s personality and character due to the ease of accessibility to a huge audible content that could be an independent audio files or an audio of online videos, movies, and television programs. There is a huge necessity to filter inappropriate audible content of the easily accessible videos and films that are likely to contain an inappropriate speech content. With this in view, all the broadcasting and online video/audio platform companies hire a lot of manpower to detect the foul voices prior to censorship. The process has a large cost in terms of manpower, time and financial resources. In addition to inaccurate detection of foul voices due to fatigue of manpower and weakness of human visual and hearing system in long time and monotonous tasks. As such, this paper proposes an intelligent deep learning-based system for film censorship through a fast and accurate detection and localization approach using advanced deep Convolutional Neural Networks (CNNs). The dataset of foul language containing isolated words samples and continuous speech were collected, annotated, processed, and analyzed for the development of automated detection of inappropriate speech content. The results indicated the feasibility of the suggested systems by reporting a high volume of inappropriate spoken terms detection. The proposed system outperformed state-of-the-art baseline algorithms on the novel foul language dataset evaluation metrics in terms of macro average AUC (93.85%), weighted average AUC (94.58%), and all other metrics such as F1-score. Additionally, proposed acoustic system outperformed ASR-based system for profanity detection based on the evaluation metrics including AUC, accuracy, precision, and F1-score. Additionally, proposed system was proven to be faster than human manual screening and detection of audible content for films’ censorship
Deep Active Learning for Pornography Recognition Using ResNet
The demand for nudity and pornographic content detection is increasing due to the prevalence of media products containing sexually explicit content with Internet being the main source. Recent literature has proved the effectiveness of deep learning techniques for adult image and video detection. However, the requirement for a huge dataset with labeled examples poses a restriction in practical use. Several research has shown that training deep models using an active learning framework could reduce the annotation effort, but this approach has yet to be applied for pornography detection. In this paper, the classification efficiency and annotation requirement of fine-tuned ResNet50V2 model trained using an active learning framework in pornographic image recognition was explored by comparing the method’s performance using three sampling strategies (random sampling, least confidence sampling, and entropy sampling). The baseline for comparison was a fully supervised learning method. The video frames of the public NPDI dataset were utilized to run a 5-fold cross-validation. The results of the experiments demonstrated that similar average test accuracy of five folds could be obtained using the deep active learning method, with only 60% of labeled samples in the training dataset compared to 100% annotated samples in fully supervised learning
Transfer Detection of YOLO to Focus CNN’s Attention on Nude Regions for Adult Content Detection
Video pornography and nudity detection aim to detect and classify people in videos into
nude or normal for censorship purposes. Recent literature has demonstrated pornography detection
utilising the convolutional neural network (CNN) to extract features directly from the whole frames
and support vector machine (SVM) to classify the extracted features into two categories. However,
existing methods were not able to detect the small-scale content of pornography and nudity in
frames with diverse backgrounds. This limitation has led to a high false-negative rate (FNR) and
misclassification of nude frames as normal ones. In order to address this matter, this paper explores
the limitation of the existing convolutional-only approaches focusing the visual attention of CNN on
the expected nude regions inside the frames to reduce the FNR. The You Only Look Once (YOLO)
object detector was transferred to the pornography and nudity detection application to detect persons
as regions of interest (ROIs), which were applied to CNN and SVM for nude/normal classification.
Several experiments were conducted to compare the performance of various CNNs and classifiers
using our proposed dataset. It was found that ResNet101 with random forest outperformed other
models concerning the F1-score of 90.03% and accuracy of 87.75%. Furthermore, an ablation study
was performed to demonstrate the impact of adding the YOLO before the CNN. YOLO–CNN was
shown to outperform CNN-only in terms of accuracy, which was increased from 85.5% to 89.5%.
Additionally, a new benchmark dataset with challenging content, including various human sizes and
backgrounds, was proposed