13 research outputs found

    Utilizing Wearable Devices To Design Personal Thermal Comfort Model

    Get PDF
    Apart from the common environmental factors such as relative humidity, radiant and ambient temperatures, studies have confirmed that thermal comfort significantly depends on internal personal parameters such as metabolic rate, age and health status. This is manifested as a difference in comfort levels between people residing under the same roof, and hence no general comprehensive comfort model satisfying everyone. Current and newly emerging advancements in state of the art wearable technology have made it possible to continuously acquire biometric information. This work proposes to access and exploit this data to build personal thermal comfort model. Relying on various supervised machine learning methods, a personal thermal comfort model will be produced and compared to a general model to show its superior performance

    A review of homogenous ensemble methods on the classification of breast cancer data

    Get PDF
    In the last decades, emerging data mining technology has been introduced to assist humankind in generating relevant decisions. Data mining is a concept established by computer scientists to lead a secure and reliable classification and deduction of data. In the medical field, data mining methods can assist in performing various medical diagnoses, including breast cancer. As evolution happens, ensemble methods are being proposed to achieve better performance in classification. This technique reinforced the use of multiple classifiers in the model. The review of the homogenous ensemble method on breast cancer classification is being carried out to identify the overall performance. The results of the reviewed ensemble techniques, such as Random Forest and XGBoost, show that ensemble methods can outperform the performance of the single classifier method. The reviewed ensemble methods have pros and cons and are useful for solving breast cancer classification problems. The methods are being discussed thoroughly to examine the overall performance in the classification

    Using simple artificial intelligence methods for predicting amyloidogenesis in antibodies

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>All polypeptide backbones have the potential to form amyloid fibrils, which are associated with a number of degenerative disorders. However, the likelihood that amyloidosis would actually occur under physiological conditions depends largely on the amino acid composition of a protein. We explore using a naive Bayesian classifier and a weighted decision tree for predicting the amyloidogenicity of immunoglobulin sequences.</p> <p>Results</p> <p>The average accuracy based on leave-one-out (LOO) cross validation of a Bayesian classifier generated from 143 amyloidogenic sequences is 60.84%. This is consistent with the average accuracy of 61.15% for a holdout test set comprised of 103 AM and 28 non-amyloidogenic sequences. The LOO cross validation accuracy increases to 81.08% when the training set is augmented by the holdout test set. In comparison, the average classification accuracy for the holdout test set obtained using a decision tree is 78.64%. Non-amyloidogenic sequences are predicted with average LOO cross validation accuracies between 74.05% and 77.24% using the Bayesian classifier, depending on the training set size. The accuracy for the holdout test set was 89%. For the decision tree, the non-amyloidogenic prediction accuracy is 75.00%.</p> <p>Conclusions</p> <p>This exploratory study indicates that both classification methods may be promising in providing straightforward predictions on the amyloidogenicity of a sequence. Nevertheless, the number of available sequences that satisfy the premises of this study are limited, and are consequently smaller than the ideal training set size. Increasing the size of the training set clearly increases the accuracy, and the expansion of the training set to include not only more derivatives, but more alignments, would make the method more sound. The accuracy of the classifiers may also be improved when additional factors, such as structural and physico-chemical data, are considered. The development of this type of classifier has significant applications in evaluating engineered antibodies, and may be adapted for evaluating engineered proteins in general.</p

    Boosting bonsai trees for handwritten/printed text discrimination

    Get PDF
    International audienceBoosting over decision-stumps proved its e ciency in Natural Language Processing essentially with symbolic features, and its good properties (fast, few and not critical parameters, not sensitive to overfitting) could be of great interest in the numeric world of pixel images. In this article we investigated the use of boosting over small decision trees, in image classification processing, for the discrimination of handwritten/printed text. Then, we conducted experiments to compare it to usual SVM-based classification revealing convincing results with very close performance, but with faster predictions and behaving far less as a black-box. Those promising results tend to make use of this classifier in more complex recognition tasks like multiclass problems

    Profiling Instances in Noise Reduction

    Get PDF
    The dependency on the quality of the training data has led to significant work in noise reduction for instance-based learning algorithms. This paper presents an empirical evaluation of current noise reduction techniques, not just from the perspective of their comparative performance, but from the perspective of investigating the types of instances that they focus on for re- moval. A novel instance profiling technique known as RDCL profiling allows the structure of a training set to be analysed at the instance level cate- gorising each instance based on modelling their local competence properties. This profiling approach o↵ers the opportunity of investigating the types of instances removed by the noise reduction techniques that are currently in use in instance-based learning. The paper also considers the e↵ect of removing instances with specific profiles from a dataset and shows that a very simple approach of removing instances that are misclassified by the training set and cause other instances in the dataset to be misclassified is an e↵ective noise reduction technique

    Data Driven Approach to Thermal Comfort Model Design

    Get PDF
    Apart from the dominant environmental factors such as relative humidity, radiant, and ambient temperatures, studies have confirmed that thermal comfort significantly depends on internal personal parameters such as metabolic rate, age, and health status. This study reviews the sensitivity of the Predicted Mean Vote (PMV) thermal comfort model relative to its environmental and personal parameters of a group of people in a space. PMV model equations adapted in ASHRAE Standard 55–Thermal Environmental Conditions for Human Occupancy, are used in this investigation to conduct a parametric study by generating and analyzing multi-dimensional comfort zone plots. It has been found that personal parameters such as metabolic rate and clothing have the highest impact. Current and newly emerging advancements in state of the art wearable technology have made it possible to continuously acquired biometric information. This work proposes to access and exploit this data to build a new innovative thermal comfort model. Relying on various supervised machine-learning methods, a thermal comfort model has been produced and compared to a general model to show its superior performance. Finally, the study represents an architecture to employ new thermal comfort model in inexpensive, responsive and extensible smart home service. Advisor: Fadi Alsalee

    딥러닝 기반 생성 모델을 이용한 자연어처리 데이터 증강 기법

    Get PDF
    학위논문(박사)--서울대학교 대학원 :공과대학 컴퓨터공학부,2020. 2. 이상구.Recent advances in generation capability of deep learning models have spurred interest in utilizing deep generative models for unsupervised generative data augmentation (GDA). Generative data augmentation aims to improve the performance of a downstream machine learning model by augmenting the original dataset with samples generated from a deep latent variable model. This data augmentation approach is attractive to the natural language processing community, because (1) there is a shortage of text augmentation techniques that require little supervision and (2) resource scarcity being prevalent. In this dissertation, we explore the feasibility of exploiting deep latent variable models for data augmentation on three NLP tasks: sentence classification, spoken language understanding (SLU) and dialogue state tracking (DST), represent NLP tasks of various complexities and properties -- SLU requires multi-task learning of text classification and sequence tagging, while DST requires the understanding of hierarchical and recurrent data structures. For each of the three tasks, we propose a task-specific latent variable model based on conditional, hierarchical and sequential variational autoencoders (VAE) for multi-modal joint modeling of linguistic features and the relevant annotations. We conduct extensive experiments to statistically justify our hypothesis that deep generative data augmentation is beneficial for all subject tasks. Our experiments show that deep generative data augmentation is effective for the select tasks, supporting the idea that the technique can potentially be utilized for other range of NLP tasks. Ablation and qualitative studies reveal deeper insight into the underlying mechanisms of generative data augmentation. As a secondary contribution, we also shed light onto the recurring posterior collapse phenomenon in autoregressive VAEs and, subsequently, propose novel techniques to reduce the model risk, which is crucial for proper training of complex VAE models, enabling them to synthesize better samples for data augmentation. In summary, this work intends to demonstrate and analyze the effectiveness of unsupervised generative data augmentation in NLP. Ultimately, our approach enables standardized adoption of generative data augmentation, which can be applied orthogonally to existing regularization techniques.최근 딥러닝 기반 생성 모델의 급격한 발전으로 이를 이용한 생성 기반 데이터 증강 기법(generative data augmentation, GDA)의 실현 가능성에 대한 기대가 커지고 있다. 생성 기반 데이터 증강 기법은 딥러닝 기반 잠재변수 모델에서 생성 된 샘플을 원본 데이터셋에 추가하여 연관된 태스크의 성능을 향상시키는 기술을 의미한다. 따라서 생성 기반 데이터 증강 기법은 데이터 공간에서 이뤄지는 정규화 기술의 한 형태로 간주될 수 있다. 이러한 딥러닝 기반 생성 모델의 새로운 활용 가능성은 자연어처리 분야에서 더욱 중요하게 부각되는 이유는 (1) 범용 가능한 텍스트 데이터 증강 기술의 부재와 (2) 텍스트 데이터의 희소성을 극복할 수 있는 대안이 필요하기 때문이다. 문제의 복잡도와 특징을 골고루 채집하기 위해 본 논문에서는 텍스트 분류(text classification), 순차적 레이블링과 멀티태스킹 기술이 필요한 발화 이해(spoken language understanding, SLU), 계층적이며 재귀적인 데이터 구조에 대한 고려가 필요한 대화 상태 추적(dialogue state tracking, DST) 등 세 가지 문제에서 딥러닝 기반 생성 모델을 활용한 데이터 증강 기법의 타당성에 대해 다룬다. 본 연구에서는 조건부, 계층적 및 순차적 variational autoencoder (VAE)에 기반하여 각 자연어처리 문제에 특화된 텍스트 및 연관 부착 정보를 동시에 생성하는 특수 딥러닝 생성 모델들을 제시하고, 다양한 하류 모델과 데이터셋을 다루는 등 폭 넓은 실험을 통해 딥 생성 모델 기반 데이터 증강 기법의 효과를 통계적으로 입증하였다. 부수적 연구에서는 자기회귀적(autoregressive) VAE에서 빈번히 발생하는 posterior collapse 문제에 대해 탐구하고, 해당 문제를 완화할 수 있는 신규 방안도 제안한다. 해당 방법을 생성적 데이터 증강에 필요한 복잡한 VAE 모델에 적용하였을 때, 생성 모델의 생성 질이 향상되어 데이터 증강 효과에도 긍정적인 영향을 미칠 수 있음을 검증하였다. 본 논문을 통해 자연어처리 분야에서 기존 정규화 기법과 병행 적용 가능한 비지도 형태의 데이터 증강 기법의 표준화를 기대해 볼 수 있다.1 Introduction 1 1.1 Motivation 1 1.2 Dissertation Overview 6 2 Background and Related Work 8 2.1 Deep Latent Variable Models 8 2.1.1 Variational Autoencoder (VAE) 10 2.1.2 Deep Generative Models and Text Generation 12 2.2 Data Augmentation 12 2.2.1 General Description 13 2.2.2 Categorization of Data Augmentation 14 2.2.3 Theoretical Explanations 21 2.3 Summary 24 3 Basic Task: Text Classi cation 25 3.1 Introduction 25 3.2 Our Approach 28 3.2.1 Proposed Models 28 3.2.2 Training with I-VAE 29 3.3 Experiments 31 3.3.1 Datasets 32 3.3.2 Experimental Settings 33 3.3.3 Implementation Details 34 3.3.4 Data Augmentation Results 36 3.3.5 Ablation Studies 39 3.3.6 Qualitative Analysis 40 3.4 Summary 45 4 Multi-task Learning: Spoken Language Understanding 46 4.1 Introduction 46 4.2 Related Work 48 4.3 Model Description 48 4.3.1 Framework Formulation 48 4.3.2 Joint Generative Model 49 4.4 Experiments 56 4.4.1 Datasets 56 4.4.2 Experimental Settings 57 4.4.3 Generative Data Augmentation Results 61 4.4.4 Comparison to Other State-of-the-art Results 63 4.4.5 Ablation Studies 63 4.5 Summary 67 5 Complex Data: Dialogue State Tracking 68 5.1 Introduction 68 5.2 Background and Related Work 70 5.2.1 Task-oriented Dialogue 70 5.2.2 Dialogue State Tracking 72 5.2.3 Conversation Modeling 72 5.3 Variational Hierarchical Dialogue Autoencoder (VHDA) 73 5.3.1 Notations 73 5.3.2 Variational Hierarchical Conversational RNN 74 5.3.3 Proposed Model 75 5.3.4 Posterior Collapse 82 5.4 Experimental Results 84 5.4.1 Experimental Settings 84 5.4.2 Data Augmentation Results 90 5.4.3 Intrinsic Evaluation - Language Evaluation 94 5.4.4 Qualitative Results 95 5.5 Summary 101 6 Conclusion 103 6.1 Summary 103 6.2 Limitations 104 6.3 Future Work 105Docto

    Multispectral in-field sensors observations to estimate corn leaf nitrogen concentration and grain yield using machine learning

    Get PDF
    Nitrogen (N) is the most critical fertilizer applied nutrient for supporting plant growth. It is a critical part of photosynthesis as a component of chlorophyl, hence it is a key indicator of plant health. In recent years, rapid development of multispectral sensing technology and machine learning (ML) methods make it possible to estimate leaf chemical components such as N for predicting yield spatially and temporally. The objectives of this study were to compare the relationships between canopy reflectance and corn (Zea mays L.) leaf N concentration acquired by two multispectral sensors: red-edge multispectral camera mounted on the Unmanned Aerial Vehicle (UAV) and crop circle ACS-430. Four fertilizer N rates were applied, ranging from deficient to excessivein order to have a broad rangein plant N status. Spectral information was collected at different phenological stages of corn to calculate vegetation indices (VIs) for each stage. Moreover, leaf samples were taken simultaneously to determine N concentration. Different ML methods (Multi-Layer Perceptron (MLP), Support Vector Machines (SVMs), Random Forest regression, Regularized regression models, and Gradient Boosting) were used to estimate leaf N% from VIs and predict yield from VIs. Random Forest Regression was utilized as a feature selection method to choose the best combination of variables for different stages and to interpret the relationships between VIs and corn leaf N concentration and grain yield. The Canopy Chlorophyll Content Index (SCCCI) and Red-edge Ratio Vegetation Index (RERVI) were selected as the most efficient VIs in leaf N estimation and SCCCI, Red-edge chlorophyll index (CIRE), RERVI, Soil Adjusted Vegetation Index (SAVI), and Normalized Difference Vegetation Index (NDVI) were chosen as the most effective VIs in predicting corn grain yield. The results derived from using a red-edge multispectral camera showed that the SCCCI was the most proper index for predicting yield at most of the phenological stages and Gradient Boosting was the best-fitted model to estimate leaf N% with an 80% coefficient of determination. Using a Crop Circle ACS-430 showed that the Support Vector Regression (SVR) model achieved the best performance measures than other models tested in the prediction of leaf N concentration

    Stories in the sediment : an analysis of phytoplankton pigments within lake sediment to predict and retrodict water quality in New Zealand lakes : a thesis presented in partial fulfilment of the requirements for the degree of Master of Science in Ecology at Massey University, Manawatū, New Zealand

    Get PDF
    When lakes experience an increase in nutrient availability, the phytoplankton and the primary productivity of the lake will also increase. This increase provides a robust means of signifying lake trophic fluctuations. The phytoplankton from the surface of the lake (photic zone) will sediment out, leading to the accumulation of both planktonic and benthic phytoplankton remains at the bottom of the water column. It is these past fluctuations in phytoplankton biomass, which accumulate in the lake sediments, provide indications of past environmental conditions and lake health. This research aimed to assess the potential of phytoplankton pigments preserved/captured within sediments as indicators of lake water quality in both neoliminological and paleolimnological lake sediments from a dataset of 223 New Zealand Lakes (≈ 6% of the lakes in New Zealand) was used for the analysis of surface sediments. These lakes ranged from low elevation lakes (<10 m) to high elevation lakes (up to 1,839 m) and included a range of geomorphic classifications. The catchments ranged from 35,288 m² to 704,470,618 m² and included shallow lakes (<10 m) to deep lakes (up to 445 m). In addition to accessing the use of hyperspectral imaging (HSI) techniques as a method for detecting phytoplankton pigments within sediments. The assessment of calibrating chlorophyll-a (chl-a) detected by HSI in lake core sediment samples to chl-a quantified by analytical chemistry methods (High performance liquid chromatography and spectrophotometry analyses), found that the use of spectrophotometry without acidification provided more consistent results (with an error rate of less than 7.5%) when compared to spectrophotometry analysis with acidification. Additionally, the use of spectrophotometry without acidification for chl-a calibration revealed the potential for a universal equation to be researched. Within lakes, high trophic levels are positively correlated with cyanobacterial dominance. One of the complications of high trophic levels is cyanobacterial blooms, which can be toxic. A reliable pigment indicator for the presence of cyanobacteria is phycocyanin. Therefore, the use of HSI was assessed as an analytical technique utilised for detecting and quantifying the concentration of phycocyanin within lake core sediment samples. This study revealed that phycocyanin could not be detected within the lake sediments within this research. Suggesting that phycocyanin was not incorporated into the lake sediment within the lakes assessed. Additionally, the HSI signal thought to be detecting phycocyanin is potentially measuring chlorophyll-a within the lake core rather than phycocyanin. To predict lake water quality through the lake trophic level index (TLI) several machine learning models were created (regression trees, random forest models, and boosted regression trees) The random forest model was created using the quantification of key phytoplankton pigments within surface sediments plus five static lake physical characteristics this model was the most accurate (within 10% of the TLI). This model provides a predictive tool to access lake TLI using a single sample of surface sediment. This model was then applied to lake core sediment samples to retrodict lake water quality. The assumption of many degraded lakes throughout New Zealand, is that this is of anthropogenic origin. The retrodicted TLI’s suggests, that while anthropogenic influence is exacerbating the degradation of the lakes, prior to this the trophic levels of these lakes did not fluctuate beyond one trophic level (i.e., moving from oligotrophic to mesotrophic). Additionally, apparent in the retrodiction of the lakes is the integration of cyanobacteria indicator pigments into the sediments relatively recently. The integration of these pigments coincides with the arrival of Europeans to the respective areas
    corecore