7 research outputs found

    A User-Guided Bayesian Framework for Ensemble Feature Selection in Life Science Applications (UBayFS)

    Full text link
    Feature selection represents a measure to reduce the complexity of high-dimensional datasets and gain insights into the systematic variation in the data. This aspect is of specific importance in domains that rely on model interpretability, such as life sciences. We propose UBayFS, an ensemble feature selection technique embedded in a Bayesian statistical framework. Our approach considers two sources of information: data and domain knowledge. We build a meta-model from an ensemble of elementary feature selectors and aggregate this information in a multinomial likelihood. The user guides UBayFS by weighting features and penalizing specific feature blocks or combinations, implemented via a Dirichlet-type prior distribution and a regularization term. In a quantitative evaluation, we demonstrate that our framework (a) allows for a balanced trade-off between user knowledge and data observations, and (b) achieves competitive performance with state-of-the-art methods

    Beta hebbian learning: definition and analysis of a new family of learning rules for exploratory projection pursuit

    Get PDF
    [EN] This thesis comprises an investigation into the derivation of learning rules in artificial neural networks from probabilistic criteria. •Beta Hebbian Learning (BHL). First of all, it is derived a new family of learning rules which are based on maximising the likelihood of the residual from a negative feedback network when such residual is deemed to come from the Beta Distribution, obtaining an algorithm called Beta Hebbian Learning, which outperforms current neural algorithms in Exploratory Projection Pursuit. • Beta-Scale Invariant Map (Beta-SIM). Secondly, Beta Hebbian Learning is applied to a well-known Topology Preserving Map algorithm called Scale Invariant Map (SIM) to design a new of its version called Beta-Scale Invariant Map (Beta-SIM). It is developed to facilitate the clustering and visualization of the internal structure of high dimensional complex datasets effectively and efficiently, specially those characterized by having internal radial distribution. The Beta-SIM behaviour is thoroughly analysed comparing its results, in terms performance quality measures with other well-known topology preserving models. • Weighted Voting Superposition Beta-Scale Invariant Map (WeVoS-Beta-SIM). Finally, the use of ensembles such as the Weighted Voting Superposition (WeVoS) is tested over the previous novel Beta-SIM algorithm, in order to improve its stability and to generate accurate topology maps when using complex datasets. Therefore, the WeVoS-Beta-Scale Invariant Map (WeVoS-Beta-SIM), is presented, analysed and compared with other well-known topology preserving models. All algorithms have been successfully tested using different artificial datasets to corroborate their properties and also with high-complex real datasets.[ES] Esta tesis abarca la investigación sobre la derivación de reglas de aprendizaje en redes neuronales artificiales a partir de criterios probabilísticos. • Beta Hebbian Learning (BHL). En primer lugar, se deriva una nueva familia de reglas de aprendizaje basadas en maximizar la probabilidad del residuo de una red con retroalimentación negativa cuando se considera que dicho residuo proviene de la Distribución Beta, obteniendo un algoritmo llamado Beta Hebbian Learning, que mejora a algoritmos neuronales actuales de búsqueda de proyecciones exploratorias. • Beta-Scale Invariant Map (Beta-SIM). En Segundo lugar, Beta Hebbian Learning se aplica a un conocido algoritmo de Mapa de Preservación de la Topología llamado Scale Invariant Map (SIM) para diseñar una nueva versión llamada Beta-Scale Invariant Map (Beta-SIM). Este nuevo algoritmo ha sido desarrollado para facilitar el agrupamiento y visualización de la estructura interna de conjuntos de datos complejos de alta dimensionalidad de manera eficaz y eficiente, especialmente aquellos caracterizados por tener una distribución radial interna. El comportamiento de Beta-SIM es analizado en profundidad comparando sus resultados, en términos de medidas de calidad de rendimiento con otros modelos bien conocidos de preservación de topología. • Weighted Voting Superposition Beta-Scale Invariant Map (WeVoS-Beta-SIM). Finalmente, el uso de ensembles como el Weighted Voting Superposition (WeVoS) sobre el algoritmo Beta-SIM es probado, con objeto de mejorar su estabilidad y generar mapas topológicos precisos cuando se utilizan conjuntos de datos complejos. Por lo tanto, se presenta, analiza y compara el WeVoS-Beta-Scale Invariant Map (WeVoS-Beta-SIM) con otros modelos bien conocidos de preservación de topología. Todos los algoritmos han sido probados con éxito sobre conjuntos de datos artificiales para corroborar sus propiedades, así como con conjuntos de datos reales de gran complejidad

    Data- og ekspertdreven variabelseleksjon for prediktive modeller i helsevesenet : mot økt tolkbarhet i underbestemte maskinlæringsproblemer

    Get PDF
    Modern data acquisition techniques in healthcare generate large collections of data from multiple sources, such as novel diagnosis and treatment methodologies. Some concrete examples are electronic healthcare record systems, genomics, and medical images. This leads to situations with often unstructured, high-dimensional heterogeneous patient cohort data where classical statistical methods may not be sufficient for optimal utilization of the data and informed decision-making. Instead, investigating such data structures with modern machine learning techniques promises to improve the understanding of patient health issues and may provide a better platform for informed decision-making by clinicians. Key requirements for this purpose include (a) sufficiently accurate predictions and (b) model interpretability. Achieving both aspects in parallel is difficult, particularly for datasets with few patients, which are common in the healthcare domain. In such cases, machine learning models encounter mathematically underdetermined systems and may overfit easily on the training data. An important approach to overcome this issue is feature selection, i.e., determining a subset of informative features from the original set of features with respect to the target variable. While potentially raising the predictive performance, feature selection fosters model interpretability by identifying a low number of relevant model parameters to better understand the underlying biological processes that lead to health issues. Interpretability requires that feature selection is stable, i.e., small changes in the dataset do not lead to changes in the selected feature set. A concept to address instability is ensemble feature selection, i.e. the process of repeating the feature selection multiple times on subsets of samples of the original dataset and aggregating results in a meta-model. This thesis presents two approaches for ensemble feature selection, which are tailored towards high-dimensional data in healthcare: the Repeated Elastic Net Technique for feature selection (RENT) and the User-Guided Bayesian Framework for feature selection (UBayFS). While RENT is purely data-driven and builds upon elastic net regularized models, UBayFS is a general framework for ensembles with the capabilities to include expert knowledge in the feature selection process via prior weights and side constraints. A case study modeling the overall survival of cancer patients compares these novel feature selectors and demonstrates their potential in clinical practice. Beyond the selection of single features, UBayFS also allows for selecting whole feature groups (feature blocks) that were acquired from multiple data sources, as those mentioned above. Importance quantification of such feature blocks plays a key role in tracing information about the target variable back to the acquisition modalities. Such information on feature block importance may lead to positive effects on the use of human, technical, and financial resources if systematically integrated into the planning of patient treatment by excluding the acquisition of non-informative features. Since a generalization of feature importance measures to block importance is not trivial, this thesis also investigates and compares approaches for feature block importance rankings. This thesis demonstrates that high-dimensional datasets from multiple data sources in the medical domain can be successfully tackled by the presented approaches for feature selection. Experimental evaluations demonstrate favorable properties of both predictive performance, stability, as well as interpretability of results, which carries a high potential for better data-driven decision support in clinical practice.Moderne datainnsamlingsteknikker i helsevesenet genererer store datamengder fra flere kilder, som for eksempel nye diagnose- og behandlingsmetoder. Noen konkrete eksempler er elektroniske helsejournalsystemer, genomikk og medisinske bilder. Slike pasientkohortdata er ofte ustrukturerte, høydimensjonale og heterogene og hvor klassiske statistiske metoder ikke er tilstrekkelige for optimal utnyttelse av dataene og god informasjonsbasert beslutningstaking. Derfor kan det være lovende å analysere slike datastrukturer ved bruk av moderne maskinlæringsteknikker for å øke forståelsen av pasientenes helseproblemer og for å gi klinikerne en bedre plattform for informasjonsbasert beslutningstaking. Sentrale krav til dette formålet inkluderer (a) tilstrekkelig nøyaktige prediksjoner og (b) modelltolkbarhet. Å oppnå begge aspektene samtidig er vanskelig, spesielt for datasett med få pasienter, noe som er vanlig for data i helsevesenet. I slike tilfeller må maskinlæringsmodeller håndtere matematisk underbestemte systemer og dette kan lett føre til at modellene overtilpasses treningsdataene. Variabelseleksjon er en viktig tilnærming for å håndtere dette ved å identifisere en undergruppe av informative variabler med hensyn til responsvariablen. Samtidig som variabelseleksjonsmetoder kan lede til økt prediktiv ytelse, fremmes modelltolkbarhet ved å identifisere et lavt antall relevante modellparametere. Dette kan gi bedre forståelse av de underliggende biologiske prosessene som fører til helseproblemer. Tolkbarhet krever at variabelseleksjonen er stabil, dvs. at små endringer i datasettet ikke fører til endringer i hvilke variabler som velges. Et konsept for å adressere ustabilitet er ensemblevariableseleksjon, dvs. prosessen med å gjenta variabelseleksjon flere ganger på en delmengde av prøvene i det originale datasett og aggregere resultater i en metamodell. Denne avhandlingen presenterer to tilnærminger for ensemblevariabelseleksjon, som er skreddersydd for høydimensjonale data i helsevesenet: "Repeated Elastic Net Technique for feature selection" (RENT) og "User-Guided Bayesian Framework for feature selection" (UBayFS). Mens RENT er datadrevet og bygger på elastic net-regulariserte modeller, er UBayFS et generelt rammeverk for ensembler som muliggjør inkludering av ekspertkunnskap i variabelseleksjonsprosessen gjennom forhåndsbestemte vekter og sidebegrensninger. En case-studie som modellerer overlevelsen av kreftpasienter sammenligner disse nye variabelseleksjonsmetodene og demonstrerer deres potensiale i klinisk praksis. Utover valg av enkelte variabler gjør UBayFS det også mulig å velge blokker eller grupper av variabler som representerer de ulike datakildene som ble nevnt over. Kvantifisering av viktigheten av variabelgrupper spiller en nøkkelrolle for forståelsen av hvorvidt datakildene er viktige for responsvariablen. Tilgang til slik informasjon kan føre til at bruken av menneskelige, tekniske og økonomiske ressurser kan forbedres dersom informasjonen integreres systematisk i planleggingen av pasientbehandlingen. Slik kan man redusere innsamling av ikke-informative variabler. Siden generaliseringen av viktighet av variabelgrupper ikke er triviell, undersøkes og sammenlignes også tilnærminger for rangering av viktigheten til disse variabelgruppene. Denne avhandlingen viser at høydimensjonale datasett fra flere datakilder fra det medisinske domenet effektivt kan håndteres ved bruk av variabelseleksjonmetodene som er presentert i avhandlingen. Eksperimentene viser at disse kan ha positiv en effekt på både prediktiv ytelse, stabilitet og tolkbarhet av resultatene. Bruken av disse variabelseleksjonsmetodene bærer et stort potensiale for bedre datadrevet beslutningsstøtte i klinisk praksis

    Future Prediction and Unknown Activity Recognition for Nursing Care Applications

    Get PDF
    本博士論文は,ユビキタスコンピューティング研究の概念をもとに,介護分野への応用を目指す.介護分野は少子高齢化に伴い最新技術の応用が必要である.本博士論文では,(課題1)現場以外で起きる問題,(課題2)高齢者の見守り,(課題3)介護士不足,の3つの課題に対し,機械学習を用いた解決を目指す.1章では,介護分野の課題を定義し,改善策と研究概要を述べる.上記の3つの課題に対して,(研究1)介護施設利用前の問題調査として,介護施設紹介コールセンターの記録データを用いた将来予測,(研究2)高齢者の数時間先の行動予測を目指し,現場のデータ収集および夜間と日中の高齢者の行動相関分析,(研究3)センサデータを用いた行動認識技術の応用化を目指した,未知行動クラス推定手法の提案,の3つの研究について,概要と貢献を記述する.2章では,ユビキタスコンピューティング研究の概要を述べ,それぞれの研究の技術的な立ち位置について述べる.ユビキタスコンピューティング研究とは,実世界のデータを収集し,データを分析し,実世界に応用する,というサイクルがある.研究1と研究2は,「将来予測」モデルを用いたデータ分析の研究分野に該当する.これら研究3は,「未知行動認識」手法を提案し実世界応用としての研究分野に該当する.これらの研究全てに対して機械学習を用いたアプローチを行うが,予測と認識,応用の目的やデータによって用いるモデルを変更させる必要があるため,「将来予測」と「未知行動認識」のそれぞれの関連研究と共にどのようなアプローチなのかを記述する.3章では,介護施設を紹介するコールセンターサービスの記録データを活用し,相談者の行動予測モデル分析の研究を述べる.行動予測モデルはテキスト分析およびアンサンブル学習アルゴリズムを使用し構築した後,説明変数の重要度を可視化することで予測モデルの中身を分析する.学習モデルの評価の結果,「見学するかどうか」を96.8%の正答率で予測することができた.分析の結果,コールセンターだけでなく,介護施設,相談者に対して,例えば「交通手段がないと見学に行かない傾向にある」といった,全10件ほどの有用な知見を示し,介護分野への貢献を示した.また技術的な貢献として,分析やアンサンブル学習を用いた記録データの活用方法を示した.4章では,介護施設に入居する高齢者のベッド上の行動を見守るデバイスのデータを収集し,高齢者の行動予測の研究を述べる.具体的には,日中行動と夜間行動の相関関係を,機械学習を用いて分析をする.予測モデルの評価結果,深夜の睡眠状況から日中の運動をするかどうかを92%の正答率で予測できた.分析の結果,日中の運動の有無と就寝時間に相関があるということ,といった介護士のサービス向上に繋がる知見を得ることができた.技術的な貢献としては,見守り用のベッドIoT製品のデータと介護記録のデータを組み合わせたデータ活用方法を示した.5章では,行動認識技術の実用化を目指した研究を述べる.既存の行動認識技術は全クラスに対し学習データセットを収集する手間が課題であり,介護記録の自動化応用を目指すための障害であった.そこで,学習データのない未知のクラスを推定する手法を提案する.評価の結果,既存手法より最大で16%予測精度が向上し,生成方法の手間を考慮すると提案する手法の方が有用であることがわかった.この研究は,プライバシーやコストの面で,介護現場で導入が見込まれるセンサデータを用いて行動を認識する技術の実用性の問題に取り組み,さらに行動認識技術の実用化での重要な課題に取り組んでいるものであり,応用面と技術面の両方で貢献した.6章では,これらの研究に対して全体的な考察をする.まず,介護分野の現在の取り組みと今後の展望を述べる.次に,介護分野に限らず,本論文で述べたような研究の視点でデータ活用をすることを述べ.最後に,展望と今後の課題として,データ量の少なさの課題と人の行動の個人差に対する対策を考察する.7章で全体の博士論文のまとめをして終わる.九州工業大学博士学位論文 学位記番号:工博甲第499号 学位授与年月日:令和2年3月25日第1章 はじめに|第2章 関連研究|第3章 介護施設紹介コールセンター記録のアンサンブル学習による傾向分析|第4章 高齢者の睡眠と生活行動の相関分析のためのセンシング実験|第5章 センサベース行動認識におけるZero-shot学習法|第6章 全体の考察|第7章 まとめ九州工業大学令和元年

    Unsupervised Feature Selection with Ensemble Learning

    No full text
    International audienceIn this paper, we show that the way internal estimates are used to measure variable importance in Random Forests are also applicable to feature selection in unsupervised learning. We propose a new method called Random Cluster Ensemble (RCE for short), that estimates the out-of-bag feature importance from an ensemble of partitions. Each partition is constructed using a different bootstrap sample and a random subset of the features. We provide empirical results on nineteen benchmark data sets indicating that RCE, boosted with a recursive feature elimination scheme (RFE), can lead to significant improvement in terms of clustering accuracy, over several state-of-the-art supervised and unsupervised algorithms, with a very limited subset of features. The method shows promise to deal with very large domains. All results, datasets and algorithms are available on line

    Unsupervised Feature Selection with Ensemble Learning

    No full text
    International audienceIn this paper, we show that the way internal estimates are used to measure variable importance in Random Forests are also applicable to feature selection in unsupervised learning. We propose a new method called Random Cluster Ensemble (RCE for short), that estimates the out-of-bag feature importance from an ensemble of partitions. Each partition is constructed using a different bootstrap sample and a random subset of the features. We provide empirical results on nineteen benchmark data sets indicating that RCE, boosted with a recursive feature elimination scheme (RFE), can lead to significant improvement in terms of clustering accuracy, over several state-of-the-art supervised and unsupervised algorithms, with a very limited subset of features. The method shows promise to deal with very large domains. All results, datasets and algorithms are available on line
    corecore