1,087 research outputs found

    Nearest Centroid: A bridge between statistics and machine learning

    Get PDF

    Comparing Algorithms for Predictive Data Analytics

    Get PDF
    The master's degree thesis is composed of theoretical and practical parts. The theoretical part describes the basics of predictive data analytics and machine learning algorithms for classification such as Logistic Regression, Decision Tree, Random Forest, SVM, and KNN. We also describe different evaluation metrics such as Recall, Precision, Accuracy, F1 Score, Cohen's Kappa, Hamming Loss, and Jaccard Index that are used to measure the performance of these algorithms. Additionally, we record the time taken for the training and prediction processes to provide insights into algorithm scalability. The key part master's thesis is the practical part that compares these algorithms with a self-implemented tool that shows results for different evaluation metrics on seven datasets. First, we describe the implementation of an application for testing where we measure evaluation metrics scores. We tested these algorithms on all seven datasets using Python libraries such as scikit-learn. Finally, we analyze the results obtained and provide final conclusions

    Chest x-ray clues to osteoporosis : criteria, correlations, and consistency

    Get PDF
    The purpose of this study was to determine whether radiologists could accurately assess osteopenia on chest plain films. Two chest radiologists evaluated lateral chest films from 100 patients (80 female and 20 male), ranging in age from 16 to 86 years, for osteopenia and its associated findings. Intra- and interobserver agreement was determined using weighted kappa statistics, and accuracy was assessed by making comparisons to bone mineral density as measured by the non-invasive gold standard of dual-energy x-ray absorptiometry (DXA). Overall, radiologists were good at identifying signs of late, but not early, disease. Intraobserver consistency was substantial for fish vertebrae (Kw1=0.638; Kw2=0.0.712) with moderate interobserver agreement (Kw=0.45). Similarly for wedged vertebrae, intraobserver consistency was substantial to moderate (Kw1=0.654; Kw2=0.533) with substantial interobserver agreement (Kw=0.622). These radiographic signs correlated with true disease as shown by high specificity values. Therefore, this study indicates that if osteopenia is suspected (i.e., there is a wedge or fish vertebra) or its associated features are seen on a CXR, it is crucial for radiologists to comment on it. The literature suggests that referring physicians do not pay attention to such findings in radiology reports. Radiologists could effect change in clinical treatment by not burying these findings in the report body, but instead putting it in the impression, along with a recommendation that the finding be followed up with DXA. Because effective interventions for women with osteoporosis exist, the results of this study will contribute to a major change in the practice of chest radiology and improve womens health by preventing the devastating disability associated with osteoporosis

    Developing a machine learning model for tumor cell quantification in standard histology images of lung cancer

    Get PDF
    Summary Background Tumor purity estimation plays a crucial role in genomic profiling and is traditionally carried out manually by pathologists. This manual approach has several disadvantages, including potential inaccuracies due to human error, inconsistency in evaluation criteria among different pathologists, and the time-consuming nature of the process. These issues may be addressed by adopting a digital approach. In this thesis, we employ a machine learning (ML)-based, cell- based classifier to estimate tumor purity in lung cancer tissues. Materials and methods In this study, conducted as part of the subsequent clinical trial TNM-I, we incorporated 61 patients diagnosed with non-small cell lung cancer (NSCLC). Tumor purity was initially estimated manually by two pathologists. The digital estimation of tumor purity was executed using a ML-based classifier in QuPath. To determine the level of agreement and inter-rater reliability between the two pathologists, as well as between the manual and digital estimations, we computed Intraclass Correlation Coefficient (ICC) and Cohen’s Kappa using SPSS. Results The ICC coefficient when comparing the tumor purity estimations done by the two pathologists was 0.833, indicating good reliability. According to Cohen’s Kappa the inter- rater reliability between the pathologists was moderate with a value of 0.534. The ICC coefficient when comparing the manual and digital tumor purity estimation was 0.838, which indicates good reliability. When analyzing for Cohen’s Kappa we got a value of 0.563, indicating moderate inter-rater reliability between the tumor purity estimations done manually and digitally. All the results were statistically significant. Conclusion In summary, we have successfully developed a ML classifier that estimates tumor purity in lung cancer tissue. Our findings align with previous research and demonstrate strong correlation with traditional detection methods. These results underscore the importance of continuing research in enhancing ML-based strategies for tumor purity estimation

    Probabilistic Graphical Models for ERP-Based Brain Computer Interfaces

    Get PDF
    An event related potential (ERP) is an electrical potential recorded from the nervous system of humans or other animals. An ERP is observed after the presentation of a stimulus. Some examples of the ERPs are P300, N400, among others. Although ERPs are used very often in neuroscience, its generation is not yet well understood and different theories have been proposed to explain the phenomena. ERPs could be generated due to changes in the alpha rhythm, an internal neural control that reset the ongoing oscillations in the brain, or separate and distinct additive neuronal phenomena. When different repetitions of the same stimuli are averaged, a coherence addition of the oscillations is obtained which explain the increase in amplitude in the signals. Two ERPs are mostly studied: N400 and P300. N400 signals arise when a subject tries to make semantic operations that support neural circuits for explicit memory. N400 potentials have been observed mostly in the rhinal cortex. P300 signals are related to attention and memory operations. When a new stimulus appears, a P300 ERP (named P3a) is generated in the frontal lobe. In contrast, when a subject perceives an expected stimulus, a P300 ERP (named P3b) is generated in the temporal – parietal areas. This implicates P3a and P3b are related, suggesting a circuit pathway between the frontal and temporal–parietal regions, whose existence has not been verified. Un potencial relacionado con un evento (ERP) es un potencial eléctrico registrado en el sistema nervioso de los seres humanos u otros animales. Un ERP se observa tras la presentación de un estímulo. Aunque los ERPs se utilizan muy a menudo en neurociencia, su generación aún no se entiende bien y se han propuesto diferentes teorías para explicar el fenómeno. Una interfaz cerebro-computador (BCI) es un sistema de comunicación en el que los mensajes o las órdenes que un sujeto envía al mundo exterior proceden de algunas señales cerebrales en lugar de los nervios y músculos periféricos. La BCI utiliza ritmos sensorimotores o señales ERP, por lo que se necesita un clasificador para distinguir entre los estímulos correctos y los incorrectos. En este trabajo, proponemos utilizar modelos probabilísticos gráficos para el modelado de la dinámica temporal y espacial de las señales cerebrales con aplicaciones a las BCIs. Los modelos gráficos han sido seleccionados por su flexibilidad y capacidad de incorporar información previa. Esta flexibilidad se ha utilizado anteriormente para modelar únicamente la dinámica temporal. Esperamos que el modelo refleje algunos aspectos del funcionamiento del cerebro relacionados con los ERPs, al incluir información espacial y temporal.DoctoradoDoctor en Ingeniería Eléctrica y Electrónic

    LightGBM: An Effective and Scalable Algorithm for Prediction of Chemical Toxicity – Application to the Tox21 and Mutagenicity Datasets

    Get PDF
    Machine learning algorithms have attained widespread use in assessing the potential toxicities of pharmaceuticals and industrial chemicals because of their faster-speed and lower-cost compared to experimental bioassays. Gradient boosting is an effective algorithm that often achieves high predictivity, but historically the relative long computational time limited its applications in predicting large compound libraries or developing in silico predictive models that require frequent retraining. LightGBM, a recent improvement of the gradient boosting algorithm inherited its high predictivity but resolved its scalability and long computational time by adopting leaf-wise tree growth strategy and introducing novel techniques. In this study, we compared the predictive performance and the computational time of LightGBM to deep neural networks, random forests, support vector machines, and XGBoost. All algorithms were rigorously evaluated on publicly available Tox21 and mutagenicity datasets using a Bayesian optimization integrated nested 10-fold cross-validation scheme that performs hyperparameter optimization while examining model generalizability and transferability to new data. The evaluation results demonstrated that LightGBM is an effective and highly scalable algorithm offering the best predictive performance while consuming significantly shorter computational time than the other investigated algorithms across all Tox21 and mutagenicity datasets. We recommend LightGBM for applications in in silico safety assessment and also in other areas of cheminformatics to fulfill the ever-growing demand for accurate and rapid prediction of various toxicity or activity related endpoints of large compound libraries present in the pharmaceutical and chemical industry

    A Multi-temporal fusion-based approach for land cover mapping in support of nuclear incident response

    Get PDF
    An increasingly important application of remote sensing is to provide decision support during emergency response and disaster management efforts. Land cover maps constitute one such useful application product during disaster events; if generated rapidly after any disaster, such map products can contribute to the efficacy of the response effort. In light of recent nuclear incidents, e.g., after the earthquake/tsunami in Japan (2011), our research focuses on constructing rapid and accurate land cover maps of the impacted area in case of an accidental nuclear release. The methodology involves integration of results from two different approaches, namely coarse spatial resolution multi-temporal and fine spatial resolution imagery, to increase classification accuracy. Although advanced methods have been developed for classification using high spatial or temporal resolution imagery, only a limited amount of work has been done on fusion of these two remote sensing approaches. The presented methodology thus involves integration of classification results from two different remote sensing modalities in order to improve classification accuracy. The data used included RapidEye and MODIS scenes over the Nine Mile Point Nuclear Power Station in Oswego (New York, USA). The first step in the process was the construction of land cover maps from freely available, high temporal resolution, low spatial resolution MODIS imagery using a time-series approach. We used the variability in the temporal signatures among different land cover classes for classification. The time series-specific features were defined by various physical properties of a pixel, such as variation in vegetation cover and water content over time. The pixels were classified into four land cover classes - forest, urban, water, and vegetation - using Euclidean and Mahalanobis distance metrics. On the other hand, a high spatial resolution commercial satellite, such as RapidEye, can be tasked to capture images over the affected area in the case of a nuclear event. This imagery served as a second source of data to augment results from the time series approach. The classifications from the two approaches were integrated using an a posteriori probability-based fusion approach. This was done by establishing a relationship between the classes, obtained after classification of the two data sources. Despite the coarse spatial resolution of MODIS pixels, acceptable accuracies were obtained using time series features. The overall accuracies using the fusion-based approach were in the neighborhood of 80%, when compared with GIS data sets from New York State. This fusion thus contributed to classification accuracy refinement, with a few additional advantages, such as correction for cloud cover and providing for an approach that is robust against point-in-time seasonal anomalies, due to the inclusion of multi-temporal data

    Taking Decisions about Information Value

    Get PDF
    • …
    corecore