3 research outputs found

    PRACTICAL DATA SCIENCE: EXAMINING THE CORRELATIONS BETWEEN STRUCTURAL AND ELECTRONIC PROPERTIES OF DIFFERENT PHASES OF TiO2 NANOPARTICLES

    Get PDF
    PRACTICAL DATA SCIENCE: EXAMINING THE CORRELATIONS BETWEEN STRUCTURAL AND ELECTRONIC PROPERTIES OF DIFFERENT PHASES OF TiO2 NANOPARTICLESAbstractIn this work, we analyze the correlations between structural and electronic properties of anatase, brookite and rutile phases TiO2 nanoparticles (NPs) using data science techniques. For this purpose, we use the geometries of three phases TiO2 NPs under heat treatment obtained from molecular dynamics (MD) simulations in the frame of DFTB+ code. We investigate the relationships among electronic properties of TiO2 and order parameter   nearest number contacts . In this architecture, the correlations among HOMO, LUMO, Energy gap , Fermi energy ,  and  have been analyzed. Our results show that there is a moderate negative correlation between  and  in the brookite and rutile phases, but a strong linear correlation between these two variables in the anatase phase. Additionally, in the brookite phase, the positive linear correlation between  and  is noteworthy. Moderate linear correlation was observed in the anatase phase and positive in the rutile phase. The positive linear dependence of  and  in brookite phase is remarkable. No strong correlation was observed in any phase between and . In the brookite phase,   has an almost perfect negative correlation withKeywords: Data science, Statistical learning, Materials Science, Nanoparticles, Data analytics.PRATİK VERİ BİLİMİ: TiO2 NANOPARTİKÜLLERİNİN FARKLI FAZLARININ YAPISAL VE ELEKTRONİK ÖZELLİKLERİ ARASINDAKİ İLİŞKİLERİN İNCELENMESİÖzetBu çalışmada, veri bilimi tekniklerini kullanarak anataz, brookit ve rutil fazlar TiO2 nanopartiküllerinin (NP) yapısal ve elektronik özellikleri arasındaki korelasyonları analiz edilmiştir. Bu amaçla DFTB+ kodu çerçevesinde moleküler dinamik (MD) simülasyonlarından elde edilen ısıl işlem altında üç faz TiO2 NP'lerin geometrileri kullanılmıştır. TiO2'nin elektronik özellikleri ve sıra parametresi    en yakın numara kontakları  arasındaki ilişkiler araştırıldı. Bu mimaride, HOMO, LUMO, Enerji açığı , Fermi enerjisi ,  ve  arasındaki korelasyonlar analiz edilmiştir. Sonuçlarımız, brookite ve rutil fazlarda  ve  arasında orta derecede negatif bir korelasyon olduğunu, ancak anataz fazında bu iki değişken arasında güçlü bir doğrusal korelasyon olduğunu göstermektedir. Ek olarak, brookit fazında,   ve  arasındaki pozitif doğrusal korelasyon dikkate değerdir. Anataz fazında orta derecede doğrusal korelasyon, rutil fazda pozitif olarak gözlendi. Brookite fazında  ve 'nin pozitif doğrusal bağımlılığı dikkat çekicidir. ve   arasındaki hiçbir aşamada güçlü bir korelasyon gözlenmemiştir. Brookit fazında ,  ile neredeyse mükemmel bir negatif korelasyona sahiptir.Anahtar Kelimeler: Veri bilimi, İstatistiksel öğrenme, Malzeme Bilimi, Nanopartiküller, Veri analiz

    Scalable Solutions for Automated Single Pulse Identification and Classification in Radio Astronomy

    Full text link
    Data collection for scientific applications is increasing exponentially and is forecasted to soon reach peta- and exabyte scales. Applications which process and analyze scientific data must be scalable and focus on execution performance to keep pace. In the field of radio astronomy, in addition to increasingly large datasets, tasks such as the identification of transient radio signals from extrasolar sources are computationally expensive. We present a scalable approach to radio pulsar detection written in Scala that parallelizes candidate identification to take advantage of in-memory task processing using Apache Spark on a YARN distributed system. Furthermore, we introduce a novel automated multiclass supervised machine learning technique that we combine with feature selection to reduce the time required for candidate classification. Experimental testing on a Beowulf cluster with 15 data nodes shows that the parallel implementation of the identification algorithm offers a speedup of up to 5X that of a similar multithreaded implementation. Further, we show that the combination of automated multiclass classification and feature selection speeds up the execution performance of the RandomForest machine learning algorithm by an average of 54% with less than a 2% average reduction in the algorithm's ability to correctly classify pulsars. The generalizability of these results is demonstrated by using two real-world radio astronomy data sets.Comment: In Proceedings of the 47th International Conference on Parallel Processing (ICPP 2018). ACM, New York, NY, USA, Article 11, 11 page

    Searching for Needles in the Cosmic Haystack

    Get PDF
    Searching for pulsar signals in radio astronomy data sets is a difficult task. The data sets are extremely large, approaching the petabyte scale, and are growing larger as instruments become more advanced. Big Data brings with it big challenges. Processing the data to identify candidate pulsar signals is computationally expensive and must utilize parallelism to be scalable. Labeling benchmarks for supervised classification is costly. To compound the problem, pulsar signals are very rare, e.g., only 0.05% of the instances in one data set represent pulsars. Furthermore, there are many different approaches to candidate classification with no consensus on a best practice. This dissertation is focused on identifying and classifying radio pulsar candidates from single pulse searches. First, to identify and classify Dispersed Pulse Groups (DPGs), we developed a supervised machine learning approach that consists of RAPID (a novel peak identification algorithm), feature extraction, and supervised machine learning classification. We tested six algorithms for classification with four imbalance treatments. Results showed that classifiers with imbalance treatments had higher recall values. Overall, classifiers using multiclass RandomForests combined with Synthetic Majority Oversampling TEchnique (SMOTE) were the most efficient; they identified additional known pulsars not in the benchmark, with less false positives than other classifiers. Second, we developed a parallel single pulse identification method, D-RAPID, and introduced a novel automated multiclass labeling (ALM) technique that we combined with feature selection to improve execution performance. D-RAPID improved execution performance over RAPID by a factor of 5. We also showed that the combination of ALM and feature selection sped up the execution performance of RandomForest by 54% on average with less than a 2% average reduction in classification performance. Finally, we proposed CoDRIFt, a novel classification algorithm that is distributed for scalability and employs semi-supervised learning to leverage unlabeled data to inform classification. We evaluated and compared CoDRIFt to eleven other classifiers. The results showed that CoDRIFt excelled at classifying candidates in imbalanced benchmarks with a majority of non-pulsar signals (\u3e95%). Furthermore, CoDRIFt models created with very limited sets of labeled data (as few as 22 labeled minority class instances) were able to achieve high recall (mean = 0.98). In comparison to the other algorithms trained on similar sets, CoDRIFt outperformed them all, with recall 2.9% higher than the next best classifier and a 35% average improvement over all eleven classifiers. CoDRIFt is customizable for other problem domains with very large, imbalanced data sets, such as fraud detection and cyber attack detection
    corecore