3 research outputs found
PRACTICAL DATA SCIENCE: EXAMINING THE CORRELATIONS BETWEEN STRUCTURAL AND ELECTRONIC PROPERTIES OF DIFFERENT PHASES OF TiO2 NANOPARTICLES
PRACTICAL DATA SCIENCE: EXAMINING THE CORRELATIONS BETWEEN STRUCTURAL AND ELECTRONIC PROPERTIES OF DIFFERENT PHASES OF TiO2 NANOPARTICLESAbstractIn this work, we analyze the correlations between structural and electronic properties of anatase, brookite and rutile phases TiO2 nanoparticles (NPs) using data science techniques. For this purpose, we use the geometries of three phases TiO2 NPs under heat treatment obtained from molecular dynamics (MD) simulations in the frame of DFTB+ code. We investigate the relationships among electronic properties of TiO2 and order parameter nearest number contacts . In this architecture, the correlations among HOMO, LUMO, Energy gap , Fermi energy , and have been analyzed. Our results show that there is a moderate negative correlation between and in the brookite and rutile phases, but a strong linear correlation between these two variables in the anatase phase. Additionally, in the brookite phase, the positive linear correlation between and is noteworthy. Moderate linear correlation was observed in the anatase phase and positive in the rutile phase. The positive linear dependence of and in brookite phase is remarkable. No strong correlation was observed in any phase between and . In the brookite phase, has an almost perfect negative correlation withKeywords: Data science, Statistical learning, Materials Science, Nanoparticles, Data analytics.PRATİK VERİ BİLİMİ: TiO2 NANOPARTİKÜLLERİNİN FARKLI FAZLARININ YAPISAL VE ELEKTRONİK ÖZELLİKLERİ ARASINDAKİ İLİŞKİLERİN İNCELENMESİÖzetBu çalışmada, veri bilimi tekniklerini kullanarak anataz, brookit ve rutil fazlar TiO2 nanopartiküllerinin (NP) yapısal ve elektronik özellikleri arasındaki korelasyonları analiz edilmiştir. Bu amaçla DFTB+ kodu çerçevesinde moleküler dinamik (MD) simülasyonlarından elde edilen ısıl işlem altında üç faz TiO2 NP'lerin geometrileri kullanılmıştır. TiO2'nin elektronik özellikleri ve sıra parametresi en yakın numara kontakları arasındaki ilişkiler araştırıldı. Bu mimaride, HOMO, LUMO, Enerji açığı , Fermi enerjisi , ve arasındaki korelasyonlar analiz edilmiştir. Sonuçlarımız, brookite ve rutil fazlarda ve arasında orta derecede negatif bir korelasyon olduğunu, ancak anataz fazında bu iki değişken arasında güçlü bir doğrusal korelasyon olduğunu göstermektedir. Ek olarak, brookit fazında, ve arasındaki pozitif doğrusal korelasyon dikkate değerdir. Anataz fazında orta derecede doğrusal korelasyon, rutil fazda pozitif olarak gözlendi. Brookite fazında ve 'nin pozitif doğrusal bağımlılığı dikkat çekicidir. ve arasındaki hiçbir aşamada güçlü bir korelasyon gözlenmemiştir. Brookit fazında , ile neredeyse mükemmel bir negatif korelasyona sahiptir.Anahtar Kelimeler: Veri bilimi, İstatistiksel öğrenme, Malzeme Bilimi, Nanopartiküller, Veri analiz
Scalable Solutions for Automated Single Pulse Identification and Classification in Radio Astronomy
Data collection for scientific applications is increasing exponentially and
is forecasted to soon reach peta- and exabyte scales. Applications which
process and analyze scientific data must be scalable and focus on execution
performance to keep pace. In the field of radio astronomy, in addition to
increasingly large datasets, tasks such as the identification of transient
radio signals from extrasolar sources are computationally expensive. We present
a scalable approach to radio pulsar detection written in Scala that
parallelizes candidate identification to take advantage of in-memory task
processing using Apache Spark on a YARN distributed system. Furthermore, we
introduce a novel automated multiclass supervised machine learning technique
that we combine with feature selection to reduce the time required for
candidate classification. Experimental testing on a Beowulf cluster with 15
data nodes shows that the parallel implementation of the identification
algorithm offers a speedup of up to 5X that of a similar multithreaded
implementation. Further, we show that the combination of automated multiclass
classification and feature selection speeds up the execution performance of the
RandomForest machine learning algorithm by an average of 54% with less than a
2% average reduction in the algorithm's ability to correctly classify pulsars.
The generalizability of these results is demonstrated by using two real-world
radio astronomy data sets.Comment: In Proceedings of the 47th International Conference on Parallel
Processing (ICPP 2018). ACM, New York, NY, USA, Article 11, 11 page
Searching for Needles in the Cosmic Haystack
Searching for pulsar signals in radio astronomy data sets is a difficult task. The data sets are extremely large, approaching the petabyte scale, and are growing larger as instruments become more advanced. Big Data brings with it big challenges. Processing the data to identify candidate pulsar signals is computationally expensive and must utilize parallelism to be scalable. Labeling benchmarks for supervised classification is costly. To compound the problem, pulsar signals are very rare, e.g., only 0.05% of the instances in one data set represent pulsars. Furthermore, there are many different approaches to candidate classification with no consensus on a best practice. This dissertation is focused on identifying and classifying radio pulsar candidates from single pulse searches. First, to identify and classify Dispersed Pulse Groups (DPGs), we developed a supervised machine learning approach that consists of RAPID (a novel peak identification algorithm), feature extraction, and supervised machine learning classification. We tested six algorithms for classification with four imbalance treatments. Results showed that classifiers with imbalance treatments had higher recall values. Overall, classifiers using multiclass RandomForests combined with Synthetic Majority Oversampling TEchnique (SMOTE) were the most efficient; they identified additional known pulsars not in the benchmark, with less false positives than other classifiers. Second, we developed a parallel single pulse identification method, D-RAPID, and introduced a novel automated multiclass labeling (ALM) technique that we combined with feature selection to improve execution performance. D-RAPID improved execution performance over RAPID by a factor of 5. We also showed that the combination of ALM and feature selection sped up the execution performance of RandomForest by 54% on average with less than a 2% average reduction in classification performance. Finally, we proposed CoDRIFt, a novel classification algorithm that is distributed for scalability and employs semi-supervised learning to leverage unlabeled data to inform classification. We evaluated and compared CoDRIFt to eleven other classifiers. The results showed that CoDRIFt excelled at classifying candidates in imbalanced benchmarks with a majority of non-pulsar signals (\u3e95%). Furthermore, CoDRIFt models created with very limited sets of labeled data (as few as 22 labeled minority class instances) were able to achieve high recall (mean = 0.98). In comparison to the other algorithms trained on similar sets, CoDRIFt outperformed them all, with recall 2.9% higher than the next best classifier and a 35% average improvement over all eleven classifiers. CoDRIFt is customizable for other problem domains with very large, imbalanced data sets, such as fraud detection and cyber attack detection