151 research outputs found

    GRAPH BASESD WORD SENSE DISAMBIGUATION FOR CLINICAL ABBREVIATIONS USING APACHE SPARK

    Get PDF
    Identification of the correct sense for an ambiguous word is one of the major challenges for language processing in all domains. Word Sense Disambiguation is the task of identifying the correct sense of an ambiguous word by referencing the surrounding context of the word. Similar to the narrative documents, clinical documents suffer from ambiguity issues that impact automatic extraction of correct sense from the document. In this project, we propose a graph-based solution based on an algorithm originally implemented by Osmar R. Zaine et al. for word sense disambiguation specifically focusing on clinical text. The algorithm makes use of proposed UMLS Metathesaurus as its source of knowledge. As an enhancement to the existing implementation of the algorithm, this project uses Apache Spark - A Big Data Technology for cluster based distributed processing and performance optimization

    Machine learning and data-parallel processing for viral metagenomics

    Get PDF
    More than 2 million cancer cases around the world each year are caused by viruses. In addition, there are epidemiological indications that other cancer-associated viruses may also exist. However, the identification of highly divergent and yet unknown viruses in human biospecimens is one of the biggest challenges in bio- informatics. Modern-day Next Generation Sequencing (NGS) technologies can be used to directly sequence biospecimens from clinical cohorts with unprecedented speed and depth. These technologies are able to generate billions of bases with rapidly decreasing cost but current bioinformatics tools are inefficient to effectively process these massive datasets. Thus, the objective of this thesis was to facilitate both the detection of highly divergent viruses among generated sequences as well as large-scale analysis of human metagenomic datasets. To re-analyze human sample-derived sequences that were classified as being of “unknown” origin by conventional alignment-based methods, we used a meth- odology based on profile Hidden Markov Models (HMM) which can capture evolutionary changes by using multiple sequence alignments. We thus identified 510 sequences that were classified as distantly related to viruses. Many of these sequences were homologs to large viruses such as Herpesviridae and Mimiviridae but some of them were also related to small circular viruses such as Circoviridae. We found that bioinformatics analysis using viral profile HMM is capable of extending the classification of previously unknown sequences and consequently the detection of viruses in biospecimens from humans. Different organisms use synonymous codons differently to encode the same amino acids. To investigate whether codon usage bias could predict the presence of virus in metagenomic sequencing data originating from human samples, we trained Random Forest and Artificial Neural Networks based on Relative Synonymous Codon Usage (RSCU) frequency. Our analysis showed that machine learning tech- niques based on RSCU could identify putative viral sequences with area under the ROC curve of 0.79 and provide important information for taxonomic classification. For identification of viral genomes among raw metagenomic sequences, we devel- oped the tool ViraMiner, a deep learning-based method which uses Convolutional Neural Networks with two convolutional branches. Using 300 base-pair length sequences, ViraMiner achieved 0.923 area under the ROC curve which is con- siderably improved performance in comparison with previous machine learning methods for virus sequence classification. The proposed architecture, to the best of our knowledge, is the first deep learning tool which can detect viral genomes on raw metagenomic sequences originating from a variety of human samples. To enable large-scale analysis of massive metagenomic sequencing data we used Apache Hadoop and Apache Spark to develop ViraPipe, a scalable parallel bio- informatics pipeline for viral metagenomics. Comparing ViraPipe (executed on 23 nodes) with the sequential pipeline (executed on a single node) was 11 times faster in the metagenome analysis. The new distributed workflow contains several standard bioinformatics tools and can scale to terabytes of data by accessing more computer power from the nodes. To analyze terabytes of RNA-seq data originating from head and neck squamous cell carcinoma samples, we used our parallel bioinformatics pipeline ViraPipe and the most recent version of the HPV sequence database. We detected transcription of HPV viral oncogenes in 92/500 cancers. HPV 16 was the most important HPV type, followed by HPV 33 as the second most common infection. If these cancers are indeed caused by HPV, we estimated that vaccination might prevent about 36 000 head and neck cancer cases in the United States every year. In conclusion, the work in this thesis improves the prospects for biomedical researchers to classify the sequence contents of ultra-deep datasets, conduct large- scale analysis of metagenome studies, and detect presence of viral genomes in human biospecimens. Hopefully, this work will contribute to our understanding of biodiversity of viruses in humans which in turn can help exploring infectious causes of human disease

    LDBC Graphalytics: A Benchmark for Large-Scale Graph Analysis on Parallel and Distributed Platforms

    Get PDF
    ABSTRACT In this paper we introduce LDBC Graphalytics, a new industrial-grade benchmark for graph analysis platforms. It consists of six deterministic algorithms, standard datasets, synthetic dataset generators, and reference output, that enable the objective comparison of graph analysis platforms. Its test harness produces deep metrics that quantify multiple kinds of system scalability, such as horizontal/vertical and weak/strong, and of robustness, such as failures and performance variability. The benchmark comes with open-source software for generating data and monitoring performance. We describe and analyze six implementations of the benchmark (three from the community, three from the industry), providing insights into the strengths and weaknesses of the platforms. Key to our contribution, vendors perform the tuning and benchmarking of their platforms

    Principal Patterns on Graphs: Discovering Coherent Structures in Datasets

    Get PDF
    Graphs are now ubiquitous in almost every field of research. Recently, new research areas devoted to the analysis of graphs and data associated to their vertices have emerged. Focusing on dynamical processes, we propose a fast, robust and scalable framework for retrieving and analyzing recurring patterns of activity on graphs. Our method relies on a novel type of multilayer graph that encodes the spreading or propagation of events between successive time steps. We demonstrate the versatility of our method by applying it on three different real-world examples. Firstly, we study how rumor spreads on a social network. Secondly, we reveal congestion patterns of pedestrians in a train station. Finally, we show how patterns of audio playlists can be used in a recommender system. In each example, relevant information previously hidden in the data is extracted in a very efficient manner, emphasizing the scalability of our method. With a parallel implementation scaling linearly with the size of the dataset, our framework easily handles millions of nodes on a single commodity server

    A Distributed Algorithm For Large-Scale Graph Partitioning

    Get PDF
    Detta kandidatarbete har sin placering pĂ„ Skeppsbron/Skeppsbrokajen i centrala Stockholm. Inriktningen jag valde var att rita ett förslag för en fiskmarknad som skulle placeras pĂ„ denna plats. Mitt arbete har fĂ„tt inspireras av Sveriges största och mest kĂ€nda fiskmarknad, Feskekörkan, i Göteborg. Analyser av Feskekörkan som organisation och dess planlösning har i mitt arbete lett till en tektoniskt uppbyggd struktur dĂ€r material byggnadskonstruktion var viktiga element. Med bland annat en fiskfjĂ€llsfasad i mĂ€ssing och en bĂ€rande skelett av storskaliga limtrĂ€balkar. Platsen som byggnaden ligger pĂ„ Ă€r ett vĂ€lbesökt promenadstrĂ„k med en bred och lĂ„ng kajkant som anvĂ€nds flitigt av sĂ„vĂ€l, turister som besöker gamla stan och det kungliga slottet, och Stockholmare som tar sig mellan Södermalm och Norrmalm. Jag har valt att bebygga platsen pĂ„ ett sĂ€tt som bĂ„de tar vara pĂ„ det vackra promenadstrĂ„ket men ocksĂ„ ger möjlighet för besökande att stanna upp och ta del av fiskmarknaden.This candidate's work has its placement on Skeppsbron/Skeppsbrokajen in the central area of Stockholm. The focus I chose was to draw a proposal for a fishmarket that would be placed at this location. My work has been inspired by the largest and most famous fish market in Sweden, Feskekörkan, in Gothenburg. Analyses of Feskekörkan’s organization and its plan has, in my work, led to a tectonically constructed structure where building materials were important elements. Including a fish scale facade made of brass and a bearing skeleton of large glulam beams. The place which the building is situated on a popular promenade with a broad and long quay which is widely used by both, tourists visiting the Old Town and the Royal Palace, and the Stockholm citizens who ride between Södermalm and Norrmalm. I have chosen to build on the site in a way that both takes advantage of the beautiful promenade but also provides the opportunity for visitors to stop and take some of the fish market

    Large Scale Feature Extraction from Linked Web Data

    Get PDF
    Veebiandmed on ajas muutuvad ning viis, kuidas neid esitatakse muutub samuti. Linkandmed on muutnud veebis leiduva info masinloetavaks. Selles töös esitame kontseptsioonitĂ”enduseks lahenduse, mis vĂ”tab veebisorimise andmetest linkandmed ja teostab nende peal tunnusehĂ”ivet. Esitletud lahenduse eesmĂ€rgiks on luua sisendeid masinĂ”ppe mudelite treenimiseks, mida kasutatakse firmade krediidiskoori hindamiseks. Meie nĂ€itelahendus keskendub toote linkandmetele. Me proovime ĂŒhendadatoodete linkandmed, mis esitavad sama toodet, aga pĂ€rinevad erinevatelt veebilehtedelt.Toodete linkandmed ĂŒhendatakse firmadega, mille lehelt tooted pĂ€rit on. Informatsioon firmadest ja nende toodetest moodustab graafi, millel arvutame graafimeetrikuid.Erinevate ajahetketede veebisorimisandmetel arvutatud graafimeetrikud moodustavad ajaseeria, mis nĂ€itab graafi muutusi lĂ€bi aja. Saadud ajaseeriatel rakendame tunnushĂ”ive arvutamist.Loodud lahendus on planeeritud suurte andmete jaoks ning ehitatud ja disainitud skaleeruvust silmas pidades. Me kasutame Apache Sparki, et töödelda suurt hulka andmeid kiiresti ning olla valmis, kui sisendandmete hulk suureneb 100 korda.Data available on the web is evolving, and the way it is represented is changing as well.Linked data has made information on the web understandable to machines. In this thesis we develop a proof of concept pipeline that extracts linked data from web crawling and performs feature extraction on it. The end goal of this pipeline is to provide input to machine learning models that are used for credit scoring. The use case focuses on extracting product linked data and connecting it with the company that offers it. Built solution attempts to detect if two products from different web sites are the same in order to use one representation for both. Information about companies and products is represented as a graph on which network metrics are calculated. Network metrics from multiple different web crawls are stored in time series that shows changes in graph over time. We then calculate derivatives on the values in time series.The developed pipeline is designed to handle data in terabytes and built with scalability in mind. We use Apache Spark to process huge amounts of data and to be ready if input data increases 100 times

    Data mining and predictive analytics application on cellular networks to monitor and optimize quality of service and customer experience

    Get PDF
    This research study focuses on the application models of Data Mining and Machine Learning covering cellular network traffic, in the objective to arm Mobile Network Operators with full view of performance branches (Services, Device, Subscribers). The purpose is to optimize and minimize the time to detect service and subscriber patterns behaviour. Different data mining techniques and predictive algorithms have been applied on real cellular network datasets to uncover different data usage patterns using specific Key Performance Indicators (KPIs) and Key Quality Indicators (KQI). The following tools will be used to develop the concept: RStudio for Machine Learning and process visualization, Apache Spark, SparkSQL for data and big data processing and clicData for service Visualization. Two use cases have been studied during this research. In the first study, the process of Data and predictive Analytics are fully applied in the field of Telecommunications to efficiently address users’ experience, in the goal of increasing customer loyalty and decreasing churn or customer attrition. Using real cellular network transactions, prediction analytics are used to predict customers who are likely to churn, which can result in revenue loss. Prediction algorithms and models including Classification Tree, Random Forest, Neural Networks and Gradient boosting have been used with an exploratory Data Analysis, determining relationship between predicting variables. The data is segmented in to two, a training set to train the model and a testing set to test the model. The evaluation of the best performing model is based on the prediction accuracy, sensitivity, specificity and the Confusion Matrix on the test set. The second use case analyses Service Quality Management using modern data mining techniques and the advantages of in-memory big data processing with Apache Spark and SparkSQL to save cost on tool investment; thus, a low-cost Service Quality Management model is proposed and analyzed. With increase in Smart phone adoption, access to mobile internet services, applications such as streaming, interactive chats require a certain service level to ensure customer satisfaction. As a result, an SQM framework is developed with Service Quality Index (SQI) and Key Performance Index (KPI). The research concludes with recommendations and future studies around modern technology applications in Telecommunications including Internet of Things (IoT), Cloud and recommender systems.Cellular networks have evolved and are still evolving, from traditional GSM (Global System for Mobile Communication) Circuit switched which only supported voice services and extremely low data rate, to LTE all Packet networks accommodating high speed data used for various service applications such as video streaming, video conferencing, heavy torrent download; and for say in a near future the roll-out of the Fifth generation (5G) cellular networks, intended to support complex technologies such as IoT (Internet of Things), High Definition video streaming and projected to cater massive amount of data. With high demand on network services and easy access to mobile phones, billions of transactions are performed by subscribers. The transactions appear in the form of SMSs, Handovers, voice calls, web browsing activities, video and audio streaming, heavy downloads and uploads. Nevertheless, the stormy growth in data traffic and the high requirements of new services introduce bigger challenges to Mobile Network Operators (NMOs) in analysing the big data traffic flowing in the network. Therefore, Quality of Service (QoS) and Quality of Experience (QoE) turn in to a challenge. Inefficiency in mining, analysing data and applying predictive intelligence on network traffic can produce high rate of unhappy customers or subscribers, loss on revenue and negative services’ perspective. Researchers and Service Providers are investing in Data mining, Machine Learning and AI (Artificial Intelligence) methods to manage services and experience. This research study focuses on the application models of Data Mining and Machine Learning covering network traffic, in the objective to arm Mobile Network Operators with full view of performance branches (Services, Device, Subscribers). The purpose is to optimize and minimize the time to detect service and subscriber patterns behaviour. Different data mining techniques and predictive algorithms will be applied on cellular network datasets to uncover different data usage patterns using specific Key Performance Indicators (KPIs) and Key Quality Indicators (KQI). The following tools will be used to develop the concept: R-Studio for Machine Learning, Apache Spark, SparkSQL for data processing and clicData for Visualization.Electrical and Mining EngineeringM. Tech (Electrical Engineering
    • 

    corecore