1,135 research outputs found

    Deep Learning Data and Indexes in a Database

    Get PDF
    A database is used to store and retrieve data, which is a critical component for any software application. Databases requires configuration for efficiency, however, there are tens of configuration parameters. It is a challenging task to manually configure a database. Furthermore, a database must be reconfigured on a regular basis to keep up with newer data and workload. The goal of this thesis is to use the query workload history to autonomously configure the database and improve its performance. We achieve proposed work in four stages: (i) we develop an index recommender using deep reinforcement learning for a standalone database. We evaluated the effectiveness of our algorithm by comparing with several state-of-the-art approaches, (ii) we build a real-time index recommender that can, in real-time, dynamically create and remove indexes for better performance in response to sudden changes in the query workload, (iii) we develop a database advisor. Our advisor framework will be able to learn latent patterns from a workload. It is able to enhance a query, recommend interesting queries, and summarize a workload, (iv) we developed LinkSocial, a fast, scalable, and accurate framework to gain deeper insights from heterogeneous data

    Evaluation and Improvement of Semantically-Enhanced Tagging System

    Get PDF
    The Social Web or ‘Web 2.0’ is focused on the interaction and collaboration between web sites users. It is credited for the existence of tagging systems, amongst other things such as blogs and Wikis. Tagging systems like YouTube and Flickr offer their users the simplicity and freedom in creating and sharing their own contents and thus folksonomy is a very active research area where many improvements are presented to overcome existing disadvantages such as the lack of semantic meaning, ambiguity, and inconsistency. TE is a tagging system proposing solutions to the problems of multilingualism, lack of semantic meaning and shorthand writing (which is very common in the social web) through the aid of semantic and social resources. The current research is presenting an addition to the TE system in the form of an embedded stemming component to provide a solution to the different lexical form problems. Prior to this, the TE system had to be explored thoroughly and then its efficiency had to be determined in order to decide on the practicality of embedding any additional components as enhancements to the performance. Deciding on this involved analysing the algorithm efficiency using an analytical approach to determine its time and space complexity. The TE had a time growth rate of O (N²) which is polynomial, thus the algorithm is considered efficient. Nonetheless, recommended modifications like patch SQL execution can improve this. Regarding space complexity, the number of tags per photo represents the problem size which, if it grows, will increase linearly the required memory space. Based on the findings above, the TE system is re-implemented on Flickr instead of YouTube, because of a recent YouTube restriction, which is of greater benefit in multi languages tagging system since the language barrier is meaningless in this case. The re-implementation is achieved using ‘flickrj’ (Java Interface for Flickr APIs). Next, the stemming component is added to perform tags normalisation prior to the ontologies querying. The component is embedded using the Java encoding of the porter 2 stemmer which support many languages including Italian. The impact of the stemming component on the performance of the TE system in terms of the size of the index table and the number of retrieved results is investigated using an experiment that showed a reduction of 48% in the size of the index table. This also means that search queries have less system tags to compare them against the search keywords and this can speed up the search. Furthermore, the experiment runs similar search trails on two versions of the TE systems one without the stemming component and the other with the stemming component and found out that the latter produced more results on the conditions of working with valid words and valid stems. The embedding of the stemming component in the new TE system has lessened the effect of the storage overhead needed for the generated system tags by their reduction for the size of the index table which make the system suited for many applications such as text classification, summarization, email filtering, machine translation…etc

    Unveiling the frontiers of deep learning: innovations shaping diverse domains

    Full text link
    Deep learning (DL) enables the development of computer models that are capable of learning, visualizing, optimizing, refining, and predicting data. In recent years, DL has been applied in a range of fields, including audio-visual data processing, agriculture, transportation prediction, natural language, biomedicine, disaster management, bioinformatics, drug design, genomics, face recognition, and ecology. To explore the current state of deep learning, it is necessary to investigate the latest developments and applications of deep learning in these disciplines. However, the literature is lacking in exploring the applications of deep learning in all potential sectors. This paper thus extensively investigates the potential applications of deep learning across all major fields of study as well as the associated benefits and challenges. As evidenced in the literature, DL exhibits accuracy in prediction and analysis, makes it a powerful computational tool, and has the ability to articulate itself and optimize, making it effective in processing data with no prior training. Given its independence from training data, deep learning necessitates massive amounts of data for effective analysis and processing, much like data volume. To handle the challenge of compiling huge amounts of medical, scientific, healthcare, and environmental data for use in deep learning, gated architectures like LSTMs and GRUs can be utilized. For multimodal learning, shared neurons in the neural network for all activities and specialized neurons for particular tasks are necessary.Comment: 64 pages, 3 figures, 3 table

    Analyzing epigenomic data in a large-scale context

    Get PDF
    While large amounts of epigenomic data are publicly available, their retrieval in a form suitable for downstream analysis is a bottleneck in current research. In a typical analysis, users are required to download huge files that span the entire genome, even if they are only interested in a small subset (e.g., promoter regions) or an aggregation thereof. Moreover, complex operations on genome-level data are not always feasible on a local computer due to resource limitations. The DeepBlue Epigenomic Data Server mitigates this issue by providing a robust server that affords a powerful API for searching, filtering, transforming, aggregating, enriching, and downloading data from several epigenomic consortia. Furthermore, its main component implements scalable data storage and Manipulation methods that scale with the increasing amount of epigenetic data, thereby making it the ideal resource for researchers that seek to integrate epigenomic data into their analysis workflow. This work also presents companion tools that utilize the DeepBlue API to enable users not proficient in scripting or programming languages to analyze epigenomic data in a user-friendly way: (i) an R/Bioconductor package that integrates DeepBlue into the R analysis workflow. The extracted data are automatically converted into suitable R data structures for downstream analysis and visualization within the Bioconductor frame- work; (ii) a web portal that enables users to search, select, filter and download the epigenomic data available in the DeepBlue Server. This interface provides elements, such as data tables, grids, data selections, developed for empowering users to find the required epigenomic data in a straightforward interface; (iii) DIVE, a web data analysis tool that allows researchers to perform large-epigenomic data analysis in a programming-free environment. DIVE enables users to compare their datasets to the datasets available in the DeepBlue Server in an intuitive interface, which summarizes the comparison of hundreds of datasets in a simple chart. Furthermore, these tools are integrated, being capable of sharing results among themselves, creating a powerful large-scale epigenomic data analysis environment. The DeepBlue Epigenomic Data Server and its ecosystem was well received by the International Human Epigenome Consortium and already attracted much attention by the epigenomic research community with currently 160 registered users and more than three million anonymous workflow processing requests since its release.Während große Mengen epigenomischer Daten öffentlich verfügbar sind, ist ihre Abfrage in einer für die Downstream-Analyse geeigneten Form ein Engpass in der aktuellen Forschung. Bei einer typischen Analyse müssen Benutzer riesige Dateien herunterladen, die das gesamte Genom umfassen, selbst wenn sie nur an einer kleinen Teilmenge (z.B., Promotorregionen) oder einer Aggregation davon interessiert sind. Darüber hinaus sind komplexe Vorgänge mit Daten auf Genomebene aufgrund von Ressourceneinschränkungen auf einem lokalen Computer nicht immer möglich. Der DeepBlue Epigenomic Data Server behebt dieses Problem, indem er eine leistungsstarke API zum Suchen, Filtern, Umwandeln, Aggregieren, Anreichern und Herunterladen von Daten verschiedener epigenomischer Konsortien bietet. Darüber hinaus implementiert der DeepBlue-Server skalierbare Datenspeicherungs- und manipulationsmethoden, die der zunehmenden Menge epigenetischer Daten gerecht werden. Dadurch ist der DeepBlue Server ideal für Forscher geeignet, die die aktuellen epigenomischen Ressourcen in ihren Analyse-Workflow integrieren möchten. In dieser Arbeit werden zusätzlich Begleittools vorgestellt, die die DeepBlue-API verwenden, um Benutzern, die sich mit Scripting oder Programmiersprachen nicht auskennen, die Möglichkeit zu geben, epigenomische Daten auf benutzerfreundliche Weise zu analysieren: (i) ein R/ Bioconductor-Paket, das DeepBlue in den R-Analyse-Workflow integriert. Die extrahierten Daten werden automatisch in geeignete R-Datenstrukturen für die Downstream-Analyse und Visualisierung innerhalb des Bioconductor-Frameworks konvertiert; (ii) ein Webportal, über das Benutzer die auf dem DeepBlue Server verfügbaren epigenomischen Daten suchen, auswählen, filtern und herunterladen können. Diese Schnittstelle bietet Elemente wie Datentabellen, Raster, Datenselektionen, mit denen Benutzer die erforderlichen epigenomischen Daten in einer einfachen Schnittstelle finden können; (iii) DIVE, ein Webdatenanalysetool, mit dem Forscher umfangreiche epigenomische Datenanalysen in einer programmierungsfreien Umgebung durchführen können. Mit DIVE können Benutzer ihre Datensätze mit den im Deep- Blue Server verfügbaren Datensätzen in einer intuitiven Benutzeroberfläche vergleichen. Dabei kann der Vergleich hunderter Datensätze in einem Diagramm ausgedrückt werden. Aufgrund der großen Datenmenge, die in DIVE verfügbar ist, werden Methoden bereitgestellt, mit denen die ähnlichsten Datensätze für eine vergleichende Analyse vorgeschlagen werden können. Alle zuvor genannten Tools sind miteinander integriert, so dass sie die Ergebnisse untereinander austauschen können, wodurch eine leistungsstarke Umgebung für die Analyse epigenomischer Daten entsteht. Der DeepBlue Epigenomic Data Server und sein Ökosystem wurden vom International Human Epigenome Consortium äußerst gut aufgenommen und erreichten seit ihrer Veröffentlichung große Aufmerksamkeit bei der epigenomischen Forschungsgemeinschaft mit derzeit 160 registrierten Benutzern und mehr als drei Millionen anonymen Verarbeitungsanforderungen

    What we leave behind : reproducibility in chromatin analysis within and across species

    Get PDF
    Epigenetics is the field of biology that investigates heritable factors regulating gene expression without being directly encoded in the genome of an organism. The human genome is densely packed inside a cell's nucleus in the form of chromatin. Certain constituents of chromatin play a vital role as epigenetic factors in the dynamic regulation of gene expression. Epigenetic changes on the chromatin level are thus an integral part of the mechanisms governing the development of the functionally diverse cell types in multicellular species such as human. Studying these mechanisms is not only important to understand the biology of healthy cells, but also necessary to comprehend the epigenetic component in the formation of many complex diseases. Modern wet lab technology enables scientists to probe the epigenome with high throughput and in extensive detail. The fast generation of epigenetic datasets burdens computational researchers with the challenge of rapidly performing elaborate analyses without compromising on the scientific reproducibility of the reported findings. To facilitate reproducible computational research in epigenomics, this thesis proposes a task-oriented metadata model, relying on web technology and supported by database engineering, that aims at consistent and human-readable documentation of standardized computational workflows. The suggested approach features, e.g., computational validation of metadata records, automatic error detection, and progress monitoring of multi-step analyses, and was successfully field-tested as part of a large epigenome research consortium. This work leaves aside theoretical considerations, and intentionally emphasizes the realistic need of providing scientists with tools that assist them in performing reproducible research. Irrespective of the technological progress, the dynamic and cell-type specific nature of the epigenome commonly requires restricting the number of analyzed samples due to resource limitations. The second project of this thesis introduces the software tool SCIDDO, which has been developed for the differential chromatin analysis of cellular samples with potentially limited availability. By combining statistics, algorithmics, and best practices for robust software development, SCIDDO can quickly identify biologically meaningful regions of differential chromatin marking between cell types. We demonstrate SCIDDO's usefulness in an exemplary study in which we identify regions that establish a link between chromatin and gene expression changes. SCIDDO's quantitative approach to differential chromatin analysis is user-customizable, providing the necessary flexibility to adapt SCIDDO to specific research tasks. Given the functional diversity of cell types and the dynamics of the epigenome in response to environmental changes, it is hardly realistic to map the complete epigenome even for a single organism like human or mouse. For non-model organisms, e.g., cow, pig, or dog, epigenome data is particularly scarce. The third project of this thesis investigates to what extent bioinformatics methods can compensate for the comparatively little effort that is invested in charting the epigenome of non-model species. This study implements a large integrative analysis pipeline, including state-of-the-art machine learning, to transfer chromatin data for predictive modeling between 13 species. The evidence presented here indicates that a partial regulatory epigenetic signal is stably retained even over millions of years of evolutionary distance between the considered species. This finding suggests complementary and cost-effective ways for bioinformatics to contribute to comparative epigenome analysis across species boundaries.Epigenetik ist das Teilgebiet der Biologie, welches vererbbare Faktoren untersucht, die die Genexpression regulieren, ohne dabei direkt im Genom eines Organismus kodiert zu sein. Das menschliche Genom liegt dicht gepackt im Zellkern in der Form von Chromatin vor. Bestimmte Bestandteile des Chromatin spielen als epigenetische Faktoren eine zentrale Rolle bei der dynamischen Regulation von Genexpression. Epigenetische Veränderungen auf Chromatinebene sind daher ein integraler Teil jener Mechanismen, die die Entwicklung von funktionell diversen Zelltypen in multizellulären Spezies wie Mensch maßgeblich steuern. Diese Mechanismen zu untersuchen ist nicht nur wichtig, um die Biologie von gesunden Zellen zu erklären, sondern auch, um den epigenetischen Anteil an der Entstehung von vielen komplexen Krankheiten zu verstehen. Moderne Labortechnologien erlauben es Wissenschaftlern, Epigenome mit hohem Durchsatz und sehr detailliert zu erforschen. Ein schneller Aufbau von epigenetischen Datensätzen stellt die computerbasierte Forschung vor die Herausforderung, schnell aufwendige Analysen durchzuführen, ohne dabei Kompromisse bei der wissenschaftlichen Reproduzierbarkeit der gelieferten Ergebnisse einzugehen. Um die computerbasierte reproduzierbare Forschung im Bereich der Epigenomik zu vereinfachen, schlägt diese Dissertation ein aufgabenorientiertes Metadaten-Modell vor, welches, aufbauend auf Internet- und Datenbanktechnologie, auf eine konsistente und gleichzeitig menschenlesbare Dokumentation für standardisierte computerbasierte Arbeitsabläufe abzielt. Das vorgeschlagene Modell ermöglicht unter anderem eine computergestützte Validierung von Metadaten, automatische Fehlererkennung, sowie Fortschrittskontrollen bei mehrstufigen Analysen, und wurde unter realen Bedingungen in einem epigenetischen Forschungskonsortium erfolgreich getestet. Die beschriebene Arbeit präsentiert keine theoretischen Betrachtungen, sondern setzt den Schwerpunkt auf die realistische Notwendigkeit, Forscher mit Werkzeugen auszustatten, die ihnen bei der Durchführung von reproduzierbarer Arbeit helfen. Unabhängig vom technologischen Fortschritt, erfordert die zellspezifische und dynamische Natur des Epigenoms häufig eine Beschränkung bei der Anzahl an zu untersuchenden Proben, um Ressourcenvorgaben einzuhalten. Das zweite Projekt dieser Arbeit stellt die Software SCIDDO vor, welche für die differenzielle Analyse von Chromatindaten auch bei geringer Verfügbarkeit von Zellproben entwickelt wurde. Durch die Kombination von Statistik, Algorithmik, und bewährten Methoden zur robusten Software-Entwicklung, erlaubt es SCIDDO, schnell biologisch sinnvolle Regionen zu identifizieren, die ein differenzielles Chromatinprofil zwischen Zelltypen aufzeigen. Wir demonstrieren SCIDDOs Nutzwert in einer beispielhaften Studie, z.B. durch die Identifikation von Regionen, die eine Verbindung von änderungen auf Chromatinebene und Genexpression herstellen. SCIDDOs quantitativer Ansatz bei der differenziellen Analyse von Chromatindaten erlaubt eine nutzer- und aufgabenspezifische Anpassung, was Flexibilität bei der Bearbeitung anderer Fragestellungen ermöglicht. Bedingt durch die funktionelle Vielfalt an Zelltypen und die Dynamik des Epigenoms resultierend aus Umgebungsveränderungen, ist es kaum realistisch, das komplette Epigenom von auch nur einer einzigen Spezies wie Mensch zu erfassen. Insbesondere für nicht-Modellorganismen wie Kuh, Schwein, oder Hund sind sehr wenig Epigenomdaten verfügbar. Das dritte Projekt dieser Dissertation untersucht, inwieweit bioinformatische Methoden dazu verwendet werden könnten, den vergleichsweise geringen Aufwand, welcher betrieben wird um das Epigenom von nicht-Modellspezies zu erforschen, zu kompensieren. Diese Studie realisiert eine große, integrative Computeranalyse, welche basierend auf Methoden des maschinellen Lernens und auf Transfer von Chromatindaten Modelle zur Genexpressionsvorhersage über Speziesgrenzen hinweg etabliert. Die gewonnenen Erkenntnisse lassen vermuten, dass ein Teil des regulatorischen epigenetischen Signals auch über Millionen von Jahren an evolutionärer Distanz zwischen den 13 betrachteten Spezies stabil erhalten bleibt. Diese Arbeit zeigt dadurch ergänzende und kosteneffektive Möglichkeiten auf, wie Bioinformatik einen Beitrag zur vergleichenden Epigenomanalyse über Speziesgrenzen hinweg leisten könnte

    The doctoral research abstracts. Vol:6 2014 / Institute of Graduate Studies, UiTM

    Get PDF
    Congratulations to Institute of Graduate Studies on the continuous efforts to publish the 6th issue of the Doctoral Research Abstracts which ranged from the discipline of science and technology, business and administration to social science and humanities. This issue captures the novelty of research from 52 PhD doctorates receiving their scrolls in the UiTM’s 81st Convocation. This convocation is very significant especially for UiTM since we are celebrating the success of 52 PhD graduands – the highest number ever conferred at any one time. To the 52 doctorates, I would like it to be known that you have most certainly done UiTM proud by journeying through the scholastic path with its endless challenges and impediments, and by persevering right till the very end. This convocation should not be regarded as the end of your highest scholarly achievement and contribution to the body of knowledge but rather as the beginning of embarking into more innovative research from knowledge gained during this academic journey, for the community and country. As alumni of UiTM, we hold you dear to our hearts. The relationship that was once between a student and supervisor has now matured into comrades, forging and exploring together beyond the frontier of knowledge. We wish you all the best in your endeavour and may I offer my congratulations to all the graduands. ‘UiTM sentiasa dihati ku’ Tan Sri Dato’ Sri Prof Ir Dr Sahol Hamid Abu Bakar , FASc, PEng Vice Chancellor Universiti Teknologi MAR

    Detection of 3D Genome Folding at Multiple Scales

    Get PDF
    Understanding 3D genome structure is crucial to learn how chromatin folds and how genes are regulated through the spatial organization of regulatory elements. Various technologies have been developed to investigate genome architecture. These technologies include ligation-based 3C Methodologies such as Hi-C and Micro-C, ligation-based pull-down methods like Proximity Ligation-Assisted ChIP-seq (PLAC Seq) and Paired-end tag sequencing (ChIA PET), and ligation-free methods like Split-Pool Recognition of Interactions by Tag Extension (SPRITE) and Genome Architecture Mapping (GAM). Although these technologies have provided great insight into chromatin organization, a systematic evaluation of these technologies is lacking. Among these technologies, Hi-C has been one of the most widely used methods to map genome-wide chromatin interactions for over a decade. To understand how the choice of experimental parameters determines the ability to detect and quantify the features of chromosome folding, we have first systematically evaluated two critical parameters in the Hi-C protocol: cross-linking and digestion of chromatin. We found that different protocols capture distinct 3D genome features with different efficiencies depending on the cell type (Chapter 2). Use of the updated Hi-C protocol with new parameters, which we call Hi-C 3.0, was subsequently evaluated and found to provide the best loop detection compared to all previous Hi-C protocols as well as better compartment quantification compared to Micro-C (Chapter 3). Finally, to understand how the aforementioned technologies (Hi-C, Micro-C, PLAC-Seq, ChIA-PET, SPRITE, GAM) that measure 3D organization could provide a comprehensive understanding of the genome structure, we have performed a comparison of these technologies. We found that each of these methods captures different aspects of the chromatin folding (Chapter 4). Collectively, these studies suggest that improving the 3D methodologies and integrative analyses of these methods will reveal unprecedented details of the genome structure and function

    Artificial intelligence (AI) in rare diseases: is the future brighter?

    Get PDF
    The amount of data collected and managed in (bio)medicine is ever-increasing. Thus, there is a need to rapidly and efficiently collect, analyze, and characterize all this information. Artificial intelligence (AI), with an emphasis on deep learning, holds great promise in this area and is already being successfully applied to basic research, diagnosis, drug discovery, and clinical trials. Rare diseases (RDs), which are severely underrepresented in basic and clinical research, can particularly benefit from AI technologies. Of the more than 7000 RDs described worldwide, only 5% have a treatment. The ability of AI technologies to integrate and analyze data from different sources (e.g., multi-omics, patient registries, and so on) can be used to overcome RDs' challenges (e.g., low diagnostic rates, reduced number of patients, geographical dispersion, and so on). Ultimately, RDs' AI-mediated knowledge could significantly boost therapy development. Presently, there are AI approaches being used in RDs and this review aims to collect and summarize these advances. A section dedicated to congenital disorders of glycosylation (CDG), a particular group of orphan RDs that can serve as a potential study model for other common diseases and RDs, has also been included.info:eu-repo/semantics/publishedVersio
    corecore