2,199 research outputs found

    Unsupervised multiple kernel learning approaches for integrating molecular cancer patient data

    Get PDF
    Cancer is the second leading cause of death worldwide. A characteristic of this disease is its complexity leading to a wide variety of genetic and molecular aberrations in the tumors. This heterogeneity necessitates personalized therapies for the patients. However, currently defined cancer subtypes used in clinical practice for treatment decision-making are based on relatively few selected markers and thus provide only a coarse classifcation of tumors. The increased availability in multi-omics data measured for cancer patients now offers the possibility of defining more informed cancer subtypes. Such a more fine-grained characterization of cancer subtypes harbors the potential of substantially expanding treatment options in personalized cancer therapy. In this thesis, we identify comprehensive cancer subtypes using multidimensional data. For this purpose, we apply and extend unsupervised multiple kernel learning methods. Three challenges of unsupervised multiple kernel learning are addressed: robustness, applicability, and interpretability. First, we show that regularization of the multiple kernel graph embedding framework, which enables the implementation of dimensionality reduction techniques, can increase the stability of the resulting patient subgroups. This improvement is especially beneficial for data sets with a small number of samples. Second, we adapt the objective function of kernel principal component analysis to enable the application of multiple kernel learning in combination with this widely used dimensionality reduction technique. Third, we improve the interpretability of kernel learning procedures by performing feature clustering prior to integrating the data via multiple kernel learning. On the basis of these clusters, we derive a score indicating the impact of a feature cluster on a patient cluster, thereby facilitating further analysis of the cluster-specific biological properties. All three procedures are successfully tested on real-world cancer data. Comparing our newly derived methodologies to established methods provides evidence that our work offers novel and beneficial ways of identifying patient subgroups and gaining insights into medically relevant characteristics of cancer subtypes.Krebs ist eine der häufigsten Todesursachen weltweit. Krebs ist gekennzeichnet durch seine Komplexität, die zu vielen verschiedenen genetischen und molekularen Aberrationen im Tumor führt. Die Unterschiede zwischen Tumoren erfordern personalisierte Therapien für die einzelnen Patienten. Die Krebssubtypen, die derzeit zur Behandlungsplanung in der klinischen Praxis verwendet werden, basieren auf relativ wenigen, genetischen oder molekularen Markern und können daher nur eine grobe Unterteilung der Tumoren liefern. Die zunehmende Verfügbarkeit von Multi-Omics-Daten für Krebspatienten ermöglicht die Neudefinition von fundierteren Krebssubtypen, die wiederum zu spezifischeren Behandlungen für Krebspatienten führen könnten. In dieser Dissertation identifizieren wir neue, potentielle Krebssubtypen basierend auf Multi-Omics-Daten. Hierfür verwenden wir unüberwachtes Multiple Kernel Learning, welches in der Lage ist mehrere Datentypen miteinander zu kombinieren. Drei Herausforderungen des unüberwachten Multiple Kernel Learnings werden adressiert: Robustheit, Anwendbarkeit und Interpretierbarkeit. Zunächst zeigen wir, dass die zusätzliche Regularisierung des Multiple Kernel Learning Frameworks zur Implementierung verschiedener Dimensionsreduktionstechniken die Stabilität der identifizierten Patientengruppen erhöht. Diese Robustheit ist besonders vorteilhaft für Datensätze mit einer geringen Anzahl von Proben. Zweitens passen wir die Zielfunktion der kernbasierten Hauptkomponentenanalyse an, um eine integrative Version dieser weit verbreiteten Dimensionsreduktionstechnik zu ermöglichen. Drittens verbessern wir die Interpretierbarkeit von kernbasierten Lernprozeduren, indem wir verwendete Merkmale in homogene Gruppen unterteilen bevor wir die Daten integrieren. Mit Hilfe dieser Gruppen definieren wir eine Bewertungsfunktion, die die weitere Auswertung der biologischen Eigenschaften von Patientengruppen erleichtert. Alle drei Verfahren werden an realen Krebsdaten getestet. Den Vergleich unserer Methodik mit etablierten Methoden weist nach, dass unsere Arbeit neue und nützliche Möglichkeiten bietet, um integrative Patientengruppen zu identifizieren und Einblicke in medizinisch relevante Eigenschaften von Krebssubtypen zu erhalten

    Full Issue

    Get PDF

    Quality by Design through multivariate latent structures

    Full text link
    La presente tesis doctoral surge ante la necesidad creciente por parte de la mayoría de empresas, y en especial (pero no únicamente) aquellas dentro de los sectores farmacéu-tico, químico, alimentación y bioprocesos, de aumentar la flexibilidad en su rango ope-rativo para reducir los costes de fabricación, manteniendo o mejorando la calidad del producto final obtenido. Para ello, esta tesis se centra en la aplicación de los conceptos del Quality by Design para la aplicación y extensión de distintas metodologías ya exis-tentes y el desarrollo de nuevos algoritmos que permitan la implementación de herra-mientas adecuadas para el diseño de experimentos, el análisis multivariante de datos y la optimización de procesos en el ámbito del diseño de mezclas, pero sin limitarse ex-clusivamente a este tipo de problemas. Parte I - Prefacio, donde se presenta un resumen del trabajo de investigación realiza-do y los objetivos principales que pretende abordar y su justificación, así como una introducción a los conceptos más importantes relativos a los temas tratados en partes posteriores de la tesis, tales como el diseño de experimentos o diversas herramientas estadísticas de análisis multivariado. Parte II - Optimización en el diseño de mezclas, donde se lleva a cabo una recapitu-lación de las diversas herramientas existentes para el diseño de experimentos y análisis de datos por medios tradicionales relativos al diseño de mezclas, así como de algunas herramientas basadas en variables latentes, tales como la Regresión en Mínimos Cua-drados Parciales (PLS). En esta parte de la tesis también se propone una extensión del PLS basada en kernels para el análisis de datos de diseños de mezclas, y se hace una comparativa de las distintas metodologías presentadas. Finalmente, se incluye una breve presentación del programa MiDAs, desarrollado con la finalidad de ofrecer a sus usuarios la posibilidad de comparar de forma sencilla diversas metodologías para el diseño de experimentos y análisis de datos para problemas de mezclas. Parte III - Espacio de diseño y optimización a través del espacio latente, donde se aborda el problema fundamental dentro de la filosofía del Quality by Design asociado a la definición del llamado 'espacio de diseño', que comprendería todo el conjunto de posibles combinaciones de condiciones de proceso, materias primas, etc. que garanti-zan la obtención de un producto con la calidad deseada. En esta parte también se trata el problema de la definición del problema de optimización como herramienta para la mejora de la calidad, pero también para la exploración y flexibilización de los procesos productivos, con el objeto de definir un procedimiento eficiente y robusto de optimiza-ción que se adapte a los diversos problemas que exigen recurrir a dicha optimización. Parte IV - Epílogo, donde se presentan las conclusiones finales, la consecución de objetivos y posibles líneas futuras de investigación. En esta parte se incluyen además los anexos.Aquesta tesi doctoral sorgeix davant la necessitat creixent per part de la majoria d'em-preses, i especialment (però no únicament) d'aquelles dins dels sectors farmacèutic, químic, alimentari i de bioprocessos, d'augmentar la flexibilitat en el seu rang operatiu per tal de reduir els costos de fabricació, mantenint o millorant la qualitat del producte final obtingut. La tesi se centra en l'aplicació dels conceptes del Quality by Design per a l'aplicació i extensió de diferents metodologies ja existents i el desenvolupament de nous algorismes que permeten la implementació d'eines adequades per al disseny d'ex-periments, l'anàlisi multivariada de dades i l'optimització de processos en l'àmbit del disseny de mescles, però sense limitar-se exclusivament a aquest tipus de problemes. Part I- Prefaci, en què es presenta un resum del treball de recerca realitzat i els objec-tius principals que pretén abordar i la seua justificació, així com una introducció als conceptes més importants relatius als temes tractats en parts posteriors de la tesi, com ara el disseny d'experiments o diverses eines estadístiques d'anàlisi multivariada. Part II - Optimització en el disseny de mescles, on es duu a terme una recapitulació de les diverses eines existents per al disseny d'experiments i anàlisi de dades per mit-jans tradicionals relatius al disseny de mescles, així com d'algunes eines basades en variables latents, tals com la Regressió en Mínims Quadrats Parcials (PLS). En aquesta part de la tesi també es proposa una extensió del PLS basada en kernels per a l'anàlisi de dades de dissenys de mescles, i es fa una comparativa de les diferents metodologies presentades. Finalment, s'inclou una breu presentació del programari MiDAs, que ofe-reix la possibilitat als usuaris de comparar de forma senzilla diverses metodologies per al disseny d'experiments i l'anàlisi de dades per a problemes de mescles. Part III- Espai de disseny i optimització a través de l'espai latent, on s'aborda el problema fonamental dins de la filosofia del Quality by Design associat a la definició de l'anomenat 'espai de disseny', que comprendria tot el conjunt de possibles combina-cions de condicions de procés, matèries primeres, etc. que garanteixen l'obtenció d'un producte amb la qualitat desitjada. En aquesta part també es tracta el problema de la definició del problema d'optimització com a eina per a la millora de la qualitat, però també per a l'exploració i flexibilització dels processos productius, amb l'objecte de definir un procediment eficient i robust d'optimització que s'adapti als diversos pro-blemes que exigeixen recórrer a aquesta optimització. Part IV- Epíleg, on es presenten les conclusions finals i la consecució d'objectius i es plantegen possibles línies futures de recerca arran dels resultats de la tesi. En aquesta part s'inclouen a més els annexos.The present Ph.D. thesis is motivated by the growing need in most companies, and specially (but not solely) those in the pharmaceutical, chemical, food and bioprocess fields, to increase the flexibility in their operating conditions in order to reduce production costs while maintaining or even improving the quality of their products. To this end, this thesis focuses on the application of the concepts of the Quality by Design for the exploitation and development of already existing methodologies, and the development of new algorithms aimed at the proper implementation of tools for the design of experiments, multivariate data analysis and process optimization, specially (but not only) in the context of mixture design. Part I - Preface, where a summary of the research work done, the main goals it aimed at and their justification, are presented. Some of the most relevant concepts related to the developed work in subsequent chapters are also introduced, such as those regarding design of experiments or latent variable-based multivariate data analysis techniques. Part II - Mixture design optimization, in which a review of existing mixture design tools for the design of experiments and data analysis via traditional approaches, as well as some latent variable-based techniques, such as Partial Least Squares (PLS), is provided. A kernel-based extension of PLS for mixture design data analysis is also proposed, and the different available methods are compared to each other. Finally, a brief presentation of the software MiDAs is done. MiDAs has been developed in order to provide users with a tool to easily approach mixture design problems for the construction of Designs of Experiments and data analysis with different methods and compare them. Part III - Design Space and optimization through the latent space, where one of the fundamental issues within the Quality by Design philosophy, the definition of the so-called 'design space' (i.e. the subspace comprised by all possible combinations of process operating conditions, raw materials, etc. that guarantee obtaining a product meeting a required quality standard), is addressed. The problem of properly defining the optimization problem is also tackled, not only as a tool for quality improvement but also when it is to be used for exploration of process flexibilisation purposes, in order to establish an efficient and robust optimization method in accordance with the nature of the different problems that may require such optimization to be resorted to. Part IV - Epilogue, where final conclusions are drawn, future perspectives suggested, and annexes are included.Palací López, DG. (2018). Quality by Design through multivariate latent structures [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/115489TESI

    Assessing the competency of seafarers using simulators in bridge resource management (BRM) training

    Get PDF

    Full Issue

    Get PDF

    Autonomous supervision and optimization of product quality in a multi-stage manufacturing process based on self-adaptive prediction models.

    Get PDF
    In modern manufacturing facilities, there are basically two essential phases for assuring high production quality with low (or even zero) defects and waste in order to save costs for companies. The first phase concerns the early recognition of potentially arising problems in product quality, the second phase concerns proper reactions upon the recognition of such problems. In this paper, we address a holistic approach for handling both issues consecutively within a predictive maintenance framework at an on-line production system. Thereby, we address multi-stage functionality based on (i) data-driven forecast models for (measure-able) product quality criteria (QCs) at a latter stage, which are established and executed through process values (and their time series trends) recorded at an early stage of production (describing its progress), and (ii) process optimization cycles whose outputs are suggestions for proper reactions at an earlier stage in the case of forecasted downtrends or exceeds of allowed boundaries in product quality. The data-driven forecast models are established through a high-dimensional batch time-series modeling problem. In this, we employ a non-linear version of PLSR (partial least squares regression) by coupling PLS with generalized Takagi–Sugeno fuzzy systems (termed as PLS-fuzzy). The models are able to self-adapt over time based on recursive parameters adaptation and rule evolution functionalities. Two concepts for increased flexibility during model updates are proposed, (i) a dynamic outweighing strategy of older samples with an adaptive update of the forgetting factor (steering forgetting intensity) and (ii) an incremental update of the latent variable space spanned by the directions (loading vectors) achieved through PLS; the whole model update approach is termed as SAFM-IF (self-adaptive forecast models with increased flexibility). Process optimization is achieved through multi-objective optimization using evolutionary techniques, where the (trained and updated) forecast models serve as surrogate models to guide the optimization process to Pareto fronts (containing solution candidates) with high quality. A new influence analysis between process values and QCs is suggested based on the PLS-fuzzy forecast models in order to reduce the dimensionality of the optimization space and thus to guarantee high(er) quality of solutions within a reasonable amount of time (→ better usage in on-line mode). The methodologies have been comprehensively evaluated on real on-line process data from a (micro-fluidic) chip production system, where the early stage comprises the injection molding process and the latter stage the bonding process. The results show remarkable performance in terms of low prediction errors of the PLS-fuzzy forecast models (showing mostly lower errors than achieved by other model architectures) as well as in terms of Pareto fronts with individuals (solutions) whose fitness was close to the optimal values of three most important target QCs (being used for supervision): flatness, void events and RMSEs of the chips. Suggestions could thus be provided to experts/operators how to best change process values and associated machining parameters at the injection molding process in order to achieve significantly higher product quality for the final chips at the end of the bonding process

    Domain Specialization as the Key to Make Large Language Models Disruptive: A Comprehensive Survey

    Full text link
    Large language models (LLMs) have significantly advanced the field of natural language processing (NLP), providing a highly useful, task-agnostic foundation for a wide range of applications. However, directly applying LLMs to solve sophisticated problems in specific domains meets many hurdles, caused by the heterogeneity of domain data, the sophistication of domain knowledge, the uniqueness of domain objectives, and the diversity of the constraints (e.g., various social norms, cultural conformity, religious beliefs, and ethical standards in the domain applications). Domain specification techniques are key to make large language models disruptive in many applications. Specifically, to solve these hurdles, there has been a notable increase in research and practices conducted in recent years on the domain specialization of LLMs. This emerging field of study, with its substantial potential for impact, necessitates a comprehensive and systematic review to better summarize and guide ongoing work in this area. In this article, we present a comprehensive survey on domain specification techniques for large language models, an emerging direction critical for large language model applications. First, we propose a systematic taxonomy that categorizes the LLM domain-specialization techniques based on the accessibility to LLMs and summarizes the framework for all the subcategories as well as their relations and differences to each other. Second, we present an extensive taxonomy of critical application domains that can benefit dramatically from specialized LLMs, discussing their practical significance and open challenges. Last, we offer our insights into the current research status and future trends in this area

    Research and Development of a General Purpose Instrument DAQ-Monitoring Platform applied to the CLOUD/CERN experiment

    Get PDF
    The current scientific environment has experimentalists and system administrators allocating large amounts of time for data access, parsing and gathering as well as instrument management. This is a growing challenge since there is an increasing number of large collaborations with significant amount of instrument resources, remote instrumentation sites and continuously improved and upgraded scientific instruments. DAQBroker is a new software designed to monitor networks of scientific instruments while also providing simple data access methods for any user. Data can be stored in one or several local or remote databases running on any of the most popular relational databases (MySQL, PostgreSQL, Oracle). It also provides the necessary tools for creating and editing the metadata associated with different instruments, perform data manipulation and generate events based on instrument measurements, regardless of the user’s know-how of individual instruments. Time series stored in a DAQBroker database also benefit from several statistical methods for time series classification, comparison and event detection as well as multivariate time series analysis methods to determine the most statistically relevant time series, rank the most influential time series and also determine the periods of most activity during specific experimental periods. This thesis presents the architecture behind the framework, assesses the performance under controlled conditions and presents a use-case under the CLOUD experiment at CERN, Switzerland. The univariate and multivariate time series statistical methods applied to this framework are also studied.O processo de investigação científica moderno requer que tanto experimentalistas como administradores de sistemas dediquem uma parte significativa do seu tempo a criar estratégias para aceder, armazenar e manipular instrumentos científicos e os dados que estes produzem. Este é um desafio crescente considerando o aumento de colaborações que necessitam de vários instrumentos, investigação em áreas remotas e instrumentos científicos com constantes alterações. O DAQBroker é uma nova plataforma desenhada para a monitorização de instrumentos científicos e ao mesmo tempo fornece métodos simples para qualquer utilizador aceder aos seus dados. Os dados podem ser guardados em uma ou várias bases de dados locais ou remotas utilizando os gestores de bases de dados mais comuns (MySQL, PostgreSQL, Oracle). Esta plataforma também fornece as ferramentas necessárias para criar e editar versões virtuais de instrumentos científicos e manipular os dados recolhidos dos instrumentos, independentemente do grau de conhecimento que o utilizador tenha com o(s) instrumento(s) utilizado(s). Séries temporais guardadas numa base de dados DAQBroker beneficiam de um conjunto de métodos estatísticos para a classificação, comparação e detecção de eventos, determinação das séries com maior influência e os sub-períodos experimentais com maior actividade. Esta tese apresenta a arquitectura da plataforma, os resultados de diversos testes de esforço efectuados em ambientes controlados e um caso real da sua utilização na experiência CLOUD, no CERN, Suíça. São estudados também os métodos de análise de séries temporais, tanto singulares como multivariadas aplicados na plataforma

    Processing Rank-Aware Queries in Schema-Based P2P Systems

    Get PDF
    Effiziente Anfragebearbeitung in Datenintegrationssystemen sowie in P2P-Systemen ist bereits seit einigen Jahren ein Aspekt aktueller Forschung. Konventionelle Datenintegrationssysteme bestehen aus mehreren Datenquellen mit ggf. unterschiedlichen Schemata, sind hierarchisch aufgebaut und besitzen eine zentrale Komponente: den Mediator, der ein globales Schema verwaltet. Anfragen an das System werden auf diesem globalen Schema formuliert und vom Mediator bearbeitet, indem relevante Daten von den Datenquellen transparent für den Benutzer angefragt werden. Aufbauend auf diesen Systemen entstanden schließlich Peer-Daten-Management-Systeme (PDMSs) bzw. schemabasierte P2P-Systeme. An einem PDMS teilnehmende Knoten (Peers) können einerseits als Mediatoren agieren andererseits jedoch ebenso als Datenquellen. Darüber hinaus sind diese Peers autonom und können das Netzwerk jederzeit verlassen bzw. betreten. Die potentiell riesige Datenmenge, die in einem derartigen Netzwerk verfügbar ist, führt zudem in der Regel zu sehr großen Anfrageergebnissen, die nur schwer zu bewältigen sind. Daher ist das Bestimmen einer vollständigen Ergebnismenge in vielen Fällen äußerst aufwändig oder sogar unmöglich. In diesen Fällen bietet sich die Anwendung von Top-N- und Skyline-Operatoren, ggf. in Verbindung mit Approximationstechniken, an, da diese Operatoren lediglich diejenigen Datensätze als Ergebnis ausgeben, die aufgrund nutzerdefinierter Ranking-Funktionen am relevantesten für den Benutzer sind. Da durch die Anwendung dieser Operatoren zumeist nur ein kleiner Teil des Ergebnisses tatsächlich dem Benutzer ausgegeben wird, muss nicht zwangsläufig die vollständige Ergebnismenge berechnet werden sondern nur der Teil, der tatsächlich relevant für das Endergebnis ist. Die Frage ist nun, wie man derartige Anfragen durch die Ausnutzung dieser Erkenntnis effizient in PDMSs bearbeiten kann. Die Beantwortung dieser Frage ist das Hauptanliegen dieser Dissertation. Zur Lösung dieser Problemstellung stellen wir effiziente Anfragebearbeitungsstrategien in PDMSs vor, die die charakteristischen Eigenschaften ranking-basierter Operatoren sowie Approximationstechniken ausnutzen. Peers werden dabei sowohl auf Schema- als auch auf Datenebene hinsichtlich der Relevanz ihrer Daten geprüft und dementsprechend in die Anfragebearbeitung einbezogen oder ausgeschlossen. Durch die Heterogenität der Peers werden Techniken zum Umschreiben einer Anfrage von einem Schema in ein anderes nötig. Da existierende Techniken zum Umschreiben von Anfragen zumeist nur konjunktive Anfragen betrachten, stellen wir eine Erweiterung dieser Techniken vor, die Anfragen mit ranking-basierten Anfrageoperatoren berücksichtigt. Da PDMSs dynamische Systeme sind und teilnehmende Peers jederzeit ihre Daten ändern können, betrachten wir in dieser Dissertation nicht nur wie Routing-Indexe verwendet werden, um die Relevanz eines Peers auf Datenebene zu bestimmen, sondern auch wie sie gepflegt werden können. Schließlich stellen wir SmurfPDMS (SiMUlating enviRonment For Peer Data Management Systems) vor, ein System, welches im Rahmen dieser Dissertation entwickelt wurde und alle vorgestellten Techniken implementiert.In recent years, there has been considerable research with respect to query processing in data integration and P2P systems. Conventional data integration systems consist of multiple sources with possibly different schemas, adhere to a hierarchical structure, and have a central component (mediator) that manages a global schema. Queries are formulated against this global schema and the mediator processes them by retrieving relevant data from the sources transparently to the user. Arising from these systems, eventually Peer Data Management Systems (PDMSs), or schema-based P2P systems respectively, have attracted attention. Peers participating in a PDMS can act both as a mediator and as a data source, are autonomous, and might leave or join the network at will. Due to these reasons peers often hold incomplete or erroneous data sets and mappings. The possibly huge amount of data available in such a network often results in large query result sets that are hard to manage. Due to these reasons, retrieving the complete result set is in most cases difficult or even impossible. Applying rank-aware query operators such as top-N and skyline, possibly in conjunction with approximation techniques, is a remedy to these problems as these operators select only those result records that are most relevant to the user. Being aware that in most cases only a small fraction of the complete result set is actually output to the user, retrieving the complete set before evaluating such operators is obviously inefficient. Therefore, the questions we want to answer in this dissertation are how to compute such queries in PDMSs and how to do that efficiently. We propose strategies for efficient query processing in PDMSs that exploit the characteristics of rank-aware queries and optionally apply approximation techniques. A peer's relevance is determined on two levels: on schema-level and on data-level. According to its relevance a peer is either considered for query processing or not. Because of heterogeneity queries need to be rewritten, enabling cooperation between peers that use different schemas. As existing query rewriting techniques mostly consider conjunctive queries only, we present an extension that allows for rewriting queries involving rank-aware query operators. As PDMSs are dynamic systems and peers might update their local data, this dissertation addresses not only the problem of considering such structures within a query processing strategy but also the problem of keeping them up-to-date. Finally, we provide a system-level evaluation by presenting SmurfPDMS (SiMUlating enviRonment For Peer Data Management Systems) -- a system created in the context of this dissertation implementing all presented techniques
    corecore