37 research outputs found

    Low-cost and efficient fault detection and diagnosis schemes for modern cores

    Get PDF
    Continuous improvements in transistor scaling together with microarchitectural advances have made possible the widespread adoption of high-performance processors across all market segments. However, the growing reliability threats induced by technology scaling and by the complexity of designs are challenging the production of cheap yet robust systems. Soft error trends are haunting, especially for combinational logic, and parity and ECC codes are therefore becoming insufficient as combinational logic turns into the dominant source of soft errors. Furthermore, experts are warning about the need to also address intermittent and permanent faults during processor runtime, as increasing temperatures and device variations will accelerate inherent aging phenomena. These challenges specially threaten the commodity segments, which impose requirements that existing fault tolerance mechanisms cannot offer. Current techniques based on redundant execution were devised in a time when high penalties were assumed for the sake of high reliability levels. Novel light-weight techniques are therefore needed to enable fault protection in the mass market segments. The complexity of designs is making post-silicon validation extremely expensive. Validation costs exceed design costs, and the number of discovered bugs is growing, both during validation and once products hit the market. Fault localization and diagnosis are the biggest bottlenecks, magnified by huge detection latencies, limited internal observability, and costly server farms to generate test outputs. This thesis explores two directions to address some of the critical challenges introduced by unreliable technologies and by the limitations of current validation approaches. We first explore mechanisms for comprehensively detecting multiple sources of failures in modern processors during their lifetime (including transient, intermittent, permanent and also design bugs). Our solutions embrace a paradigm where fault tolerance is built based on exploiting high-level microarchitectural invariants that are reusable across designs, rather than relying on re-execution or ad-hoc block-level protection. To do so, we decompose the basic functionalities of processors into high-level tasks and propose three novel runtime verification solutions that combined enable global error detection: a computation/register dataflow checker, a memory dataflow checker, and a control flow checker. The techniques use the concept of end-to-end signatures and allow designers to adjust the fault coverage to their needs, by trading-off area, power and performance. Our fault injection studies reveal that our methods provide high coverage levels while causing significantly lower performance, power and area costs than existing techniques. Then, this thesis extends the applicability of the proposed error detection schemes to the validation phases. We present a fault localization and diagnosis solution for the memory dataflow by combining our error detection mechanism, a new low-cost logging mechanism and a diagnosis program. Selected internal activity is continuously traced and kept in a memory-resident log whose capacity can be expanded to suite validation needs. The solution can catch undiscovered bugs, reducing the dependence on simulation farms that compute golden outputs. Upon error detection, the diagnosis algorithm analyzes the log to automatically locate the bug, and also to determine its root cause. Our evaluations show that very high localization coverage and diagnosis accuracy can be obtained at very low performance and area costs. The net result is a simplification of current debugging practices, which are extremely manual, time consuming and cumbersome. Altogether, the integrated solutions proposed in this thesis capacitate the industry to deliver more reliable and correct processors as technology evolves into more complex designs and more vulnerable transistors.El continuo escalado de los transistores junto con los avances microarquitectónicos han posibilitado la presencia de potentes procesadores en todos los segmentos de mercado. Sin embargo, varios problemas de fiabilidad están desafiando la producción de sistemas robustos. Las predicciones de "soft errors" son inquietantes, especialmente para la lógica combinacional: soluciones como ECC o paridad se están volviendo insuficientes a medida que dicha lógica se convierte en la fuente predominante de soft errors. Además, los expertos están alertando acerca de la necesidad de detectar otras fuentes de fallos (causantes de errores permanentes e intermitentes) durante el tiempo de vida de los procesadores. Los segmentos "commodity" son los más vulnerables, ya que imponen unos requisitos que las técnicas actuales de fiabilidad no ofrecen. Estas soluciones (generalmente basadas en re-ejecución) fueron ideadas en un tiempo en el que con tal de alcanzar altos nivel de fiabilidad se asumían grandes costes. Son por tanto necesarias nuevas técnicas que permitan la protección contra fallos en los segmentos más populares. La complejidad de los diseños está encareciendo la validación "post-silicon". Su coste excede el de diseño, y el número de errores descubiertos está aumentando durante la validación y ya en manos de los clientes. La localización y el diagnóstico de errores son los mayores problemas, empeorados por las altas latencias en la manifestación de errores, por la poca observabilidad interna y por el coste de generar las señales esperadas. Esta tesis explora dos direcciones para tratar algunos de los retos causados por la creciente vulnerabilidad hardware y por las limitaciones de los enfoques de validación. Primero exploramos mecanismos para detectar múltiples fuentes de fallos durante el tiempo de vida de los procesadores (errores transitorios, intermitentes, permanentes y de diseño). Nuestras soluciones son de un paradigma donde la fiabilidad se construye explotando invariantes microarquitectónicos genéricos, en lugar de basarse en re-ejecución o en protección ad-hoc. Para ello descomponemos las funcionalidades básicas de un procesador y proponemos tres soluciones de `runtime verification' que combinadas permiten una detección de errores a nivel global. Estas tres soluciones son: un verificador de flujo de datos de registro y de computación, un verificador de flujo de datos de memoria y un verificador de flujo de control. Nuestras técnicas usan el concepto de firmas y permiten a los diseñadores ajustar los niveles de protección a sus necesidades, mediante compensaciones en área, consumo energético y rendimiento. Nuestros estudios de inyección de errores revelan que los métodos propuestos obtienen altos niveles de protección, a la vez que causan menos costes que las soluciones existentes. A continuación, esta tesis explora la aplicabilidad de estos esquemas a las fases de validación. Proponemos una solución de localización y diagnóstico de errores para el flujo de datos de memoria que combina nuestro mecanismo de detección de errores, junto con un mecanismo de logging de bajo coste y un programa de diagnóstico. Cierta actividad interna es continuamente registrada en una zona de memoria cuya capacidad puede ser expandida para satisfacer las necesidades de validación. La solución permite descubrir bugs, reduciendo la necesidad de calcular los resultados esperados. Al detectar un error, el algoritmo de diagnóstico analiza el registro para automáticamente localizar el bug y determinar su causa. Nuestros estudios muestran un alto grado de localización y de precisión de diagnóstico a un coste muy bajo de rendimiento y área. El resultado es una simplificación de las prácticas actuales de depuración, que son enormemente manuales, incómodas y largas. En conjunto, las soluciones de esta tesis capacitan a la industria a producir procesadores más fiables, a medida que la tecnología evoluciona hacia diseños más complejos y más vulnerables

    High-Performance low-vcc in-order core

    Get PDF
    Power density grows in new technology nodes, thus requiring Vcc to scale especially in mobile platforms where energy is critical. This paper presents a novel approach to decrease Vcc while keeping operating frequency high. Our mechanism is referred to as immediate read after write (IRAW) avoidance. We propose an implementation of the mechanism for an Intel® SilverthorneTM in-order core. Furthermore, we show that our mechanism can be adapted dynamically to provide the highest performance and lowest energy-delay product (EDP) at each Vcc level. Results show that IRAW avoidance increases operating frequency by 57% at 500mV and 99% at 400mV with negligible area and power overhead (below 1%), which translates into large speedups (48% at 500mV and 90% at 400mV) and EDP reductions (0.61 EDP at 500mV and 0.33 at 400mV).Peer ReviewedPostprint (published version

    Genetic algorithm based schedulers for grid computing systems

    Get PDF
    In this paper we present Genetic Algorithms (GAs) based schedulers for efficiently allocating jobs to resources in a Grid system. Scheduling is a key problem in emergent computational systems, such as Grid and P2P, in order to benefit from the large computing capacity of such systems. We present an extensive study on the usefulness of GAs for designing efficient Grid schedulers when makespan and flowtime are minimized. Two encoding schemes have been considered and most of GA operators for each of them are implemented and empirically studied. The extensive experimental study showed that our GA-based schedulers outperform existing GA implementations in the literature for the problem and also revealed their efficiency when makespan and flowtime are minimized either in a hierarchical or a simultaneous optimization mode; previous approaches considered only the minimization of the makespan. Moreover, we were able to identify which GAs versions work best under certain Grid characteristics, which is very useful for real Grids. Our GA-based schedulers are very fast and hence they can be used to dynamically schedule jobs arriving in the Grid system by running in batch mode for a short time.Peer ReviewedPostprint (author's final draft

    Design and evaluation of a tabu search method for job scheduling in distributed enviorments

    Get PDF
    The efficient allocation of jobs to grid resources is indispensable for high performance grid-based applications. The scheduling problem is computationally hard even when there are no dependencies among jobs. Thus, we present in this paper a new tabu search (TS) algorithm for the problem of batch job scheduling on computational grids. We consider the job scheduling as a bi-objective optimization problem consisting of the minimization of the makespan and flowtime. The bi-objectivity is tackled through a hierarchic approach in which makespan is considered a primary objective and flowtime a secondary one. An extensive experimental study has been first conducted in order to fine-tune the parameters of our TS algorithm. Then, our tuned TS is compared versus two well known TS algorithms in the literature (one of them is hybridized with an ant colony optimization algorithm) for the problem. The computational results show that our TS implementation clearly outperforms the compared algorithms. Finally, we evaluated the performance of our TS algorithm on a new set of instances that better fits with the concept of computational grid. These instances are composed of a higher number of -heterogeneous- machines (up to 256) and emulate the dynamic behavior of these systems.Peer ReviewedPostprint (published version

    Online error detection and correction of erratic bits in register files

    Get PDF
    Aggressive voltage scaling needed for low power in each new process generation causes large deviations in the threshold voltage of minimally sized devices of the 6T SRAM cell. Gate oxide scaling can cause large transient gate leakage (a trap in the gate oxide), which is known as the erratic bits phenomena. Register file protection is necessary to prevent errors from quickly spreading to different parts of the system, which may cause applications to crash or silent data corruption. This paper proposes a simple and cost-effective mechanism that increases the resiliency of the register files to erratic bits. Our mechanism detects those registers that have erratic bits, recovers from the error and quarantines the faulty register. After the quarantine period, it is able to detect whether they are fully operational with low overhead.Postprint (published version

    A Customized Pigmentation SNP Array Identifies a Novel SNP Associated with Melanoma Predisposition in the SLC45A2 Gene

    Get PDF
    As the incidence of Malignant Melanoma (MM) reflects an interaction between skin colour and UV exposure, variations in genes implicated in pigmentation and tanning response to UV may be associated with susceptibility to MM. In this study, 363 SNPs in 65 gene regions belonging to the pigmentation pathway have been successfully genotyped using a SNP array. Five hundred and ninety MM cases and 507 controls were analyzed in a discovery phase I. Ten candidate SNPs based on a p-value threshold of 0.01 were identified. Two of them, rs35414 (SLC45A2) and rs2069398 (SILV/CKD2), were statistically significant after conservative Bonferroni correction. The best six SNPs were further tested in an independent Spanish series (624 MM cases and 789 controls). A novel SNP located on the SLC45A2 gene (rs35414) was found to be significantly associated with melanoma in both phase I and phase II (P<0.0001). None of the other five SNPs were replicated in this second phase of the study. However, three SNPs in TYR, SILV/CDK2 and ADAMTS20 genes (rs17793678, rs2069398 and rs1510521 respectively) had an overall p-value<0.05 when considering the whole DNA collection (1214 MM cases and 1296 controls). Both the SLC45A2 and the SILV/CDK2 variants behave as protective alleles, while the TYR and ADAMTS20 variants seem to function as risk alleles. Cumulative effects were detected when these four variants were considered together. Furthermore, individuals carrying two or more mutations in MC1R, a well-known low penetrance melanoma-predisposing gene, had a decreased MM risk if concurrently bearing the SLC45A2 protective variant. To our knowledge, this is the largest study on Spanish sporadic MM cases to date

    Impact of the first wave of the SARS-CoV-2 pandemic on the outcome of neurosurgical patients: A nationwide study in Spain

    Get PDF
    Objective To assess the effect of the first wave of the SARS-CoV-2 pandemic on the outcome of neurosurgical patients in Spain. Settings The initial flood of COVID-19 patients overwhelmed an unprepared healthcare system. Different measures were taken to deal with this overburden. The effect of these measures on neurosurgical patients, as well as the effect of COVID-19 itself, has not been thoroughly studied. Participants This was a multicentre, nationwide, observational retrospective study of patients who underwent any neurosurgical operation from March to July 2020. Interventions An exploratory factorial analysis was performed to select the most relevant variables of the sample. Primary and secondary outcome measures Univariate and multivariate analyses were performed to identify independent predictors of mortality and postoperative SARS-CoV-2 infection. Results Sixteen hospitals registered 1677 operated patients. The overall mortality was 6.4%, and 2.9% (44 patients) suffered a perioperative SARS-CoV-2 infection. Of those infections, 24 were diagnosed postoperatively. Age (OR 1.05), perioperative SARS-CoV-2 infection (OR 4.7), community COVID-19 incidence (cases/10 5 people/week) (OR 1.006), postoperative neurological worsening (OR 5.9), postoperative need for airway support (OR 5.38), ASA grade =3 (OR 2.5) and preoperative GCS 3-8 (OR 2.82) were independently associated with mortality. For SARS-CoV-2 postoperative infection, screening swab test <72 hours preoperatively (OR 0.76), community COVID-19 incidence (cases/10 5 people/week) (OR 1.011), preoperative cognitive impairment (OR 2.784), postoperative sepsis (OR 3.807) and an absence of postoperative complications (OR 0.188) were independently associated. Conclusions Perioperative SARS-CoV-2 infection in neurosurgical patients was associated with an increase in mortality by almost fivefold. Community COVID-19 incidence (cases/10 5 people/week) was a statistically independent predictor of mortality. Trial registration number CEIM 20/217

    RESCUhE Project: Cultural Heritage vulnerability in a changing and directional climate

    Get PDF
    [EN] RESCUhE Project (Improving structural RESilience of Cultural HEritage to directional extreme hydro-meteorological events in the context of the Climate Change) is a coordinated IGME-UAM research project funded by Spanish Government (MCIN/AEI/10.13039/501100011033). The framework of this research is the predicted increase in climate change vulnerability of heritage sites and the current disconnection between both environmental research on material decay and the practical aspects of designing preventive conservation measurements.RESCUhE Project (Improving structural RESilience of Cultural HEritage to directional extreme hydro-meteorological events in the context of the Climate Change) is a coordinated IGME-UAM research project funded by Spanish Government (MCIN/AEI/10.13039/501100011033).Peer reviewe

    Epidemiological trends of HIV/HCV coinfection in Spain, 2015-2019

    Get PDF
    Altres ajuts: Spanish AIDS Research Network; European Funding for Regional Development (FEDER).Objectives: We assessed the prevalence of anti-hepatitis C virus (HCV) antibodies and active HCV infection (HCV-RNA-positive) in people living with HIV (PLWH) in Spain in 2019 and compared the results with those of four similar studies performed during 2015-2018. Methods: The study was performed in 41 centres. Sample size was estimated for an accuracy of 1%. Patients were selected by random sampling with proportional allocation. Results: The reference population comprised 41 973 PLWH, and the sample size was 1325. HCV serostatus was known in 1316 PLWH (99.3%), of whom 376 (28.6%) were HCV antibody (Ab)-positive (78.7% were prior injection drug users); 29 were HCV-RNA-positive (2.2%). Of the 29 HCV-RNA-positive PLWH, infection was chronic in 24, it was acute/recent in one, and it was of unknown duration in four. Cirrhosis was present in 71 (5.4%) PLWH overall, three (10.3%) HCV-RNA-positive patients and 68 (23.4%) of those who cleared HCV after anti-HCV therapy (p = 0.04). The prevalence of anti-HCV antibodies decreased steadily from 37.7% in 2015 to 28.6% in 2019 (p < 0.001); the prevalence of active HCV infection decreased from 22.1% in 2015 to 2.2% in 2019 (p < 0.001). Uptake of anti-HCV treatment increased from 53.9% in 2015 to 95.0% in 2019 (p < 0.001). Conclusions: In Spain, the prevalence of active HCV infection among PLWH at the end of 2019 was 2.2%, i.e. 90.0% lower than in 2015. Increased exposure to DAAs was probably the main reason for this sharp reduction. Despite the high coverage of treatment with direct-acting antiviral agents, HCV-related cirrhosis remains significant in this population

    CIBERER : Spanish national network for research on rare diseases: A highly productive collaborative initiative

    Get PDF
    Altres ajuts: Instituto de Salud Carlos III (ISCIII); Ministerio de Ciencia e Innovación.CIBER (Center for Biomedical Network Research; Centro de Investigación Biomédica En Red) is a public national consortium created in 2006 under the umbrella of the Spanish National Institute of Health Carlos III (ISCIII). This innovative research structure comprises 11 different specific areas dedicated to the main public health priorities in the National Health System. CIBERER, the thematic area of CIBER focused on rare diseases (RDs) currently consists of 75 research groups belonging to universities, research centers, and hospitals of the entire country. CIBERER's mission is to be a center prioritizing and favoring collaboration and cooperation between biomedical and clinical research groups, with special emphasis on the aspects of genetic, molecular, biochemical, and cellular research of RDs. This research is the basis for providing new tools for the diagnosis and therapy of low-prevalence diseases, in line with the International Rare Diseases Research Consortium (IRDiRC) objectives, thus favoring translational research between the scientific environment of the laboratory and the clinical setting of health centers. In this article, we intend to review CIBERER's 15-year journey and summarize the main results obtained in terms of internationalization, scientific production, contributions toward the discovery of new therapies and novel genes associated to diseases, cooperation with patients' associations and many other topics related to RD research
    corecore