Search CORE

774 research outputs found

Disk failure prediction based on multi-layer domain adaptive learning

Author: Dawood Hussain
Gao Guangfu
Wu Peng
Publication venue
Publication date: 10/10/2023
Field of study

Large scale data storage is susceptible to failure. As disks are damaged and replaced, traditional machine learning models, which rely on historical data to make predictions, struggle to accurately predict disk failures. This paper presents a novel method for predicting disk failures by leveraging multi-layer domain adaptive learning techniques. First, disk data with numerous faults is selected as the source domain, and disk data with fewer faults is selected as the target domain. A training of the feature extraction network is performed with the selected origin and destination domains. The contrast between the two domains facilitates the transfer of diagnostic knowledge from the domain of source and target. According to the experimental findings, it has been demonstrated that the proposed technique can generate a reliable prediction model and improve the ability to predict failures on disk data with few failure samples

arXiv.org e-Print Archive

An improved CTGAN for data processing method of imbalanced disk failure

Author: Dawood Hussain
Jia Jingbo
Wu Peng
Publication venue
Publication date: 10/10/2023
Field of study

To address the problem of insufficient failure data generated by disks and the imbalance between the number of normal and failure data. The existing Conditional Tabular Generative Adversarial Networks (CTGAN) deep learning methods have been proven to be effective in solving imbalance disk failure data. But CTGAN cannot learn the internal information of disk failure data very well. In this paper, a fault diagnosis method based on improved CTGAN, a classifier for specific category discrimination is added and a discriminator generate adversarial network based on residual network is proposed. We named it Residual Conditional Tabular Generative Adversarial Networks (RCTGAN). Firstly, to enhance the stability of system a residual network is utilized. RCTGAN uses a small amount of real failure data to synthesize fake fault data; Then, the synthesized data is mixed with the real data to balance the amount of normal and failure data; Finally, four classifier (multilayer perceptron, support vector machine, decision tree, random forest) models are trained using the balanced data set, and the performance of the models is evaluated using G-mean. The experimental results show that the data synthesized by the RCTGAN can further improve the fault diagnosis accuracy of the classifier

arXiv.org e-Print Archive

A Survey of Methods for Handling Disk Data Imbalance

Author: Chen Yuehui
Li Qiang
Wu Peng
Yuan Shuangshuang
Publication venue
Publication date: 13/10/2023
Field of study

Class imbalance exists in many classification problems, and since the data is designed for accuracy, imbalance in data classes can lead to classification challenges with a few classes having higher misclassification costs. The Backblaze dataset, a widely used dataset related to hard discs, has a small amount of failure data and a large amount of health data, which exhibits a serious class imbalance. This paper provides a comprehensive overview of research in the field of imbalanced data classification. The discussion is organized into three main aspects: data-level methods, algorithmic-level methods, and hybrid methods. For each type of method, we summarize and analyze the existing problems, algorithmic ideas, strengths, and weaknesses. Additionally, the challenges of unbalanced data classification are discussed, along with strategies to address them. It is convenient for researchers to choose the appropriate method according to their needs

arXiv.org e-Print Archive

FIT4Green - Energy aware ICT Optimization Policies

Author: Basmadjian Robert
Bunse Christian
Georgiadou Vasiliki
Giuliani Giovanni
Klingert Sonja
Lovasz Gergo
Majanen Mikko
Publication venue: 'Airiti Press, Inc.'
Publication date: 01/01/2010
Field of study

MAnnheim DOCument Server

VTT Research System

The 1993/1994 NASA Graduate Student Researchers Program

Author
Publication venue
Publication date: 01/09/1992
Field of study

The NASA Graduate Student Researchers Program (GSRP) attempts to reach a culturally diverse group of promising U.S. graduate students whose research interests are compatible with NASA's programs in space science and aerospace technology. Each year we select approximately 100 new awardees based on competitive evaluation of their academic qualifications, their proposed research plan and/or plan of study, and their planned utilization of NASA research facilities. Fellowships of up to $22,000 are awarded for one year and are renewable, based on satisfactory progress, for a total of three years. Approximately 300 graduate students are, thus, supported by this program at any one time. Students may apply any time during their graduate career or prior to receiving their baccalaureate degree. An applicant must be sponsored by his/her graduate department chair or faculty advisor; this book discusses the GSRP in great detail

NASA Technical Reports Server

Internet Predictions

Author: Barroso Luiz André
Cerf Vinton G.
Chandy K. Mani
Clark David
Elliott Chip
Estrin Deborah
Hooke Adrian
Hölzle Urs
Ishida Toru
Mulligan Geoff
Odlyzko Andrew
Reding Viviane
Sharma Sharad
Smarr Larry
Young R. Michael
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2010
Field of study

More than a dozen leading experts give their opinions on where the Internet is headed and where it will be in the next decade in terms of technology, policy, and applications. They cover topics ranging from the Internet of Things to climate change to the digital storage of the future. A summary of the articles is available in the Web extras section

Caltech Authors

The 1991/92 graduate student researchers program, including the underrepresented minority focus component

Author
Publication venue
Publication date
Field of study

The Graduate Student Research Program (GSRP) was expanded in 1987 to include the Underrepresented Minority Focus Component (UMFC). This program was designed to increase minority participation in graduate study and research, and ultimately, in space science and aerospace technology careers. This booklet presents the areas of research activities at NASA facilities for the GSRP and summarizes and presents the objectives of the UMFC

NASA Technical Reports Server

The 1995 NASA guide to graduate support

Author
Publication venue
Publication date: 01/09/1994
Field of study

The future of the United States is in the classrooms of America and tomorrow's scientific and technological capabilities are derived from today's investments in research. In 1980, NASA initiated the Graduate Student Researchers Program (GSRP) to cultivate additional research ties to the academic community and to support promising students pursuing advanced degrees in science and engineering. Since then, approximately 1300 students have completed the program's requirements. In 1987, the program was expanded to include the Underrepresented Minority and Disabled Focus (UMDF) Component. This program was designed to increase participation of underrepresented groups in graduate study and research and, ultimately, in space science and aerospace technology careers. Approximately 270 minority students have completed the program's requirements while making significant contributions to the nation's aerospace efforts. Continuing to expand fellowship opportunities, NASA announced the Graduate Student Fellowships in Global Change Research in 1990. Designed to support the rapid growth in the study of earth as a system, more than 250 fellowships have been awarded. And, in 1992, NASA announced opportunities in the multiagency High Performance Computing and Communications (HPCC) Program designed to accelerate the development and application of massively parallel processing. Approximately five new fellowships will be awarded yearly. This booklet will guide you in your efforts to participate in programs for graduate student support

NASA Technical Reports Server

Recommended from our members

Social network support for data delivery infrastructures

Author: Sastry Nishanth Ramakrishna
Publication venue: University of Cambridge
Publication date: 11/10/2011
Field of study

Network infrastructures often need to stage content so that it is accessible to consumers. The standard solution, deploying the content on a centralised server, can be inadequate in several situations. Our thesis is that information encoded in social networks can be used to tailor content staging decisions to the user base and thereby build better data delivery infrastructures. This claim is supported by two case studies, which apply social information in challenging situations where traditional content staging is infeasible. Our approach works by examining empirical traces to identify relevant social properties, and then exploits them. The first study looks at cost-effectively serving the ``Long Tail'' of rich-media user-generated content, which need to be staged close to viewers to control latency and jitter. Our traces show that a preference for the unpopular tail items often spreads virally and is localised to some part of the social network. Exploiting this, we propose Buzztraq, which decreases replication costs by selectively copying items to locations favoured by viral spread. We also design SpinThrift, which separates popular and unpopular content based on the relative proportion of viral accesses, and opportunistically spins down disks containing unpopular content, thereby saving energy. The second study examines whether human face-to-face contacts can efficiently create paths over time between arbitrary users. Here, content is staged by spreading it through intermediate users until the destination is reached. Flooding every node minimises delivery times but is not scalable. We show that the human contact network is resilient to individual path failures, and for unicast paths, can efficiently approximate flooding in delivery time distribution simply by randomly sampling a handful of paths found by it. Multicast by contained flooding within a community is also efficient. However, connectivity relies on rare contacts and frequent contacts are often not useful for data delivery. Also, periods of similar duration could achieve different levels of connectivity; we devise a test to identify good periods. We finish by discussing how these properties influence routing algorithms.This work was supported by a St. John's College Benefactor's Scholarship and a Research Studentship from the Cambridge Philosophical Society

Apollo (Cambridge)

Data-Driven Intelligent Scheduling For Long Running Workloads In Large-Scale Datacenters

Author: Xu Guoyao
Publication venue: DigitalCommons@WayneState
Publication date: 01/01/2019
Field of study

Cloud computing is becoming a fundamental facility of society today. Large-scale public or private cloud datacenters spreading millions of servers, as a warehouse-scale computer, are supporting most business of Fortune-500 companies and serving billions of users around the world. Unfortunately, modern industry-wide average datacenter utilization is as low as 6% to 12%. Low utilization not only negatively impacts operational and capital components of cost efficiency, but also becomes the scaling bottleneck due to the limits of electricity delivered by nearby utility. It is critical and challenge to improve multi-resource efficiency for global datacenters. Additionally, with the great commercial success of diverse big data analytics services, enterprise datacenters are evolving to host heterogeneous computation workloads including online web services, batch processing, machine learning, streaming computing, interactive query and graph computation on shared clusters. Most of them are long-running workloads that leverage long-lived containers to execute tasks. We concluded datacenter resource scheduling works over last 15 years. Most previous works are designed to maximize the cluster efficiency for short-lived tasks in batch processing system like Hadoop. They are not suitable for modern long-running workloads of Microservices, Spark, Flink, Pregel, Storm or Tensorflow like systems. It is urgent to develop new effective scheduling and resource allocation approaches to improve efficiency in large-scale enterprise datacenters. In the dissertation, we are the first of works to define and identify the problems, challenges and scenarios of scheduling and resource management for diverse long-running workloads in modern datacenter. They rely on predictive scheduling techniques to perform reservation, auto-scaling, migration or rescheduling. It forces us to pursue and explore more intelligent scheduling techniques by adequate predictive knowledges. We innovatively specify what is intelligent scheduling, what abilities are necessary towards intelligent scheduling, how to leverage intelligent scheduling to transfer NP-hard online scheduling problems to resolvable offline scheduling issues. We designed and implemented an intelligent cloud datacenter scheduler, which automatically performs resource-to-performance modeling, predictive optimal reservation estimation, QoS (interference)-aware predictive scheduling to maximize resource efficiency of multi-dimensions (CPU, Memory, Network, Disk I/O), and strictly guarantee service level agreements (SLA) for long-running workloads. Finally, we introduced a large-scale co-location techniques of executing long-running and other workloads on the shared global datacenter infrastructure of Alibaba Group. It effectively improves cluster utilization from 10% to averagely 50%. It is far more complicated beyond scheduling that involves technique evolutions of IDC, network, physical datacenter topology, storage, server hardwares, operating systems and containerization. We demonstrate its effectiveness by analysis of newest Alibaba public cluster trace in 2017. We are the first of works to reveal the global view of scenarios, challenges and status in Alibaba large-scale global datacenters by data demonstration, including big promotion events like Double 11 . Data-driven intelligent scheduling methodologies and effective infrastructure co-location techniques are critical and necessary to pursue maximized multi-resource efficiency in modern large-scale datacenter, especially for long-running workloads

Digital Commons@Wayne State University

ProQuest OAI Repository