10,110 research outputs found
Avoiding core's DUE & SDC via acoustic wave detectors and tailored error containment and recovery
The trend of downsizing transistors and operating voltage scaling has made the processor chip more sensitive against radiation phenomena making soft errors an important challenge. New reliability techniques for handling soft errors in the logic and memories that allow meeting the desired failures-in-time (FIT) target are key to keep harnessing the benefits of Moore's law. The failure to scale the soft error rate caused by particle strikes, may soon limit the total number of cores that one may have running at the same time. This paper proposes a light-weight and scalable architecture to eliminate silent data corruption errors (SDC) and detected unrecoverable errors (DUE) of a core. The architecture uses acoustic wave detectors for error detection. We propose to recover by confining the errors in the cache hierarchy, allowing us to deal with the relatively long detection latencies. Our results show that the proposed mechanism protects the whole core (logic, latches and memory arrays) incurring performance overhead as low as 0.60%. © 2014 IEEE.Peer ReviewedPostprint (author's final draft
Reliable Messaging to Millions of Users with MigratoryData
Web-based notification services are used by a large range of businesses to
selectively distribute live updates to customers, following the
publish/subscribe (pub/sub) model. Typical deployments can involve millions of
subscribers expecting ordering and delivery guarantees together with low
latencies. Notification services must be vertically and horizontally scalable,
and adopt replication to provide a reliable service. We report our experience
building and operating MigratoryData, a highly-scalable notification service.
We discuss the typical requirements of MigratoryData customers, and describe
the architecture and design of the service, focusing on scalability and fault
tolerance. Our evaluation demonstrates the ability of MigratoryData to handle
millions of concurrent connections and support a reliable notification service
despite server failures and network disconnections
- …