UCL Discovery
UCL home » Library Services » Electronic resources » UCL Discovery

Core Failure Mitigation in Integer Sum-of-Product Computations on Cloud Computing Systems

Anarado, I; Andreopoulos, Y; (2016) Core Failure Mitigation in Integer Sum-of-Product Computations on Cloud Computing Systems. IEEE Transactions on Multimedia , 18 (4) pp. 789-801. 10.1109/TMM.2016.2532603. Green open access

[thumbnail of TMM-Jan-16-6556_2col_1space.pdf]
Preview
Text
TMM-Jan-16-6556_2col_1space.pdf - Accepted Version

Download (911kB) | Preview

Abstract

The decreasing mean-time-to-failure estimates in cloud computing systems indicate that multimedia applications running on such environments should be able to mitigate an increasing number of core failures at runtime. We propose a new roll-forward failure-mitigation approach for integer sumof-product computations, with emphasis on generic matrix multiplication (GEMM)and convolution/crosscorrelation (CONV) routines. Our approach is based on the production of redundant results within the numerical representation of the outputs via the use of numerical packing.This differs fromall existing roll-forward solutions that require a separate set of checksum (or duplicate) results. Our proposal imposes 37.5% reduction in the maximum output bitwidth supported in comparison to integer sum-ofproduct realizations performed on 32-bit integer representations which is comparable to the bitwidth requirement of checksummethods for multiple core failure mitigation. Experiments with state-of-the-art GEMM and CONV routines running on a c4.8xlarge compute-optimized instance of amazon web services elastic compute cloud (AWS EC2) demonstrate that the proposed approach is able to mitigate up to one quadcore failure while achieving processing throughput that is: 1) comparable to that of the conventional, failure-intolerant, integer GEMM and CONV routines, 2) substantially superior to that of the equivalent roll-forward failure-mitigation method based on checksum streams. Furthermore, when used within an image retrieval framework deployed over a cluster of AWS EC2 spot (i.e., low-cost albeit terminatable) instances, our proposal leads to: 1) 16%-23% cost reduction against the equivalent checksum-based method and 2) more than 70% cost reduction against conventional failure-intolerant processing on AWS EC2 on-demand (i.e., highercost albeit guaranteed) instances.

Type: Article
Title: Core Failure Mitigation in Integer Sum-of-Product Computations on Cloud Computing Systems
Open access status: An open access version is available from UCL Discovery
DOI: 10.1109/TMM.2016.2532603
Publisher version: http://dx.doi.org/10.1109/TMM.2016.2532603
Language: English
Additional information: Copyright © 2016 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
Keywords: integer matrix products, convolution, core failures, multimedia cloud computing
UCL classification: UCL
UCL > Provost and Vice Provost Offices
UCL > Provost and Vice Provost Offices > UCL BEAMS
UCL > Provost and Vice Provost Offices > UCL BEAMS > Faculty of Engineering Science
UCL > Provost and Vice Provost Offices > UCL BEAMS > Faculty of Engineering Science > Dept of Electronic and Electrical Eng
URI: https://discovery.ucl.ac.uk/id/eprint/1505955
Downloads since deposit
112Downloads
Download activity - last month
Download activity - last 12 months
Downloads by country - last 12 months

Archive Staff Only

View Item View Item