Resilient neural network training for accelerators with computing errors

Xu, Dawen, Xing, Kouzi, Liu, Cheng, Wang, Ying, Dai, Yulin, Cheng, Long ORCID: 0000-0003-1638-059X, Li, Huawei and Zhang, Lei (2019) Resilient neural network training for accelerators with computing errors. In: 2019 IEEE 30th International Conference on Application-specific Systems, Architectures and Processors (ASAP), 15-17 July 2019, New York, USA.

[+]

—With the advancements of neural networks, customized accelerators are increasingly adopted in massive AI applications. To gain higher energy efficiency or performance, many hardware design optimizations such as near-threshold logic or overclocking can be utilized. In these cases, computing errors may happen and the computing errors are difficult to be captured by conventional training on general purposed processors (GPPs). Applying the offline trained neural network models to the accelerators with errors directly may lead to considerable prediction accuracy loss. To address this problem, we explore the resilience of neural network models and relax the accelerator design constraints to enable aggressive design options. First of all, we propose to train the neural network models using the accelerators’ forward computing results such that the models can learn both the data and the computing errors. In addition, we observe that some of the neural network layers are more sensitive to the computing errors. With this observation, we schedule the most sensitive layer to the attached GPP to reduce the negative influence of the computing errors. According to the experiments, the neural network models obtained from the proposed training outperform the original models significantly when the CNN accelerators are affected by computing errors.

Item Type:	Conference or Workshop Item (Paper)
Event Type:	Conference
Refereed:	Yes
Uncontrolled Keywords:	Two dimensional displays; Hafnium; resilient training; CNN accelerator; relaxed design constrain; fault tolerance
Subjects:	UNSPECIFIED
DCU Faculties and Centres:	DCU Faculties and Schools > Faculty of Engineering and Computing > School of Computing
Published in:	2019 IEEE 30th International Conference on Application-specific Systems, Architectures and Processors. ASAP . IEEE.
Publisher:	IEEE
Official URL:	http://dx.doi.org/10.1109/ASAP.2019.00-23
Copyright Information:	© 2019 IEEE
Use License:	This item is licensed under a Creative Commons Attribution-NonCommercial-Share Alike 3.0 License. View License
Funders:	National Natural Science Foundation of China (No. 61874124), Chinese Academy of Sciences STS (No. KFJ-STS-SCYD-226), European Union’s Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie grant agreement (No. 799066)
ID Code:	24293
Deposited On:	20 Mar 2020 11:28 by Long Cheng . Last Modified 20 Mar 2020 11:28

Full text available as:

Preview

PDF - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
250kB

See more details

Downloads

Downloads per month over past year

Archive Staff Only: edit this record

DORAS | DCU Research Repository

Resilient neural network training for accelerators with computing errors

Altmetric Badge

Dimensions Badge

Downloads