(Translated by https://www.hiragana.jp/)
AI accelerator on IBM Telum processor | Proceedings of the 49th Annual International Symposium on Computer Architecture skip to main content
10.1145/3470496.3533042acmconferencesArticle/Chapter ViewAbstractPublication PagesiscaConference Proceedingsconference-collections
research-article

AI accelerator on IBM Telum processor: industrial product

Published: 11 June 2022 Publication History

Abstract

IBM Telum is the next generation processor chip for IBM Z and LinuxONE systems. The Telum design is focused on enterprise class workloads and it achieves over 40% per socket performance growth compared to IBM z15. The IBM Telum is the first server-class chip with a dedicated on-chip AI accelerator that enables clients to gain real time insights from their data as it is getting processed.
Seamlessly infusing AI in all enterprise workloads is highly desirable to get real business insight on every transaction as well as to improve IT operation, security, and data privacy. While it would undeniably provide significant additional value, its application in practice is often accompanied by hurdles from low throughput if run on-platform to security concerns and inconsistent latency if run off-platform. The IBM Telum chip introduces an on-chip AI accelerator that provides consistent low latency and high throughput (over 200 TFLOPS in 32 chip system) inference capacity usable by all threads. The accelerator is memory coherent and directly connected to the fabric like any other general-purpose core to support low latency inference while meeting the system's transaction rate. A scalable architecture providing transparent access to AI accelerator functions via a non-privileged general-purpose core instruction further reduces software orchestration and library complexity as well as provides extensibility to the AI functions. On a global bank customer credit card fraud detection model, the AI accelerator achieves 22× speed up in latency compared to a general purpose core utilizing vector execution units. For the same model, the AI accelerator achieves 116k inferences every second with a latency of only 1.1 msec. As the system is scaled up from one chip to 32 chips, it performs more than 3.5 Million inferences/sec and the latency still stays very low at only 1.2 msec.
This paper briefly introduces the IBM Telum chip and later describes the integrated AI accelerator. IBM Telum's AI accelerator architecture, microarchitecture, integration into the system stack, performance, and power are covered in detail.

References

[1]
Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. https://www.tensorflow.org/ Software available from tensorflow.org.
[2]
Bülent Abali, Bart Blaner, John J. Reilly, Matthias Klein, Ashutosh Mishra, Craig B. Agricola, Bedri Sendir, Alper Buyuktosunoglu, Christian Jacobi, William J. Starke, Haren Myneni, and Charlie Wang. 2020. Data Compression Accelerator on IBM POWER9 and z15 Processors : Industrial Product. In 47th ACM/IEEE Annual International Symposium on Computer Architecture, ISCA 2020, Valencia, Spain, May 30 - June 3, 2020. IEEE, 1--14.
[3]
Ankur Agrawal, Bruce M. Fleischer, Silvia M. Mueller, Xiao Sun, Naigang Wang, Jungwook Choi, and Kailash Gopalakrishnan. 2019. DLFloat: A 16-b Floating Point Format Designed for Deep Learning Training and Inference. In 26th IEEE Symposium on Computer Arithmetic, ARITH 2019, Kyoto, Japan, June 10--12, 2019, Naofumi Takagi, Sylvie Boldo, and Martin Langhammer (Eds.). IEEE, 92--95.
[4]
Ankur Agrawal, Sae Kyu Lee, Joel Silberman, Matthew M. Ziegler, Mingu Kang, Swagath Venkataramani, Nianzheng Cao, Bruce M. Fleischer, Michael Guillorn, Matt Cohen, Silvia Mueller, Jinwook Oh, Martin Lutz, Jinwook Jung, Siyu Koswatta, Ching Zhou, Vidhi Zalani, James Bonanno, Robert Casatuta, Chia-Yu Chen, Jungwook Choi, Howard Haynie, Alyssa Herbert, Radhika Jain, Monodeep Kar, Kyu-Hyoun Kim, Yulong Li, Zhibin Ren, Scot Rider, Marcel Schaal, Kerstin Schelm, Michael Scheuermann, Xiao Sun, Hung Tran, Naigang Wang, Wei Wang, Xin Zhang, Vinay Shah, Brian W. Curran, Vijayalakshmi Srinivasan, Pong-Fei Lu, Sunil Shukla, Leland Chang, and Kailash Gopalakrishnan. 2021. A 7nm 4-Core AI Chip with 25.6TFLOPS Hybrid FP8 Training, 102.4TOPS INT4 Inference and Workload-Aware Throttling. In IEEE International Solid-State Circuits Conference, ISSCC 2021, San Francisco, CA, USA, February 13--22, 2021. IEEE, 144--146.
[5]
Jorge Albericio, Patrick Judd, Tayler H. Hetherington, Tor M. Aamodt, Natalie D. Enright Jerger, and Andreas Moshovos. 2016. Cnvlutin: Ineffectual-Neuron-Free Deep Neural Network Computing. In 43rd ACM/IEEE Annual International Symposium on Computer Architecture, ISCA 2016, Seoul, South Korea, June 18--22, 2016. IEEE Computer Society, 1--13.
[6]
Anaconda Inc. 2012. Anaconda Software Distribution. https://docs.anaconda.com/
[7]
Aayush Ankit, Izzat El Hajj, Sai Rahul Chalamalasetti, Geoffrey Ndu, Martin Foltin, R. Stanley Williams, Paolo Faraboschi, Wen-mei W. Hwu, John Paul Strachan, Kaushik Roy, and Dejan S. Milojicic. 2019. PUMA: A Programmable Ultra-efficient Memristor-based Accelerator for Machine Learning Inference. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2019, Providence, RI, USA, April 13--17, 2019, Iris Bahar, Maurice Herlihy, Emmett Witchel, and Alvin R. Lebeck (Eds.). ACM, 715--731.
[8]
Junjie Bai, Fang Lu, and Ke Zhang. 2017. Open Neural Network Exchange (ONNX). https://github.com/onnx/onnx
[9]
Christopher J. Berry, Brian Bell, Adam Jatkowski, Jesse Surprise, John Isakson, Ofer Geva, Brian Deskin, Mark Cichanowski, Dina Hamid, Chris Cavitt, Gregory Fredeman, Anthony Saporito, Ashutosh Mishra, Alper Buyuktosunoglu, Tobias Webel, Preetham Lobo, Pradeep Parashurama, Ramon Bertran, Dureseti Chidambarrao, David Wolpert, and Brandon Bruen. 2020. 2.7 IBM z15: A 12-Core 5.2GHz Microprocessor. In 2020 IEEE International Solid- State Circuits Conference, ISSCC 2020, San Francisco, CA, USA, February 16--20, 2020. IEEE, 54--56.
[10]
Christopher J. Berry, James D. Warnock, John Badar, Dean G. Bair, Erwin Behnen, Brian Bell, Alper Buyuktosunoglu, Chris Cavitt, Pierce Chuang, Ofer Geva, Dina Hamid, John Isakson, Preetham Lobo, Frank Malgioglio, Guenter Mayer, José Luis Neves, Thomas Strach, Jesse Surprise, Christos Vezyrtzis, Tobias Webel, and David Wolpert. 2018. IBM z14 design methodology enhancements in the 14-nm technology node. IBM J. Res. Dev. 62, 2/3 (2018), 9:1--9:12.
[11]
Christopher J. Berry, David Wolpert, Christos Vezyrtzis, Richard F. Rizzolo, Sean M. Carey, Yaniv Maroz, Hunter F. Shi, Dureseti Chidambarrao, Christian Jacobi, Anthony Saporito, Thomas Strach, Alper Buyuktosunoglu, Preetham Lobo, Pierce Chuang, Pawel Owczarczyk, Ramon Bertran, Tobias Webel, and Phillip J. Restle. 2019. IBM z14: Processor Characterization and Power Management for High-Reliability Mainframe Systems. IEEE J. Solid State Circuits 54, 1 (2019), 121--132.
[12]
Ramon Bertran, Alper Buyuktosunoglu, Pradip Bose, Timothy J. Slegel, Gerard Salem, Sean M. Carey, Richard F. Rizzolo, and Thomas Strach. 2014. Voltage Noise in Multi-Core Processors: Empirical Characterization and Optimization Opportunities. In 47th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2014, Cambridge, United Kingdom, December 13--17, 2014. IEEE Computer Society, 368--380.
[13]
Ramon Bertran, Alper Buyuktosunoglu, Meeta Sharma Gupta, Marc González, and Pradip Bose. 2012. Systematic Energy Characterization of CMP/SMT Processor Systems via Automated Micro-Benchmarks. In 45th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2012, Vancouver, BC, Canada, December 1--5, 2012. IEEE Computer Society, 199--211.
[14]
Ramon Bertran, Yutaka Sugawara, Hans M. Jacobson, Alper Buyuktosunoglu, and Pradip Bose. 2013. Application-level power and performance characterization and optimization on IBM Blue Gene/Q systems. IBM J. Res. Dev. 57, 1/2 (2013), 4.
[15]
Christian Bornträger, Jonathan D. Bradbury, Reinhard Bündgen, Fadi Busaba, Lisa Cranton Heller, and Viktor Mihajlovski. 2020. Secure your cloud workloads with IBM Secure Execution for Linux on IBM z15 and LinuxONE III. IBM J. Res. Dev. 64, 5/6 (2020), 2:1--2:11.
[16]
Calvin Bulla, Lluc Alvarez, Miquel Moretó, Ramon Bertran, Alper Buyuktosunoglu, and Pradip Bose. 2018. ChopStiX: Systematic Extraction of Code-Representative Microbenchmarks. In 2018 IEEE International Symposium on Workload Characterization, IISWC 2018, Raleigh, NC, USA, September 30-October 2, 2018. IEEE Computer Society, 80--81.
[17]
Fadi Busaba, Michael A. Blake, Brian W. Curran, Michael F. Fee, Christian Jacobi, Pak-kin Mak, Brian R. Prasky, and Craig R. Walters. 2012. IBM zEnterprise 196 microprocessor and cache subsystem. IBM J. Res. Dev. 56, 1 (2012), 1.
[18]
James A. Busby, Edward N. Cohen, E. Anne Dames, Jessica Doherty, Silvio Dragone, Dave Evans, Michael J. Fisher, Nihad Hadzic, Christoph Hagleitner, Arthur J. Higby, Michael D. Hocker, Luanne S. Jagich, Michael J. Jordan, Richard Kisley, Kirk D. Lamb, Mark D. Marik, Jimmie Mayfield, Thomas E. Morris, Thomas D. Needham, William Santiago-Fernandez, Volker Urban, Tamas Visegrady, and Klaus Werner. 2020. The IBM 4769 Cryptographic Coprocessor. IBM J. Res. Dev. 64, 5/6 (2020), 3:1--3:11.
[19]
Srimat T. Chakradhar, Murugan Sankaradass, Venkata Jakkula, and Srihari Cadambi. 2010. A dynamically configurable coprocessor for convolutional neural networks. In 37th International Symposium on Computer Architecture (ISCA 2010), June 19--23, 2010, Saint-Malo, France, André Seznec, Uri C. Weiser, and Ronny Ronen (Eds.). ACM, 247--257.
[20]
Tianshi Chen, Zidong Du, Ninghui Sun, Jia Wang, Chengyong Wu, Yunji Chen, and Olivier Temam. 2014. DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning. In Architectural Support for Programming Languages and Operating Systems, ASPLOS 2014, Salt Lake City, UT, USA, March 1--5, 2014, Rajeev Balasubramonian, Al Davis, and Sarita V. Adve (Eds.). ACM, 269--284.
[21]
Tianqi Chen and Carlos Guestrin. 2016. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, August 13--17, 2016, Balaji Krishnapuram, Mohak Shah, Alexander J. Smola, Charu C. Aggarwal, Dou Shen, and Rajeev Rastogi (Eds.). ACM, 785--794.
[22]
Yu-Hsin Chen, Joel S. Emer, and Vivienne Sze. 2016. Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks. In 43rd ACM/IEEE Annual International Symposium on Computer Architecture, ISCA 2016, Seoul, South Korea, June 18--22, 2016. IEEE Computer Society, 367--379.
[23]
Yunji Chen, Tao Luo, Shaoli Liu, Shijin Zhang, Liqiang He, Jia Wang, Ling Li, Tianshi Chen, Zhiwei Xu, Ninghui Sun, and Olivier Temam. 2014. DaDianNao: A Machine-Learning Supercomputer. In 47th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2014, Cambridge, United Kingdom, December 13--17, 2014. IEEE Computer Society, 609--622.
[24]
Ping Chi, Shuangchen Li, Cong Xu, Tao Zhang, Jishen Zhao, Yongpan Liu, Yu Wang, and Yuan Xie. 2016. PRIME: A Novel Processing-in-Memory Architecture for Neural Network Computation in ReRAM-Based Main Memory. In 43rd ACM/IEEE Annual International Symposium on Computer Architecture, ISCA 2016, Seoul, South Korea, June 18--22, 2016. IEEE Computer Society, 27--39.
[25]
Jack Choquette, Wishwesh Gandhi, Olivier Giroux, Nick Stam, and Ronny Krashinsky. 2021. NVIDIA A100 Tensor Core GPU: Performance and Innovation. IEEE Micro 41, 2 (2021), 29--35.
[26]
Jack Choquette, M.-J. Edward Lee, Ronny Krashinsky, Vishnu Balan, and Brucek Khailany. 2021. 3.2 The A100 Datacenter GPU and Ampere Architecture. In IEEE International Solid-State Circuits Conference, ISSCC 2021, San Francisco, CA, USA, February 13--22, 2021. IEEE, 48--50.
[27]
Pierce I-Jen Chuang, Christos Vezyrtzis, Divya Pathak, Richard F. Rizzolo, Tobias Webel, Thomas Strach, Otto A. Torreiter, Preetham Lobo, Alper Buyuktosunoglu, Ramon Bertran, Michael S. Floyd, Malcolm S. Ware, Gerard Salem, Sean M. Carey, and Phillip Restle. 2017. 26.2 Power supply noise in a 22nm z13™ microprocessor. In 2017 IEEE International Solid-State Circuits Conference, ISSCC 2017, San Francisco, CA, USA, February 5--9, 2017. IEEE, 438--439.
[28]
Brian W. Curran, Christian Jacobi, J. J. Bonanno, D. A. Schroter, K. J. Alexander, A. Puranik, and Markus M. Helms. 2015. The IBM z13 multithreaded microprocessor. IBM J. Res. Dev. 59, 4/5 (2015), 1:1--1:13.
[29]
Nagu R. Dhanwada, David J. Hathaway, Victor V. Zyuban, Peng Peng, Karl Moody, William W. Dungan, Arun Joseph, Rahul M. Rao, and Christopher J. Gonzalez. 2013. Efficient PVT independent abstraction of large IP blocks for hierarchical power analysis. In The IEEE/ACM International Conference on Computer-Aided Design, ICCAD'13, San Jose, CA, USA, November 18--21, 2013, Jörg Henkel (Ed.). IEEE, 458--465.
[30]
Celestine Dünner, Thomas P. Parnell, Dimitrios Sarigiannis, Nikolas Ioannou, Andreea Anghel, Gummadi Ravi, Madhusudanan Kandasamy, and Haralampos Pozidis. 2018. Snap ML: A Hierarchical Framework for Machine Learning. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3--8, 2018, Montréal, Canada, Samy Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen Grauman, Nicolò Cesa-Bianchi, and Roman Garnett (Eds.). Curran Associates, Inc., 250--260. https://proceedings.neurips.cc/paper/2018/hash/eecca5b6365d9607ee5a9d336962c534-Abstract.html
[31]
Schuyler Eldridge, Amos Waterland, Margo I. Seltzer, Jonathan Appavoo, and Ajay Joshi. 2015. Towards General-Purpose Neural Network Computing. In 2015 International Conference on Parallel Architectures and Compilation, PACT 2015, San Francisco, CA, USA, October 18--21, 2015. IEEE Computer Society, 99--112.
[32]
Clément Farabet, Berin Martini, B. Corda, Polina Akselrod, Eugenio Culurciello, and Yann LeCun. 2011. NeuFlow: A runtime reconfigurable dataflow processor for vision. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR Workshops 2011, Colorado Springs, CO, USA, 20--25 June, 2011. IEEE Computer Society, 109--116.
[33]
Bruce M. Fleischer, Sunil Shukla, Matthew M. Ziegler, Joel Silberman, Jinwook Oh, Vijayalakshmi Srinivasan, Jungwook Choi, Silvia Mueller, Ankur Agrawal, Tina Babinsky, Nianzheng Cao, Chia-Yu Chen, Pierce Chuang, Thomas W. Fox, George Gristede, Michael Guillorn, Howard Haynie, Michael Klaiber, Dongsoo Lee, Shih-Hsien Lo, Gary W. Maier, Michael Scheuermann, Swagath Venkataramani, Christos Vezyrtzis, Naigang Wang, Fanchieh Yee, Ching Zhou, Pong-Fei Lu, Brian W. Curran, Leland Chang, and Kailash Gopalakrishnan. 2018. A Scalable Multi- TeraOPS Deep Learning Processor Core for AI Training and Inference. In 2018 IEEE Symposium on VLSI Circuits, Honolulu, HI, USA, June 18--22, 2018. IEEE, 35--36.
[34]
Eric J. Fluhr, Rahul M. Rao, Howard Smith, Alper Buyuktosunoglu, and Ramon Bertran. 2018. IBM POWER9 circuit design and energy optimization for 14-nm technology. IBM J. Res. Dev. 62, 4/5 (2018), 4:1--4:11.
[35]
Jeremy Fowers, Kalin Ovtcharov, Michael Papamichael, Todd Massengill, Ming Liu, Daniel Lo, Shlomi Alkalay, Michael Haselman, Logan Adams, Mahdi Ghandi, Stephen Heil, Prerak Patel, Adam Sapek, Gabriel Weisz, Lisa Woods, Sitaram Lanka, Steven K. Reinhardt, Adrian M. Caulfield, Eric S. Chung, and Doug Burger. 2018. A Configurable Cloud-Scale DNN Processor for Real-Time AI. In 45th ACM/IEEE Annual International Symposium on Computer Architecture, ISCA 2018, Los Angeles, CA, USA, June 1--6, 2018, Murali Annavaram, Timothy Mark Pinkston, and Babak Falsafi (Eds.). IEEE Computer Society, 1--14.
[36]
Mingyu Gao, Jing Pu, Xuan Yang, Mark Horowitz, and Christos Kozyrakis. 2017. TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2017, Xi'an, China, April 8--12, 2017, Yunji Chen, Olivier Temam, and John Carter (Eds.). ACM, 751--764.
[37]
Ofer Geva, Christopher J. Berry, Robert J. Sonnelitter, David Wolpert, Adam Collura, Thomas Strach, Di Phan, Cédric Lichtenau, Alper Buyuktosunoglu, Hubert Harrer, Jeffrey A. Zitz, Chad Marquart, Douglas Malone, Tobias Webel, Adam Jatkowski, John Isakson, Dina Hamid, Mark Cichanowski, Michael Romain, Faisal Hasan, Kevin Williams, Jesse Surprise, Chris Cavitt, and Mark Cohen. 2022. IBM Telum: a 16-Core 5+ GHz DCM. In IEEE International Solid-State Circuits Conference, ISSCC 2022, San Francisco, CA, USA, February 20--26, 2022. IEEE, 46--48.
[38]
Vinayak Gokhale, Jonghoon Jin, Aysegul Dundar, Berin Martini, and Eugenio Culurciello. 2014. A 240 G-ops/s Mobile Coprocessor for Deep Neural Networks. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR Workshops 2014, Columbus, OH, USA, June 23--28, 2014. IEEE Computer Society, 696--701.
[39]
Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark A. Horowitz, and William J. Dally. 2016. EIE: Efficient Inference Engine on Compressed Deep Neural Network. In 43rd ACM/IEEE Annual International Symposium on Computer Architecture, ISCA 2016, Seoul, South Korea, June 18--22, 2016. IEEE Computer Society, 243--254.
[40]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27--30, 2016. IEEE Computer Society, 770--778.
[41]
Lisa Cranton Heller and Mark S. Farrell. 2004. Millicode in an IBM zSeries processor. IBM J. Res. Dev. 48, 3--4 (2004), 425--434.
[42]
Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Weinberger. 2017. Densely Connected Convolutional Networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21--26, 2017. IEEE Computer Society, 2261--2269.
[43]
IBM Cloud. 2019. https://www.ibm.com/products/cloud-pak-for-data
[44]
IBM Db2. 1993. https://www.ibm.com/products/db2-database
[45]
IBM Research. 2020. https://snapml.readthedocs.io
[46]
IBM Watson. 2020. https://www.ibm.com/products/machine-learning-for-zos
[47]
IBM Z. 1969. https://www.ibm.com/it-infrastructure/z/cics
[48]
IBM Z. 2021. https://github.com/IBM/zDNN/blob/main/samples/simple_add.c
[49]
IBM Z. 2021. https://github.com/IBM/zDNN
[50]
Christian Jacobi, Anthony Saporito, Martin Recktenwald, Aaron Tsai, Ulrich Mayer, Markus M. Helms, Adam Collura, Pak-kin Mak, Robert J. Sonnelitter, Michael A. Blake, Tim Bronson, Arthur O'neill, and Vesselina K. Papazova. 2018. Design of the IBM z14 microprocessor. IBM J. Res. Dev. 62, 2/3 (2018), 8:1--8:11.
[51]
Christian Jacobi, Timothy J. Slegel, and Dan F. Greiner. 2012. Transactional Memory Architecture and Implementation for IBM System Z. In 45th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2012, Vancouver, BC, Canada, December 1--5, 2012. IEEE Computer Society, 25--36.
[52]
Christian Jacobi and Charles F. Webb. 2020. History of IBM Z Mainframe Processors. IEEE Micro 40, 6 (2020), 50--58.
[53]
Hans M. Jacobson, Alper Buyuktosunoglu, Pradip Bose, Emrah Acar, and Richard J. Eickemeyer. 2011. Abstraction and microarchitecture scaling in early-stage power modeling. In 17th International Conference on High-Performance Computer Architecture (HPCA-17 2011), February 12--16 2011, San Antonio, Texas, USA. IEEE Computer Society, 394--405.
[54]
Hans M. Jacobson, Arun Joseph, Dharmesh Parikh, Pradip Bose, and Alper Buyuktosunoglu. 2014. Empirically derived abstractions in uncore power modeling for a server-class processor chip. In International Symposium on Low Power Electronics and Design, ISLPED'14, La Jolla, CA, USA - August 11 - 13, 2014, Yuan Xie, Tanay Karnik, Muhammad M. Khellah, and Renu Mehra (Eds.). ACM, 147--152.
[55]
Norman P. Jouppi, Doe Hyun Yoon, Matthew Ashcraft, Mark Gottscho, Thomas B. Jablin, George Kurian, James Laudon, Sheng Li, Peter C. Ma, Xiaoyu Ma, Thomas Norrie, Nishant Patil, Sushma Prasad, Cliff Young, Zongwei Zhou, and David A. Patterson. 2021. Ten Lessons From Three Generations Shaped Google's TPUv4i: Industrial Product. In 48th ACM/IEEE Annual International Symposium on Computer Architecture, ISCA 2021, Valencia, Spain, June 14--18, 2021. IEEE, 1--14.
[56]
Patrick Judd, Jorge Albericio, Tayler H. Hetherington, Tor M. Aamodt, and Andreas Moshovos. 2016. Stripes: Bit-serial deep neural network computing. In 49th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2016, Taipei, Taiwan, October 15--19, 2016. IEEE Computer Society, 19:1--19:12.
[57]
Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. 2014. Large-Scale Video Classification with Convolutional Neural Networks. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2014, Columbus, OH, USA, June 23--28, 2014. IEEE Computer Society, 1725--1732.
[58]
Duckhwan Kim, Jaeha Kung, Sek M. Chai, Sudhakar Yalamanchili, and Saibal Mukhopadhyay. 2016. Neurocube: A Programmable Digital Neuromorphic Architecture with High-Density 3D Memory. In 43rd ACM/IEEE Annual International Symposium on Computer Architecture, ISCA 2016, Seoul, South Korea, June 18--22, 2016. IEEE Computer Society, 380--392.
[59]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2017. ImageNet classification with deep convolutional neural networks. Commun. ACM 60, 6 (2017), 84--90.
[60]
Sae Kyu Lee, Ankur Agrawal, Joel Silberman, Matthew M. Ziegler, Mingu Kang, Swagath Venkataramani, Nianzheng Cao, Bruce M. Fleischer, Michael Guillorn, Matthew Cohen, Silvia M. Mueller, Jinwook Oh, Martin Lutz, Jinwook Jung, Siyu Koswatta, Ching Zhou, Vidhi Zalani, Monodeep Kar, James Bonanno, Robert Casatuta, Chia-Yu Chen, Jungwook Choi, Howard Haynie, Alyssa Herbert, Radhika Jain, Kyu-Hyoun Kim, Yulong Li, Zhibin Ren, Scot Rider, Marcel Schaal, Kerstin Schelm, Michael Scheuermann, Xiao Sun, Hung Tran, Naigang Wang, Wei Wang, Xin Zhang, Vinay Shah, Brian W. Curran, Vijayalakshmi Srinivasan, Pong-Fei Lu, Sunil Shukla, Kailash Gopalakrishnan, and Leland Chang. 2022. A 7- nm Four-Core Mixed-Precision AI Chip With 26.2-TFLOPS Hybrid-FP8 Training, 104.9-TOPS INT4 Inference, and Workload-Aware Throttling. IEEE J. Solid State Circuits 57, 1 (2022), 182--197.
[61]
Shaoli Liu, Zidong Du, Jinhua Tao, Dong Han, Tao Luo, Yuan Xie, Yunji Chen, and Tianshi Chen. 2016. Cambricon: An Instruction Set Architecture for Neural Networks. In 43rd ACM/IEEE Annual International Symposium on Computer Architecture, ISCA 2016, Seoul, South Korea, June 18--22, 2016. IEEE Computer Society, 393--405.
[62]
Wenyan Lu, Guihai Yan, Jiajun Li, Shijun Gong, Yinhe Han, and Xiaowei Li. 2017. FlexFlow: A Flexible Dataflow Accelerator Architecture for Convolutional Neural Networks. In 2017 IEEE International Symposium on High Performance Computer Architecture, HPCA 2017, Austin, TX, USA, February 4--8, 2017. IEEE Computer Society, 553--564.
[63]
Abhinandan Majumdar, Srihari Cadambi, Michela Becchi, Srimat T. Chakradhar, and Hans Peter Graf. 2012. A Massively Parallel, Energy Efficient Programmable Accelerator for Learning and Classification. ACM Trans. Archit. Code Optim. 9, 1 (2012), 6:1--6:30.
[64]
Supun Nakandala, Karla Saur, Gyeong-In Yu, Konstantinos Karanasos, Carlo Curino, Markus Weimer, and Matteo Interlandi. 2020. A Tensor Compiler for Unified Machine Learning Prediction Serving. In 14th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2020, Virtual Event, November 4--6, 2020. USENIX Association, 899--917. https://www.usenix.org/conference/osdi20/presentation/nakandala
[65]
Jinwook Oh, Sae Kyu Lee, Mingu Kang, Matthew M. Ziegler, Joel Silberman, Ankur Agrawal, Swagath Venkataramani, Bruce M. Fleischer, Michael Guillorn, Jungwook Choi, Wei Wang, Silvia Mueller, Shimon Ben-Yehuda, James Bonanno, Nianzheng Cao, Robert Casatuta, Chia-Yu Chen, Matt Cohen, Ophir Erez, Thomas W. Fox, George Gristede, Howard Haynie, Vicktoria Ivanov, Siyu Koswatta, Shih-Hsien Lo, Martin Lutz, Gary W. Maier, Alex Mesh, Yevgeny Nustov, Scot Rider, Marcel Schaal, Michael Scheuermann, Xiao Sun, Naigang Wang, Fanchieh Yee, Ching Zhou, Vinay Shah, Brian W. Curran, Vijayalakshmi Srinivasan, Pong-Fei Lu, Sunil Shukla, Kailash Gopalakrishnan, and Leland Chang. 2020. A 3.0 TFLOPS 0.62V Scalable Processor Core for High Compute Utilization AI Training and Inference. In IEEE Symposium on VLSI Circuits, VLSI Circuits 2020, Honolulu, HI, USA, June 16--19, 2020. IEEE, 1--2.
[66]
ONNX-MLIR. 2021. https://onnx.ai/onnx-mlir/
[67]
Viresh Paruthi. 2010. Large-scale application of formal verification: From fiction to fact. In Proceedings of 10th International Conference on Formal Methods in Computer-Aided Design, FMCAD 2010, Lugano, Switzerland, October 20--23, Roderick Bloem and Natasha Sharygina (Eds.). IEEE, 175--180. https://ieeexplore.ieee.org/document/5770947/
[68]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Z. Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8--14, 2019, Vancouver, BC, Canada, Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d'Alché-Buc, Emily B. Fox, and Roman Garnett (Eds.). Curran Associates, Inc., 8024--8035. https://proceedings.neurips.cc/paper/2019/hash/bdbca288fee7f92f2bfa9f7012727740-Abstract.html
[69]
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12 (2011), 2825--2830.
[70]
Mateja Putic, Swagath Venkataramani, Schuyler Eldridge, Alper Buyuktosunoglu, Pradip Bose, and Mircea Stan. 2018. Dyhard-DNN: even more DNN acceleration with dynamic hardware reconfiguration. In Proceedings of the 55th Annual Design Automation Conference, DAC 2018, San Francisco, CA, USA, June 24--29, 2018. ACM, 14:1--14:6.
[71]
Brandon Reagen, Paul N. Whatmough, Robert Adolf, Saketh Rama, Hyunkwang Lee, Sae Kyu Lee, José Miguel Hernández-Lobato, Gu-Yeon Wei, and David M. Brooks. 2016. Minerva: Enabling Low-Power, Highly-Accurate Deep Neural Network Accelerators. In 43rd ACM/IEEE Annual International Symposium on Computer Architecture, ISCA 2016, Seoul, South Korea, June 18--22, 2016. IEEE Computer Society, 267--278.
[72]
Anthony Saporito, Martin Recktenwald, Christian Jacobi, Gerrit Koch, Deanna Postles Dunn Berger, Robert J. Sonnelitter, Craig R. Walters, Jang-Soo Lee, Cédric Lichtenau, Ulrich Mayer, Eduard Herkel, Stefan Payer, Silvia M. Müller, Vesselina K. Papazova, Ekaterina M. Ambroladze, and Timothy C. Bronson. 2020. Design of the IBM z15 microprocessor. IBM J. Res. Dev. 64, 5/6 (2020), 7:1--7:18.
[73]
Ruhi Sarikaya and Alper Buyuktosunoglu. 2010. A Unified Prediction Method for Predicting Program Behavior. IEEE Trans. Computers 59, 2 (2010), 272--282.
[74]
Ruhi Sarikaya, Canturk Isci, and Alper Buyuktosunoglu. 2010. Runtime workload behavior prediction using statistical metric modeling with application to dynamic power management. In Proceedings of the 2010 IEEE International Symposium on Workload Characterization, IISWC 2010, Atlanta, GA, USA, December 2--4, 2010. IEEE Computer Society, 1--10.
[75]
Ruhi Sarikaya, Canturk Isci, and Alper Buyuktosunoglu. 2013. Runtime Application Behavior Prediction Using a Statistical Metric Model. IEEE Trans. Computers 62, 3 (2013), 575--588.
[76]
Pierre Sermanet, David Eigen, Xiang Zhang, Michaël Mathieu, Rob Fergus, and Yann LeCun. 2014. OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks. In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14--16, 2014, Conference Track Proceedings, Yoshua Bengio and Yann LeCun (Eds.). arXiv, 1--16. http://arxiv.org/abs/1312.6229
[77]
Ali Shafiee, Anirban Nag, Naveen Muralimanohar, Rajeev Balasubramonian, John Paul Strachan, Miao Hu, R. Stanley Williams, and Vivek Srikumar. 2016. ISAAC: A Convolutional Neural Network Accelerator with In-Situ Analog Arithmetic in Crossbars. In 43rd ACM/IEEE Annual International Symposium on Computer Architecture, ISCA 2016, Seoul, South Korea, June 18--22, 2016. IEEE Computer Society, 14--26.
[78]
Yakun Sophia Shao, Jason Clemons, Rangharajan Venkatesan, Brian Zimmer, Matthew Fojtik, Nan Jiang, Ben Keller, Alicia Klinefelter, Nathaniel Ross Pinckney, Priyanka Raina, Stephen G. Tell, Yanqing Zhang, William J. Dally, Joel S. Emer, C. Thomas Gray, Brucek Khailany, and Stephen W. Keckler. 2021. Simba: scaling deep-learning inference with chiplet-based architecture. Commun. ACM 64, 6 (2021), 107--116.
[79]
Hardik Sharma, Jongse Park, Naveen Suda, Liangzhen Lai, Benson Chau, Vikas Chandra, and Hadi Esmaeilzadeh. 2018. Bit Fusion: Bit-Level Dynamically Composable Architecture for Accelerating Deep Neural Network. In 45th ACM/IEEE Annual International Symposium on Computer Architecture, ISCA 2018, Los Angeles, CA, USA, June 1--6, 2018, Murali Annavaram, Timothy Mark Pinkston, and Babak Falsafi (Eds.). IEEE Computer Society, 764--775.
[80]
C. Kevin Shum, Fadi Busaba, and Christian Jacobi. 2013. IBM zEC12: The Third-Generation High-Frequency Mainframe Microprocessor. IEEE Micro 33, 2 (2013), 38--47.
[81]
Karen Simonyan and Andrew Zisserman. 2015. Very Deep Convolutional Networks for Large-Scale Image Recognition. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7--9, 2015, Conference Track Proceedings, Yoshua Bengio and Yann LeCun (Eds.). arXiv, 1--14. http://arxiv.org/abs/1409.1556
[82]
Linghao Song, Xuehai Qian, Hai Li, and Yiran Chen. 2017. PipeLayer: A Pipelined ReRAM-Based Accelerator for Deep Learning. In 2017 IEEE International Symposium on High Performance Computer Architecture, HPCA 2017, Austin, TX, USA, February 4--8, 2017. IEEE Computer Society, 541--552.
[83]
Naveen Suda, Vikas Chandra, Ganesh Dasika, Abinash Mohanty, Yufei Ma, Sarma B. K. Vrudhula, Jae-sun Seo, and Yu Cao. 2016. Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks. In Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Monterey, CA, USA, February 21--23, 2016, Deming Chen and Jonathan W. Greene (Eds.). ACM, 16--25.
[84]
Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A. Alemi. 2017. Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, February 4--9, 2017, San Francisco, California, USA, Satinder P. Singh and Shaul Markovitch (Eds.). AAAI Press, 4278--4284. http://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14806
[85]
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott E. Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7--12, 2015. IEEE Computer Society, 1--9.
[86]
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. 2016. Rethinking the Inception Architecture for Computer Vision. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27--30, 2016. IEEE Computer Society, 2818--2826.
[87]
Masakazu Tanomoto, Shinya Takamaeda-Yamazaki, Jun Yao, and Yasuhiko Nakashima. 2015. A CGRA-Based Approach for Accelerating Convolutional Neural Networks. In IEEE 9th International Symposium on Embedded Multicore/Manycore Systems-on-Chip, MCSoC 2015, Turin, Italy, September 23--25, 2015. IEEE Computer Society, 73--80.
[88]
Brian W. Thompto and Bodo Hoppe. 2010. Verification for fault tolerance of the IBM system z microprocessor. In Proceedings of the 47th Design Automation Conference, DAC 2010, Anaheim, California, USA, July 13--18, 2010, Sachin S. Sapatnekar (Ed.). ACM, 525--530.
[89]
Brian W. Thompto, Dung Q. Nguyen, José E. Moreira, Ramon Bertran, Hans M. Jacobson, Richard J. Eickemeyer, Rahul M. Rao, Michael Goulet, Marcy Byers, Christopher J. Gonzalez, Karthik Swaminathan, Nagu R. Dhanwada, Silvia M. Müller, Andreas Wagner, Satish Kumar Sadasivam, Robert K. Montoye, William J. Starke, Christian G. Zoellin, Michael S. Floyd, Jeffrey Stuecheli, Nandhini Chandramoorthy, John-David Wellman, Alper Buyuktosunoglu, Matthias Pflanz, Balaram Sinharoy, and Pradip Bose. 2021. Energy Efficiency Boost in the AI-Infused POWER10 Processor. In 48th ACM/IEEE Annual International Symposium on Computer Architecture, ISCA 2021, Valencia, Spain, June 14--18, 2021. IEEE, 29--42.
[90]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4--9, 2017, Long Beach, CA, USA, Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett (Eds.). Curran Associates, Inc., 5998--6008. https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html
[91]
Swagath Venkataramani, Ashish Ranjan, Subarno Banerjee, Dipankar Das, Sasikanth Avancha, Ashok Jagannathan, Ajaya Durg, Dheemanth Nagaraj, Bharat Kaul, Pradeep Dubey, and Anand Raghunathan. 2017. ScaleDeep: A Scalable Compute Architecture for Learning and Evaluating Deep Networks. In Proceedings of the 44th Annual International Symposium on Computer Architecture, ISCA 2017, Toronto, ON, Canada, June 24--28, 2017. ACM, 13--26.
[92]
Swagath Venkataramani, Vijayalakshmi Srinivasan, Wei Wang, Sanchari Sen, Jintao Zhang, Ankur Agrawal, Monodeep Kar, Shubham Jain, Alberto Mannari, Hoang Tran, Yulong Li, Eri Ogawa, Kazuaki Ishizaki, Hiroshi Inoue, Marcel Schaal, Mauricio J. Serrano, Jungwook Choi, Xiao Sun, Naigang Wang, Chia-Yu Chen, Allison Allain, James Bonanno, Nianzheng Cao, Robert Casatuta, Matthew Cohen, Bruce M. Fleischer, Michael Guillorn, Howard Haynie, Jinwook Jung, Mingu Kang, Kyu-Hyoun Kim, Siyu Koswatta, Sae Kyu Lee, Martin Lutz, Silvia Mueller, Jinwook Oh, Ashish Ranjan, Zhibin Ren, Scot Rider, Kerstin Schelm, Michael Scheuermann, Joel Silberman, Jie Yang, Vidhi Zalani, Xin Zhang, Ching Zhou, Matthew M. Ziegler, Vinay Shah, Moriyoshi Ohara, Pong-Fei Lu, Brian W. Curran, Sunil Shukla, Leland Chang, and Kailash Gopalakrishnan. 2021. RaPiD: AI Accelerator for Ultra-low Precision Training and Inference. In 48th ACM/IEEE Annual International Symposium on Computer Architecture, ISCA 2021, Valencia, Spain, June 14--18, 2021. IEEE, 153--166.
[93]
Subhashini Venugopalan, Marcus Rohrbach, Jeffrey Donahue, Raymond J. Mooney, Trevor Darrell, and Kate Saenko. 2015. Sequence to Sequence - Video to Text. In 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7--13, 2015. IEEE Computer Society, 4534--4542.
[94]
Christos Vezyrtzis, Thomas Strach, Pierce I-Jen Chuang, Preetham Lobo, Richard F. Rizzolo, Tobias Webel, Pawel Owczarczyk, Alper Buyuktosunoglu, Ramon Bertran, David T. Hui, Susan M. Eickhoff, Michael S. Floyd, Gerard Salem, Sean M. Carey, Stelios G. Tsapepas, and Phillip J. Restle. 2018. Droop mitigation using critical-path sensors and an on-chip distributed power supply estimation engine in the z14™ enterprise processor. In 2018 IEEE International Solid-State Circuits Conference, ISSCC 2018, San Francisco, CA, USA, February 11--15, 2018. IEEE, 300--302.
[95]
Craig R. Walters, David S. Hutton, Edward W. Chencinski, Christine Axnix, Ralf Winkelmann, Michael F. Fee, Anthony Saporito, and Christian Jacobi. 2018. Performance innovations in the IBM z14 platform. IBM J. Res. Dev. 62, 2/3 (2018), 7:1--7:11.
[96]
Craig R. Walters, Pak-kin Mak, D. P. D. Berger, Michael A. Blake, Tim Bronson, K. D. Klapproth, A. J. O'Neill, Robert J. Sonnelitter, and Vesselina K. Papazova. 2015. The IBM z13 processor cache subsystem. IBM J. Res. Dev. 59, 4/5 (2015), 5:1--5:11.
[97]
Charles F. Webb. 2021. Microprocessor Advances and the Mainframe Legacy. IEEE Micro 41, 6 (2021), 68--70.
[98]
Tobias Webel, Preetham M. Lobo, Ramon Bertran, Gerard Salem, Malcolm Allen-Ware, Richard F. Rizzolo, Sean M. Carey, Thomas Strach, Alper Buyuktosunoglu, Charles Lefurgy, Pradip Bose, Ricardo Nigaglioni, Timothy J. Slegel, Michael S. Floyd, and Brian W. Curran. 2015. Robust power management in the IBM z13. IBM J. Res. Dev. 59, 4/5 (2015), 16:1--16:12.
[99]
Tobias Webel, Preetham M. Lobo, Thomas Strach, Pradeep Bhadravati Parashurama, Srinivas Purushotham, Ramon Bertran, and Alper Buyuktosunoglu. 2020. Proactive power management in IBM z15. IBM J. Res. Dev. 64, 5/6 (2020), 15:1--15:12.
[100]
Ofri Wechsler, Michael Behar, and Bharat Daga. 2019. Spring Hill (NNP-I 1000) Intel's Data Center Inference Chip. In 2019 IEEE Hot Chips 31 Symposium (HCS), Cupertino, CA, USA, August 18--20, 2019. IEEE, 1--12.
[101]
David Wolpert, Christopher J. Berry, Brian Bell, Adam Jatkowski, Jesse Surprise, John Isakson, Ofer Geva, Brian Deskin, Mark Cichanowski, Dina Hamid, Chris Cavitt, Gregory Fredeman, Dinesh Kannambadi, Anthony Saporito, Ashutosh Mishra, Alper Buyuktosunoglu, Tobias Webel, Preetham Lobo, Ramon Bertran, Pradeep Bhadravati Parashurama, Dureseti Chidambarrao, Brandon Bruen, Alan P. Wagstaff, Eric Lukes, Sean M. Carey, Hunter F. Shi, Michael Romain, Paul Logsdon, and Ishita Agarwal. 2021. Cores, Cache, Content, and Characterization: IBM's Second Generation 14-nm Product, z15. IEEE J. Solid State Circuits 56, 1 (2021), 98--111.
[102]
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Lukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, and Jeffrey Dean. 2016. Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. CoRR abs/1609.08144 (2016), 1--23. arXiv:1609.08144 http://arxiv.org/abs/1609.08144
[103]
Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, and Jason Cong. 2015. Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks. In Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Monterey, CA, USA, February 22--24, 2015, George A. Constantinides and Deming Chen (Eds.). ACM, 161--170.
[104]
Jintao Zhang, Zhuo Wang, and Naveen Verma. 2017. In-Memory Computation of a Machine-Learning Classifier in a Standard 6T SRAM Array. IEEE J. Solid State Circuits 52, 4 (2017), 915--924.
[105]
Wei Zhang, Xiaodong Cui, Ulrich Finkler, George Saon, Abdullah Kayi, Alper Buyuktosunoglu, Brian Kingsbury, David S. Kung, and Michael Picheny. 2019. A Highly Efficient Distributed Deep Learning System for Automatic Speech Recognition. In Interspeech 2019, 20th Annual Conference of the International Speech Communication Association, Graz, Austria, 15--19 September 2019, Gernot Kubin and Zdravko Kacic (Eds.). ISCA, 2628--2632.
[106]
Wei Zhang, Xiaodong Cui, Abdullah Kayi, Mingrui Liu, Ulrich Finkler, Brian Kingsbury, George Saon, Youssef Mroueh, Alper Buyuktosunoglu, Payel Das, David S. Kung, and Michael Picheny. 2020. Improving Efficiency in Large-Scale Decentralized Distributed Training. In 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2020, Barcelona, Spain, May 4--8, 2020. IEEE, 3022--3026.
[107]
Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. 2018. ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18--22, 2018. Computer Vision Foundation / IEEE Computer Society, 6848--6856.
[108]
Matthew M. Ziegler, Ramon Bertran, Alper Buyuktosunoglu, and Pradip Bose. 2017. Machine learning techniques for taming the complexity of modern hardware design. IBM J. Res. Dev. 61, 4--5 (2017), 13:1--13:14.
[109]
Christian G. Zoellin, Oliver Draese, Jonathan D. Bradbury, Christian Jacobi, Aditya Puranik, Peter Sutton, and Robert Tokumaru. 2018. New database compression assists in the IBM z14 processor. IBM J. Res. Dev. 62, 2/3 (2018), 12:1--12:11.
[110]
Barret Zoph and Quoc V. Le. 2017. Neural Architecture Search with Reinforcement Learning. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24--26, 2017, Conference Track Proceedings. OpenReview.net, 1--16. https://openreview.net/forum?id=r1Ue8Hcxg

Cited By

View all
  • (2024)Efficient Tensor Offloading for Large Deep-Learning Model Training based on Compute Express LinkProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00100(1-18)Online publication date: 17-Nov-2024
  • (2024)IBM z16 Modular Scalable Cache Hierarchy and Future Concept Applications2024 IEEE 67th International Midwest Symposium on Circuits and Systems (MWSCAS)10.1109/MWSCAS60917.2024.10658848(893-897)Online publication date: 11-Aug-2024
  • (2024)Data Motion Acceleration: Chaining Cross-Domain Multi Accelerators2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA57654.2024.00083(1043-1062)Online publication date: 2-Mar-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ISCA '22: Proceedings of the 49th Annual International Symposium on Computer Architecture
June 2022
1097 pages
ISBN:9781450386104
DOI:10.1145/3470496
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

In-Cooperation

  • IEEE CS TCAA: IEEE CS technical committee on architectural acoustics

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 June 2022

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. AI on server-class processor
  2. Telum
  3. enterprise workload AI
  4. low-latency in-transaction inference
  5. on-chip AI accelerator
  6. z16

Qualifiers

  • Research-article

Conference

ISCA '22
Sponsor:

Acceptance Rates

ISCA '22 Paper Acceptance Rate 67 of 400 submissions, 17%;
Overall Acceptance Rate 543 of 3,203 submissions, 17%

Upcoming Conference

ISCA '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)298
  • Downloads (Last 6 weeks)28
Reflects downloads up to 13 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Efficient Tensor Offloading for Large Deep-Learning Model Training based on Compute Express LinkProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00100(1-18)Online publication date: 17-Nov-2024
  • (2024)IBM z16 Modular Scalable Cache Hierarchy and Future Concept Applications2024 IEEE 67th International Midwest Symposium on Circuits and Systems (MWSCAS)10.1109/MWSCAS60917.2024.10658848(893-897)Online publication date: 11-Aug-2024
  • (2024)Data Motion Acceleration: Chaining Cross-Domain Multi Accelerators2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA57654.2024.00083(1043-1062)Online publication date: 2-Mar-2024
  • (2024)Enterprise-Class Cache Compression Design2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA57654.2024.00080(996-1011)Online publication date: 2-Mar-2024
  • (2024)CUTEJournal of Systems Architecture: the EUROMICRO Journal10.1016/j.sysarc.2024.103106149:COnline publication date: 2-Jul-2024
  • (2023)Acceleration of Decision-Tree Ensemble Models on the IBM Telum Processor2023 IEEE International Symposium on Circuits and Systems (ISCAS)10.1109/ISCAS46773.2023.10181908(1-5)Online publication date: 21-May-2023
  • (2023)Characterization and Exploration of Latch Checkers for Efficient RAS Protection2023 53rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks - Supplemental Volume (DSN-S)10.1109/DSN-S58398.2023.00026(63-69)Online publication date: Jun-2023

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media