Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
166 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Gemini: Mapping and Architecture Co-exploration for Large-scale DNN Chiplet Accelerators (2312.16436v1)

Published 27 Dec 2023 in cs.AR

Abstract: Chiplet technology enables the integration of an increasing number of transistors on a single accelerator with higher yield in the post-Moore era, addressing the immense computational demands arising from rapid AI advancements. However, it also introduces more expensive packaging costs and costly Die-to-Die (D2D) interfaces, which require more area, consume higher power, and offer lower bandwidth than on-chip interconnects. Maximizing the benefits and minimizing the drawbacks of chiplet technology is crucial for developing large-scale DNN chiplet accelerators, which poses challenges to both architecture and mapping. Despite its importance in the post-Moore era, methods to address these challenges remain scarce.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (68)
  1. “The proof for comprehensiveness of sa operators,” https://anonymous.4open.science/r/2024HPCA_proof-4F15.
  2. “Space calculation of gemini mapping and tangram mapping,” https://anonymous.4open.science/r/Space-Calculation-754A/.
  3. M. Alwani, H. Chen, M. Ferdman, and P. A. Milder, “Fused-layer CNN accelerators,” in 49th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2016, Taipei, Taiwan, October 15-19, 2016.   IEEE Computer Society, 2016, pp. 22:1–22:12. [Online]. Available: https://doi.org/10.1109/MICRO.2016.7783725
  4. AMD, “Epyc-7003,” https://www.amd.com/en/processors/epyc-7003-series.
  5. A. Boni, A. Pierazzi, and D. Vecchi, “Lvds i/o interface for gb/s-per-pin operation in 0.35-/spl mu/m cmos,” IEEE Journal of Solid-State Circuits, vol. 36, no. 4, pp. 706–711, 2001.
  6. T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei, “Language models are few-shot learners,” 2020.
  7. J. Cai, Y. Wei, Z. Wu, S. Peng, and K. Ma, “Inter-layer scheduling space definition and exploration for tiled accelerators,” in Proceedings of the 50th Annual International Symposium on Computer Architecture, ISCA 2023, Orlando, FL, USA, June 17-21, 2023, Y. Solihin and M. A. Heinrich, Eds.   ACM, 2023, pp. 13:1–13:17. [Online]. Available: https://doi.org/10.1145/3579371.3589048
  8. Y. Chen, J. S. Emer, and V. Sze, “Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks,” in 43rd ACM/IEEE Annual International Symposium on Computer Architecture, ISCA 2016, Seoul, South Korea, June 18-22, 2016.   IEEE Computer Society, 2016, pp. 367–379. [Online]. Available: https://doi.org/10.1109/ISCA.2016.40
  9. Y. Chen, J. Li, H. Xiao, X. Jin, S. Yan, and J. Feng, “Dual path networks,” in Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S. V. N. Vishwanathan, and R. Garnett, Eds., 2017, pp. 4467–4475. [Online]. Available: https://proceedings.neurips.cc/paper/2017/hash/f7e0b956540676a129760a3eae309294-Abstract.html
  10. J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: pre-training of deep bidirectional transformers for language understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), J. Burstein, C. Doran, and T. Solorio, Eds.   Association for Computational Linguistics, 2019, pp. 4171–4186. [Online]. Available: https://doi.org/10.18653/v1/n19-1423
  11. A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” 2021.
  12. dramexchange, “Dram price,” https://www.dramexchange.com/.
  13. Y. Feng and K. Ma, “Chiplet actuary: a quantitative cost model and multi-chiplet architecture exploration,” in DAC ’22: 59th ACM/IEEE Design Automation Conference, San Francisco, California, USA, July 10 - 14, 2022, R. Oshana, Ed.   ACM, 2022, pp. 121–126. [Online]. Available: https://doi.org/10.1145/3489517.3530428
  14. M. Gao, J. Pu, X. Yang, M. Horowitz, and C. Kozyrakis, “TETRIS: scalable and efficient neural network acceleration with 3d memory,” in Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2017, Xi’an, China, April 8-12, 2017, Y. Chen, O. Temam, and J. Carter, Eds.   ACM, 2017, pp. 751–764. [Online]. Available: https://doi.org/10.1145/3037697.3037702
  15. M. Gao, X. Yang, J. Pu, M. Horowitz, and C. Kozyrakis, “TANGRAM: optimized coarse-grained dataflow for scalable NN accelerators,” in Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2019, Providence, RI, USA, April 13-17, 2019, I. Bahar, M. Herlihy, E. Witchel, and A. R. Lebeck, Eds.   ACM, 2019, pp. 807–820. [Online]. Available: https://doi.org/10.1145/3297858.3304014
  16. W. Gomes, A. Koker, P. N. Stover, D. B. Ingerly, S. Siers, S. Venkataraman, C. Pelto, T. Shah, A. Rao, F. O’Mahony, E. Karl, L. Cheney, I. Rajwani, H. Jain, R. Cortez, A. Chandrasekhar, B. Kanthi, and R. Koduri, “Ponte vecchio: A multi-tile 3d stacked processor for exascale computing,” in IEEE International Solid-State Circuits Conference, ISSCC 2022, San Francisco, CA, USA, February 20-26, 2022.   IEEE, 2022, pp. 42–44. [Online]. Available: https://doi.org/10.1109/ISSCC42614.2022.9731673
  17. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016.   IEEE Computer Society, 2016, pp. 770–778. [Online]. Available: https://doi.org/10.1109/CVPR.2016.90
  18. K. Hegde, P. Tsai, S. Huang, V. Chandra, A. Parashar, and C. W. Fletcher, “Mind mappings: enabling efficient algorithm-accelerator mapping space search,” in ASPLOS ’21: 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Virtual Event, USA, April 19-23, 2021, T. Sherwood, E. D. Berger, and C. Kozyrakis, Eds.   ACM, 2021, pp. 943–958. [Online]. Available: https://doi.org/10.1145/3445814.3446762
  19. P. Huang, C. Lu, W. Wei, C. Chiu, K. Ting, C. Hu, C. Tsai, S. Hou, W. Chiou, C. Wang, and D. Yu, “Wafer level system integration of the fifth generation cowos®-s with high performance si interposer at 2500 mm2,” in 2021 IEEE 71st Electronic Components and Technology Conference (ECTC).   IEEE, 2021, pp. 101–104.
  20. Q. Huang, A. Kalaiah, M. Kang, J. Demmel, G. Dinh, J. Wawrzynek, T. Norell, and Y. S. Shao, “Cosa: Scheduling by constrained optimization for spatial accelerators,” in 48th ACM/IEEE Annual International Symposium on Computer Architecture, ISCA 2021, Valencia, Spain, June 14-18, 2021.   IEEE, 2021, pp. 554–566. [Online]. Available: https://doi.org/10.1109/ISCA52012.2021.00050
  21. M. James, M. Tom, P. Groeneveld, and V. Kibardin, “ISPD 2020 physical mapping of neural networks on a wafer-scale deep learning accelerator,” in ISPD 2020: International Symposium on Physical Design, Taipei, Taiwan, March 29 - April 1, 2020, delayed to September 20-23, 2020, W. Swartz and J. Lienig, Eds.   ACM, 2020, pp. 145–149. [Online]. Available: https://doi.org/10.1145/3372780.3380846
  22. Z. Jia, S. Lin, C. R. Qi, and A. Aiken, “Exploring hidden dimensions in parallelizing convolutional neural networks,” in Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, ser. Proceedings of Machine Learning Research, J. G. Dy and A. Krause, Eds., vol. 80.   PMLR, 2018, pp. 2279–2288. [Online]. Available: http://proceedings.mlr.press/v80/jia18a.html
  23. Y. Jiao, L. Han, R. Jin, Y. Su, C. Ho, L. Yin, Y. Li, L. Chen, Z. Chen, L. Liu, Z. He, Y. Yan, J. He, J. Mao, X. Zai, X. Wu, Y. Zhou, M. Gu, G. Zhu, R. Zhong, W. Lee, P. Chen, Y. Chen, W. Li, D. Xiao, Q. Yan, M. Zhuang, J. Chen, Y. Tian, Y. Lin, W. Wu, H. Li, and Z. Dou, “A 12nm programmable convolution-efficient neural-processing-unit chip achieving 825tops,” in 2020 IEEE International Solid- State Circuits Conference, ISSCC 2020, San Francisco, CA, USA, February 16-20, 2020.   IEEE, 2020, pp. 136–140. [Online]. Available: https://doi.org/10.1109/ISSCC19947.2020.9062984
  24. N. P. Jouppi, D. H. Yoon, M. Ashcraft, M. Gottscho, T. B. Jablin, G. Kurian, J. Laudon, S. Li, P. C. Ma, X. Ma, T. Norrie, N. Patil, S. Prasad, C. Young, Z. Zhou, and D. A. Patterson, “Ten lessons from three generations shaped google’s tpuv4i : Industrial product,” in 48th ACM/IEEE Annual International Symposium on Computer Architecture, ISCA 2021, Valencia, Spain, June 14-18, 2021.   IEEE, 2021, pp. 1–14. [Online]. Available: https://doi.org/10.1109/ISCA52012.2021.00010
  25. S. Kao, G. Jeong, and T. Krishna, “Confuciux: Autonomous hardware resource assignment for DNN accelerators using reinforcement learning,” in 53rd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2020, Athens, Greece, October 17-21, 2020.   IEEE, 2020, pp. 622–636. [Online]. Available: https://doi.org/10.1109/MICRO50266.2020.00058
  26. S. Kao and T. Krishna, “GAMMA: automating the HW mapping of DNN models on accelerators via genetic algorithm,” in IEEE/ACM International Conference On Computer Aided Design, ICCAD 2020, San Diego, CA, USA, November 2-5, 2020.   IEEE, 2020, pp. 44:1–44:9. [Online]. Available: https://doi.org/10.1145/3400302.3415639
  27. S. Kirkpatrick, C. D. Gelatt Jr, and M. P. Vecchi, “Optimization by simulated annealing,” science, vol. 220, no. 4598, pp. 671–680, 1983.
  28. S. Knowles, “Graphcore,” in IEEE Hot Chips 33 Symposium, HCS 2021, Palo Alto, CA, USA, August 22-24, 2021.   IEEE, 2021, pp. 1–25. [Online]. Available: https://doi.org/10.1109/HCS52781.2021.9567075
  29. H. Kwon, P. Chatarasi, M. Pellauer, A. Parashar, V. Sarkar, and T. Krishna, “Understanding reuse, performance, and hardware cost of DNN dataflow: A data-centric approach,” in Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2019, Columbus, OH, USA, October 12-16, 2019.   ACM, 2019, pp. 754–768. [Online]. Available: https://doi.org/10.1145/3352460.3358252
  30. H. Kwon, A. Samajdar, and T. Krishna, “MAERI: enabling flexible dataflow mapping over DNN accelerators via reconfigurable interconnects,” in Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2018, Williamsburg, VA, USA, March 24-28, 2018, X. Shen, J. Tuck, R. Bianchini, and V. Sarkar, Eds.   ACM, 2018, pp. 461–475. [Online]. Available: https://doi.org/10.1145/3173162.3173176
  31. H. Liao, J. Tu, J. Xia, and X. Zhou, “Davinci: A scalable architecture for neural network computing,” in 2019 IEEE Hot Chips 31 Symposium (HCS), Cupertino, CA, USA, August 18-20, 2019.   IEEE, 2019, pp. 1–44. [Online]. Available: https://doi.org/10.1109/HOTCHIPS.2019.8875654
  32. C. Liu, B. Zoph, M. Neumann, J. Shlens, W. Hua, L. Li, L. Fei-Fei, A. L. Yuille, J. Huang, and K. Murphy, “Progressive neural architecture search,” in Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part I, ser. Lecture Notes in Computer Science, V. Ferrari, M. Hebert, C. Sminchisescu, and Y. Weiss, Eds., vol. 11205.   Springer, 2018, pp. 19–35. [Online]. Available: https://doi.org/10.1007/978-3-030-01246-5_2
  33. L. Lu, N. Guan, Y. Wang, L. Jia, Z. Luo, J. Yin, J. Cong, and Y. Liang, “TENET: A framework for modeling tensor dataflow based on relation-centric notation,” in 48th ACM/IEEE Annual International Symposium on Computer Architecture, ISCA 2021, Valencia, Spain, June 14-18, 2021.   IEEE, 2021, pp. 720–733. [Online]. Available: https://doi.org/10.1109/ISCA52012.2021.00062
  34. X. Ma, C. Si, Y. Wang, C. Liu, and L. Zhang, “NASA: accelerating neural network design with a NAS processor,” in 48th ACM/IEEE Annual International Symposium on Computer Architecture, ISCA 2021, Valencia, Spain, June 14-18, 2021.   IEEE, 2021, pp. 790–803. [Online]. Available: https://doi.org/10.1109/ISCA52012.2021.00067
  35. G. Moore, “Cramming more components onto integrated circuits,” Electronics, vol. 38, 04 1965.
  36. S. Naffziger, N. Beck, T. Burd, K. Lepak, G. H. Loh, M. Subramony, and S. White, “Pioneering chiplet technology and design for the amd epyc™ and ryzen™ processor families : Industrial product,” in 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), 2021, pp. 57–70.
  37. S. Naffziger, K. Lepak, M. Paraschou, and M. Subramony, “2.2 AMD chiplet architecture for high-performance server and desktop products,” in 2020 IEEE International Solid- State Circuits Conference, ISSCC 2020, San Francisco, CA, USA, February 16-20, 2020.   IEEE, 2020, pp. 44–45. [Online]. Available: https://doi.org/10.1109/ISSCC19947.2020.9063103
  38. T. Norrie, N. Patil, D. H. Yoon, G. Kurian, S. Li, J. Laudon, C. Young, N. P. Jouppi, and D. A. Patterson, “Google’s training chips revealed: Tpuv2 and tpuv3,” in IEEE Hot Chips 32 Symposium, HCS 2020, Palo Alto, CA, USA, August 16-18, 2020.   IEEE, 2020, pp. 1–70. [Online]. Available: https://doi.org/10.1109/HCS49909.2020.9220735
  39. Nvidia, “Nvdla deep learning accelerator,” http://nvdla.org., 2017.
  40. OpenAI, “Gpt-4 technical report,” 2023.
  41. A. Parashar, P. Raina, Y. S. Shao, Y. Chen, V. A. Ying, A. Mukkara, R. Venkatesan, B. Khailany, S. W. Keckler, and J. S. Emer, “Timeloop: A systematic approach to DNN accelerator evaluation,” in IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS 2019, Madison, WI, USA, March 24-26, 2019.   IEEE, 2019, pp. 304–315. [Online]. Available: https://doi.org/10.1109/ISPASS.2019.00042
  42. J. W. Poulton, W. J. Dally, X. Chen, J. G. Eyles, T. H. Greer, S. G. Tell, J. M. Wilson, and C. T. Gray, “A 0.54 pj/b 20 gb/s ground-referenced single-ended short-reach serial link in 28 nm cmos for advanced packaging applications,” IEEE Journal of Solid-State Circuits, vol. 48, no. 12, pp. 3206–3218, 2013.
  43. J. W. Poulton, J. M. Wilson, W. J. Turner, B. Zimmer, X. Chen, S. S. Kudva, S. Song, S. G. Tell, N. Nedovic, W. Zhao, S. R. Sudhakaran, C. T. Gray, and W. J. Dally, “A 1.17-pj/b, 25-gb/s/pin ground-referenced single-ended serial link for off- and on-package communication using a process- and temperature-adaptive voltage regulator,” IEEE Journal of Solid-State Circuits, vol. 54, no. 1, pp. 43–54, 2019.
  44. V. J. Reddi, C. Cheng, D. Kanter, P. Mattson, G. Schmuelling, C. Wu, B. Anderson, M. Breughe, M. Charlebois, W. Chou, R. Chukka, C. Coleman, S. Davis, P. Deng, G. Diamos, J. Duke, D. Fick, J. S. Gardner, I. Hubara, S. Idgunji, T. B. Jablin, J. Jiao, T. S. John, P. Kanwar, D. Lee, J. Liao, A. Lokhmotov, F. Massa, P. Meng, P. Micikevicius, C. Osborne, G. Pekhimenko, A. T. R. Rajan, D. Sequeira, A. Sirasao, F. Sun, H. Tang, M. Thomson, F. Wei, E. Wu, L. Xu, K. Yamada, B. Yu, G. Yuan, A. Zhong, P. Zhang, and Y. Zhou, “Mlperf inference benchmark,” in 47th ACM/IEEE Annual International Symposium on Computer Architecture, ISCA 2020, Valencia, Spain, May 30 - June 3, 2020.   IEEE, 2020, pp. 446–459. [Online]. Available: https://doi.org/10.1109/ISCA45697.2020.00045
  45. A. Shafiee, A. Nag, N. Muralimanohar, R. Balasubramonian, J. P. Strachan, M. Hu, R. S. Williams, and V. Srikumar, “ISAAC: A convolutional neural network accelerator with in-situ analog arithmetic in crossbars,” in 43rd ACM/IEEE Annual International Symposium on Computer Architecture, ISCA 2016, Seoul, South Korea, June 18-22, 2016.   IEEE Computer Society, 2016, pp. 14–26. [Online]. Available: https://doi.org/10.1109/ISCA.2016.12
  46. Y. S. Shao, J. Clemons, R. Venkatesan, B. Zimmer, M. Fojtik, N. Jiang, B. Keller, A. Klinefelter, N. Pinckney, P. Raina, S. G. Tell, Y. Zhang, W. J. Dally, J. Emer, C. T. Gray, B. Khailany, and S. W. Keckler, “Simba: Scaling deep-learning inference with multi-chip-module-based architecture,” in Proceedi .ngs of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO ’52.   New York, NY, USA: Association for Computing Machinery, 2019, p. 14–27. [Online]. Available: https://doi.org/10.1145/3352460.3358302
  47. R. Shivnaraine, M. van Ierssel, K. Farzan, D. DiClemente, G. Ng, N. Wang, J. Musayev, G. Dutta, M. Shibata, A. Moradi, H. Vahedi, M. Farzad, P. Kainth, M. Yu, N. Nguyen, J. Pham, and A. McLaren, “11.2 A 26.5625-to-106.25gb/s XSR serdes with 1.55pj/b efficiency in 7nm CMOS,” in IEEE International Solid-State Circuits Conference, ISSCC 2021, San Francisco, CA, USA, February 13-22, 2021.   IEEE, 2021, pp. 181–183. [Online]. Available: https://doi.org/10.1109/ISSCC42613.2021.9365975
  48. A. Shokrollahi, D. A. Carnelli, J. Fox, K. L. Hofstra, B. Holden, A. Hormati, P. Hunt, M. Johnston, J. Keay, S. Pesenti, R. Simpson, D. Stauffer, A. Stewart, G. Surace, A. Tajalli, O. T. Amiri, A. Tschank, R. Ulrich, C. Walter, F. Licciardello, Y. Mogentale, and A. Singh, “10.1 A pin-efficient 20.83gb/s/wire 0.94pj/bit forwarded clock cnrz-5-coded serdes up to 12mm for MCM packages in 28nm CMOS,” in 2016 IEEE International Solid-State Circuits Conference, ISSCC 2016, San Francisco, CA, USA, January 31 - February 4, 2016.   IEEE, 2016, pp. 182–183. [Online]. Available: https://doi.org/10.1109/ISSCC.2016.7417967
  49. A. Singh, D. A. Carnelli, A. Falay, K. L. Hofstra, F. Licciardello, K. Salimi, H. Santos, A. Shokrollahi, R. Ulrich, C. Walter, J. Fox, P. Hunt, J. Keay, R. Simpson, A. Stewart, G. Surace, and H. S. Cronie, “26.3 A pin- and power-efficient low-latency 8-to-12gb/s/wire 8b8w-coded serdes link for high-loss channels in 40nm technology,” in 2014 IEEE International Conference on Solid-State Circuits Conference, ISSCC 2014, Digest of Technical Papers, San Francisco, CA, USA, February 9-13, 2014.   IEEE, 2014, pp. 442–443. [Online]. Available: https://doi.org/10.1109/ISSCC.2014.6757505
  50. D. C. Stow, I. Akgun, R. Barnes, P. Gu, and Y. Xie, “Cost analysis and cost-driven IP reuse methodology for soc design based on 2.5d/3d integration,” in Proceedings of the 35th International Conference on Computer-Aided Design, ICCAD 2016, Austin, TX, USA, November 7-10, 2016, F. Liu, Ed.   ACM, 2016, p. 56. [Online]. Available: https://doi.org/10.1145/2966986.2980095
  51. C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi, “Inception-v4, inception-resnet and the impact of residual connections on learning,” in Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, February 4-9, 2017, San Francisco, California, USA, S. P. Singh and S. Markovitch, Eds.   AAAI Press, 2017, pp. 4278–4284. [Online]. Available: http://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14806
  52. E. Talpes, D. Williams, and D. D. Sarma, “DOJO: the microarchitecture of tesla’s exa-scale computer,” in 2022 IEEE Hot Chips 34 Symposium, HCS 2022, Cupertino, CA, USA, August 21-23, 2022.   IEEE, 2022, pp. 1–28. [Online]. Available: https://doi.org/10.1109/HCS55958.2022.9895534
  53. Z. Tan, H. Cai, R. Dong, and K. Ma, “Nn-baton: DNN workload orchestration and chiplet granularity exploration for multichip accelerators,” in 48th ACM/IEEE Annual International Symposium on Computer Architecture, ISCA 2021, Valencia, Spain, June 14-18, 2021.   IEEE, 2021, pp. 1013–1026. [Online]. Available: https://doi.org/10.1109/ISCA52012.2021.00083
  54. UCIe, “Ucie1.1 specifications,” https://www.uciexpress.org/specifications.
  55. J. Vasiljevic, L. Bajic, D. Capalija, S. Sokorac, D. Ignjatovic, L. Bajic, M. Trajkovic, I. Hamer, I. Matosevic, A. Cejkov, U. Aydonat, T. Zhou, S. Z. Gilani, A. Paiva, J. Chu, D. Maksimovic, S. A. Chin, Z. Moudallal, A. Rakhmati, S. Nijjar, A. Bhullar, B. Drazic, C. Lee, J. Sun, K. Kwong, J. Connolly, M. Dooley, H. Farooq, J. Y. T. Chen, M. Walker, K. Dabiri, K. Mabee, R. S. Lal, N. Rajatheva, R. Retnamma, S. Karodi, D. Rosen, E. Munoz, A. Lewycky, A. Knezevic, R. Kim, A. Rui, A. Drouillard, and D. Thompson, “Compute substrate for software 2.0,” IEEE Micro, vol. 41, no. 2, pp. 50–55, 2021. [Online]. Available: https://doi.org/10.1109/MM.2021.3061912
  56. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Proceedings of the 31st International Conference on Neural Information Processing Systems, ser. NIPS’17.   Red Hook, NY, USA: Curran Associates Inc., 2017, p. 6000–6010.
  57. S. Venkataramani, A. Ranjan, S. Banerjee, D. Das, S. Avancha, A. Jagannathan, A. Durg, D. Nagaraj, B. Kaul, P. Dubey, and A. Raghunathan, “Scaledeep: A scalable compute architecture for learning and evaluating deep networks,” in Proceedings of the 44th Annual International Symposium on Computer Architecture, ISCA 2017, Toronto, ON, Canada, June 24-28, 2017.   ACM, 2017, pp. 13–26. [Online]. Available: https://doi.org/10.1145/3079856.3080244
  58. R. Venkatesan, Y. S. Shao, M. Wang, J. Clemons, S. Dai, M. Fojtik, B. Keller, A. Klinefelter, N. R. Pinckney, P. Raina, Y. Zhang, B. Zimmer, W. J. Dally, J. S. Emer, S. W. Keckler, and B. Khailany, “Magnet: A modular accelerator generator for neural networks,” in Proceedings of the International Conference on Computer-Aided Design, ICCAD 2019, Westminster, CO, USA, November 4-7, 2019, D. Z. Pan, Ed.   ACM, 2019, pp. 1–8. [Online]. Available: https://doi.org/10.1109/ICCAD45719.2019.8942127
  59. P. Vivet, E. Guthmuller, Y. Thonnart, G. Pillonnet, G. Moritz, I. Miro-Panadès, C. F. Tortolero, J. Durupt, C. Bernard, D. Varreau, J. J. H. Pontes, S. Thuries, D. Coriat, M. Harrand, D. Dutoit, D. Lattard, L. Arnaud, J. Charbonnier, P. Coudrain, A. Garnier, F. Berger, A. Gueugnot, A. Greiner, Q. L. Meunier, A. Farcy, A. Arriordaz, S. Cheramy, and F. Clermidy, “A 220gops 96-core processor with 6 chiplets 3d-stacked on an active interposer offering 0.6ns/mm latency, 3tb/s/mm22{}^{\mbox{2}}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT inter-chiplet interconnects and 156mw/mm22{}^{\mbox{2}}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT@ 82%-peak-efficiency DC-DC converters,” in 2020 IEEE International Solid- State Circuits Conference, ISSCC 2020, San Francisco, CA, USA, February 16-20, 2020.   IEEE, 2020, pp. 46–48. [Online]. Available: https://doi.org/10.1109/ISSCC19947.2020.9062927
  60. H. Wang, X. Zhu, L. Peh, and S. Malik, “Orion: a power-performance simulator for interconnection networks,” in Proceedings of the 35th Annual International Symposium on Microarchitecture, Istanbul, Turkey, November 18-22, 2002, E. R. Altman, K. Ebcioglu, S. A. Mahlke, B. R. Rau, and S. J. Patel, Eds.   ACM/IEEE Computer Society, 2002, pp. 294–305. [Online]. Available: https://doi.org/10.1109/MICRO.2002.1176258
  61. O. Wechsler, M. Behar, and B. Daga, “Spring hill (NNP-I 1000) intel’s data center inference chip,” in 2019 IEEE Hot Chips 31 Symposium (HCS), Cupertino, CA, USA, August 18-20, 2019.   IEEE, 2019, pp. 1–12. [Online]. Available: https://doi.org/10.1109/HOTCHIPS.2019.8875671
  62. Q. Xiao, S. Zheng, B. Wu, P. Xu, X. Qian, and Y. Liang, “HASCO: towards agile hardware and software co-design for tensor computation,” in 48th ACM/IEEE Annual International Symposium on Computer Architecture, ISCA 2021, Valencia, Spain, June 14-18, 2021.   IEEE, 2021, pp. 1055–1068. [Online]. Available: https://doi.org/10.1109/ISCA52012.2021.00086
  63. S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He, “Aggregated residual transformations for deep neural networks,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 5987–5995.
  64. A. Yang, “Deep learning training at scale spring crest deep learning accelerator (intel® nervana™ NNP-T),” in 2019 IEEE Hot Chips 31 Symposium (HCS), Cupertino, CA, USA, August 18-20, 2019.   IEEE, 2019, pp. 1–20. [Online]. Available: https://doi.org/10.1109/HOTCHIPS.2019.8875643
  65. X. Yang, M. Gao, Q. Liu, J. Setter, J. Pu, A. Nayak, S. Bell, K. Cao, H. Ha, P. Raina, C. Kozyrakis, and M. Horowitz, “Interstellar: Using halide’s scheduling language to analyze DNN accelerators,” in ASPLOS ’20: Architectural Support for Programming Languages and Operating Systems, Lausanne, Switzerland, March 16-20, 2020, J. R. Larus, L. Ceze, and K. Strauss, Eds.   ACM, 2020, pp. 369–383. [Online]. Available: https://doi.org/10.1145/3373376.3378514
  66. S. Zheng, X. Zhang, L. Liu, S. Wei, and S. Yin, “Atomic dataflow based graph-level workload orchestration for scalable DNN accelerators,” in IEEE International Symposium on High-Performance Computer Architecture, HPCA 2022, Seoul, South Korea, April 2-6, 2022.   IEEE, 2022, pp. 475–489. [Online]. Available: https://doi.org/10.1109/HPCA53966.2022.00042
  67. S. Zheng, X. Zhang, D. Ou, S. Tang, L. Liu, S. Wei, and S. Yin, “Efficient scheduling of irregular network structures on cnn accelerators,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 39, no. 11, pp. 3408–3419, 2020.
  68. B. Zimmer, R. Venkatesan, Y. S. Shao, J. Clemons, M. Fojtik, N. Jiang, B. Keller, A. Klinefelter, N. R. Pinckney, P. Raina, S. G. Tell, Y. Zhang, W. J. Dally, J. S. Emer, C. T. Gray, S. W. Keckler, and B. Khailany, “A 0.32-128 tops, scalable multi-chip-module-based deep neural network inference accelerator with ground-referenced signaling in 16 nm,” IEEE J. Solid State Circuits, vol. 55, no. 4, pp. 920–932, 2020. [Online]. Available: https://doi.org/10.1109/JSSC.2019.2960488
Citations (12)

Summary

We haven't generated a summary for this paper yet.