In the last 5 years, the
remarkable performance achieved in a variety of application areas
(natural language processing, computer vision, games, etc.) has
led to the emergence of heterogeneous architectures to accelerate
machine learning workloads. In parallel, production deployment,
model complexity and diversity pushed for higher productivity
systems, more powerful programming abstractions, software and
system architectures, dedicated runtime systems and numerical
libraries, deployment and analysis tools. Deep learning models are
generally memory and computationally intensive, for both training
and inference. Accelerating these operations has obvious
advantages, first by reducing the energy consumption (e.g. in data
centers), and secondly, making these models usable on smaller
devices at the edge of the Internet. In addition, while
convolutional neural networks have motivated much of this effort,
numerous applications and models involve a wider variety of
operations, network architectures, and data processing. These
applications and models permanently challenge computer
architecture, the system stack, and programming abstractions. The
high level of interest in these areas calls for a dedicated forum
to discuss emerging acceleration techniques and computation
paradigms for machine learning algorithms, as well as the
applications of machine learning to the construction of such
The workshop brings together researchers and practitioners working on computing systems for machine learning, and using machine learning to build better computing systems. It also reaches out to a wider community interested in this rapidly growing area, to raise awareness of the existing efforts, to foster collaboration and the free exchange of ideas.
This builds on the success of the First AccML at HiPEAC 2020.
Topics of interest include (but are not limited to):
Novel ML systems: heterogeneous multi/many-core systems, GPUs and FPGAs;
Novel ML hardware accelerators and associated software;
Emerging semiconductor technologies with applications to ML hardware acceleration;
ML for the construction and tuning of systems;
Cloud and edge ML computing: hardware and software to accelerate training and inference;
Computing systems research addressing the privacy and security of ML-dominated systems;
May 8, 2020
Notification to authors:
May 20, 2020
Papers should be in double column IEEE format of between 4 and 8 pages including references. Papers should be uploaded as PDF and not anonymized.
Submissions can be made at easychair.org/conferences/?conf=2ndaccml.
Papers will be reviewed by the workshop's technical program committee according to criteria regarding a submission's quality, relevance to the workshop's topics, and, foremost, its potential to spark discussions about directions, insights, and solutions on the topics mentioned above. Research papers, case studies, and position papers are all welcome.
The workshop does not have formal proceedings, so accepted papers do not preclude publishing at future conferences and/or journals..
Universitat Politècnica de Catalunya
Title: Removing Ineffectual Computations in
There is a growing interest in extending computing devices with the ability to analyze and understand signals and data coming from a large variety of activities in our daily live, and provide real time responses in complex situations, with the goal to emulate human perception and problem solving. Examples include personal assistants, self-driving cars, domestics robots and health-care devices just to name a few. Neural networks have proven to be an effective approach to support many of these functionalities.
Most of these systems have very limited energy budgets so the effectiveness of this approach is strongly dependent on the energy-efficiency of the adopted solution. In this talk we present several alternative directions for improving the energy-efficiency of neural networks based on identifying and removing ineffectual computations.
Antonio González (Ph.D. 1989) is a Full Professor at the Computer Architecture Department of the Universitat Politècnica de Catalunya, Barcelona (Spain), and the director of the Architecture and Compiler research group. He was the founding director of the Intel Barcelona Research Center from 2002 to 2014. His research has focused on computer architecture. In this area, Antonio holds 52 patents, has published over 370 research papers and has given over 120 invited talks. He has also made multiple contributions to the design of the architecture of several commercial microprocessors.
Antonio has been program chair for ICS, ISPASS, MICRO, HPCA and ISCA, and general chair for MICRO and HPCA among other symposia. He has served on the program committee for over 130 international symposia in the field of computer architecture, and has been Associate Editor of the IEEE Transactions on Computers, IEEE Transactions on Parallel and Distributed Systems, IEEE Computer Architecture Letters, ACM Transactions on Architecture and Code Optimization, ACM Transactions on Parallel Computing, and Journal of Embedded Computing.
Antonio’s awards include the award to the best student in computer engineering in Spain, the Rosina Ribalta award as the advisor of the best PhD project in Information Technology and Communications, the Duran Farrell award for research in technology, the Aritmel National Award of Informatics to the Computer Engineer of the Year, the King James I award for his contributions in research on new technologies, and the ICREA Academia Award. He is an IEEE Fellow.
Title: Scaling Machine Learning Workloads on
Machine learning applications place large computational demands on hardware resources when performing classification, regression, clustering and training. What is common in many of these applications is that the quality of the outcome or model improves as we process more data. GPUs have been shown to be an effective platform for accelerating machine learning workloads, though have limits in terms of the amount a single GPU can process. This talk will look at ongoing work in hardware compaction and multi-GPU acceleration, enabling further scaling of machine learning workloads.
David Kaeli received his BS and PhD in Electrical Engineering from Rutgers University, and an MS in Computer Engineering from Syracuse University. He is presently a COE Distinguished Full Processor on the ECE faculty at Northeastern University, Boston, MA. Dr. Kaeli has published over 350 critically reviewed publications, 7 books, and 13 patents. He serves as the Editor in Chief of ACM Transactions on Computer Architecture and Code Optimization, and an Associate Editor of the IEEE Transactions on Parallel and Distributed Systems and the Journal of Parallel and Distributed Computing. Dr. Kaeli is an IEEE Fellow and an ACM Distinguished Scientist.
Title: A Communication-Centric Approach for
Designing Flexible DNN Accelerators
Deep Neural Networks (DNN) have demonstrated highly promising results across computer vision and speech recognition, and are becoming foundational for ubiquitous AI. The computational complexity of these algorithms and a need for high energy-efficiency has led to a surge in research on hardware accelerators. To reduce the latency and energy costs of accessing DRAM, most DNN accelerators are spatial in nature, with hundreds of processing elements (PE) operating in parallel and communicating with each other directly.
DNNs are evolving at a rapid rate - leading to myriad layer types (convolution, attention, LSTM, MLP) of varying shape (regular and irregular). Given a DNN there can be myriad computationally efficient implementations (e.g., via pruning) - leading to structured and unstructured sparsity. Finally, a given DNN can be tiled and partitioned in myriad ways to exploit data reuse. All of the above can lead to irregular dataflow patterns within the accelerator substrate. Getting high mapping efficiency for all these cases is highly challenging in accelerators today that are often tightly coupled 2D grids with rigid near-neighbor connectivity.
First, given a target DNN, we will demonstrate a systematic methodology for understanding data reuse opportunities within the algorithm and determine the cost vs benefit for efficiently exploiting them in hardware using our dataflow + microarchitectural model called MAESTRO (MICRO 2019 + IEEE Micro Top Picks). Next, we present a systematic communication-centric methodology for accelerator design, that can provide ~100% efficiency for arbitratry DNNs shapes, sparsity ratios and mappings. We demonstrate instances of this approach with two accelerators, MAERI (ASPLOS 2018 + IEEE Micro Top Picks Hon’ mention) and SIGMA (HPCA 2020 + Best Paper Award), that show orders of magnitude better utilization over state-of-the-art baselines like NVIDIA's NVDLA and Google’s TPU.
Tushar Krishna is an Assistant Professor in the School of Electrical and Computer Engineering at Georgia Tech. He also holds the ON Semiconductor Junior Professorship. He has a Ph.D. in Electrical Engineering and Computer Science from MIT (2014), a M.S.E in Electrical Engineering from Princeton University (2009), and a B.Tech in Electrical Engineering from the Indian Institute of Technology (IIT) Delhi (2007). Before joining Georgia Tech in 2015, Dr. Krishna spent a year as a post-doctoral researcher at Intel, Massachusetts.
Dr. Krishna’s research spans computer architecture, interconnection networks, networks-on-chip (NoC) and deep learning accelerators with a focus on optimizing data movement in modern computing systems. Three of his papers have been selected for IEEE Micro’s Top Picks from Computer Architecture, one more received an honorable mention, and three have won best paper awards. He received the National Science Foundation (NSF) CRII award in 2018, and both a Google Faculty Award and a Facebook Faculty Award in 2019.
Title: Reflections on TPUs, Current Problems
in Acceleration, and What's Next
Google's first TPU has been a remarkably successful accelerator, spawning a sequence of successors and inspiring a wave of new chips from established companies and startups. I'll start with some retrospection about what we got right and the ways in which we were lucky in building that first TPU. Then I'll pivot to the problems I think are currently hard and possibly underserved by our NN accelerator systems (to spoil: programmability, memory, and multi-tenancy). Lastly I'll speculate about where ML might take us: how much might the algorithms and computations change, the implications of the Accelerator Wall, and the virtuous feedback between algorithms and architecture that might be the basis of a true Golden Age for our field.
Cliff Young is a software engineer in the Google Brain team, where he works on codesign for deep learning accelerators. He is one of the designers of Google’s Tensor Processing Unit (TPU), which is used in production applications including Search, Maps, Photos, and Translate. TPUs also powered AlphaGo’s historic 4-1 victory over Go champion Lee Sedol. Previously, Cliff built special-purpose supercomputers for molecular dynamics at D. E. Shaw Research and worked at Bell Labs. Cliff holds AB, MS, and PhD degrees in computer science from Harvard University.
|Time (EDT/New York)||Virtual Event - 31st May 2020|
|9:00 AM – 9:10 AM||Welcome|
|9:10 AM – 10:10 AM||
Invited talk: Removing Ineffectual Computations in Neural Networks (9:10 AM – 9:50 AM)
Antonio Gonzalez, Universitat Politècnica de Catalunya
Paper talk: You Only Spike Once: Improving Energy-Efficient Neuromorphic Inference to ANN-Level Accuracy (9:50 AM – 10:10 AM)
Srivatsa P, Kyle Timothy Ng Chu, Yaswanth Tavva, Jibin Wu, Malu Zhang, Haizhou Li and Trevor E. Carlson
|10:10 AM – 11:10 AM||Invited talk: Scaling
Machine Learning Workloads on Today’s GPUs
(10:10 AM – 10:50 AM)
David Kaeli, Northeastern University
Paper talk: HCM: Hardware-Aware Complexity Metric for Neural Network Architectures (10:50 AM – 11:10 AM)
Alex Karbachevsky, Chaim Baskin, Evgenii Zheltonozhskii, Yevgeny Yermolin, Freddy Gabbay, Alexander Bronstein and Avi Mendelson
|11:10 AM – 11:40 AM||Break|
|11:40 AM – 12:40 PM||Invited talk: A
Communication-Centric Approach for Designing Flexible
DNN Accelerators (11:40 AM – 12:20 PM)
Tushar Krishna, Georgia Tech
Paper talk: STONNE: A Detailed Architectural Simulator for Flexible Neural Network Accelerators (12:20 PM – 12:40 PM)
Francisco Muñoz-Martínez, José L. Abellán, Manuel E. Acacio and Tushar Krishna
|12:40 PM – 2:00 PM||Invited talk: Reflections
on TPUs, Current Problems in Acceleration, and What's
Next (12:40 PM – 1:20 PM)
Cliff Young, Google
Paper talk: Statistical Robustness of MCMC Accelerators (1:20 PM – 1:40 PM)
Xiangyu Zhang, Ramin Bashizade, Yicheng Wang, Cheng Lyu, Sayan Mukherjee and Alvin R. Lebeck
Paper talk: Acceleration Techniques for Sampling-based Machine Learning (1:40 PM – 2:00 PM)
Yanqi Liu, Ruth Iris Bahar and Giuseppe Calderoni
|2:00 PM – 2:05 PM||Closing remarks|