What I learn from AI HARDWARE EUROPE SUMMIT (July 2020): Very short summary of the presentations in 3 days. You can access to all presentation here
Summary of AI HARDWARE EUROPE SUMMIT (July 2020)
Some of the problems FPGAs solve:
- excessive heat
- electricity consumption
- resistance to environmental factors and motion
- lifespan
The goal is artificial general intelligence (AGI). AI software and hardware should work together to achieve this goal. A lot of research in this area work with new methods of algorithms and hardware together to achieve high performance. Most of new hardware working with TensorFlow and Pytorch. In most case new hardware come with software solution that have high performance in special use cade or scenario. However, the new hardware can be modify by programmer like Embedded FPGA (eFPGA) in order to implement their requirement.
Some of the best presentations are: Machine Intelligent Systems & Software by Victoria Rege from Graphcore, Leveraging sparsity to enable ultra-low latency inferences demonstrated using GrAI One by Orlando Moreira and Remi Poittevin from GrAI Matter Labs, Challenges for Using Machine Learning in Industrial Automation by Ingo Thon from Siemens [Convolutional autoencoder for anomaly detection], From Training to Production Inference for Automotive AI - Transforming Research into Reality ( From laboratory training to automotive inference: the realities of embedding AI) by Tony King-Smith and Marton Feher from AIMotive ,Towards Embedded Intelligence by Michaela Blott from Xilinx, and the last good presentation is MLIR: Accelerating Artificial Intelligence by Albert Cohen from Google.
Some key words for the event: FINN is an experimental framework from Xilinx Research Labs to explore deep neural network inference on FPGAs, The MLIR project is a novel approach to building reusable and extensible compiler infrastructure, Fair and useful benchmarks for measuring training and inference performance of ML hardware, software, and services at mlperf.org, Graphcore IPU which is the new type of processor for AI with very high performance.
Day1:
Day 2:
Day 3:
OPENING KEYNOTE: Past Chip Childhood and System Teenage: Why We Need to Build a Mature Ecosystem
Olivier Temam - DeepMind
PRESENTATION: Power and Cost Efficient High Performance Inference at the Edge
Geoff Tate - Flex Logix
*PRESENTATION: Machine Intelligent Systems & Software
Victoria Rege - Graphcore
PANEL: High Performance and Low Energy Consumption: Developments in Performing at the Edge
Moderator - Luca Benini - Professor
Orr Danon - Hailo Technologies
Eric Flamand - GreenWaves Technologies
Ask an Analyst: Moderated Q&A and Group Discussion
Michael Azoff - Kisaco Research
Brett Simpson - Arete Research
**PRESENTATION: Leveraging sparsity to enable ultra-low latency inferences demonstrated using GrAI One
Orlando Moreira - GrAI Matter Labs
Remi Poittevin - GrAI Matter Labs
*PANEL: Investment Trends & Dynamics in AI Hardware and the Startup Ecosystem
Moderator - Brett Simpson - Arete Research
Christian Patze - M Ventures
Sascha Fritz - Robert Bosch Venture Capital GmbH
*PRESENTATION: Challenges for Using Machine Learning in Industrial Automation
Ingo Thon - Siemens
PANEL: Designing Safe, Power-Efficient and Affordable Autonomous Systems
Moderator - Robert Krutsch - Zenuity
Arnaud Van Den Bossche - NXP
Gordon Cooper - Synopsys
ASK AN EXPERT: Moderated Q&A and Group Discussion About Possibilities For Real Time AI enabled by Edge Compute
Eric Flamand - GreenWaves Technologies
*PRESENTATION: From Training to Production Inference for Automotive AI - Transforming Research into Reality
Tony King-Smith - AI Motive
Marton Feher - AIMotive
PRESENTATION: Neuromorphic Computing at BMW
Oliver Wick - BMW
PANEL: Applications of Neuromorphic Hardware in Industry, Automotive & Robotics
Moderator - Yulia Sandamirskaya - Intel
Christian Mayr - TU Dresden
Steve Furber - University of Manchester
PRESENTATION: Computer Architecture - The Next Step: Energy Efficient Machine Learning
Uri Weiser - Technion
PRESENTATION: Energy Efficient AI and Carbon Offsetting
Rick Calle - Microsoft
PRESENTATION: Edge Processors for Deep Learning - Practical Perspectives
Orr Danon - Hailo Technologies
PRESENTATION - Why heterogenous multi-processing is a critical requirement for Edge Computing: Example of Automotive
Eric Baissus - Kalray
**PRESENTATION: Towards Embedded Intelligence
Michaela Blott - Xilinx
**PRESENTATION: MLIR: Accelerating Artificial Intelligence
Albert Cohen - Google
OPENING KEYNOTE: Past Chip Childhood and System Teenage: Why We Need to Build a Mature Ecosystem
Olivier Temam - DeepMind
Temam @google.com
AI progress comes at a heavy computing
DRL, AutoML, AGI need higher computing requirements
Chanelle:
Hyper focused on chips
Idiosyncratic hardware
AI algorithms evolve very fast
Efficiency/flexibility tradeoff
He talk about AI chip and said AI progress comes at a heavy computing. Algorithm like DRL, AutoML, AGI need higher computing requirements. The challenge on AI chip is tradeoff between efficiency and flexibility, AI algorithms evolve very fast which change the methods of processing, Idiosyncratic hardware for specific domain, hyper focused on chips not in the system.
I am impress about "Accelerator Benchmarking on Read Edge-inference Applications" presentation at AI HW summit and I like to see live demo of " InferX X1" on October. Do you have any benchmarking of InferX X1 for the deep reinforcement learning?
PRESENTATION: Power and Cost Efficient High Performance Inference at the Edge
Geoff Tate - Flex Logix-; geoff@flex-logix.com
Accelerator Benchmarking on Read Edge-inference Applications,
He talk about Embedded FPGA (eFPGA) and new product which is "inferX X1" that will come soon. It use TDP:7-13W compare to Nvidia Xavier NX 15W. He mentioned The ResNet-50 with the image size of 224*224 will not tell you the robustness of the memory system that been required with megapixel images. That’s why not good for comparison the best for customer to use and compare is YOLOv3. YOLOv3 intermediate activations are 64MB peak for 2 Megapixel images, this stresses memory subsystems much more than ResNet-50. Their chip is good because of efficiency is in Data packing and transposition. It allowed efficiency 3D convolutions which for each layer ad dedicated path RAM to compute to RAM and deep layer fusion reduces memory requirement and DRAM access in "hidden" in the background.
Embedded FPGA, eFPGA, clustering MACs with a reconfigurable interconnect delivered high inference throughput at low cost, inferX X1 in fab now TDP:7-13W (compare to Nvidia Xavier NX 15W), not Pytorch, the best for customer to use and compare is YOLOv3. YOLOv3 intermediate activations are 64MB peak for 2 Megapixel images, this stresses memory subsystems much more than ResNet-50. They said the normal default image size is 224*224 but the activation intermediate (max layer) growth exponentially by increasing the size of images.
The ResNet-50 with the image size of 224*224 will not tell you the robustness of the memory system that been required with megapixel images. That’s why not good for comparison.
Nvidia Xavier NX: has 3 inference (GPU, 2x DLA):self-driving car multiple models are running simultaneously
But in most AI system one camera one model one system process images frame per frame. If we have stream of data coming in 15FPS one image at time 15FPS
Key to inferX X1 efficiency is in Data packing and transposition, efficiency 3D convolutions , for each layer ad dedicated path RAM to compute to RAM, deep layer fusion reduces memory requirement, DRAM access in "hidden" in the background,
TSMC,GUC,synopsys,arteris,analog bits, cadence, mentor
I am impress about "Machine Intelligent Systems & Software GRAPHCORE IPU" presentation at AI HW summit. Do you have any benchmarking of GRAPHCORE IPU for the deep reinforcement learning?
PRESENTATION: Machine Intelligent Systems & Software
Victoria Rege - Graphcore; victoria@graphcore.ai info@graphcore.ai ; @fleurdevie
This presentation is one of the best. She talk about GRAPHCORE IPU(INTELLIGENCE PROCESSING UNIT) and POPLAR SDK. The Graphcore IPU can run training of sample model in 3 hours which require 40 hours on GPU. In another use case Graphcore IPU accelerated medical imaging on azure can process 2000 images/sec compare to 166 images/sec on GPU.
GRAPHCORE IPU(INTELLIGENCE PROC
ESSING UNIT)
The poplar SDK,
https://www.graphcore.ai/finance (40 hours on GPU to 3 hours)
Running a PyTorch model on the Graphcore IPU: ResNeXt-101 example
Ipu accelerated medical imaging on azure (IPU 2000 images/sec vs 166 images/sec)
PANEL: High Performance and Low Energy Consumption: Developments in Performing at the Edge
Moderator - Luca Benini - Professor: ML processors
Orr Danon - Hailo Technologies
Eric Flamand - GreenWaves Technologies
Ask an Analyst: Moderated Q&A and Group Discussion
Michael Azoff - Kisaco Research
Brett Simpson - Arete Research
PRESENTATION: Leveraging sparsity to enable ultra-low latency inferences demonstrated using GrAI One
*Orlando Moreira - GrAI Matter Labs
Remi Poittevin - GrAI Matter Labs
***
Edge workloads involve real-time: responsive smart devices, closely coupled feedback loops, autonomous systems
Input data streams are continuous: video/Audio feeds, industrial sensor ensembles, bio signals (EEG,EKG, movement)
The data rate is much higher than the real information rate:
voice 512-> information 39 bits/s
UXGA video 79 MB/s -> information 95%
Frame based processing: apply single frame algorithm independently to each input frame in a stream.
Advantages:
Many popular sensors are frame based
Simple, easy to scale: image -> video stream
Disadvantage:
Repeated and redundant data is processed over and over
Sparsity in video
Sparsity in structure: pruning of needless weights and kernels in network
Sparsity in space: most pixels in an image have no relevant feature data. ; results in 0-valued activations
Sparsity in time: image changes little from instant to instant; why should we always re-process the whole frame?
Event based computation of networks (process only the data that change); single events in an input layer fan out (typically 1:9 to 1:49 per feature map); only the affected pixels in the convolutional layer need be computed. ; locality of change is preserved downstream. ; the events than fan in typically 4:1 in pooling layers. ; additionally, events are only 2
5% likely to change the pixel state (tpy. 2*2 max pooling).; locality of change is preserved downstream. ;
sparNet: sparse and event-based execution model
Exploits time-sparsity in a time series
Converts frame. Based network to evet based inference
Event based: change is sent sporadically, so no frame structure to input data
Only propagates changes, thus less work needs to be done
Requires resilient neuron state
Threshold: per neuron, defines how much change is needed to warrant propagation.
To convert a CNN to sparNet, they set a threshold per neuron
Pilotnet in sparnet
Execution pilotnet with sparnet dramatically reduces the number of operations required.
Effect becomes dramatic at high fps:
Same amount change per same time interval
But for frame based processing, load increases linearly with frame rate
Higher fps => lower sampling period => lower latency
Consequences of sparsity for computer architecture:
Requires that they store resilient neuron states
Suggests in/near memory computation
Frame structure is lost:
Event based scheduling of computation
Suggests data flow synchronization
Significantly less sequential memory accesses occur
Reduced value of caching, network bursts, bulk dma transfers;
Reduced opportunities for latency hiding
Suggests in/near memory computation
Conclusion:
Neuronflow is designed to exploit sparsity
Sparsity in structure
Neuronflow:
Event based activation skips
0-weight(pruned) synapses/kernels.
Sparsity in space
Neuronflow: 0-valued activations are neither sent nor processed.
Sparsity in time
Neuronflow: if change between frames is below threshold, it is neither sent nor processed.
Neuron state to 200 to store the space/…..
PANEL: Investment Trends & Dynamics in AI Hardware and the Startup Ecosystem
Moderator - Brett Simpson - Arete Research
Christian Patze - M Ventures
Sascha Fritz - Robert Bosch Venture Capital GmbH
Day 2=========================================
PRESENTATION: Challenges for Using Machine Learning in Industrial Automation
Dr. Ingo Thon - Siemens- Ingo.Thon@siemens.com
He presented some challenge in hardware. Some key notes are time series data chip is missing, AI should sit at hardware level,
Imagine automating the unpredictable.
Drivers for new developments in industries: time to market, flexibility (PID), quality, efficiency
Convolutional autoencoder for anomaly detection; reptile algorithm;
trick is wide product in line after trained can used pre trained to adept to other type
Cost come on reliability/performance/easy to use (man power) -
Drivers for new developments in industries: time to market, flexibility (PID), quality, efficiency
AI should site at HW level
Imagine automating the unpredictable …
Visual quality inspection solved but hard:
Convolutional autoencoder for anomaly detection; reptile algorithm;
trick is wide product in line after trained can used pre trained to adept to other type
Cost come on reliability/performance/easy to use (man power) -
Time series data chip is missing
PANEL: Designing Safe, Power-Efficient and A; ffordable Autonomous Systems
Moderator - Robert Krutsch - Zenuity
Arnaud Van Den Bossche - NXP
Gordon Cooper - Synopsys
ASK AN EXPERT: Moderated Q&A and Group Discussion About Possibilities For Real Time AI enabled by Edge Compute
Eric Flamand - GreenWaves Technologies
Watch again
PRESENTATION: From Training to Production Inference for Automotive AI - Transforming Research into Reality
Tony King-Smith - AI Motive
Marton Feher - AIMotive
From laboratory training to automotive inference: the realities of embedding AI
Convolution is Not the same as matrix multiplication. Matrix multipliers used extensively in GPUs and DSPs - so many algorithm implementations use them BUT matrix multipliers need pre and post processing to re order data for convolution. Need to accelerate the NN algorithms, not just implementations of them.
Convolution is Not the same as matrix multiplication
Matrix multipliers used extensively in GPUs and DSPs - so many algorithm implementations use them BUT matrix multipliers need pre and post processing to re order data for convolution. Need to accelerate the NN algorithms, not just implementations of them.
Manually optimization; Convolution 5*5 kernel ; relu 5*5 conv and 5*5 de conv ;
PRESENTATION: Neuromorphic Computing at BMW
Oliver Wick - BMW; oliver.wick@bmw.de
Neuromorphic computing. Building up a neuromorphic computing readiness for BMW.
PANEL: Applications of Neuromorphic Hardware in Industry, Automotive & Robotics
Moderator - Yulia Sandamirskaya - Intel
Christian Mayr - TU Dresden
Steve Furber - University of Manchester
Day 3=============================================
---PRESENTATION: Computer Architecture - The Next Step: Energy Efficient Machine Learning
Professor Uri Weiser - Technion ---
Technical hardware talk. Deep learning is everywhere: pedestrian detection, vehicle detection, collision avoidance, parking assist, speech understanding, plate/traffic sign detection, passenger control, face recognition. We are at the beginning stages of comprehending the environment and where we are. In AI hardware the Performance is the king and the efficiency is the next step.
Spatial correlation and value prediction in convolutional neural networks
Non blocking simultaneous multithreading: embracing the resiliency of deep neural network
ML architecture is resilient to inaccuracies -> SMT is suitable in DNN approximation environment
Why resNet for benchmarking? https://mlperf.org/
Fair and useful benchmarks for measuring training and inference performance of ML hardware, software, and services.
PRESENTATION: Energy Efficient AI and Carbon Offsetting
Rick Calle - Microsoft ; M2 the venture Arm of Microsoft
What can AI industry do to reduce AI computational energy? (in a world of trillion X?)
Paper:
M12 meta analysis of research papers: energy and ploicy considerations for deep learning in NLP
Language models are few shot learners (OpenAI GPT-3)
BERTology, learning and evaluationg general linguistic intelligence (Google)
---PRESENTATION: Edge Processors for Deep Learning - Practical Perspectives
Orr Danon - Hailo Technologies
DNN accelerators in the wild; use case driven overview
Video analytics platforms pipeline start with frame grab. Then, analyse 1 which consists of detection, quality classification and grip orientation. Then, analyse 2 which consist of decision logic. Finally, act which consists of pick and place.
PRESENTATION - Why heterogenous multi-processing is a critical requirement for Edge Computing: Example of Automotive
Eric Baissus - Kalray: new type processor and solutions, 3rd MPPA processor,
Investors: nxp, renault nissan mistubishi, safran, mbda
Multicore and many core processors: homogeneous multicore processor (mix of FPGA,GPU,ASIC,CPUs), PGGPU manycore processor, CPU based manycore processor. Only 25% of usable data will reach a data centre the 75% need to be analysed locally in real time.
I would like to know more about FINN compiler.
PRESENTATION: Towards Embedded Intelligence: opportunities and challenges in the technology landscape
Michaela Blott - Xilinx
In memory computing; waverscale computing specialized architectures DPU;
How can enable a broader spectrum of end-users to be able to specialize hardware architectures and co design solutions?
https://arxiv.org/abs/2004.03021
FINN(10k-10M FPS); logicNets (100M+ FPS)
Innovative architectures emerge to address the needs of embedded intelligence; specialization of hardware architecture are key; with more flexibility, more opportunity to customization (potential to exploit with FPGAs and ACAPs, allow to specialize to the specifics of individual use cases; tools such as FINN are needed to address of complexity in the design entry); future: key challenge in the community remains around how to compare (focussed on embedded; https://rcl-lab.github.io/QutibenchWeb/)
https://xilinx.github.io/finn/
PRESENTATION: MLIR: Accelerating Artificial Intelligence
Albert Cohen - Google
Mlir-hiring@google.com
A new golden age for computer architecture, a call to action for software stack construction compilers, execution environments, tools
MLIR: Multi Level Intermediate Representation
Mlir.llvm.org
===================
Compiler research ; unification ; https://llvm.org/ ;
Price rate is €999 + VAT. you have the opportunity to access all 19 presentations and panel discussions on-demand for the cost of only €149+VAT. Register online and receive immediate access.
Appendix:
Software and AI/ML
Software and AI/ML
Either for soft-processors (Microblaze, NIOS) or physical microcontrollers.
Bare metal, RTOS or Linux-based applications.
Software deployed Neural Networks
Real Time Operating Systems (RTOS)
FreeRTOS, VxWorks, pSOS, Ecos, Nucleus, Proprietary
Vast experience with RTOS
Microprocessors/Microcontrollers
x86
68x
Freescale Power ARch Tech
ARM
MIPS
SuperH
Symbian
XScale
Embedded Operating Systems
Linux
WinCE
Windows Embedded
CE.NET 4.x
QNX
Symbian
We Love Linux
Application & Kernel Dev.
Embedded Linux
Windows CE
VxWorks
ThreadX
QNX
Custom BSP and Driver Dev.
Embedded Linux
Windows CE
VxWorks
ThreadX
QNX
Tailor-made
Custom Driver Development
Network & Communications
Storage Drivers
Device Drivers
Experience
AI
Detection/Recognition of objects and faces
ADAS
Security
Data centers
and more…
Experts in AI/ML
Real Time Operating Systems (RTOS): FreeRTOS, VxWorks, pSOS, Ecos, Nucleus, Proprietary
https://xilinx.github.io/finn/ : FINN is an experimental framework from Xilinx Research Labs to explore deep neural network inference on FPGAs.
Mlir.llvm.org : The MLIR project is a novel approach to building reusable and extensible compiler infrastructure. MLIR aims to address software fragmentation, improve compilation for heterogeneous hardware, significantly reduce the cost of building domain specific compilers, and aid in connecting existing compilers together.
https://mlperf.org/ : Fair and useful benchmarks for measuring training and inference performance of ML hardware, software, and services.
Graphcore IPU: https://www.graphcore.ai/posts/microsoft-accelerates-resnext-50-medical-imaging-inference-with-the-ipu