AI-Hardware

What I learn from AI HARDWARE EUROPE SUMMIT (July 2020): Very short summary of the presentations in 3 days. You can access to all presentation here

Summary of AI HARDWARE EUROPE SUMMIT (July 2020)

Some of the problems FPGAs solve:

- excessive heat

- electricity consumption

- resistance to environmental factors and motion

- lifespan

The goal is artificial general intelligence (AGI). AI software and hardware should work together to achieve this goal. A lot of research in this area work with new methods of algorithms and hardware together to achieve high performance. Most of new hardware working with TensorFlow and Pytorch. In most case new hardware come with software solution that have high performance in special use cade or scenario. However, the new hardware can be modify by programmer like Embedded FPGA (eFPGA) in order to implement their requirement.

Some of the best presentations are: Machine Intelligent Systems & Software by Victoria Rege from Graphcore, Leveraging sparsity to enable ultra-low latency inferences demonstrated using GrAI One by Orlando Moreira and Remi Poittevin from GrAI Matter Labs, Challenges for Using Machine Learning in Industrial Automation by Ingo Thon from Siemens [Convolutional autoencoder for anomaly detection], From Training to Production Inference for Automotive AI - Transforming Research into Reality ( From laboratory training to automotive inference: the realities of embedding AI) by Tony King-Smith and Marton Feher from AIMotive ,Towards Embedded Intelligence by Michaela Blott from Xilinx, and the last good presentation is MLIR: Accelerating Artificial Intelligence by Albert Cohen from Google.

Some key words for the event: FINN is an experimental framework from Xilinx Research Labs to explore deep neural network inference on FPGAs, The MLIR project is a novel approach to building reusable and extensible compiler infrastructure, Fair and useful benchmarks for measuring training and inference performance of ML hardware, software, and services at mlperf.org, Graphcore IPU which is the new type of processor for AI with very high performance.

Day1:

Day 2:

Day 3:

OPENING KEYNOTE: Past Chip Childhood and System Teenage: Why We Need to Build a Mature Ecosystem

Olivier Temam - DeepMind

PRESENTATION: Power and Cost Efficient High Performance Inference at the Edge

Geoff Tate - Flex Logix

*PRESENTATION: Machine Intelligent Systems & Software

Victoria Rege - Graphcore

PANEL: High Performance and Low Energy Consumption: Developments in Performing at the Edge

Moderator - Luca Benini - Professor

Orr Danon - Hailo Technologies

Eric Flamand - GreenWaves Technologies

Ask an Analyst: Moderated Q&A and Group Discussion

Michael Azoff - Kisaco Research

Brett Simpson - Arete Research

**PRESENTATION: Leveraging sparsity to enable ultra-low latency inferences demonstrated using GrAI One

Orlando Moreira - GrAI Matter Labs

Remi Poittevin - GrAI Matter Labs

*PANEL: Investment Trends & Dynamics in AI Hardware and the Startup Ecosystem

Moderator - Brett Simpson - Arete Research

Christian Patze - M Ventures

Sascha Fritz - Robert Bosch Venture Capital GmbH

*PRESENTATION: Challenges for Using Machine Learning in Industrial Automation

Ingo Thon - Siemens

PANEL: Designing Safe, Power-Efficient and Affordable Autonomous Systems

Moderator - Robert Krutsch - Zenuity

Arnaud Van Den Bossche - NXP

Gordon Cooper - Synopsys

ASK AN EXPERT: Moderated Q&A and Group Discussion About Possibilities For Real Time AI enabled by Edge Compute

Eric Flamand - GreenWaves Technologies

*PRESENTATION: From Training to Production Inference for Automotive AI - Transforming Research into Reality

Tony King-Smith - AI Motive

Marton Feher - AIMotive

PRESENTATION: Neuromorphic Computing at BMW

Oliver Wick - BMW

PANEL: Applications of Neuromorphic Hardware in Industry, Automotive & Robotics

Moderator - Yulia Sandamirskaya - Intel

Christian Mayr - TU Dresden

Steve Furber - University of Manchester

PRESENTATION: Computer Architecture - The Next Step: Energy Efficient Machine Learning

Uri Weiser - Technion

PRESENTATION: Energy Efficient AI and Carbon Offsetting

Rick Calle - Microsoft

PRESENTATION: Edge Processors for Deep Learning - Practical Perspectives

Orr Danon - Hailo Technologies

PRESENTATION - Why heterogenous multi-processing is a critical requirement for Edge Computing: Example of Automotive

Eric Baissus - Kalray

**PRESENTATION: Towards Embedded Intelligence

Michaela Blott - Xilinx

**PRESENTATION: MLIR: Accelerating Artificial Intelligence

Albert Cohen - Google

  1. OPENING KEYNOTE: Past Chip Childhood and System Teenage: Why We Need to Build a Mature Ecosystem

Olivier Temam - DeepMind

Temam @google.com

AI progress comes at a heavy computing

DRL, AutoML, AGI need higher computing requirements

Chanelle:

  1. Hyper focused on chips

  2. Idiosyncratic hardware

  3. AI algorithms evolve very fast

  4. Efficiency/flexibility tradeoff

He talk about AI chip and said AI progress comes at a heavy computing. Algorithm like DRL, AutoML, AGI need higher computing requirements. The challenge on AI chip is tradeoff between efficiency and flexibility, AI algorithms evolve very fast which change the methods of processing, Idiosyncratic hardware for specific domain, hyper focused on chips not in the system.

I am impress about "Accelerator Benchmarking on Read Edge-inference Applications" presentation at AI HW summit and I like to see live demo of " InferX X1" on October. Do you have any benchmarking of InferX X1 for the deep reinforcement learning?

  1. PRESENTATION: Power and Cost Efficient High Performance Inference at the Edge

Geoff Tate - Flex Logix-; geoff@flex-logix.com

Accelerator Benchmarking on Read Edge-inference Applications,

He talk about Embedded FPGA (eFPGA) and new product which is "inferX X1" that will come soon. It use TDP:7-13W compare to Nvidia Xavier NX 15W. He mentioned The ResNet-50 with the image size of 224*224 will not tell you the robustness of the memory system that been required with megapixel images. That’s why not good for comparison the best for customer to use and compare is YOLOv3. YOLOv3 intermediate activations are 64MB peak for 2 Megapixel images, this stresses memory subsystems much more than ResNet-50. Their chip is good because of efficiency is in Data packing and transposition. It allowed efficiency 3D convolutions which for each layer ad dedicated path RAM to compute to RAM and deep layer fusion reduces memory requirement and DRAM access in "hidden" in the background.

Embedded FPGA, eFPGA, clustering MACs with a reconfigurable interconnect delivered high inference throughput at low cost, inferX X1 in fab now TDP:7-13W (compare to Nvidia Xavier NX 15W), not Pytorch, the best for customer to use and compare is YOLOv3. YOLOv3 intermediate activations are 64MB peak for 2 Megapixel images, this stresses memory subsystems much more than ResNet-50. They said the normal default image size is 224*224 but the activation intermediate (max layer) growth exponentially by increasing the size of images.

The ResNet-50 with the image size of 224*224 will not tell you the robustness of the memory system that been required with megapixel images. That’s why not good for comparison.

Nvidia Xavier NX: has 3 inference (GPU, 2x DLA):self-driving car multiple models are running simultaneously

But in most AI system one camera one model one system process images frame per frame. If we have stream of data coming in 15FPS one image at time 15FPS

Key to inferX X1 efficiency is in Data packing and transposition, efficiency 3D convolutions , for each layer ad dedicated path RAM to compute to RAM, deep layer fusion reduces memory requirement, DRAM access in "hidden" in the background,

TSMC,GUC,synopsys,arteris,analog bits, cadence, mentor

I am impress about "Machine Intelligent Systems & Software GRAPHCORE IPU" presentation at AI HW summit. Do you have any benchmarking of GRAPHCORE IPU for the deep reinforcement learning?

  1. PRESENTATION: Machine Intelligent Systems & Software

Victoria Rege - Graphcore; victoria@graphcore.ai info@graphcore.ai ; @fleurdevie

This presentation is one of the best. She talk about GRAPHCORE IPU(INTELLIGENCE PROCESSING UNIT) and POPLAR SDK. The Graphcore IPU can run training of sample model in 3 hours which require 40 hours on GPU. In another use case Graphcore IPU accelerated medical imaging on azure can process 2000 images/sec compare to 166 images/sec on GPU.

GRAPHCORE IPU(INTELLIGENCE PROC

ESSING UNIT)

The poplar SDK,

https://www.graphcore.ai/finance (40 hours on GPU to 3 hours)

Running a PyTorch model on the Graphcore IPU: ResNeXt-101 example

Ipu accelerated medical imaging on azure (IPU 2000 images/sec vs 166 images/sec)

https://www.graphcore.ai/posts/microsoft-accelerates-resnext-50-medical-imaging-inference-with-the-ipu

  1. PANEL: High Performance and Low Energy Consumption: Developments in Performing at the Edge

Moderator - Luca Benini - Professor: ML processors

Orr Danon - Hailo Technologies

Eric Flamand - GreenWaves Technologies

  1. Ask an Analyst: Moderated Q&A and Group Discussion

Michael Azoff - Kisaco Research

Brett Simpson - Arete Research

  1. PRESENTATION: Leveraging sparsity to enable ultra-low latency inferences demonstrated using GrAI One

*Orlando Moreira - GrAI Matter Labs

Remi Poittevin - GrAI Matter Labs

***

Edge workloads involve real-time: responsive smart devices, closely coupled feedback loops, autonomous systems

Input data streams are continuous: video/Audio feeds, industrial sensor ensembles, bio signals (EEG,EKG, movement)

The data rate is much higher than the real information rate:

voice 512-> information 39 bits/s

UXGA video 79 MB/s -> information 95%

Frame based processing: apply single frame algorithm independently to each input frame in a stream.

Advantages:

Many popular sensors are frame based

Simple, easy to scale: image -> video stream

Disadvantage:

Repeated and redundant data is processed over and over

Sparsity in video

Sparsity in structure: pruning of needless weights and kernels in network

Sparsity in space: most pixels in an image have no relevant feature data. ; results in 0-valued activations

Sparsity in time: image changes little from instant to instant; why should we always re-process the whole frame?

Event based computation of networks (process only the data that change); single events in an input layer fan out (typically 1:9 to 1:49 per feature map); only the affected pixels in the convolutional layer need be computed. ; locality of change is preserved downstream. ; the events than fan in typically 4:1 in pooling layers. ; additionally, events are only 2

5% likely to change the pixel state (tpy. 2*2 max pooling).; locality of change is preserved downstream. ;

sparNet: sparse and event-based execution model

Exploits time-sparsity in a time series

Converts frame. Based network to evet based inference

Event based: change is sent sporadically, so no frame structure to input data

Only propagates changes, thus less work needs to be done

Requires resilient neuron state

Threshold: per neuron, defines how much change is needed to warrant propagation.

To convert a CNN to sparNet, they set a threshold per neuron

Pilotnet in sparnet

Execution pilotnet with sparnet dramatically reduces the number of operations required.

Effect becomes dramatic at high fps:

Same amount change per same time interval

But for frame based processing, load increases linearly with frame rate

Higher fps => lower sampling period => lower latency

Consequences of sparsity for computer architecture:

Requires that they store resilient neuron states

Suggests in/near memory computation

Frame structure is lost:

Event based scheduling of computation

Suggests data flow synchronization

Significantly less sequential memory accesses occur

Reduced value of caching, network bursts, bulk dma transfers;

Reduced opportunities for latency hiding

Suggests in/near memory computation

Conclusion:

Neuronflow is designed to exploit sparsity

Sparsity in structure

Neuronflow:

Event based activation skips

0-weight(pruned) synapses/kernels.

Sparsity in space

Neuronflow: 0-valued activations are neither sent nor processed.

Sparsity in time

Neuronflow: if change between frames is below threshold, it is neither sent nor processed.

Neuron state to 200 to store the space/…..

  1. PANEL: Investment Trends & Dynamics in AI Hardware and the Startup Ecosystem

Moderator - Brett Simpson - Arete Research

Christian Patze - M Ventures

Sascha Fritz - Robert Bosch Venture Capital GmbH

Day 2=========================================

PRESENTATION: Challenges for Using Machine Learning in Industrial Automation

Dr. Ingo Thon - Siemens- Ingo.Thon@siemens.com

He presented some challenge in hardware. Some key notes are time series data chip is missing, AI should sit at hardware level,

Imagine automating the unpredictable.

Drivers for new developments in industries: time to market, flexibility (PID), quality, efficiency

Convolutional autoencoder for anomaly detection; reptile algorithm;

trick is wide product in line after trained can used pre trained to adept to other type

Cost come on reliability/performance/easy to use (man power) -

Drivers for new developments in industries: time to market, flexibility (PID), quality, efficiency

AI should site at HW level

Imagine automating the unpredictable …

Visual quality inspection solved but hard:

Convolutional autoencoder for anomaly detection; reptile algorithm;

trick is wide product in line after trained can used pre trained to adept to other type

Cost come on reliability/performance/easy to use (man power) -

Time series data chip is missing

PANEL: Designing Safe, Power-Efficient and A; ffordable Autonomous Systems

Moderator - Robert Krutsch - Zenuity

Arnaud Van Den Bossche - NXP

Gordon Cooper - Synopsys

ASK AN EXPERT: Moderated Q&A and Group Discussion About Possibilities For Real Time AI enabled by Edge Compute

Eric Flamand - GreenWaves Technologies

Watch again

PRESENTATION: From Training to Production Inference for Automotive AI - Transforming Research into Reality

Tony King-Smith - AI Motive

Marton Feher - AIMotive

From laboratory training to automotive inference: the realities of embedding AI

Convolution is Not the same as matrix multiplication. Matrix multipliers used extensively in GPUs and DSPs - so many algorithm implementations use them BUT matrix multipliers need pre and post processing to re order data for convolution. Need to accelerate the NN algorithms, not just implementations of them.

Convolution is Not the same as matrix multiplication

Matrix multipliers used extensively in GPUs and DSPs - so many algorithm implementations use them BUT matrix multipliers need pre and post processing to re order data for convolution. Need to accelerate the NN algorithms, not just implementations of them.

Manually optimization; Convolution 5*5 kernel ; relu 5*5 conv and 5*5 de conv ;

PRESENTATION: Neuromorphic Computing at BMW

Oliver Wick - BMW; oliver.wick@bmw.de

Neuromorphic computing. Building up a neuromorphic computing readiness for BMW.

PANEL: Applications of Neuromorphic Hardware in Industry, Automotive & Robotics

Moderator - Yulia Sandamirskaya - Intel

Christian Mayr - TU Dresden

Steve Furber - University of Manchester

Day 3=============================================

---PRESENTATION: Computer Architecture - The Next Step: Energy Efficient Machine Learning

Professor Uri Weiser - Technion ---

Technical hardware talk. Deep learning is everywhere: pedestrian detection, vehicle detection, collision avoidance, parking assist, speech understanding, plate/traffic sign detection, passenger control, face recognition. We are at the beginning stages of comprehending the environment and where we are. In AI hardware the Performance is the king and the efficiency is the next step.

  1. Spatial correlation and value prediction in convolutional neural networks

  2. Non blocking simultaneous multithreading: embracing the resiliency of deep neural network

ML architecture is resilient to inaccuracies -> SMT is suitable in DNN approximation environment

Why resNet for benchmarking? https://mlperf.org/

Fair and useful benchmarks for measuring training and inference performance of ML hardware, software, and services.

PRESENTATION: Energy Efficient AI and Carbon Offsetting

Rick Calle - Microsoft ; M2 the venture Arm of Microsoft

What can AI industry do to reduce AI computational energy? (in a world of trillion X?)



Paper:

M12 meta analysis of research papers: energy and ploicy considerations for deep learning in NLP

Language models are few shot learners (OpenAI GPT-3)

BERTology, learning and evaluationg general linguistic intelligence (Google)

---PRESENTATION: Edge Processors for Deep Learning - Practical Perspectives

Orr Danon - Hailo Technologies

DNN accelerators in the wild; use case driven overview

Video analytics platforms pipeline start with frame grab. Then, analyse 1 which consists of detection, quality classification and grip orientation. Then, analyse 2 which consist of decision logic. Finally, act which consists of pick and place.

PRESENTATION - Why heterogenous multi-processing is a critical requirement for Edge Computing: Example of Automotive

Eric Baissus - Kalray: new type processor and solutions, 3rd MPPA processor,

Investors: nxp, renault nissan mistubishi, safran, mbda

Multicore and many core processors: homogeneous multicore processor (mix of FPGA,GPU,ASIC,CPUs), PGGPU manycore processor, CPU based manycore processor. Only 25% of usable data will reach a data centre the 75% need to be analysed locally in real time.

I would like to know more about FINN compiler.

PRESENTATION: Towards Embedded Intelligence: opportunities and challenges in the technology landscape

Michaela Blott - Xilinx

In memory computing; waverscale computing specialized architectures DPU;

How can enable a broader spectrum of end-users to be able to specialize hardware architectures and co design solutions?

https://arxiv.org/abs/2004.03021

FINN(10k-10M FPS); logicNets (100M+ FPS)

Innovative architectures emerge to address the needs of embedded intelligence; specialization of hardware architecture are key; with more flexibility, more opportunity to customization (potential to exploit with FPGAs and ACAPs, allow to specialize to the specifics of individual use cases; tools such as FINN are needed to address of complexity in the design entry); future: key challenge in the community remains around how to compare (focussed on embedded; https://rcl-lab.github.io/QutibenchWeb/)

https://xilinx.github.io/finn/

PRESENTATION: MLIR: Accelerating Artificial Intelligence

Albert Cohen - Google

Mlir-hiring@google.com

A new golden age for computer architecture, a call to action for software stack construction compilers, execution environments, tools

MLIR: Multi Level Intermediate Representation

Mlir.llvm.org

===================

Compiler research ; unification ; https://llvm.org/ ;

Price rate is €999 + VAT. you have the opportunity to access all 19 presentations and panel discussions on-demand for the cost of only €149+VAT. Register online and receive immediate access.

Appendix:

Software and AI/ML

SPARTRONIX

Software and AI/ML

Either for soft-processors (Microblaze, NIOS) or physical microcontrollers.

Bare metal, RTOS or Linux-based applications.

Software deployed Neural Networks

Real Time Operating Systems (RTOS)

FreeRTOS, VxWorks, pSOS, Ecos, Nucleus, Proprietary

Vast experience with RTOS

Microprocessors/Microcontrollers

x86

68x

Freescale Power ARch Tech

ARM

MIPS

SuperH

Symbian

XScale

Embedded Operating Systems

Linux

WinCE

Windows Embedded

CE.NET 4.x

QNX

Symbian

We Love Linux

Application & Kernel Dev.

Embedded Linux

Windows CE

VxWorks

ThreadX

QNX

Custom BSP and Driver Dev.

Embedded Linux

Windows CE

VxWorks

ThreadX

QNX

Tailor-made

Custom Driver Development

Network & Communications

Storage Drivers

Device Drivers

Experience

AI

Detection/Recognition of objects and faces

ADAS

Security

Data centers

and more…

Experts in AI/ML

Real Time Operating Systems (RTOS): FreeRTOS, VxWorks, pSOS, Ecos, Nucleus, Proprietary

  1. https://xilinx.github.io/finn/ : FINN is an experimental framework from Xilinx Research Labs to explore deep neural network inference on FPGAs.

  2. Mlir.llvm.org : The MLIR project is a novel approach to building reusable and extensible compiler infrastructure. MLIR aims to address software fragmentation, improve compilation for heterogeneous hardware, significantly reduce the cost of building domain specific compilers, and aid in connecting existing compilers together.

  3. https://mlperf.org/ : Fair and useful benchmarks for measuring training and inference performance of ML hardware, software, and services.

  4. Graphcore IPU: https://www.graphcore.ai/posts/microsoft-accelerates-resnext-50-medical-imaging-inference-with-the-ipu