My paper: A Comprehensive Review on Deep Reinforcement Learning

The updates

Dear friends, 
I recently wrote a survey paper on "A Comprehensive Review on Deep Reinforcement Learning: A Survey", with some of the leading AI and DRL researchers (including): In this work, we covered top recent DRL works, grouped into several categories. We were lucky to have you, as the external reviewers of this work. I hope this is useful for the research community. Any feedback will be highly welcomed. You can find its summary here too. Imitation learning, expert (teacher), hierarchical, hybrid imitation, high performance parallelism,


Notes and info

  • training on unlabeled data, lifelong learning, and especially letting models explore a simulated environment before transferring what they learn to the real world

  • Lately, simulation has helped achieve impressive results in reinforcement learning, which is extremely data-intensive.

  • using reinforcement learning to train robots that reason about how their actions will affect their environment.

  • How is it that many people learn to drive a car fairly safely in 20 hours of practice, while current imitation learning algorithms take hundreds of thousands of hours, and reinforcement learning algorithms take millions of hours? Clearly we’re missing something big.

  • In 2021, I expect self-supervised methods to learn features of video and images. Could there be a similar revolution in high-dimensional continuous data like video?

  • One critical challenge is dealing with uncertainty. Models like BERT can’t tell if a missing word in a sentence is “cat” or “dog,” but they can produce a probability distribution vector. We don’t have a good model of probability distributions for images or video frames. But recent research is coming so close that we’re likely to find it soon.

  • Suddenly we’ll get really good performance predicting actions in videos with very few training samples, where it wasn’t possible before. That would make the coming year a very exciting time in AI.

  • DeepMind released the code, model, & dataset behind their groundbreaking "AlphaFold" system. It's predicts protein shapes from genomic data with apps in health, sustainability, & materials design

Reading List (Video, Conference, Workshop, Paper)

DeepMind Open-Sources Lab2D, A System For The Creation Of 2D Environments For Machine Learning

Google to release DeepMind's StreetLearn for teaching machine-learning agents to navigate cities

Scalable agent alignment via reward modeling – DeepMind Safety Research – Medium

Google's DeepMind Can Support, Defeat Humans in Quake III Arena - ExtremeTech | You searched for deep mind - ExtremeTech | DeepMind AI Moves on from Board Games to StarCraft II - ExtremeTech | DeepMind AI Challenges Pro StarCraft II Players, Wins Almost Every Match - ExtremeTech | Google's DeepMind AI gets a few new tricks to learn faster

Robot arm

There are 4 Courses in this Specialization


Fundamentals of Reinforcement Learning



801 ratings

205 reviews

Reinforcement Learning is a subfield of Machine Learning, but is also a general purpose formalism for automated decision-making and AI. This course introduces you to statistical learning techniques where an agent explicitly takes actions and interacts with the world. Understanding the importance and challenges of learning agents that make decisions is of vital importance today, with more and more companies interested in interactive agents and intelligent decision-making.

This course introduces you to the fundamentals of Reinforcement Learning. When you finish this course, you will: - Formalize problems as Markov Decision Processes - Understand basic exploration methods and the exploration/exploitation tradeoff - Understand value functions, as a general-purpose tool for optimal decision-making - Know how to implement dynamic programming as an efficient solution approach to an industrial control problem This course teaches you the key concepts of Reinforcement Learning, underlying classic and modern algorithms in RL. After completing this course, you will be able to start using RL for real problems, where you have or can specify the MDP. This is the first course of the Reinforcement Learning Specialization.



Sample-based Learning Methods



397 ratings

75 reviews

In this course, you will learn about several algorithms that can learn near optimal policies based on trial and error interaction with the environment---learning from the agent’s own experience. Learning from actual experience is striking because it requires no prior knowledge of the environment’s dynamics, yet can still attain optimal behavior. We will cover intuitively simple but powerful Monte Carlo methods, and temporal difference learning methods including Q-learning. We will wrap up this course investigating how we can get the best of both worlds: algorithms that can combine model-based planning (similar to dynamic programming) and temporal difference updates to radically accelerate learning.

By the end of this course you will be able to: - Understand Temporal-Difference learning and Monte Carlo as two strategies for estimating value functions from sampled experience - Understand the importance of exploration, when using sampled experience rather than dynamic programming sweeps within a model - Understand the connections between Monte Carlo and Dynamic Programming and TD. - Implement and apply the TD algorithm, for estimating value functions - Implement and apply Expected Sarsa and Q-learning (two TD methods for control) - Understand the difference between on-policy and off-policy control - Understand planning with simulated experience (as opposed to classic planning strategies) - Implement a model-based approach to RL, called Dyna, which uses simulated experience - Conduct an empirical study to see the improvements in sample efficiency when using Dyna



Prediction and Control with Function Approximation



252 ratings

40 reviews

In this course, you will learn how to solve problems with large, high-dimensional, and potentially infinite state spaces. You will see that estimating value functions can be cast as a supervised learning problem---function approximation---allowing you to build agents that carefully balance generalization and discrimination in order to maximize reward. We will begin this journey by investigating how our policy evaluation or prediction methods like Monte Carlo and TD can be extended to the function approximation setting. You will learn about feature construction techniques for RL, and representation learning via neural networks and backprop. We conclude this course with a deep-dive into policy gradient methods; a way to learn policies directly without learning a value function. In this course you will solve two continuous-state control tasks and investigate the benefits of policy gradient methods in a continuous-action environment.

Prerequisites: This course strongly builds on the fundamentals of Courses 1 and 2, and learners should have completed these before starting this course. Learners should also be comfortable with probabilities & expectations, basic linear algebra, basic calculus, Python 3.0 (at least 1 year), and implementing algorithms from pseudocode. By the end of this course, you will be able to: -Understand how to use supervised learning approaches to approximate value functions -Understand objectives for prediction (value estimation) under function approximation -Implement TD with function approximation (state aggregation), on an environment with an infinite state space (continuous state space) -Understand fixed basis and neural network approaches to feature construction -Implement TD with neural network function approximation in a continuous state environment -Understand new difficulties in exploration when moving to function approximation -Contrast discounted problem formulations for control versus an average reward problem formulation -Implement expected Sarsa and Q-learning with function approximation on a continuous state control task -Understand objectives for directly estimating policies (policy gradient objectives) -Implement a policy gradient method (called Actor-Critic) on a discrete state environment



A Complete Reinforcement Learning System (Capstone)



177 ratings

33 reviews

In this final course, you will put together your knowledge from Courses 1, 2 and 3 to implement a complete RL solution to a problem. This capstone will let you see how each component---problem formulation, algorithm selection, parameter selection and representation design---fits together into a complete solution, and how to make appropriate choices when deploying RL in the real world. This project will require you to implement both the environment to stimulate your problem, and a control agent with Neural Network function approximation. In addition, you will conduct a scientific study of your learning system to develop your ability to assess the robustness of RL agents. To use RL in the real world, it is critical to (a) appropriately formalize the problem as an MDP, (b) select appropriate algorithms, (c ) identify what choices in your implementation will have large impacts on performance and (d) validate the expected behaviour of your algorithms. This capstone is valuable for anyone who is planning on using RL to solve real problems.

To be successful in this course, you will need to have completed Courses 1, 2, and 3 of this Specialization or the equivalent. By the end of this course, you will be able to: Complete an RL solution to a problem, starting from problem formulation, appropriate algorithm selection and implementation and empirical study into the effectiveness of the solution.


Using pre trained model to train deeper and lager model

Imitation Learning

Safety Gym, a suite of environments and tools for measuring progress towards reinforcement learning agents that respect safety constraints while training. It also provides a standardized method of comparing algorithms and how well they avoid costly mistakes while learning. If deep reinforcement learning is applied to the real world, whether in robotics or internet-based tasks, it will be important to have algorithms that are safe even while learning—like a self-driving car that can learn to avoid accidents without actually having to experience them. Credit: Two Minute Papers, OpenAI Follow me for more AI/ Datascience posts:

OpenAI Safety Gym: A Safe Place For AIs To Learn 💪

DeepMind proposes novel way to train ‘safe’ reinforcement learning AI

The Batch Issue 35

Different Skills From Different Demos

Reinforcement learning trains models by trial and error. In batch reinforcement learning (BRL), models learn by observing many demonstrations by a variety of actors. For instance, a robot might learn how to fix ingrown toenails by watching hundreds of surgeons perform the procedure. But what if one doctor is handier with a scalpel while another excels at suturing? A new method lets models absorb the best skills from each.

What’s new: Ajay Mandlekar and collaborators at Nvidia, Stanford, and the University of Toronto devised a BRL technique that enables models to learn different portions of a task from different examples. This way, the model can gain useful information from inconsistent examples. Implicit Reinforcement without Interaction at Scale (IRIS) achieved state-of-the-art BRL performance in three tasks performed in a virtual environment.

Key insight: Learning from demonstrations is a double-edged sword. An agent gets to see how to complete a task, but the scope of its action is limited to the most complete demonstration of a given task. IRIS breaks down tasks into sequences of intermediate subgoals. Then it performs the actions required to accomplish each subgoal. In this way, the agent learns from the best parts of each demonstration and combines them to accomplish the task.

How it works: IRIS includes a subgoal selection model that predicts intermediate points on the way to accomplishing an assigned task. These subgoals are defined automatically by the algorithm, and may not correspond to parts of a task as humans would describe them. A controller network tries to replicate the optimal sequence of actions leading to a given subgoal.

  • The subgoal selection model is made up of a conditional variational autoencoder that produces a set of possible subgoals and a value function (trained via a BRL version of Q-learning) that predicts which next subgoal will lead to the highest reward.

  • The controller is a recurrent neural network that decides on the actions required to accomplish the current subgoal. It learns to predict how demonstrations tend to unfold, and to imitate short sequences of actions from specific demonstrations.

  • Once it’s trained, the subgoal selection model determines the next subgoal. The controller takes the requisite actions. Then the subgoal selection model evaluates the current state and computes a new subgoal, and so on.

Results: In the Robosuite's lifting and pick-and-place tasks, previous state-of-the-art BRL approaches couldn't pick up objects reliably, nor place them elsewhere at all. IRIS learned to pick up objects with over 80 percent success and placed them with 30 percent success.

Why it matters: Automatically identifying subgoals has been a holy grail in reinforcement learning, with active research in hierarchical RL and other areas. The method used in this paper applies to relatively simple tasks where things happen in a predictable sequence (such as picking and then placing), but might be a small step in an important direction.

We’re thinking: Batch reinforcement learning is useful when a model must be interpretable or safe — after all, a robotic surgeon shouldn’t experiment on living patients — but it hasn’t been terribly effective. IRIS could make it a viable option.

Dec 11, 2019

Issue 34

Seeing the World Blindfolded

In reinforcement learning, if researchers want an agent to have an internal representation of its environment, they’ll build and train a world model that it can refer to. New research shows that world models can emerge from standard training, rather than needing to be built separately.

What’s new: Google Brain researchers C. Daniel Freeman, Luke Metz, and David Ha enabled an agent to build a world model by blindfolding it as it learned to accomplish tasks. They call their approach observational dropout.

Key insight: Blocking an agent's observations of the world at random moments forces it to generate its own internal representation to fill in the gaps. The agent learns this representation without being instructed to predict how the environment will change in response to its actions.

How it works: At every timestep, the agent acts on either its observation (framed in red in the video above) or its prediction of what it wasn’t able to observe (imagery not framed in red). The agent contains a controller that decides on the most rewarding action. To compute the potential reward of a given action, the agent includes an additional deep net trained using the RL algorithm REINFORCE.

  • Observational dropout blocks the agent from observing the environment according to a user-defined probability. When this happens, the agent predicts an observation.

  • If random blindfolding blocks several observations in a row, the agent uses its most recent prediction to generate the next one.

  • This procedure over many iterations produces a sequence of observations and predictions. The agent learns from this sequence, and its ability to predict blocked observations is tantamount to a world model.

Results: Observational dropout solved the task known as Cartpole, in which the model must balance a pole upright on a rolling cart, even when its view of the world was blocked 90 percent of the time. In a more complex Car Racing task, in which a model must navigate a car around a track as fast as possible, the model performed almost equally well whether it was allowed to see its surroundings or blindfolded up to 60 percent of the time.

Why it matters: Modeling reality is often part art and part science. World models generated by observational dropout aren’t perfect representations, but they’re sufficient for some tasks. This work could lead to simple-but-effective world models of complex environments that are impractical to model completely.

We’re thinking: Technology being imperfect, observational dropout is a fact of life, not just a research technique. A self-driving car or auto-piloted airplane reliant on sensors that drop data points could create a catastrophe. This technique could make high-stakes RL models more robust.

Dec 4, 2019

Issue 33

Is AI Making Mastery Obsolete?

Is there any reason to continue playing games that AI has mastered? Ask the former champions who have been toppled by machines.

What happened: In 2016, International Go master Lee Sedol famously lost three out of four matches to DeepMind’s AlphaGo model. The 36-year-old announced his retirement from competition on November 27. “Even if I become the number one, there is an entity that cannot be defeated,” he told South Korean's Yonhap News Agency,

Stages of grief: Prior to the tournament, Lee predicted that he would defeat AlphaGo easily. But the model’s inexplicable — and indefatigable — playing style pushed him into fits of shock and disbelief. Afterward, he apologized for his failure to the South Korean public.

Reaching acceptance: Garry Kasparov, the former world-champion chess player, went through his own cycle of grief after being defeated by IBM’s DeepBlue in 1997. Although he didn’t retire, Kasparov did accuse IBM’s engineers of cheating. He later retracted the charge, and in 2017 wrote a book arguing that, if humans can overcome their feelings of being threatened by AI, they can learn from it. The book advocates an augmented intelligence in which humans and machines work together to solve problems.

The human element: Although AlphaGo won in the 2016 duel, its human opponent still managed to shine. During the fourth match, Sedol made a move so unconventional it defied AlphaGo’s expectation and led to his sole victory.

We’re thinking: Lee wasn't defeated by a machine alone. He was beaten by a machine built by humans under the direction of AlphaGo research lead David Silver. Human mastery is obsolete only if you ignore people like Silver and his team.