Wolfram Cloud Document

WOLFRAM NOTEBOOK

Overview

Robust implementation of robot motion planning subject to various real world constraints is a task not easily dealt with using analytical methods. Much progress has been made using Reinforcement Learning techniques to give robots robust control in their interaction with the environment. Reinforcement Learning is a broad area of research and encompasses many different techniques. They can be classified broadly by attributes of their environment, actions and state by categorizing them as discrete or continuous, deterministic or probabilistic. The robot arm action is input torque which is a continuous action space. It’s state is defined as the position and velocity of each joint which is also a continuous state space and has a deterministic environment since a specific action and state we lead to a specific next state. An algorithm well suited for this continuous action-state-space deterministic-environment is OpenAI’s Deep Deterministic Policy Gradient (DDPG).

Dynamic equations for the 3 axis robot were generated relating input torque to motion. Below is a visualization of the robot with three independent moving axes.

Learning Algorithm

The Deep Deterministic Policy Gradient (DDPG) is a fusion of two reinforcement learning techniques, Q learning and Policy Gradient. A Q function assigns a value to a state-action pair. A Policy (mu) is a function which assigns the optimum action to a state or in other words the action that will give the highest Q value. The Q value in turn represents the total reward accumulated after taking this action from this state and then subsequently following the policy mu. In this way both Q and Mu are dependent on each other which can lead to instability. To counter this the algorithm makes use of a target policy and Q function which get updated more slowly to provide stability to the learning process.

A reward function is also defined which assigns a reward to a state. In this situation it is the nearness to the end-effector target position. Note the difference between the Q function and the reward function. The reward function assigns a reward for being in a specific state. while the Q function assigns a value to a state-action pair followed by a trajectory defined by the policy. In other words the reward function is local to this state whereas the Q function is predicting the value of the entire trajectory after taking this action following policy mu.

Both Q and the Policy are approximated with deep neural networks and will be described in greater detail in the next section.

The learning algorithm is broken into two parts simulation and training.

Simulation

In the simulation segment data is collected into a replay buffer which is then sampled to provide a batch for the training segment. The robot is given an initial state and the reward function assigns a value to that state. The state is fed into the Policy (mu) to give the next action which is the set of torques to be applied to each axis. This state and action is fed to the dynamic model and a new state is given. Finally, this state is evaluated if it is the target state or not and flagged. Then these five values--state,action,reward,next state, and done flag--get stored into the replay buffer. This process is then repeated in a loop.

Training

In the training segment a batch is sampled from the the replay buffer and used to train the neural networks. This is done to ensure that the algorithm won’t over-train for local conditions. A target Q value is calculated for each sample in the batch which is the reward of this state plus all the subsequent rewards estimated by the target Q function and following policy target mu. Subsequently, each neural network gets updated independently in four steps.

Steps

1. First the Q function is trained to minimize the mean square loss between it’s value and the target value calculated.
2. Next the Policy network mu is trained to maximize the Q value
3. Next the target Q network weights are adjusted slightly in the direction of the Q network
4. Finally the target Mu network weights are adjusted slightly in the direction of the Mu network

Neural Network Architecture

The Neural Network architecture in the literature for the Value function Q and Policy mu is comprised of two hidden layers of size 400 and 300 respectively. The activation layers are all element-wise ramp layers with the exception of the output to mu which is a tanh layer to bound the output between the saturation torques. A slight modification was made to the classic Q function to include the next state as well as the current state. It was hypothesized that this extra information provided would assist in the learning process. Below is the architecture for the Q and Mu functions.

Q Neural Net

Out[]=

NetGraph



InputPorts
State:	vector (size: 10)
Action:	vector (size: 3)
NextState:	vector (size: 10)
OutputPort
QValue:	vector (size: 1)



Mu Neural Net

Out[]=

NetGraph



InputPort
State:	vector (size: 10)
OutputPort
Action:	vector (size: 3)

Action:

Port

Form:

vector

(size: 3)



Training Net

Two training nets were employed to implement step 1 & 2 in the learning algorithm. The Q training net was trained with a mean square loss function with the target value which was calculated from the reward function from this state plus the target Q value from next state.

Out[]=

NetGraph



InputPorts
State:	vector (size: 10)
Action:	vector (size: 3)
NextState:	vector (size: 10)
Target:	vector (size: 1)
OutputPort
Loss:	real



The mu Training net is a net of mu composed in q. Since the goal of this step is to maximize Q a custom loss function was employed which zeroed the target and therefore updated Mu to minimize the negative Q value which was the objective of step 2.

Out[]=

NetGraph



InputPorts
State:	vector (size: 10)
NextState:	vector (size: 10)
Target:	array
OutputPort
Output:	array



Results

Many implementations with varying hyperparameters were employed such as network architecture, reward function definition and training parameters. Some improved the overall trajectory reward before plateauing before reaching a good solution and some very early on got stuck in local maxima. Below is a selection of several attempts of the algorithm to learn the motion to get to the target.

Whats Next?

To achieve a successful implementation of DDPG several things will be attempted. Larger and more complex network architecture will be explored as well as increasing the action noise threshold during learning to increase exploration. The reward function will also be varied to encourage different types of behaviour

You are using a browser not supported by the Wolfram Cloud

Supported browsers include recent versions of Chrome, Edge, Firefox and Safari.

I understand and wish to continue anyway »