In the simulation segment data is collected into a replay buffer which is then sampled to provide a batch for the training segment. The robot is given an initial state and the reward function assigns a value to that state. The state is fed into the Policy (mu) to give the next action which is the set of torques to be applied to each axis. This state and action is fed to the dynamic model and a new state is given. Finally, this state is evaluated if it is the target state or not and flagged. Then these five values--state,action,reward,next state, and done flag--get stored into the replay buffer. This process is then repeated in a loop.