Learning Locomotion Full Description

Detailed Description of Our Technical Approach

The highlights of our approach include the selection of salient state and action spaces, joint-trajectory free stance control, and effective teaching. The following description of our approach is from exerpts of our Learning Locomotion proposal written in April 2005.

Base Static Gait Algorithm

The first algorithm we develop will be a static one-leg-at-a-time crawl gait. We will use a divide-and-conquer approach, dividing the problem into the separate subtasks of stance control, swing control, terrain footstep mapping, and terrain quality mapping. For each subtask, we will implement a controller that uses a reduced degree-of-freedom, salient state and action space and learns a solution to the subtask while attempting to minimize a penalty function.

Stance Phase Controller

The input to the stance phase controller is the desired body center of mass position with respect to the center of the feet support polygon. The goal of the stance phase controller is to quickly get the robot from whatever configuration it is in to the desired center of mass position while maintaining stability. The desired orientation and height of the body will be determined based on the foot locations. The desired yaw will be that which makes the robot face forward; the desired pitch will be that which matches the slope of the ground, based on the foot locations; the desired roll will be zero; and the desired height will be a constant parameter. This choice is invariant throughout this controller and reduces the space of solutions.

The key feature of the stance phase controller is that we will transform the raw state and action spaces from joint positions, velocities, and torques to more salient, and reduced dimensional spaces. The transformed state vector will consist of: leg positions with respect to the center of the feet support polygon; body center of mass location and velocity with respect to the center of the feet support polygon; body pitch and roll; and pitch and roll velocities. The action vector will consist of the net vertical force from the legs, the net torques about the yaw, pitch, and roll axes, and the commanded center of pressure x and y location with respect to the center of mass. The yaw variables will be eliminated through a rotational transformation as the heading of the robot is irrelevant to the task of stance control. This leaves a state vector with 16 dimensions (18 when four legs are on the ground) and an action vector with 6 dimensions.

Since this is a large space, we will further divide the stance controller into subtasks of controlling body height, body orientation, and body x-y position. The height controller will use only body height as its state vector, vertical force as its action vector, and squared error between height and desired height as its penalty function to be minimized. The orientation controller will use support polygon pitch and body yaw, pitch, and roll and their derivatives as the state vector, net body torques as the action vector, and sum of the squared error between the desired orientation and the actual orientation as the penalty function to be minimized. The body x-y controller will have as the state vector the leg (x,y) positions and the body center of mass (x,y) location and velocity with respect to the center of the support polygon. It will use the center of pressure location with respect to the center of mass as the action vector. The penalty function will consist of the squared error between the desired and actual center of mass location, as well as a penalty for a low stability margin. The stability margin will be computed as the amount of sideways force that needs to be applied to the center of mass before it tips.

Each of the stance controller subtasks will be learned using a standard temporal difference Reinforcement Learning algorithm, such as Q-Learning with value function approximation. Learning will occur first on a simulation of the robot, initialized with randomly selected foot locations, body position, and body orientation. In simulation, we will run on the order of 1 million episodes until the algorithm reliably performs stance control. The resultant learned value function will then be transferred to the real robot where learning will continue. We will present the robot with a large number of training examples consisting of varying leg positions and desired body position. The leg positions will be set simply by setting the robot on different parts of the Terrain Board and the desired body positions will be randomly generated.

The described stance phase controller should be successful due to our state representation and a sensible division of subtasks. It will be tractable since the highest state dimension is 10 for the x-y position subtask.

Swing Leg Controller

The swing leg controller will have as its input the desired footstep location of the swing leg and a bounding sphere of keep-out regions for the leg, representing the ground protrusions that need to be avoided. The state vector will be the Cartesian position and velocity of the foot with respect to its hip location. The action vector will be the Cartesian forces on the foot. We will use a standard Reinforcement Learning algorithm with the penalty function to minimize consisting of the squared error between the desired foot location and the actual foot location, and the sum of the squared Cartesian forces on the foot endpoint.

As with the stance controller, we will first perform a significant amount of training in simulation and then transfer the learned behavior to the real robot for further training. After this learning, the swing leg controller will be able to quickly swing the leg to the desired location while avoiding the keep-out region. Further learning will lead to reduced swing time and hence quicker walking speed.

Footstep Sequence Planner

To determine the input to the stance and swing controllers on rough terrain, we will perform footstep sequence planning on a grid at the resolution of the terrain datafile. We will use a footstep evaluation function,, which evaluates the quality of taking a footstep s, based on the stability margin during the footstep, and the time taken for the step. This function will have two terms, where is a closed-form idealized value based on the static stability margin and distance traveled during the step. is a learned correction term based on the actual measured values of the steps that are encountered while walking. We use this correction term as it is difficult to model the true values without experiencing them.

We define the space S of all possible footsteps in terms of four initial foot locations and the step taken by one of the feet: , .

The foot locations are ordered so that is the foot that is moved to . The space S is defined relative to one foot, so only three of the initial foot locations need to be specified. In addition is invariant with respect to yaw rotations, so we can reduce one more dimension, resulting in an 11-dimensional space. During planning, and are fixed. Real-time re-planning can occur asis learned.

We will search the space of footsteps quickly using a dynamic programming algorithm that focuses on potential paths in a “swath” around a preliminary path given by the High Level Path Generator described below. We use dynamic programming because the optimal sequence of footsteps can be broken into smaller optimal subsequences, allowing search through an exponentially large number of sequences in polynomial time. We define the quality of a sequence of footsteps as the sum of the quality of two optimal subsequences:

We expect the rough terrain to constrain the number of candidate footsteps. To further reduce computational requirements, we will bias our search along an initial path provided by the High Level Path Generator described below. We will limit the space of footsteps to those within a certain distance (approximately 4 body widths) from this path. The resultant space of potential paths inside this “swath” will be tractable to search using dynamic programming.

High Level Path Generator

In order to drastically reduce computational requirements of the Footstep Sequencing Algorithm, we will determine a promising path with a high level path generator. This generator begins with the fine-resolution grid of terrain heights. From these data, we will perform the following steps:

Produce a cell grid from the datafile, where each cell has a center position, height and a unit normal vector (
).
Produce a collection of terrain “patches”, where each patch represents an area of similar slope and no height discontinuities. To construct a patch, recursively examine the neighboring cells of the patch and add those whose height change is below a threshold and whose normal is within a threshold angle of the parent cell of the patch.
Produce an undirected graph where nodes represent patches and edges represent a possible traversal from one patch to another, based on maximum step distance from the patch edge. Assign node values based on the cost of traversing the patch (patch slope, size) and edge values based on the cost of traversing between patches (boundary height discontinuity, distance between patch centers, and “gap” distance if the patches are not adjacent).
Label the start and goal patches, and search the graph for a minimum cost path using a standard graph search algorithm.
Fit a smoothed spline to the patch centroids found by the search.

The resultant path will be a promising candidate to search exhaustively for footsteps using the Footstep Sequence Planning algorithm.

Justification of Our Approach

Our approach to Learning Locomotion will be successful due to its use of salient state and action spaces, a divide-and-conquer approach with subtasks and subgoals, choosing reduced dimensional state and action subspaces for each subtask, and by providing relevant training examples.

Similar reductions of spaces are typical in non-trivial examples of machine learning. The theoretical alternative is given the state of the Little Dog robot (joint angles and velocities), the state of the world (terrain map), and the goal (get to the other end of the Terrain Board), one could use pure tabular-based Reinforcement Learning to find the optimal action (servo commands) given the current state. This is a well posed reinforcement learning problem with textbook techniques that guarantee convergences. However, there are several “curses” that make this “Pure Reinforcement Learning” approach impossible. As with nearly all successful examples of nontrivial machine learning, we can “break” these curses using prior knowledge of the problem, human intuition, and approximation. The art is to determine the best way to incorporate these features without significantly degrading the solution quality.

Curse of Dimensionality. If we were to discretize each of the 12 joint angles, 12 joint velocities, and 3 orientation angles and velocities into only 10 discrete values, there would be 1030 states of the robot! This search space is too large to exhaustively search. To break the curse of dimensionality we will focus on promising regions of the state space, reduce the search space through the use of salient features as state variables, use approximation functions for generalization across regions of state space, and accept suboptimal but good solutions.

Needle in a Haystack Curse. In the space of control systems, the quantity of good controllers is very small compared to the quantity of all possible controllers. For systems such as legged robots, in which all bad controllers are equally bad with respect to the goal (how does one compare which of two robots fell down “better”?), the small subset of good controllers lives on “fitness plateaus” in the search space. Finding these plateaus can be a “needle in a haystack” problem. To break the needle in a haystack curse, we will guide learning onto plateaus in fitness space by providing examples, seeding learning routines, and narrowing the range of potential policies through the choice of control system structure.

Curse of Delayed Rewards with a Complex Task. The goal of the Learning Locomotion competition is to traverse the Terrain Board as quickly as possible. Thus no reward is given (and thus no learning occurs) in the pure Reinforcement Learning approach until a very complex sequence of actions is accomplished purely as a matter of chance. Crossing the board by pure chance would eventually occur theoretically, but with the complexity of this problem it would take a prohibitively long amount of time. To break the curse of delayed rewards, we will divide the problem into subtasks, provide intermediate rewards for partial work towards a goal, and seed value functions through examples.

Curse of the Expense of Real World Trials. The time cost of performing exploratory actions on a real robot is high. If each trial takes one minute, even a fully automated system can only perform 1440 trials per day. One cannot afford to waste these precious trials on unguided pure Reinforcement Learning in massive state spaces. To break the curse of the expense of real world trials, we will run numerous trials in simulation, and focus on potentially good solutions without exhaustive exploration of all possibilities.

In deciding which elements of expert knowledge to incorporate in the design of the learning algorithm, one must make a tradeoff between reducing the time to find a good solution in the reduced space of possibilities and reducing the potential optimality of solutions. Incorporating those elements of human knowledge that are fundamental to the problem of quadrupedal locomotion will have the greatest cost-benefit. Restrictions that are incorporated simply to make the problem tractable, or based on misconceptions of the quadrupedal locomotion problem would have the least cost-benefit. The trick is recognizing the difference and being masterful in the most appropriate incorporation of the most fundamental domain knowledge.

Fundamental Characteristics of Quadrupedal Locomotion

We will incorporate knowledge of the fundamentals of quadrupedal locomotion into our learning algorithms in a variety of ways, which will reduce learning time requirements while not reducing the quality of potentially learned solutions. The most fundamental aspect of legged locomotion, in its most compact form is that angular momentum about the center of pressure is only changed by gravity. This observation leads to our selection of state and action spaces, highlights the importance of the location of the center of pressure with respect to the center of mass, and justifies our selection of subtasks for our walking algorithm.

While high degree-of-freedom legged locomotion has extremely complicated dynamics, one fundamental aspect of the dynamics is that the rate of change of the total angular momentum of the robot (or animal) about the center of pressure of the feet is only affected by gravity:

, (1)

where is the rate of change of total angular momentum about the center of pressure, is the vector from the center of pressure to the center of mass, m is the mass of the robot, and is the gravity vector. This compact representation results in a number of corollaries:

With the assumptions of a point mass body, and a straight leg, equation (1) reduces to the dynamic equations of an inverted pendulum.
For statically stable gaits, both the momentum of the robot and the rate of change of momentum are small. For, then equation (1) requires . In other words, the center of mass must remain almost directly over the center of pressure. Because the center of pressure by definition is always inside the support polygon of the feet, for statically stable gaits the center of mass must stay above the support polygon. Therefore, for a quadruped, the only statically stable gait is the “crawl” gait in which one leg is moved at a time.
The angular momentum of a robot about its center of pressure is the sum of the angular momentum due to the center of mass rotating around the center of pressure, and internal angular momentum due to internal mass rotating about the center of mass. Thus one way to control forward velocity, and hence balance, is to “windmill” appendages.
For the Little Dog platform, the legs are light. If one assumes that the leg mass cannot be “windmilled” to control balance, then the only opportunity for changing the momentum of the center of mass about the center of pressure is by controlling the center of pressure location with respect to the center of mass.
If one assumes that the height of the center of mass of the robot is not accelerating quickly, then equation (1), converted to Cartesian coordinates reduces to , , where x and y are the coordinates of the center of mass with respect to the center of pressure, z is the height of the center of mass from the center of pressure, and g is the gravity constant. Note that these dynamic equations are linear.

This analysis affirms that locations of the center of pressure and of the center of mass are two salient features of the stance phase of legged locomotion. The normally complicated, highly nonlinear dynamics become much simpler and smoother when represented with those variables.

Encoding Salient Features in the State and Action Vector

Machine learning results can be drastically improved by a proper choice of state and action vectors that are salient to the problem at hand. For example, in a previous study we have shown that the learning speed with TD-Lambda Reinforcement Learning for a simple game of Pong is drastically increased if the state vector is augmented with the projected location where the ball will arrive. If the action vector is changed to paddle acceleration with respect to projected ball location, then the learning becomes trivial and an optimal policy can be found in only several trials. This is an extreme case in which the entire “answer” was encoded in the state and action vectors, but no new information was added and no potential solutions were eliminated.

In previous studies with Spring Flamingo, Agile Hexapod simulations, and ongoing studies with Pluto, we have shown that the choice of appropriate state and action spaces can significantly affect the complexity and quality of a control system. Figure 1 shows these robots walking over moderately rough terrain, sensing the terrain only through their feet. The traditional approach of high gain position servos at each degree of freedom would result in suboptimal performance, including the typically observed condition in most walking robots of foot slipping. Instead, for each of these robots, we transform the state vector to be relative Cartesian position from each foot to the center of mass of the robot, and the action vector to be Cartesian forces provided from each leg to the center of mass of the robot. We then use a force distribution algorithm that distributes a desired net Cartesian body force and torque vector among the legs that are currently on the ground. The resultant transformed state and action vectors are then relative Cartesian positions of the legs and net Cartesian forces and torques on the body. With these salient state and action spaces, linear proportional-derivative controllers can be used for controlling the body motion. Similarly, the swing legs are swung by commanding Cartesian endpoint forces based on Cartesian endpoint positions with respect to the body.

With the Little Dog platform, we propose to use a similar transformation of state and action spaces to a more salient set. Using Cartesian state vectors will be straightforward. However, using Cartesian forces requires fairly good torque control at the joints, which we suspect the Little Dog platform will not have due to the use of high reduction gear drives with their associated friction and backlash. However, because each foot contains a force sensor, we propose to command Cartesian forces among the legs through a combination of current control at the joint level and feedback of the force sensors, in the following manner:

Given a desired center of pressure location, vertical force, and torque vector on the body, distribute that force and torque vector among the legs in contact with the ground.
For each leg, transform from Cartesian forces to joint torques.
Apply current to each motor in attempts to achieve these joint torques.
Measure the ground reaction forces and compute the center of pressure location. Compare these to their expected values.
Modify the current applied to each motor to reduce the discrepancy between desired and measured ground force and center of pressure location.

Divide and Conquer Approach

Quadrupedal animals smoothly transition from walk to trot to gallop gaits as speed increases. It is still debated whether a single mechanism accounts for each gait, or if there are several different mechanisms that are discretely switched as the gait changes. From an engineering point of view, the latter case allows for a system that is much easier to implement and will be the approach we take. We will develop separate learning algorithms for the crawl gait, trot walking, trot running, bounding, and jumping. The leg phasing for each gait will either be dictated by the controller structure or by penalizing non-conforming motion in reward functions.

Previous work on legged locomotion has shown that dividing the problem into subtasks, each with an independent controller can result in successful locomotion, even though the dynamics are coupled. The most widely known example of this approach is Raibert’s “three part controller”, in which height, pitch, and speed are controlled by separate individual control laws. In our work, we have utilized a similar approach, but have extended it by a) using multiple interacting controllers for some parts, such as speed control, and b) increasing the region of convergence by augmenting the decoupled controllers with coupled controllers that were either hand-crafted or generated through machine learning techniques. In this project we will use a separate learning algorithm to optimize each subtask. The subtasks will be defined through the state and action selection and the reward function for each learning algorithm. Each of these algorithms will be learning simultaneously and interact with each other. As one improves on its subtask, the others will be exposed to that change and improve upon their subtask accordingly.

Comparison with Current Technology

Our project approach is primarily influenced by and similar to our previous and ongoing research in legged locomotion. This work includes (Figure 1) Spring Flamingo, a planar bipedal walking robot that can walk over rolling terrain with no prior information of the terrain; a simulated 12-dof 3D biped that can recover from significant disturbances; a simulated hexapod which can walk over rolling terrain while balancing a pendulum on its back; and simulations of Pluto, a 12-dof quadrupedal robot. The Pluto simulations can jump over one body length, climb standard sized stairs, dynamically trot-walk over rolling terrain and randomly placed blocks with no prior knowledge of the terrain, and run two body lengths per second.

Figure 1 : Examples of our previous and ongoing work in legged locomotion on rough terrain

In each of these projects, we have used modest amounts of machine learning, particularly genetic algorithms to tune controller parameters and supervised learning to learn dynamics models for model-based control approaches. The Learning Locomotion project is expanding on our previous work by extending the range of terrains to extreme terrain where deliberative planning must be used for selecting footholds and by incorporating massive amounts of machine learning, with a significant focus on Reinforcement Learning techniques.

Our machine learning approaches are influenced by a large number of previous researchers too numerous to list. Much of this prior work is succinctly described by Sutton and Barto (MIT Press 2002), two pioneers in the field of Reinforcement Learning. Our proposed approach to learning by demonstration is influenced by Atkeson and Schaal who have shown the utility of this approach in several robotic applications including bipedal walking (IROS ‘03) and robot juggling (ICRA ’97).

There are some elegant examples of machine learning applied to locomotion, including Miller and Kun (ICRA ’96), who learned optimal neural network weights as an augmented controller for a bipedal walking controller; Kimura and colleagues (Advanced Robotics ‘01) who controlled a quadrupedal robot using an adaptive neural system model; and Tedrake who applied Reinforcement Learning with an actor-critic architecture to extend the region of convergence of a biped from passive walking downhill to active walking on flat ground and gentle inclines. All of these studies were adept at incorporating prior knowledge effectively, but none of them were applied to extremely rough terrain.