Zambesi: Reinforcement Learning for Legend of the Greasepole

The Legend of the Greasepole
ZAMBESI: REINFORCEMENT LEARNING FOR "LEGEND OF THE GREASEPOLE"

Robert Burke, 18 Apr/99

About The Pole Game

The Pole Game begins as 85 artificially intelligent frosh are tossed into the greasepit through a roaring crowd. You view the action through the eyes of a Frec (upper-year student) standing on the bank of the pit. The premise is that you and some of your co-Frecs have noticed that the frosh this year are particularly keen. They’ll likely climb the Greasepole in record time, and not learn the teamwork they’ll need to survive their upcoming Applied Science education. You have to stall the frosh for as long as possible as they attempt to climb the pole.

As the game progresses, the frosh will learn new tricks, become more resilient to your attempts to stall them, and learn to work together. Should they have problems, they will be assisted by Alan "Pop Boy" Burchell, an upper-year student who jumps into the pit and accelerates their learning process.

The Frosh Character and IntelliFrosh

IntelliFrosh is the behavior-Based Artificial Intelligence engine developed by Robert Burke that governs the behavior of the frosh and other sprites in the game.

The actions of the frosh character break down into 31 behaviors that encompass over 5,000 lines of code. Although the details of these behaviors changed over the course of the project, the core of the artificial intelligence changed very little since its inception in the winter of 1996. After working on IntelliFrosh for a year, the author had a chance to take notes in September of 1997 while watching Science ’01 climb their Greasepole. Sample pages of his notes – mud and lanolin stains included – are found on the Legend of the Greasepole CD. These observations were used to improve IntelliFrosh’s simulation of the transition from chaos to order.

Each of the 85 frosh "thinks" 24 times a second, and there is no "overmind" that controls the horde. They base their decisions on the interaction of over 15 internal characteristics that describe their motivations and their knowledge of the game world.

Overview of Frosh Behaviors

The frosh behaviors are best understood arranged in a four-tiered system. Each behavior consists of an "initialization" function and an "action" function. Every frosh sprite tracks a pointer to the function serving as their current behavior.

The first group of behaviors, numbered 1 through 3, manages frosh under the influence of gravity. The frosh fall with little or no control over their actions.

The second tier – behaviors 4 through 7 – manages frosh in the greasepit water. Behavior 4 is a sort of "hub" for the artificial intelligence at this tier. A frosh may make a decision based on internal characteristics and perceived state of the game world to transfer between behavior 4 and behaviors 5, 6 and 7. A frosh exhibiting behavior 7 may choose to exhibit behavior 9 and climb out of the water as a function of their ambition, level of excitement, and knowledge of weight ratios. These characteristics all vary as the game progresses. For example, one way in which the artificial frosh mimic their human counterparts is that they start out keen to climb up the human pyramid. Just about everyone wants to be a hero, and the resulting human pyramid becomes top-heavy. As the game progresses, the frosh learn to exercise caution before climbing up.

The third tier – behaviors 9 through 14 – manages frosh dealing with the upper levels of the human pyramid. As they climb, they become influenced by weight on their shoulders, and they apply weight on the shoulders of those beneath them. A significant amount of the learning in the frosh pertains to how they handle situations encountered at this tier in the behavioral structure. Frosh need to know when to stay put and when to climb up. They need to know if they should beckon other frosh up, jump down to reduce the weight of the pyramid, or balance the weight across the level they are on. They need to know not to accept beer and pizza if it’s tossed to them, and they need to avoid putting too much weight on the shoulders of any of their compatriots below.

The fourth tier of behaviors manages the various incarnations of behavior 16 – tugging on the tam. The tam loosens as frosh yank on it in an attempt to get the nails out. A strong tug has a greater loosening effect, but also increases the chances that the frosh will slip. Gnawing on the tam can expedite the process.

The following is a complete list of the behaviors frosh are able to exhibit. Unused behaviors are a result of modifications made to the behavior list during development.

1 – Free-falling into pit

2 – Leaping into pit

3 – Underwater

4 – Wading through pit towards a goal (or target)

5: "Fun stuff"

    5A – Eating pizza

    5B – Drinking beer

    5C – Splashing ArtSci

    5D – Pushing/Splashing Commie

6: "Confusion"

    6A – Drunken Singing

    6B – Moshing

    6C – Swimming Through the Pit

    6D – Scratching Head / Communicating / Picking nose

    6E – Flying as cow-eagles (Iron Ring effect)

    6F – Grazing as sheep (Iron Ring effect)

7: "Pyramid Base"

    7A – Linking Arms at base of human pyramid

    7B – Bucking under pressure at base of human pyramid

8 – Unused

9: Climbing up

    9A – Climbing over shoulders to a level above the water

    9B – Unused

10 – Stumbling over necks towards the pole

11: Providing upper-level support

    11A – Arms up to support people above

    11B – Walking across the pyramid to balance weight

    11C – Beckoning other frosh up to this level

    11D – Eating pizza up high

    11E – Drinking beer up high

12 – Unused

13 – Unused

14 – Clinging desperately to the pole

15 – Unused

16: Hanging on to the tam

    16A – Reach and yank on tam

    16B – Lost grip and hanging off tam

    16C – Light tug

    16D – Heavy tug

    16E – Teeth tug

    16F – Holding aloft the tam… victorious!

The Old System

Overview

The existing system provides the Frosh with access to a comprehensive summary of the state of the game world, as well as a set of integer and Boolean internal characteristics. At "critical moments" when the Frosh need to choose which behavior to exhibit next, they rely on information hand-picked from this summary to make a decision.

In addition, sets of "knowledge boosters" allow the Frosh to significantly improve their overall performance as time progresses. These boosters "unlock" behaviors (such as balancing weight across the pyramid) that were formerly unavailable to the Frosh. The speed at which these boosters unlock is a function of a toggle in the game’s menu that allows the user to select between "lame" and "keen" frosh.

Integer-Based Internal Characteristics

Each frosh keeps track of eight integer-based and six Boolean internal characteristics. These range from their ability to sustain weight on their shoulders to their thoughtfulness when approaching a climbing decision.

The following listing of integer-based internal characteristics for the frosh provides some insight into the metrics on which they base behavioral decisions. Typically, each of these characteristics is adjusted multiple times a second for every frosh.

attrBehavior The number of the current behavior being exhibited by the frosh

attrGoal The frosh's goal (ranging from "Senseless Wandering" to "Get into a pyramid spot" to "go splash that ArtSci") [goalMINDLESS_WANDERING, goalPYRAMID_SPOT, goalCLIMBING_UP, goalBOOSTING_UP, goalBOOSTED_UP, goalCLARK, goalPIZZA, goalARTSCI, goalCOMMIE, goalTHINK, goalMOSH]

attrMotivation General motivation level [3..20]

attrStrength General strength and resilience [3..20]

attrFrame Frame of current behavior (if applicable)

attrPersonality Frosh’s personality (goofy, heavyweight, hoister, climber); adjusts their propensity to perform various actions.

attrUpperLevelGoal If the frosh gets to an upper level, what are they likely to do? Cling to the pole, climb higher, or support people above them? This changes their propensity to do any of the three. [upperGoalCling, upperGoalClimb, upperGoalSupport]

attrMindSet The current mindset of the frosh. Originally intended to provide the ability for a "thought bubble" about the head of a frosh, it takes into account their hunger, thirst and other drives and arrives at an predominant mindset. [mindsetMotivated; mindsetExcited; mindsetHungry; mindsetThirsty; mindsetDrunk]

attrPyramidLevel Which level of the human pyramid the frosh is at [0=none, 1 = base… 6=hanging on to tam].

attrEthnicity Static throughout game. Allows for different skin tones.

Boolean-Based Internal Characteristics

Similarly, these are the Boolean-based internal characteristics for the frosh. Several of these are functions of integer-based internal characteristics that are stored in Boolean format to expedite calculations.

attrExcited Is the frosh rowdy enough to be running with his or her tongue out?

attrLookingLeft Self-explanatory; used for graphics rendering

attrLookingAtScreen Self-explanatory; used for graphics rendering during climbing actions

attrWeightOnShoulders Is there a hurtful amount of weight on this frosh’s shoulders?

attrThirsty Is this frosh thirsty enough to even consider running for a mug of beer?

attrHungry Is this frosh hungry enough to even consider running for a slice of pizza?

Global Properties

In addition, there are a number of "global" properties that affect all of the frosh. Although this was not part of the original plan (because all the Frosh were meant to act as autonomous agents), the author elected to include these properties after studying non-artificial frosh climbing the pole. The influence of group psychology at the greasepit is undeniable. This is modeled with IntelliFrosh as a morale metric that adjusts the other internal characteristics of the frosh.

Improvements Over Time

Because the primary drive the frosh experience throughout the game is to satisfy their desire to climb the greasepole, the internal characteristics of the frosh tend to adjust themselves to facilitate more effective pole climbing. They are also affected by the actions of the player. For example:

A frosh that is being beamed with apples suffers a temporary reduction in strength; however, his or her strength will then rise to represent an increased resilience and ability to withstand attack.

A frosh that is fed pizza will no longer be hungry.

Frosh enjoy splashing ArtScis, but the thrill grows old fast. A high intelligence level or a high motivation level can negate the allure of an ArtSci in the pit.

Tossing the 114 Exam makes the frosh scatter, but their intelligence goes up a notch as they learn from it (by osmosis).

Knowledge "Boosters"

The frosh also benefit from eight different "boosters" that represent knowledge of advanced pole-climbing techniques. These techniques are based on the real climbing methods employed by the frosh at Greasepole ’97.

Here is an example of a moment at which a critical decision needs to be made: What should a frosh do when on an upper level of the pyramid with only one frosh above them and a number of individuals beside them? Climb up? Hold fast? Jump off and reduce the overall weight of the pyramid? Beckon others up?

It became clear as development progressed that these sorts of decisions are critical to the evolution of a "teamwork" model for the frosh. The performance boosts available to a frosh include the following:

Heightened ability to resist the temptation to drink beer or eat pizza.

Ability to make decision to jump from a high level of the pyramid to help reduce weight.

Better understanding of when to be keen and climb up, and when to stay put and support people above.

Ability to support additional weight on shoulders (coincides with increased strength values).

Ability to beckon other frosh up (passing them a "message" encouraging them to do so).

Critique of the Old System

From a purely functional point of view, the old system is fantastic. It allows the Frosh to exhibit teamwork and learning. Most importantly, the Frosh in the game act an awful lot like the Frosh at the real Greasepole.

On the other hand, the "booster" system restricts the ways in which Frosh can learn and adapt. Indeed, this author’s greatest lament is that the Frosh never surprise him any more! And why should they? In a manner of speaking, the Frosh represent the fixed result of a Reinforcement Learning network. The author acted as the critic for this network by continually watching the Frosh attempt to climb the Pole, and then tweaking their artificial intelligence to improve their overall technique. As a result, the Frosh aren’t really learning "new" techniques; rather, they have the pre-developed techniques revealed to them.

Zambesi: The New System

In considering what we could do to improve the system, let us first consider what happens at the real Greasepole. Each Frosh in the greasepit isn’t aware of exactly how many people are on each level of the pyramid. All they’re aware of is where they are, what’s going on in their immediate vicinity, and to what degree their external environment is causing them pain. With the exception of some of the "groupthink" characteristics described above – encouragement and banter from peers, crowd excitement, etc. – one might hypothesize that there are only a few external characteristics affecting the behavior of each frosh.

Let us return to our example of perhaps the most critical decision a Frosh can make, and generalize it further. When a Frosh isn’t goofing around and is serious about helping climb the Greasepole, what should they do when they finish executing a particular behavior? The current incarnation of the game’s artificial intelligence essentially limits a Frosh to the following options:

Climb up

Jump down

Stand fast

Walk across the pyramid (to balance the weight across the level)

Move towards the Greasepole

Zambesi is a reinforcement learning network that allows the Frosh to choose which of the above behaviors to exhibit.

Removal of the "Boosters"

The "boosters" functionality was removed entirely. Under Zambesi, Frosh have "knowledge" of all of the possible pole-climbing techniques right off the bat, and the Boosters are fixed as follows:

0 - Resilience to apple attack.

Originally started at 1 and ramped up to 20 as the game progresses, and indicated the power of an apple toss that would be required to take out a Frosh. Locked at 10.

1- Ability to withstand weight on shoulders.

Originally started at 30 and ramped up to 1500, and represented a probability of 1/(100*n) that this Frosh would buckle under the pressure of someone on their shoulders during any iteration. Iterations occur at 24Hz. Thus, at the game’s start, the expected value for the amount of time a Frosh could sustain weight is 3000 iterations, or just over 2 minutes.

A probability distribution that makes it increasingly likely that the Frosh will buckle under pressure would be more appropriate. But that’s another challenge! Locked at 90.

2, 3, 4, 5, 6, 7 – Aspects of the climbing process.

These were how the Frosh "learned" new tricks, and were removed completely.

Reinforcement Learning Network Inputs

Each Frosh has his or her own Reinforcement Learning network that consists of four "neuron"-like units. Each unit accepts the following stimulus signals as inputs:

x1 [Integer] How many people are immediately around me?

x2 [Integer] How many people am I supporting?

x3 [Integer] How many people are supporting me?

x4 [Integer] Which level of the pyramid am I at?

x5 [Integer] How many people immediately around me appear to be stable?

x6 [Double] How close am I to the Pole?

x7 [Integer] How many people are on my level?

These inputs are the final result of an experiment designed to determine the optimal set. The seventh input may be unnecessary. The fifth input is functionally different enough from the first input to warrant its inclusion. None of these parameters (with the potential exception of the seventh) represent information that wouldn’t be readily available to a real Frosh at the Greasepole.

The Critic

We must address how to model a Greasepole "critic." In his article "Reinforcement Learning," Andrew Barto suggests that if an action ai is chosen on trial t and the critic’s feedback is "success," then pi(t) should be increased and the probabilities of the other actions are decreased; whereas if the critic indicates "failure," then pi(t) should be decreased and the probabilities of the other actions appropriately adjusted.

We face two problems using this technique. First, we need to define what constitutes a "success" in the context of the Greasepole. It would seem prudent that the Frosh evaluate the consequences of their actions immediately after they finish executing the behavior they selected. This will likely also be the moment at which they choose another behavior using their neural network. (If the Frosh have their behavior rudely interrupted due to unforeseen circumstances – like the human pyramid collapsing around them – clearly their action should be considered a failure.)

This leads us into the second problem – that of credit assignment. Barto notes that a scalar evaluation of a complex system’s behavior "does not indicate which of its many action components, both internal and external, were responsible for the evaluation. Thus, it is difficult to determine which of these components deserve the credit (or the blame) for the evaluation." A sensible approach for Greasepole would be to assign credit equally to all the Frosh. Hopefully, when we average in many variations of the behavior, the components that help produce the most effective behavior will be reinforced.

Testing

The input parameters, weighting scheme and output behaviors described above were combined into a reinforcement learning technique implementation called Zambesi. The "Roadster" class acts as a plug-in that upgrades the brain of every Frosh in the Greasepit.

ZambesiTest.cpp includes a number of tests designed to demonstrate the effective, robust nature of the Zambesi implementation. The Zambesi class interface consists of the following two functions:

// Decide what to do based on the input vector

outputBehaviour decide(critEval fCriticStartLevel, inputVectorComp vInputs[NUM_INPUTS]);

// If a decide() has occurred since the last train(), update the weights.

void train(critEval fCriticFinalLevel);

Each Frosh has his or her own Roadster. As demonstrated in ZambesiTest, the Roadster is robust enough to bounce back from accidental extra calls to either "train" or "decide." This helps make it safe for the insanely busy world of Legend of the Greasepole’s code.

Plugging Zambesi into Legend of the Greasepole’s code proved to be a challenge. A great deal of the code for the Frosh artificial intelligence had to be removed and replaced with calls to the "new brains."

Whether or not the condition (oldCriticValue==newCriticValue) constituted success or failure became very relevant. If it constituted success, the Frosh seemed to feel freer to take their time. If it constituted failure, they would tend to climb up frantically, often failing as they did so. In the implementation provided with this document, critic equality is considered a failure. When it is considered a success, the Frosh are rewarded for just standing around.

Conclusions

The Zambesi-powered Frosh exhibit a clear transition towards more logical and cohesive pole-climbing behavior. The fact that this behavior emerges from such a chaotic start is impressive. There are, however, times when a Frosh that seems so close to doing something exceptional makes an exceptionally poor decision.

The nature of these poor decisions (for example, climbing up without adequate support) suggest the following areas for potential future improvement:

The critic algorithm. It may not be robust enough. Perhaps a weighted average of past heights would be more appropriate. It also only rewards positive changes in height. From the time a Frosh reaches the top of the Greasepole until the time the human pyramid has toppled, the Frosh are all continually receiving "failure" signals. Surely a more clever model exists.

The nature of the inputs. "Proximity to the greasepole" was included as an input after it became clear that the Frosh had little sense of this concept (and the author could find no way to extract the information from the other inputs).

The value h . It is currently set at 0.15. Perhaps a higher learning rate would help expedite the process.

The weight magnitudes. An examination of the weights after a fifteen-minute period of learning revealed that some of the weights had risen over 105 and under -105! Perhaps the magnitude of the weights should be limited.

Trying to train eighty-five networks in a reasonable amount of time is a daunting task. It is not surprising that this set of Frosh would not make very exciting competitors in a "video game" situation. Nevertheless, the fact that the Frosh do learn and once again surprise the author with new tactics is a thrilling testament to the power of reinforcement learning networks.

References

Barto, Andrew G., "Reinforcement Learning" in The Handbook of Brain Theory and Neural Networks (Michael A. Arbib, Ed.) pp 804-809.

Barto, Andrew G., "Reinforcement Learning in Motor Control" in The Handbook of Brain Theory and Neural Networks (Michael A. Arbib, Ed.) pp 809-813.

Burke, Robert C., "The Legend of the Greasepole Project Summary." (http://engsoc.queensu.ca/polegame)