At it’s core, the principles that underlie StrongFit are a system of learning. As the system continues to evolve, so do the structures that shape it and the concept of reinforcement learning presents a powerful allegory to how humans grow and adapt. Though easily buried by an abstract algorithm, the basic of Q-learning is this: success is found in moving beyond our mistakes and that our intention within the present matters. This is part three of breaking down the Q-learning algorithm and what we can learn about human nature from this advancement in machine learning.
The measure of rewards is just a scoring system. This is demonstrated in most education systems where 9/10 = A, 8/10 = B and so on. Competition also drives us to associate first with gold, second to silver, and third to bronze. Rewards create in us a measure of success calibrated to comparison and standard, driving us to take action based on gaining points and avoiding those moves that cost us points.
In supervised learning, an agent is taught how to memorize information. A successful student, under this model, is the one that can repeat the information best and adhere to the teaching style. Points are not often merited to those with creative or abstract thinking skills, in fact, this type of thinking usually results in losing points because mistakes are inherently associated with failure.
Through model-free reinforcement learning, however, we learn that the best way for humans to learn includes measuring our actions, no matter the state or circumstance, against our long-term goals. The best action for the agent to take is not always that which accumulates the greatest momentary reward, but the one that offers the greatest long term value.
Accumulating points is part of the game and you have to know the rules. Being the biggest, fastest or strongest on the competition floor puts you on the podium. Hard work gets you the promotion and raises your social status and resume. Good grades and good quarterly merits mean more scholarships and bonus checks. But we often seek these rewards because we like recognition, we like recognition because it provides us with validation, we’re in need of validation when we’re not sure about who we are or why we’re working so hard to get these rewards in the first place. This cycle can be damaging mentally, physically and emotionally when left unchecked. Chasing a 10lbs snatch PR when your shoulder is hurting doesn’t carry any value once you’re in an arm sling and certificates of achievements won’t guarantee your fulfillment. In measuring true value, you need deeper vision.
In the Q-learning algorithm, the impact of future rewards is determined by a discount factor where 0 creates a short-sighted agent that only considered the current reward versus a factor of 1 in which it will strive for the highest long-term reward. The discount factor measures how far ahead in time an agent will consider a reward still of value. For example, we may be willing to pass $100 today for $150 tomorrow, but might feel differently if the payout were a year from now. If the dollar amount is the same, on the other hand, we would probably prefer $100 today rather than $100 tomorrow. The discount factor is a trade off measure between the instantaneous reward of one state and the long term payout of another.
Discovery requires newness and newness mandates that we must try new things. The necessity for randomness and chance in our pursuit of growth helps us to find a better way to do things rather than chasing the same things that those before us have done or that we already know has worked. This, of course, inevitably means that we will make mistakes, but the understanding must be that mistakes are not the same as failure.
If we stay within what rewards us now, we will never grow beyond the sphere of our comfort zone and therefore nothing will ever change and we learn nothing. When we expand the vision of what we actually value, however, success follows its own course as we learn and grow along the way.