Value Functions and Learning Methods

At the core of reinforcement learning is the question: “How good is this situation, and what should I do next?” Agents answer this using value functions—mathematical estimates of how beneficial states or actions will be in the long run.

Like a skilled chess player sensing whether a position is favorable, agents learn through experience which situations and actions are likely to lead to high rewards. This isn’t obvious upfront; value functions must be built from trial and error, often requiring heavy mathematical modeling.

By capturing these learned estimates, value functions provide the foundation for decision-making in reinforcement learning. Please note that this chapter is a little more technical than what we've been doing so far.


What Value Functions Represent

Value functions assign numerical scores to situations (states) and potential actions, representing the agent's best estimate of future rewards. These aren't immediate rewards, but predictions of cumulative future success.

Consider a simple navigation task where an agent learns to reach a goal:

  • State values: How good is it to be in this particular location? A state near the goal might have high value, while states near obstacles have low value.
  • Action values: How good is it to take a particular action from this state? Moving toward the goal has higher value than moving away from it.

The mathematical representation captures these intuitions:

V(s)=expected future rewards from state sV(s) = \text{expected future rewards from state } s Q(s,a)=expected future rewards from taking action a in state sQ(s,a) = \text{expected future rewards from taking action } a \text{ in state } s

These functions encode the agent's accumulated wisdom about which situations and actions lead to success. The power lies in using these learned values to make decisions: when facing a choice, pick the action with the highest predicted value.


Learning Values Through Experience

Value functions start as random guesses but improve through experience. Each time an agent takes an action and observes the results, it can update its estimates to be more accurate.

Value update rule:

New Value=Old Value+Learning Rate×(Actual ExperienceOld Value)\text{New Value} = \text{Old Value} + \text{Learning Rate} \times (\text{Actual Experience} - \text{Old Value})

This creates a gradual convergence toward accurate value estimates as the agent accumulates more experience.

🗺️ Navigation Example: An agent learning to navigate a maze initially assigns random values to different locations. After exploring, it discovers that certain paths lead to the goal while others lead to dead ends. Over time, locations closer to successful paths get higher values, while dead ends get lower values. The agent uses these learned values to choose better paths in future navigation attempts.

The learning rate controls how quickly values change—high learning rates adapt quickly but may be unstable, while low learning rates are stable but adapt slowly to changes.


Q-Learning: Learning Action Values

Q-learning represents one of the most influential methods for learning action values. The "Q" stands for "quality"—how good it is to take a particular action in a particular state.

The core insight is that the value of an action equals the immediate reward plus the value of whatever state you end up in:

Q(s,a)=immediate reward+discounted value of next stateQ(s,a) = \text{immediate reward} + \text{discounted value of next state}

Discount factor: Future rewards are typically worth less than immediate ones, so we multiply future values by a discount factor between 0 and 1:

Q(s,a)=r+γ×maxQ(s,a)Q(s,a) = r + \gamma \times \max Q(s',a')

Where γ\gamma (gamma) is the discount factor, rr is the immediate reward, and ss' is the next state.

Q-learning update:

Q(s,a)Q(s,a)+α[r+γmaxQ(s,a)Q(s,a)]Q(s,a) \leftarrow Q(s,a) + \alpha [r + \gamma \max Q(s',a') - Q(s,a)]

This update rule gradually improves Q-value estimates based on actual experience.

🎮 Gaming Example: Consider an agent learning to play a simple game where moving right toward a goal gives +10 reward, while moving into a wall gives -1 reward. Initially, all Q-values start at zero. After experiencing the game, Q-values for "move right" in states near the goal become positive, while Q-values for "move into wall" become negative. The agent learns to prefer actions with higher Q-values.


From Tables to Neural Networks

Q-learning traditionally uses tables to store values for each state-action pair. This works well for simple problems but breaks down when the number of possible states becomes large.

Table limitations:

  • Chess has approximately 104310^{43} possible board positions
  • Robotic control involves continuous sensor readings with infinite possible values
  • Image-based games have millions of possible pixel combinations

Neural network solution: Instead of storing individual values in tables, neural networks can approximate value functions for any input state or state-action pair.

Q(s,a)Neural Network(s,a)Q(s,a) \approx \text{Neural Network}(s,a)

The network learns to map from states and actions to predicted values, enabling generalization to states never seen before.

Deep Q-Networks (DQN): This approach revolutionized reinforcement learning by combining Q-learning with deep neural networks. The network takes game states (like Atari game screens) as input and outputs Q-values for all possible actions.

The training process works by:

  1. Playing the game and collecting experiences
  2. Using these experiences to train the neural network to predict Q-values more accurately
  3. Using the improved Q-values to make better action choices
  4. Repeating the cycle to continuously improve performance

Policy Gradients: Learning Strategies Directly

While Q-learning learns values and derives actions from them, policy gradient methods learn action strategies directly. Instead of asking "how good is this action?" they ask "how likely should this action be?"

Policy representation: A policy can be represented as a probability distribution over actions:

π(as)=probability of taking action a in state s\pi(a|s) = \text{probability of taking action } a \text{ in state } s

Policy gradient principle: Adjust the policy to make good actions more likely and bad actions less likely:

Policy UpdateReward×Action Probability Gradient\text{Policy Update} \propto \text{Reward} \times \text{Action Probability Gradient}

This approach directly optimizes the decision-making strategy rather than learning values first.

Advantages of policy methods:

  • Can handle continuous action spaces naturally
  • Can learn stochastic policies that mix different strategies
  • Often more stable for complex, high-dimensional problems

Actor-Critic methods: Combine both approaches by learning a policy (actor) and value function (critic) simultaneously. The critic evaluates how good the actor's choices are, while the actor adjusts its strategy based on the critic's feedback.


Handling Continuous and Complex Spaces

Real-world problems often involve continuous states and actions that can't be easily discretized into tables or simple categories.

Continuous states: Robot joint angles, stock prices, or sensor readings can take any value within a range. Neural networks naturally handle continuous inputs by learning smooth approximations.

Continuous actions: Steering angles, motor speeds, or bid amounts require precise numerical outputs. Policy gradient methods can output continuous action distributions.

Function approximation benefits:

  • Generalization to unseen states
  • Smooth behavior in continuous spaces
  • Ability to handle high-dimensional inputs like images
  • Efficient storage compared to large tables

🚗 Self-Driving Example: An autonomous vehicle's steering controller takes continuous sensor inputs (camera images, radar data, speed measurements) and outputs continuous steering angles. The value function estimates how good different steering actions are likely to be, considering both immediate safety and progress toward the destination.


Deep Reinforcement Learning in Practice

Deep reinforcement learning combines the decision-making framework of reinforcement learning with the representation learning power of deep neural networks.

Key components:

  • Experience replay: Store past experiences and learn from random samples to improve training stability
  • Target networks: Use separate networks for generating training targets to reduce learning instability
  • Exploration strategies: Balance trying new actions with exploiting current knowledge

Training process:

  1. Collect experience by interacting with the environment
  2. Store experiences in a replay buffer
  3. Sample random batches of experiences for training
  4. Update neural networks to better predict values or policies
  5. Use updated networks to collect new experiences
  6. Repeat until performance converges

The combination of deep learning and reinforcement learning has enabled breakthroughs in complex domains like game-playing, robotics, and autonomous systems.


Final Takeaways

Value functions transform reinforcement learning from random trial-and-error into systematic improvement guided by learned predictions of future success. By estimating how good different states and actions are likely to be, agents can make increasingly intelligent decisions even in complex, uncertain environments.

The evolution from simple table-based methods to sophisticated neural network approaches has enabled reinforcement learning to tackle problems that seemed impossible just years ago. Understanding how value functions work provides insight into how agents develop the intuition and strategic thinking that enables impressive performance across diverse domains.

These mathematical foundations underpin the practical successes we see in game-playing systems, robotics, and other applications where learning from experience leads to sophisticated, adaptive behavior.