RL 101: Building An Environment

In RL, we train an agent to make smart decisions. The agent learns by interacting with the environment (a.k.a the world), where the environment acts as a strict "Judge". The goal of the agent is to maximize the cumulative reward over time.

To build an environment, we need to define three things:

State: The current state of the environment (what the agent sees).
Action: The action taken by the agent (what the agent can do).
Reward: The reward received by the agent for taking the action in the current state (what the agent gets for doing something).

Let's start by building a simple environment. We will use a grid navigator as our environment. A 4x4 grid where the agent can move up, down, left, and right. The agent starts at the top-left corner of the grid and the goal is to reach the bottom-right corner.

We initialize the GridWorld class by defining the state which consists of the board size, the agent's starting spot (usually top-left [0,0]), and a goal (usually bottom-right [3,3]).

import numpy as np

class GridWorld:
    def __init__(self, size=4):
        self.size = size
        self.agent_pos = [0, 0]  # Start at top-left
        self.goal_pos = [size-1, size-1] # Goal at bottom-right

    def reset(self):
        self.agent_pos = [0, 0]
        return self.agent_pos

To define the action space, we need to define the possible actions the agent can take. Up/Down moves changes the Row index (pos[0]) and Left/Right moves changes the Column index (pos[1]). We can give a tuple of possible actions where we need to map 0=Up, 1=Down, 2=Left, 3=Right. We also need to take into account the boundary checks and ensure we never step off the edge of the world.

def step(self, action):
    r, c = self.agent_pos

    if action == 0 and r > 0:
        r -= 1
    elif action == 1 and r < self.size - 1:
        r += 1
    elif action == 2 and c > 0:
        c -= 1
    elif action == 3 and c < self.size - 1:
        c += 1

    self.agent_pos = [r, c]

The final step is to define what happens when an agent moves. This brings about an important concept in reinforcement learning called reward shaping. This is where we can define the reward for each action and state. For example, if the agent reaches the goal, we can give a high reward, and if the agent steps off the edge, we can give a low reward. Reward shaping is the use of small intermediate ‘fake’ rewards given to the learning agent that help it converge more quickly.

Think about it like this:

Option A (Reward = 0): If stepping on an empty square costs nothing, the agent might walk in circles for 1,000 steps before finally hitting the goal. It has no reason to hurry.
Option B (Reward = -1): If every step is slightly "painful" (like losing battery power), the agent is motivated to find the shortest path to the goal to stop the pain.

So, to make our agent smart and efficient, we usually give a small negative reward for every step that isn't the goal.

Putting it all together, we have a class combining the setup, the movement, and the reward logic.

import numpy as np

class GridWorld:
    def __init__(self, size=4):
        self.size = size
        self.agent_pos = [0, 0]
        self.goal_pos = [size-1, size-1]

    def reset(self):
        self.agent_pos = [0, 0]
        return self.agent_pos

    def step(self, action):
        # 1. Update Position
        r, c = self.agent_pos
        if action == 0 and r > 0: r -= 1              # Up
        elif action == 1 and r < self.size - 1: r += 1 # Down
        elif action == 2 and c > 0: c -= 1            # Left
        elif action == 3 and c < self.size - 1: c += 1 # Right
        self.agent_pos = [r, c]

        # 2. Calculate Reward (Your logic!)
        reward = 0
        done = False
        if self.agent_pos == self.goal_pos:
            reward = 10
            done = True
        else:
            reward = -1
            
        return self.agent_pos, reward, done

We have just defined a reinforcement learning environment for a simple grid world. Agents can now be trained on this environment to learn the optimal policy. For sanity check, let's run a few episodes and see if the agent is learning to reach the goal.

env = GridWorld()
obs = env.reset()
done = False

print(f"Start! Position: {obs}")

while not done:
    action = int(input("Move (0=Up, 1=Down, 2=Left, 3=Right): "))
    obs, reward, done = env.step(action)
    print(f"Pos: {obs}, Reward: {reward}, Done: {done}")

print("End!")

Voila! The agent is now learning to reach the goal. It's a simple example, but it's a great start to understanding reinforcement learning.

The next article will be exploring Q-Learning to train the agent to reach the goal on this environment.