🎯 What You'll Learn

📋 Before You Begin

From Classification to Decision Optimization

We're upgrading from a supervised classification system to a decision optimization system.

The model is not just predicting sentiment — it is learning to maximize financial reward.

Layer Before (Classification) After (Decision System)
Output Class label Action (buy/sell/hold)
Loss Cross-entropy Reward-based
Evaluation F1 score Cumulative profit
Threshold Static Learnable policy

💡 The Key Shift

Old objective: minimize classification error. New objective: maximize expected profit. This changes everything — from loss function to evaluation metrics.

Action Space Design

We convert sentiment into trading actions:

Sentiment Action
Bullish BUY
Bearish SELL
Neutral HOLD
python
action_map = {
    0: "BUY",
    1: "SELL",
    2: "HOLD"
}

Reward Function Design (CRITICAL)

This is the most important design lever. The reward function is business logic encoded mathematically.

Two Approaches

python
def compute_reward(action, price_t, price_t1):
    # Calculate profit based on action
    if action == "BUY":
        return price_t1 - price_t
    elif action == "SELL":
        return price_t - price_t1
    else:
        return 0
python
def compute_reward(action, price_t, price_t1, transaction_cost=0.01):
    # Risk-aware reward with transaction costs
    if action == "BUY":
        return (price_t1 - price_t) - transaction_cost
    elif action == "SELL":
        return (price_t - price_t1) - transaction_cost
    else:
        return -0.001  # Small penalty for inactivity

💡 Critical Insight

Reward function = business logic encoded mathematically. The quality of your reward function determines whether your RL agent learns useful behavior or exploits loopholes.

Policy Gradient (REINFORCE)

Instead of predicting labels, the model outputs probabilities over actions:

L = -log π_θ(a|x) · R

Where:

Modify Model Output

python
def get_action(logits):
    # Convert logits to probabilities
    probs = torch.softmax(logits, dim=1)
    
    # Create probability distribution
    dist = torch.distributions.Categorical(probs)
    
    # Sample action from distribution
    action = dist.sample()
    
    return action.item(), dist.log_prob(action)
Output
action: 0 (BUY)
log_prob: -0.693

The RL Training Loop

Here's the complete training loop with policy gradient:

python
def train_rl(model, dataloader, optimizer, price_series):
    model.train()
    total_reward = 0

    for i, batch in enumerate(dataloader):
        optimizer.zero_grad()

        input_ids = batch["input_ids"]
        attention_mask = batch["attention_mask"]

        # Forward pass: get action probabilities
        logits = model(input_ids, attention_mask)

        # Sample action from policy distribution
        action, log_prob = get_action(logits)

        # Simulate market outcome
        price_t = price_series[i]
        price_t1 = price_series[i + 1]

        action_name = action_map[action]
        reward = compute_reward(action_name, price_t, price_t1)

        # Policy gradient loss: -log(prob) * reward
        loss = -log_prob * reward
        loss.backward()
        optimizer.step()

        total_reward += reward

    return total_reward

How Data Flows Through RL Training

📝
Text Input
🧠
Model
🎲
Action
💰
Reward
Signal
1 Text Input: News headline or social media post about stock
2 Model: Outputs probability distribution over BUY/SELL/HOLD
3 Action: Sample action from policy (e.g., BUY)
4 Reward: Calculate profit from action, update policy

Evaluation: Profit Instead of Accuracy

We evaluate by cumulative profit, not F1 score:

python
def evaluate_profit(model, dataloader, price_series):
    model.eval()
    total_profit = 0

    with torch.no_grad():
        for i, batch in enumerate(dataloader):
            logits = model(batch["input_ids"], batch["attention_mask"])
            
            # Choose most likely action
            action = torch.argmax(logits, dim=1).item()

            action_name = action_map[action]

            price_t = price_series[i]
            price_t1 = price_series[i + 1]

            total_profit += compute_reward(action_name, price_t, price_t1)

    print("Total Profit:", total_profit)
Output
Total Profit: 1247.53

⚠️ Important

A model with 70% accuracy can lose money. A model with 40% accuracy can make money. Accuracy ≠ Profit. Always evaluate on what matters.

Hybrid Training (Best Practice)

Pure RL is unstable. So we combine approaches:

Two-Phase Training

Click each card to flip:

Phase 1: Supervised Pretraining
Tap to flip
Train using labeled sentiment data. The model learns basic language understanding and sentiment classification before optimizing for profit.
1 / 2
python
# Phase 1: Supervised pretraining
train(model, dataloader, optimizer)

# Phase 2: RL fine-tuning for profit
train_rl(model, dataloader, optimizer, price_series)

💡 Why Hybrid?

Supervised pretraining gives the model a good initialization. RL fine-tuning adapts it to the specific objective (profit). This is more stable than pure RL from scratch.

Advanced Design Levers

Where experts focus their effort:

Penalize volatility, reward consistency:

python
# Add penalty for high-variance returns
reward = profit - volatility_penalty

Add entropy bonus to encourage exploration:

text
L = -log π(a|x) · R + β · H(π)

where H(π) = entropy of policy

Penalize large drawdowns:

python
if drawdown > threshold:
    reward -= penalty

Use discounted rewards for long-term credit assignment:

text
R_t = Σ γ^k · r_{t+k}

where γ = discount factor (e.g., 0.99)

Critical Real-World Insight

This system will fail in naive form because:

Student

I built the RL trading system but it's not making profit. The model keeps making bad trades.

Instructor

Most value comes from reward design + data alignment, not model architecture. How did you design your reward function?

Student

I just used raw profit... price difference between t and t+1.

Instructor

That's the naive form. You need transaction costs, risk penalties, volatility adjustment. The market will exploit any simplification.

💡 The Truth

Most value comes from reward design + data alignment, not model architecture. A simple model with great reward design beats a complex model with poor reward design every time.

Test Your Understanding

1. What is the key difference between classification and decision optimization?

0 / 5

Keep practicing!

Key Takeaways