RL Trading Signal Optimizer

🎯 What You'll Learn

Design an action space for trading decisions (buy/sell/hold)
Build reward functions that encode business logic mathematically
Implement policy gradient training (REINFORCE algorithm)
Evaluate models by cumulative profit instead of accuracy
Apply hybrid training: supervised pretraining + RL fine-tuning

📋 Before You Begin

Understanding of basic machine learning concepts
Familiarity with PyTorch or similar frameworks
Basic knowledge of sentiment analysis
No prior RL experience required

From Classification to Decision Optimization

⚡ Intermediate ⏱ ~3 min

We're upgrading from a supervised classification system to a decision optimization system.

The model is not just predicting sentiment — it is learning to maximize financial reward.

Layer	Before (Classification)	After (Decision System)
Output	Class label	Action (buy/sell/hold)
Loss	Cross-entropy	Reward-based
Evaluation	F1 score	Cumulative profit
Threshold	Static	Learnable policy

💡 The Key Shift

Old objective: minimize classification error. New objective: maximize expected profit. This changes everything — from loss function to evaluation metrics.

Action Space Design

⚡ Beginner ⏱ ~2 min

We convert sentiment into trading actions:

Sentiment	Action
Bullish	BUY
Bearish	SELL
Neutral	HOLD

python

action_map = {
    0: "BUY",
    1: "SELL",
    2: "HOLD"
}

Reward Function Design (CRITICAL)

⚡ Advanced ⏱ ~5 min

This is the most important design lever. The reward function is business logic encoded mathematically.

Two Approaches

python

def compute_reward(action, price_t, price_t1):
    # Calculate profit based on action
    if action == "BUY":
        return price_t1 - price_t
    elif action == "SELL":
        return price_t - price_t1
    else:
        return 0

python

def compute_reward(action, price_t, price_t1, transaction_cost=0.01):
    # Risk-aware reward with transaction costs
    if action == "BUY":
        return (price_t1 - price_t) - transaction_cost
    elif action == "SELL":
        return (price_t - price_t1) - transaction_cost
    else:
        return -0.001  # Small penalty for inactivity

💡 Critical Insight

Reward function = business logic encoded mathematically. The quality of your reward function determines whether your RL agent learns useful behavior or exploits loopholes.

Policy Gradient (REINFORCE)

⚡ Advanced ⏱ ~4 min

Instead of predicting labels, the model outputs probabilities over actions:

        L = -log π_θ(a|x) · R
      

Where:

π_θ(a|x) = probability of action a given input x
R = reward received
θ = model parameters

Modify Model Output

python

def get_action(logits):
    # Convert logits to probabilities
    probs = torch.softmax(logits, dim=1)
    
    # Create probability distribution
    dist = torch.distributions.Categorical(probs)
    
    # Sample action from distribution
    action = dist.sample()
    
    return action.item(), dist.log_prob(action)

Output

action: 0 (BUY)
log_prob: -0.693

The RL Training Loop

⚡ Advanced ⏱ ~5 min

Here's the complete training loop with policy gradient:

python

def train_rl(model, dataloader, optimizer, price_series):
    model.train()
    total_reward = 0

    for i, batch in enumerate(dataloader):
        optimizer.zero_grad()

        input_ids = batch["input_ids"]
        attention_mask = batch["attention_mask"]

        # Forward pass: get action probabilities
        logits = model(input_ids, attention_mask)

        # Sample action from policy distribution
        action, log_prob = get_action(logits)

        # Simulate market outcome
        price_t = price_series[i]
        price_t1 = price_series[i + 1]

        action_name = action_map[action]
        reward = compute_reward(action_name, price_t, price_t1)

        # Policy gradient loss: -log(prob) * reward
        loss = -log_prob * reward
        loss.backward()
        optimizer.step()

        total_reward += reward

    return total_reward

How Data Flows Through RL Training

📝

Text Input

🧠

Model

🎲

Action

💰

Reward

Signal

1 Text Input: News headline or social media post about stock

2 Model: Outputs probability distribution over BUY/SELL/HOLD

3 Action: Sample action from policy (e.g., BUY)

4 Reward: Calculate profit from action, update policy

Evaluation: Profit Instead of Accuracy

⚡ Intermediate ⏱ ~3 min

We evaluate by cumulative profit, not F1 score:

python

def evaluate_profit(model, dataloader, price_series):
    model.eval()
    total_profit = 0

    with torch.no_grad():
        for i, batch in enumerate(dataloader):
            logits = model(batch["input_ids"], batch["attention_mask"])
            
            # Choose most likely action
            action = torch.argmax(logits, dim=1).item()

            action_name = action_map[action]

            price_t = price_series[i]
            price_t1 = price_series[i + 1]

            total_profit += compute_reward(action_name, price_t, price_t1)

    print("Total Profit:", total_profit)

Output

Total Profit: 1247.53

⚠️ Important

A model with 70% accuracy can lose money. A model with 40% accuracy can make money. Accuracy ≠ Profit. Always evaluate on what matters.

Hybrid Training (Best Practice)

⚡ Advanced ⏱ ~4 min

Pure RL is unstable. So we combine approaches:

Two-Phase Training

Click each card to flip:

Phase 1: Supervised Pretraining

Tap to flip

Train using labeled sentiment data. The model learns basic language understanding and sentiment classification before optimizing for profit.

1 / 2

python

# Phase 1: Supervised pretraining
train(model, dataloader, optimizer)

# Phase 2: RL fine-tuning for profit
train_rl(model, dataloader, optimizer, price_series)

💡 Why Hybrid?

Supervised pretraining gives the model a good initialization. RL fine-tuning adapts it to the specific objective (profit). This is more stable than pure RL from scratch.

Advanced Design Levers

⚡ Advanced ⏱ ~4 min

Where experts focus their effort:

Penalize volatility, reward consistency:

python

# Add penalty for high-variance returns
reward = profit - volatility_penalty

Add entropy bonus to encourage exploration:

text

L = -log π(a|x) · R + β · H(π)

where H(π) = entropy of policy

Penalize large drawdowns:

python

if drawdown > threshold:
    reward -= penalty

Use discounted rewards for long-term credit assignment:

text

R_t = Σ γ^k · r_{t+k}

where γ = discount factor (e.g., 0.99)

Critical Real-World Insight

⚡ Advanced ⏱ ~3 min

This system will fail in naive form because:

Market is noisy
Rewards are delayed
Distribution is non-stationary (changes over time)

Student

I built the RL trading system but it's not making profit. The model keeps making bad trades.

Instructor

Most value comes from reward design + data alignment, not model architecture. How did you design your reward function?

Student

I just used raw profit... price difference between t and t+1.

Instructor

That's the naive form. You need transaction costs, risk penalties, volatility adjustment. The market will exploit any simplification.

💡 The Truth

Most value comes from reward design + data alignment, not model architecture. A simple model with great reward design beats a complex model with poor reward design every time.

Test Your Understanding

⚡ Intermediate ⏱ ~5 min

1. What is the key difference between classification and decision optimization?

2. Why is the reward function the "most important design lever" in RL?

3. What does the policy gradient loss function do?

4. Why use hybrid training (supervised pretraining + RL fine-tuning)?

5. Why might a model with 70% accuracy lose money while a 40% accuracy model makes money?

0 / 5

Keep practicing!

Key Takeaways

System upgrade: From predicting sentiment to maximizing profit — changes loss, output, and evaluation
Reward function is critical: Business logic encoded mathematically — the agent optimizes exactly what you specify
Policy gradient: Increase probability of high-reward actions using L = -log π(a|x) · R
Evaluate by profit: Accuracy ≠ Profit. A 40% accurate model can be profitable with good risk management
Hybrid training: Supervised pretraining + RL fine-tuning is more stable than pure RL
Advanced levers: Reward shaping, entropy bonus, risk constraints, discounted rewards
Real-world insight: Most value comes from reward design + data alignment, not model architecture