🎯 What You'll Learn
- Design an action space for trading decisions (buy/sell/hold)
- Build reward functions that encode business logic mathematically
- Implement policy gradient training (REINFORCE algorithm)
- Evaluate models by cumulative profit instead of accuracy
- Apply hybrid training: supervised pretraining + RL fine-tuning
📋 Before You Begin
- Understanding of basic machine learning concepts
- Familiarity with PyTorch or similar frameworks
- Basic knowledge of sentiment analysis
- No prior RL experience required
From Classification to Decision Optimization
We're upgrading from a supervised classification system to a decision optimization system.
The model is not just predicting sentiment — it is learning to maximize financial reward.
| Layer | Before (Classification) | After (Decision System) |
|---|---|---|
| Output | Class label | Action (buy/sell/hold) |
| Loss | Cross-entropy | Reward-based |
| Evaluation | F1 score | Cumulative profit |
| Threshold | Static | Learnable policy |
💡 The Key Shift
Old objective: minimize classification error. New objective: maximize expected profit. This changes everything — from loss function to evaluation metrics.
Action Space Design
We convert sentiment into trading actions:
| Sentiment | Action |
|---|---|
| Bullish | BUY |
| Bearish | SELL |
| Neutral | HOLD |
action_map = {
0: "BUY",
1: "SELL",
2: "HOLD"
}
Reward Function Design (CRITICAL)
This is the most important design lever. The reward function is business logic encoded mathematically.
Two Approaches
def compute_reward(action, price_t, price_t1):
# Calculate profit based on action
if action == "BUY":
return price_t1 - price_t
elif action == "SELL":
return price_t - price_t1
else:
return 0
def compute_reward(action, price_t, price_t1, transaction_cost=0.01):
# Risk-aware reward with transaction costs
if action == "BUY":
return (price_t1 - price_t) - transaction_cost
elif action == "SELL":
return (price_t - price_t1) - transaction_cost
else:
return -0.001 # Small penalty for inactivity
💡 Critical Insight
Reward function = business logic encoded mathematically. The quality of your reward function determines whether your RL agent learns useful behavior or exploits loopholes.
Policy Gradient (REINFORCE)
Instead of predicting labels, the model outputs probabilities over actions:
Where:
- π_θ(a|x) = probability of action a given input x
- R = reward received
- θ = model parameters
Modify Model Output
def get_action(logits):
# Convert logits to probabilities
probs = torch.softmax(logits, dim=1)
# Create probability distribution
dist = torch.distributions.Categorical(probs)
# Sample action from distribution
action = dist.sample()
return action.item(), dist.log_prob(action)
action: 0 (BUY) log_prob: -0.693
The RL Training Loop
Here's the complete training loop with policy gradient:
def train_rl(model, dataloader, optimizer, price_series):
model.train()
total_reward = 0
for i, batch in enumerate(dataloader):
optimizer.zero_grad()
input_ids = batch["input_ids"]
attention_mask = batch["attention_mask"]
# Forward pass: get action probabilities
logits = model(input_ids, attention_mask)
# Sample action from policy distribution
action, log_prob = get_action(logits)
# Simulate market outcome
price_t = price_series[i]
price_t1 = price_series[i + 1]
action_name = action_map[action]
reward = compute_reward(action_name, price_t, price_t1)
# Policy gradient loss: -log(prob) * reward
loss = -log_prob * reward
loss.backward()
optimizer.step()
total_reward += reward
return total_reward
How Data Flows Through RL Training
Evaluation: Profit Instead of Accuracy
We evaluate by cumulative profit, not F1 score:
def evaluate_profit(model, dataloader, price_series):
model.eval()
total_profit = 0
with torch.no_grad():
for i, batch in enumerate(dataloader):
logits = model(batch["input_ids"], batch["attention_mask"])
# Choose most likely action
action = torch.argmax(logits, dim=1).item()
action_name = action_map[action]
price_t = price_series[i]
price_t1 = price_series[i + 1]
total_profit += compute_reward(action_name, price_t, price_t1)
print("Total Profit:", total_profit)
Total Profit: 1247.53
⚠️ Important
A model with 70% accuracy can lose money. A model with 40% accuracy can make money. Accuracy ≠ Profit. Always evaluate on what matters.
Hybrid Training (Best Practice)
Pure RL is unstable. So we combine approaches:
Two-Phase Training
Click each card to flip:
# Phase 1: Supervised pretraining
train(model, dataloader, optimizer)
# Phase 2: RL fine-tuning for profit
train_rl(model, dataloader, optimizer, price_series)
💡 Why Hybrid?
Supervised pretraining gives the model a good initialization. RL fine-tuning adapts it to the specific objective (profit). This is more stable than pure RL from scratch.
Advanced Design Levers
Where experts focus their effort:
Penalize volatility, reward consistency:
# Add penalty for high-variance returns
reward = profit - volatility_penalty
Add entropy bonus to encourage exploration:
L = -log π(a|x) · R + β · H(π)
where H(π) = entropy of policy
Penalize large drawdowns:
if drawdown > threshold:
reward -= penalty
Use discounted rewards for long-term credit assignment:
R_t = Σ γ^k · r_{t+k}
where γ = discount factor (e.g., 0.99)
Critical Real-World Insight
This system will fail in naive form because:
- Market is noisy
- Rewards are delayed
- Distribution is non-stationary (changes over time)
I built the RL trading system but it's not making profit. The model keeps making bad trades.
Most value comes from reward design + data alignment, not model architecture. How did you design your reward function?
I just used raw profit... price difference between t and t+1.
That's the naive form. You need transaction costs, risk penalties, volatility adjustment. The market will exploit any simplification.
💡 The Truth
Most value comes from reward design + data alignment, not model architecture. A simple model with great reward design beats a complex model with poor reward design every time.
Test Your Understanding
1. What is the key difference between classification and decision optimization?
Keep practicing!
Key Takeaways
- System upgrade: From predicting sentiment to maximizing profit — changes loss, output, and evaluation
- Reward function is critical: Business logic encoded mathematically — the agent optimizes exactly what you specify
- Policy gradient: Increase probability of high-reward actions using L = -log π(a|x) · R
- Evaluate by profit: Accuracy ≠ Profit. A 40% accurate model can be profitable with good risk management
- Hybrid training: Supervised pretraining + RL fine-tuning is more stable than pure RL
- Advanced levers: Reward shaping, entropy bonus, risk constraints, discounted rewards
- Real-world insight: Most value comes from reward design + data alignment, not model architecture