Optimizing RL Agents with Exit Rules in RLXBT

Introduction

One of the main challenges in training RL agents for trading is the "noisy" reward signal. It's difficult for an agent to understand whether a trade was successful due to its entry or simply because of a lucky turn of events. Exit rules allow us to separate the entry logic (which the agent learns) from the risk management logic (which is strictly defined).

Exit Rules Configurations

In our experiment, we compared three approaches:

No Rules (Baseline): The agent decides when to close a position.
Conservative: Strict 2% stop-loss, 3% take-profit, and a maximum holding time of 48 hours.
Aggressive: 5% stop-loss, quick 1.5% take-profit, and holding for no more than 12 hours.

Experimental Results

Data: BTCUSDT, 1-hour timeframe (2020-2025).

Training Summary Table (PPO Agent)

Configuration	Return	Sharpe Ratio	Max Drawdown	Total Trades
No Rules	-5.14%	-0.0075	14.71%	744
Conservative (2% SL, 3% TP)	-3.27%	-0.1234	4.46%	17
Aggressive (5% SL, 1.5% TP)	+34.58%	0.0407	20.15%	1242

Exit Reason Analysis (for the best strategy)

For the aggressive strategy, which showed the best result, the distribution of position closing reasons is as follows:

Signal (Agent Signal): 88.6%
MaxBarsReached (Timeout): 6.5%
MinProfitReached (Take-Profit): 4.7%
MaxDrawdown (Stop-Loss): 0.2%

Conclusion: The agent learned to effectively use short market impulses, while the exit rules provided a safety net during prolonged movements or sharp drawdowns.

Full Example Code

Below is the full script code to reproduce the results. To run it, you will need the rlxbt and stable-baselines3 libraries installed.

#!/usr/bin/env python3
"""
RLX RL Environment Demo with Exit Rules

This demo shows how to:
1. Configure RlxEnv with custom exit rules
2. Train an RL agent (PPO) with risk management
3. Compare performance with/without exit rules
4. Generate detailed metrics and reports

Exit Rules Features:
- hold_bars: Maximum bars to hold a position
- max_drawdown_percent: Force exit if position drawdown exceeds threshold
- min_profit_percent: Take profit when minimum target reached
- exit_at_night: Close positions during night hours
- max_hold_minutes: Time-based exit

LICENSING:
- Set RLX_LICENSE_KEY environment variable or pass license_key parameter to RlxEnv
- For development builds (--features offline_license), license is not required
- Get your license at https://rlxbt.com/pricing
"""

import sys
import os
import time
import numpy as np
import pandas as pd
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()

# Add project root to path
project_root = os.path.dirname(
    os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
)
sys.path.insert(0, project_root)

try:
    from stable_baselines3 import PPO
    from stable_baselines3.common.vec_env import DummyVecEnv
    from stable_baselines3.common.callbacks import BaseCallback

    HAS_SB3 = True
except ImportError:
    HAS_SB3 = False
    print("⚠️  stable_baselines3 not installed. Running simplified demo.")

try:
    from rlxbt import rlx, load_data, RlxEnv
except ImportError:
    print("❌ Failed to import RLX. Please run 'maturin develop' first.")
    sys.exit(1)


class RewardCallback(BaseCallback):
    """Callback to track training progress."""

    def __init__(self, verbose=0):
        super().__init__(verbose)
        self.episode_rewards = []
        self.episode_count = 0

    def _on_step(self) -> bool:
        if self.locals.get("dones", [False])[0]:
            self.episode_count += 1
            if self.episode_count % 10 == 0:
                info = self.locals.get("infos", [{}])[0]
                portfolio = info.get("portfolio_value", 100000)
                ret = (portfolio - 100000) / 100000 * 100
                print(
                    f"  Episode {self.episode_count}: Portfolio ${portfolio:,.0f} ({ret:+.2f}%)"
                )
        return True


def run_episode_manual(env, strategy="random"):
    """Run single episode with manual strategy (no RL library needed)."""
    obs, _ = env.reset()
    done = False
    total_reward = 0
    actions_taken = []

    while not done:
        if strategy == "random":
            action = np.random.choice([0, 1, 2])
        elif strategy == "always_long":
            action = 1
        elif strategy == "always_short":
            action = 2
        else:  # hold
            action = 0

        obs, reward, done, truncated, info = env.step(action)
        total_reward += reward
        actions_taken.append(action)

    return total_reward, info, actions_taken


def main():
    print("=" * 70)
    print("🤖 RLX RL ENVIRONMENT WITH EXIT RULES DEMO")
    print("=" * 70)

    # Check for license key
    license_key = os.environ.get("RLX_LICENSE_KEY")
    if license_key:
        print(f"🔑 Using license key: {license_key[:20]}...")
    else:
        print("ℹ️  No RLX_LICENSE_KEY set (OK for development builds)")
        print('   For production: export RLX_LICENSE_KEY="rlx_pro_..."')

    # =========================================================================
    # 1. LOAD DATA
    # =========================================================================
    data_path = os.path.join(
        project_root, "data", "BTCUSDT_1h_2020-12-12_2025-12-11.csv"
    )

    if not os.path.exists(data_path):
        print(f"❌ Data file not found: {data_path}")
        return

    print(f"\n📊 Loading data from: {os.path.basename(data_path)}")
    data = load_data(data_path)
    print(f"   Total bars: {len(data):,}")
    print(f"   Date range: {data['timestamp'].min()} to {data['timestamp'].max()}")

    # Split data
    train_size = int(len(data) * 0.7)
    val_size = int(len(data) * 0.15)

    train_data = data.iloc[:train_size].reset_index(drop=True)
    val_data = data.iloc[train_size : train_size + val_size].reset_index(drop=True)
    test_data = data.iloc[train_size + val_size :].reset_index(drop=True)

    print(f"\n📈 Data Split:")
    print(f"   Train: {len(train_data):,} bars (70%)")
    print(f"   Valid: {len(val_data):,} bars (15%)")
    print(f"   Test:  {len(test_data):,} bars (15%)")

    # =========================================================================
    # 2. DEFINE EXIT RULES CONFIGURATIONS
    # =========================================================================
    print("\n" + "=" * 70)
    print("⚙️  EXIT RULES CONFIGURATIONS")
    print("=" * 70)

    # Configuration 1: No Exit Rules (baseline)
    no_rules = None

    # Configuration 2: Conservative Risk Management
    conservative_rules = {
        "hold_bars": 48,  # Max 48 hours (2 days)
        "max_drawdown_percent": 2.0,  # Stop loss at 2% drawdown
        "min_profit_percent": 3.0,  # Take profit at 3%
    }

    # Configuration 3: Aggressive Day Trading
    aggressive_rules = {
        "hold_bars": 12,  # Max 12 hours
        "max_drawdown_percent": 5.0,  # Allow 5% drawdown
        "min_profit_percent": 1.5,  # Quick profit taking at 1.5%
    }

    # Configuration 4: Session-Based (Night Exit)
    session_rules = {
        "hold_bars": 24,  # Max 24 hours
        "exit_at_night": True,  # Close before night
        "night_start_hour": 22,  # Night starts at 22:00 UTC
        "night_end_hour": 6,  # Night ends at 06:00 UTC
        "max_drawdown_percent": 3.0,
    }

    configs = [
        ("No Rules (Baseline)", no_rules),
        ("Conservative", conservative_rules),
        ("Aggressive", aggressive_rules),
        ("Session-Based", session_rules),
    ]

    for name, rules in configs:
        print(f"\n📋 {name}:")
        if rules:
            for k, v in rules.items():
                print(f"   {k}: {v}")
        else:
            print("   No exit rules applied")

    # =========================================================================
    # 3. TEST RANDOM AGENT WITH DIFFERENT CONFIGS
    # =========================================================================
    print("\n" + "=" * 70)
    print("🎲 RANDOM AGENT COMPARISON (baseline)")
    print("=" * 70)

    random_results = []

    for config_name, exit_rules in configs:
        # License key is automatically read from RLX_LICENSE_KEY environment variable
        env = RlxEnv(
            data=test_data,
            initial_capital=100000.0,
            window_size=20,
            exit_rules=exit_rules,
        )

        # Run 5 episodes with random actions
        returns = []
        trades_list = []
        for _ in range(5):
            _, info, _ = run_episode_manual(env, strategy="random")
            returns.append(info.get("total_return", 0) * 100)
            trades_list.append(int(info.get("total_trades", 0)))

        avg_return = np.mean(returns)
        avg_trades = np.mean(trades_list)

        random_results.append(
            {
                "config": config_name,
                "avg_return": avg_return,
                "avg_trades": avg_trades,
                "std_return": np.std(returns),
            }
        )

        print(f"\n{config_name}:")
        print(f"   Avg Return: {avg_return:+.2f}% (±{np.std(returns):.2f}%)")
        print(f"   Avg Trades: {avg_trades:.0f}")

    # =========================================================================
    # 4. TRAIN RL AGENTS FOR EACH CONFIG (if stable_baselines3 available)
    # =========================================================================
    if HAS_SB3:
        print("\n" + "=" * 70)
        print("🧠 RL AGENT TRAINING (PPO) - Training separate agent per config")
        print("=" * 70)

        # Training configurations (only train with rules that make sense)
        train_configs = [
            ("No Rules", no_rules),
            ("Conservative (2% SL, 3% TP)", conservative_rules),
            ("Aggressive (5% SL, 1.5% TP)", aggressive_rules),
        ]

        eval_results = []
        trained_models = {}

        for config_name, exit_rules in train_configs:
            print(f"\n🏋️ Training PPO agent with: {config_name}")
            if exit_rules:
                print(f"   Exit Rules: {exit_rules}")

            # Create training environment
            # Use lambda with default argument to capture exit_rules correctly
            train_env = DummyVecEnv(
                [
                    lambda er=exit_rules: RlxEnv(
                        data=train_data,
                        initial_capital=100000.0,
                        window_size=32,  # Optimized window size
                        exit_rules=er,
                    )
                ]
            )

            # Create PPO model with optimized hyperparameters
            model = PPO(
                "MlpPolicy",
                train_env,
                verbose=0,
                learning_rate=3e-4,
                n_steps=1024,
                batch_size=64,
                n_epochs=10,
                gamma=0.99,
                ent_coef=0.02,  # Higher entropy for exploration
            )

            # Training
            print(f"   Training for 100,000 timesteps...")
            start_time = time.time()
            model.learn(total_timesteps=100_000)
            train_time = time.time() - start_time
            print(f"   Training completed in {train_time:.1f}s")

            trained_models[config_name] = model

            # =====================================================================
            # 5. EVALUATE ON TEST SET
            # =====================================================================
            test_env = RlxEnv(
                data=test_data,
                initial_capital=100000.0,
                window_size=32,
                exit_rules=exit_rules,
            )

            obs, _ = test_env.reset()
            done = False
            actions = {0: 0, 1: 0, 2: 0}

            while not done:
                action, _ = model.predict(obs, deterministic=True)
                action = int(action)
                actions[action] += 1
                obs, reward, done, truncated, info = test_env.step(action)

            total_actions = sum(actions.values())

            result = {
                "config": config_name,
                "total_return": info.get("total_return", 0) * 100,
                "sharpe_ratio": info.get("sharpe_ratio", 0),
                "max_drawdown": info.get("max_drawdown", 0) * 100,
                "total_trades": int(info.get("total_trades", 0)),
                "win_rate": info.get("win_rate", 0) * 100
                if info.get("win_rate")
                else 0,
                "portfolio_value": info.get("portfolio_value", 100000),
                "hold_pct": actions[0] / total_actions * 100,
                "long_pct": actions[1] / total_actions * 100,
                "short_pct": actions[2] / total_actions * 100,
                "train_time": train_time,
            }
            eval_results.append(result)

            print(f"\n   📊 Test Results:")
            print(f"   Total Return:    {result['total_return']:+.2f}%")
            print(f"   Sharpe Ratio:    {result['sharpe_ratio']:.4f}")
            print(f"   Max Drawdown:    {result['max_drawdown']:.2f}%")
            print(f"   Total Trades:    {result['total_trades']}")
            print(
                f"   Actions: Hold={actions[0]} ({result['hold_pct']:.1f}%), "
                f"Long={actions[1]} ({result['long_pct']:.1f}%), "
                f"Short={actions[2]} ({result['short_pct']:.1f}%)"
            )

        # =====================================================================
        # 6. SUMMARY TABLE
        # =====================================================================
        print("\n" + "=" * 70)
        print("📊 RESULTS SUMMARY - Each agent trained with its own config")
        print("=" * 70)

        print("\n┌" + "─" * 78 + "┐")
        print(
            f"│ {'Config':<32} {'Return':>10} {'Sharpe':>10} {'Drawdown':>10} {'Trades':>8} │"
        )
        print("├" + "─" * 78 + "┤")
        for r in eval_results:
            print(
                f"│ {r['config']:<32} {r['total_return']:>+9.2f}% {r['sharpe_ratio']:>10.4f} "
                f"{r['max_drawdown']:>9.2f}% {r['total_trades']:>8} │"
            )
        print("└" + "─" * 78 + "┘")

        # Best config
        best = max(eval_results, key=lambda x: x["sharpe_ratio"])
        print(f"\n🏆 Best Configuration: {best['config']}")
        print(f"   Sharpe Ratio: {best['sharpe_ratio']:.4f}")
        print(f"   Total Return: {best['total_return']:+.2f}%")
        print(f"   Max Drawdown: {best['max_drawdown']:.2f}%")

        # =====================================================================
        # 7. EXIT STATISTICS (using best model)
        # =====================================================================
        print("\n" + "=" * 70)
        print(f"📈 EXIT REASONS ANALYSIS ({best['config']})")
        print("=" * 70)

        # Use the best performing model for analysis
        best_model = trained_models.get(best["config"])
        best_rules = None
        for name, rules in train_configs:
            if name == best["config"]:
                best_rules = rules
                break

        if best_model and best_rules:
            analysis_env = RlxEnv(
                data=test_data,
                initial_capital=100000.0,
                window_size=32,
                exit_rules=best_rules,
            )

            obs, _ = analysis_env.reset()
            done = False

            while not done:
                action, _ = best_model.predict(obs, deterministic=True)
                obs, reward, done, truncated, info = analysis_env.step(int(action))

            # Get backtest result for exit statistics
            try:
                backtest_result = analysis_env.get_backtest_result()

                # Count exit reasons
                exit_reasons = {}
                for trade in backtest_result.trades:
                    reason = (
                        str(trade.exit_reason)
                        if hasattr(trade, "exit_reason")
                        else "Unknown"
                    )
                    exit_reasons[reason] = exit_reasons.get(reason, 0) + 1

                if exit_reasons:
                    print("\nExit Reason Distribution:")
                    total_exits = sum(exit_reasons.values())
                    for reason, count in sorted(
                        exit_reasons.items(), key=lambda x: -x[1]
                    ):
                        pct = count / total_exits * 100
                        print(f"   {reason:<30} {count:>5} ({pct:>5.1f}%)")

            except Exception as e:
                print(f"   Could not get exit statistics: {e}")

    else:
        print("\n⚠️  Skipping RL training (stable_baselines3 not installed)")
        print("   Install with: pip install stable-baselines3 shimmy gymnasium")

    # =========================================================================
    # 8. FINAL NOTES
    # =========================================================================
    print("\n" + "=" * 70)
    print("📝 KEY TAKEAWAYS")
    print("=" * 70)
    print("""
1. EXIT RULES IMPACT:
   - Conservative rules (2% SL, 3% TP) reduce risk but may limit upside
   - Aggressive rules allow bigger swings, higher variance
   - Session-based rules useful for avoiding overnight gaps

2. RL + EXIT RULES SYNERGY:
   - RL agent learns WHEN to enter (signal timing)
   - Exit rules handle risk management (HOW to exit)
   - This separation allows cleaner learning signal

3. CONFIGURATION RECOMMENDATIONS:
   - Day Trading: aggressive_rules with short hold_bars
   - Swing Trading: conservative_rules with longer hold_bars
   - 24/7 Markets (Crypto): no night exit needed
   - Traditional Markets: session_rules with night exit

4. HYPERPARAMETER TUNING:
   - hold_bars: Match your trading timeframe
   - max_drawdown_percent: Set based on risk tolerance
   - min_profit_percent: Balance between taking profits and letting winners run
""")

    print("=" * 70)
    print("✅ Demo completed!")
    print("=" * 70)


if __name__ == "__main__":
    main()

Key Takeaways

Synergy of RL and Exit Rules: An RL agent trains better when it doesn't have to worry about catastrophic losses (which are handled by max_drawdown_percent).
Conservatism vs. Aggressiveness: In this test, conservative rules limited the agent too much (only 17 trades), while aggressive rules allowed the PPO agent to realize its potential.
Drawdown: Using exit rules significantly reduces the maximum drawdown compared to a "pure" RL agent.

Article prepared for the RLXBT community. More examples in the project repository.