The Bottom Line: This article explains a rigorous, scientific approach to deciding when to enter and exit trades in stock markets. Unlike most trading advice, this framework accounts for the real costs of trading, uses proper statistical methods, and provides a clear formula for when action is justified. Whether you're a curious investor or a quantitative researcher, this guide bridges the gap between academic rigor and practical application.
Introduction: The Real Question Nobody Answers
Here's a scenario every investor faces: You've done your research. You believe a stock is going to rise. But when exactly should you buy it? Right now? At market open tomorrow? Should you wait for a dip?
Most trading books give you vague advice like "buy on weakness" or "follow the trend." But they never answer the fundamental question: How confident do you need to be before acting?
This article presents a complete framework that answers that question with mathematical precision—while remaining grounded in the messy reality of trading costs, uncertain information, and noisy markets.
What Makes This Different?
Traditional approaches treat timing as pattern recognition: "The market tends to go up on Mondays" or "Buy when price touches the 50-day moving average." But these approaches suffer from three fatal flaws:
- They ignore trading costs. A small statistical edge can easily be wiped out by the spread, fees, and market impact.
- They assume independence. Stock returns are not like coin flips—today's move affects tomorrow's.
- They never tell you when to act. A 55% win rate sounds good, but is it enough after costs?
Our framework addresses all three problems by treating timing as what it really is: a decision under uncertainty with frictions.
Part 1: Setting Up the Problem
The Prices You See vs. The Prices You Get
Before we can talk about timing, we need to be honest about something: the price you see on your screen is not the price you'll actually pay.
When you look at a stock quote, you typically see something called the mid-price—the average of the best bid (what buyers will pay) and the best ask (what sellers want):
where is the mid-price at time , is the ask price, and is the bid price.
But here's the catch: You can't actually trade at the mid-price. If you want to buy immediately, you pay the ask. If you want to sell immediately, you receive the bid. The difference—called the spread—is your first cost of doing business.
What Is a Return, Really?
When we talk about how much a stock moved, we use returns—the percentage change in price. There are two common ways to measure this:
Simple return (what most people think of):
Log return (what quants prefer because it has nicer mathematical properties):
For small moves, these are nearly identical. Log returns have the advantage that they add up nicely over time: the log return from Monday to Wednesday equals the Monday-to-Tuesday return plus the Tuesday-to-Wednesday return.
For a prediction horizon of periods (say, 90 minutes), we define:
This is what we're trying to predict: how much will the stock move over our chosen time horizon?
The Cardinal Rule: No Peeking at the Future
This sounds obvious, but it's where most timing research goes wrong: every calculation must use only information available at the time of the decision.
Mathematically, we express this using the concept of a filtration —a fancy term for "all the information you could possibly know at time ." This includes past prices, past trades, news that's already been released, and any indicators you've computed from historical data.
A valid timing signal must be -measurable, meaning it depends only on information available at time . Anything else is cheating—and will produce spectacular backtests that fail miserably in live trading.
Part 2: The True Cost of Trading
Your Fill Price Is Not the Mid-Price
Let's model what actually happens when you trade. If you're buying, your actual fill price looks something like this:
And when selling:
Let's break down these terms:
| Component | What It Means | Typical Size |
|---|---|---|
| or | Ask or bid price | The starting point |
| Market impact—your order moves the price against you | 1-10 basis points | |
| Slippage from latency and execution delays | 1-5 basis points | |
| Brokerage fees and exchange costs | 0-10 basis points |
Note: A basis point (bp) is 0.01%, so 10 bp = 0.1%.
The Net Return: What You Actually Make
For a long trade (buying now, selling later), your net log return is:
This is the number that matters. Not the mid-price return. Not the gross return. The net return after all costs.
Why this matters: Imagine you predict a 0.3% move with 60% accuracy. Sounds profitable, right? But if your total trading costs are 0.2%, your edge just shrunk by two-thirds. This is why many "proven" strategies evaporate when you account for realistic execution.
Part 3: From Prediction to Decision
Here's where most timing research stops: they build a model, measure its accuracy, and call it a day. But accuracy isn't action. How do you actually decide when to trade?
Two Ways to Think About It
Approach A: Full Distribution (Ideal)
If you can model the entire distribution of future returns—not just the average, but how spread out or skewed they might be—you can make optimal decisions using expected utility theory.
Approach B: Win/Loss Framework (Practical)
A simpler approach is to predict:
1. The probability of a profitable trade
2. The expected win size when you're right
3. The expected loss size when you're wrong
Let's define these precisely:
This is a binary variable: 1 if the trade would be profitable, 0 otherwise. Our model estimates:
This is the probability—given everything we know at time —that the trade will be profitable.
We also estimate:
This is the expected gain when we win—a positive number.
This is the expected loss when we lose—also expressed as a positive number for convenience.
The Million-Dollar Question: How Confident Is Confident Enough?
This is the heart of the framework. Given our probability estimate and our win/loss magnitudes and , when should we actually trade?
The expected value of trading is:
In plain English: your expected profit equals (probability of winning × average win) minus (probability of losing × average loss).
But we shouldn't trade just because expected value is positive. We want a margin of safety—call it —to account for model uncertainty, risk limits, and operational constraints. So our rule becomes:
Trade only if:
The Magic Formula: Your Required Win Rate
Solving that inequality for , we get the minimum probability threshold:
This is the key result. It tells you exactly how confident you need to be before acting.
Let's understand what this formula says:
| If this increases... | Then ... | Intuition |
|---|---|---|
| Expected loss | Goes up | Bigger potential losses require more confidence |
| Expected win | Goes down | Bigger potential wins justify acting with less certainty |
| Risk margin | Goes up | More cautious stance raises the bar |
A Worked Example
Suppose your model gives you:
- (62% estimated probability of profit)
- (expected win of 9 basis points)
- (expected loss of 7 basis points)
- (risk margin of 1 basis point)
The required threshold is:
Since , you should trade.
But now imagine volatility spikes, and you increase your risk margin to :
Now , so you should not trade.
This is why thresholding must depend on current market conditions—not be a fixed number.
Part 4: Finding Statistical Edges
Now that we know how to convert probabilities into decisions, where do those probabilities come from? The framework uses three main sources of information.
4.1 Temporal Patterns: Do Certain Times Work Better?
The Question: Are there certain days of the week or times of day when stocks tend to perform differently?
The Challenge: This seems simple, but it's a statistical minefield. Markets are not like coin flips—there's serial correlation (today's return affects tomorrow's) and heteroskedasticity (volatility clusters). Standard statistical tests assume away these features and give misleading results.
Weekday Effects
For each day (Monday through Friday), we want to estimate:
and test whether it differs from other days.
The Right Way to Test This:
- Calculate returns using a consistent definition
- Estimate differences using HAC (Heteroskedasticity and Autocorrelation Consistent) standard errors—these account for the fact that returns are correlated and have changing volatility
- Use block permutation tests that preserve the time structure while breaking the day-of-week association
- Apply multiple testing corrections because you're testing many hypotheses (5 days × multiple stocks × multiple horizons)
Intraday Windows
Similarly, we can partition the trading day into windows (Open, Mid-morning, Midday, Afternoon, Close) and compute the return within each:
The same statistical discipline applies: HAC inference, permutation tests, multiple testing control.
Important Caveat: Temporal patterns are weak and regime-dependent. They should inform your model as features, not drive decisions on their own.
4.2 Price Levels: Support and Resistance
Technical analysts have long observed that prices seem to "bounce" off certain levels. Can we formalize this?
Rolling Quantiles as Levels
Instead of drawing arbitrary lines on a chart, we use statistical quantiles of recent prices:
Here, is the lookback window (say, 5,000 to 20,000 bars) and is a small number like 0.1 or 0.15.
Translation: Support is roughly the 10th percentile of recent prices—a level the stock rarely trades below. Resistance is the 90th percentile—a level it rarely exceeds.
Generating Candidates
With a tolerance (to avoid exact-boundary whipsaws) and a momentum check :
Long candidate: Price near support AND momentum turning positive
Short candidate: Price near resistance AND momentum turning negative
The momentum check prevents you from trying to catch a falling knife—it waits for evidence that the bounce is actually happening.
4.3 Multi-Horizon Momentum
Rather than relying on a single moving average crossover (which is noisy), we aggregate signals across multiple time horizons:
This score is positive when short-term averages are above long-term averages across multiple horizons (bullish) and negative in the opposite case (bearish).
The weights should be estimated from past data and regularized to prevent overfitting.
Part 5: The Machine Learning Layer
What Are We Predicting?
The primary target is the binary outcome: will the trade be profitable after costs?
Note that this uses executable fill prices, not mid-prices. This ensures we're predicting economic profitability, not just price direction.
Features (Inputs to the Model)
All features must be computed using only past information. Typical inputs include:
- Recent returns: for various lags
- Volatility measures: Realized volatility, range-based estimators
- Temporal features: Day of week, time of day, rolling window estimates
- Level features: Distance to support/resistance
- Momentum: Aggregated score across horizons
- Microstructure: Spread, depth proxies, volume patterns
Model Choice
For tabular data like this, gradient boosting (XGBoost, LightGBM) is typically a strong baseline. More complex models like transformers or RNNs can be tried but must beat the simpler approach after costs—not just on accuracy metrics.
The model outputs:
Optionally, it can also output magnitude estimates and .
Why Calibration Matters More Than Accuracy
Here's a subtle but critical point: your model's probabilities must be trustworthy.
A model is well-calibrated if, among all the times it says "60% chance of profit," about 60% actually are profitable. Many models have good accuracy but terrible calibration—they might output "70%" when the true probability is only 55%.
Why does this matter? Because your threshold formula uses as an actual probability. If your model's probabilities are wrong, you'll trade too much or too little.
Solution: After training, apply a calibration step (isotonic regression or Platt scaling) on held-out data to correct the probabilities.
Part 6: Validation—Proving It Works
The Problem with Regular Train/Test Splits
In typical machine learning, you randomly shuffle data into training and test sets. This doesn't work for time series because:
- Future leakage: Random shuffling can put future observations in training
- Overlapping labels: If your prediction horizon is 90 minutes, observations 30 minutes apart share some of the same future returns
Walk-Forward Validation with Purging and Embargo
The solution is walk-forward validation:
- Train on historical data up to time
- Validate on data from to
- Roll forward and repeat
The embargo period ensures no information leakage. If your prediction uses returns over , then any training sample within periods of the validation start could leak information. A safe embargo is at least periods.
Purging removes training samples whose label periods overlap with validation/test periods.
Metrics That Matter
Statistical Metrics:
- AUC (Area Under ROC Curve)
- Brier Score (measures probability accuracy)
- ECE (Expected Calibration Error)
Economic Metrics:
- Net P&L after all costs
- Sharpe Ratio (return per unit risk)
- Maximum Drawdown (worst peak-to-trough loss)
- Fill Rate (what percentage of intended trades actually execute)
Critical: All economic metrics must be computed net of execution costs. A strategy with great gross returns but poor execution is not a strategy—it's an illusion.
Part 7: Execution—Where Theory Meets Reality
The Execution Shortfall
You can have the best predictions in the world, but poor execution will destroy your edge. We measure execution quality using shortfall:
where is a benchmark price—typically VWAP (Volume-Weighted Average Price) over your execution window.
Positive shortfall means you did worse than the benchmark; negative means better.
Transaction Cost Analysis (TCA)
Total trading cost breaks down as:
Market impact is often modeled with a square-root rule: impact scales with the square root of your participation rate (what fraction of volume you represent).
The Feedback Loop
Here's the key insight: your cost assumptions should be updated from realized execution data.
If your model assumed 5 bp of costs but you're consistently seeing 8 bp, your threshold is too low and you're overtrading. Update the conservative buffer based on actual shortfall statistics.
Part 8: Putting It All Together
The Complete Pipeline
Here's how everything fits together in real-time:
For each decision time t:
1. UPDATE FEATURES
- Compute all rolling statistics using only data up to t
- No future information allowed
2. GENERATE PREDICTION
- Feed features into calibrated model
- Output: probability p̂_t and magnitude estimates μ̂⁺, μ̂⁻
3. COMPUTE THRESHOLD
- Calculate π*_t = (μ̂⁻ + λ_t) / (μ̂⁺ + μ̂⁻)
- λ_t depends on current volatility and risk limits
4. DECIDE
- If p̂_t > π*_t: proceed to execution
- Otherwise: no action
5. EXECUTE (if trading)
- Select order type based on urgency and liquidity
- Use TWAP/VWAP/POV algorithm to minimize impact
- Respect participation limits
6. RECORD AND LEARN
- Log fill price, shortfall, latency
- Update cost buffer if systematic deviations
The Baseline Ladder: Complexity Must Earn Its Keep
Before deploying any sophisticated model, compare against simpler alternatives:
| Level | Strategy | Purpose |
|---|---|---|
| 1 | Random timing | Sanity check—anything should beat this |
| 2 | Simple momentum + fixed costs | Basic heuristic benchmark |
| 3 | Statistical timing only | Tests value of temporal patterns |
| 4 | Gradient boosting with features | Strong tabular baseline |
| 5 | Sequence model (Transformer/RNN) | Only if it beats Level 4 stably |
Rule: Never use a complex model unless it beats the simpler one in net-of-cost terms across multiple assets and time periods.
Part 9: What Can Go Wrong (And How to Avoid It)
Common Failure Modes
| Problem | Symptom | Solution |
|---|---|---|
| Leakage | Amazing backtest, terrible live performance | Audit all features; enforce purging and embargo |
| Multiple testing | "Discovered" patterns that don't replicate | Control false discovery rate; require effect sizes |
| Cost neglect | Profitable before costs, losing after | Use executable fills; maintain adaptive cost buffer |
| Miscalibration | Systematic overtrading or undertrading | Monitor ECE; recalibrate on fresh data |
| Regime change | Strategy works, then suddenly doesn't | Rolling re-estimation; regime detection |
What This Framework Does NOT Guarantee
Let's be clear: no methodology guarantees profits.
What this framework does guarantee:
- If you observe good performance, it's less likely to be an artifact of data snooping or cost-free fantasy
- Your decision rule is economically interpretable and auditable
- You have a systematic way to update beliefs and improve
Part 10: Practical Takeaways
For Individual Investors
Know your costs. Before evaluating any timing idea, understand your actual trading costs (spread + fees + slippage).
Demand calibration. If someone tells you their model has "70% accuracy," ask: "Are the predicted probabilities actually correct?" Accuracy without calibration is nearly useless for decision-making.
Use the threshold formula. Even without a fancy model, you can use:
Estimate average wins and losses from your trading history, add a risk margin, and you have a principled minimum confidence requirement.Be skeptical of temporal patterns. "The market goes up on Mondays" might have been true historically but may not survive proper statistical scrutiny or persist in the future.
For Quantitative Researchers
Report everything net of costs. Gross returns are misleading. Always specify your execution model.
Use the baseline ladder. Force complex models to prove their worth against simpler alternatives.
Calibrate before thresholding. ECE matters more than AUC for actual trading decisions.
Document for reproducibility. Data sources, preprocessing, feature definitions, embargo rules, execution assumptions—all should be explicit.
For Portfolio Managers
The threshold is not a constant. should vary with volatility, liquidity, and risk budget. A fixed threshold is suboptimal.
Execution quality is alpha. Two identical prediction models can have vastly different P&L based on execution. Measure and optimize for shortfall.
Audit the loop. Regularly verify that predicted probabilities match realized frequencies and that cost assumptions match reality.
Conclusion: From Signals to Decisions
The core insight of this framework is simple but often overlooked: timing is not pattern recognition—it's decision-making under uncertainty with frictions.
This shift in perspective has profound implications:
- Statistical significance is necessary but not sufficient; you need economic significance after costs
- Probability estimates must be calibrated, not just accurate
- The threshold for action must be derived from costs and risk, not chosen arbitrarily
- Execution is not an afterthought; it's where alpha lives or dies
The formula encapsulates this philosophy: how confident you need to be depends on what's at stake.
By following the procedures outlined here—dependence-aware statistics, leakage-safe validation, execution-conscious evaluation, and explicit threshold derivation—you can build timing systems that are both scientifically credible and practically tradeable.
Or, equally valuable, you can quickly falsify timing ideas that don't survive scrutiny—before they cost you money.
Glossary
| Term | Definition |
|---|---|
| Basis point (bp) | 0.01%, or one-hundredth of a percent |
| Calibration | The property that predicted probabilities match actual frequencies |
| ECE | Expected Calibration Error—a measure of how well-calibrated a model is |
| Embargo | A gap between training and test data to prevent information leakage |
| Filtration () | All information available at time |
| HAC | Heteroskedasticity and Autocorrelation Consistent (a type of robust standard error) |
| Log return | ; percentage returns that add over time |
| Market impact | The price move caused by your own trading |
| Purging | Removing training samples that overlap with test labels |
| Shortfall | The difference between your execution price and a benchmark |
| Slippage | Unfavorable price movement between decision and execution |
| VWAP | Volume-Weighted Average Price—a common execution benchmark |
Further Reading
n
1. Newey & West (1987) — The foundational paper on HAC standard errors
2. Benjamini & Hochberg (1995) — Controlling false discovery rate in multiple testing
3. Gneiting & Raftery (2007) — Proper scoring rules for probability forecasting
This article is for educational and research purposes only. It is not ivestment advice, and past performance of any strategy does not guarantee future results.