Teaching a Machine to Read Candlesticks

by Martin Russmann

How a Convolutional Neural Network Learns to Predict Stock Movement from Raw Price Data

The cat came first. In 2012, a massive neural network trained by Google on millions of unlabeled YouTube frames developed a neuron that responded strongly to cat faces. It had not been given a handcrafted definition of a cat. No engineer had specified whiskers, ears, or fur as symbolic rules. The system had simply been exposed to enough visual regularity that one internal unit became selectively responsive to one of the internet’s most overrepresented objects. The moment mattered not because a machine had literally never recognized a cat before, but because it indicated that high-level visual structure could emerge from data, scale, and optimization rather than manual feature engineering. (Google Research)

A few months later, AlexNet did something different, and in practical terms, even more consequential. It was not an unsupervised cat discoverer. It was a supervised deep convolutional network trained on roughly 1.2 million labeled ImageNet images spanning 1,000 classes. In the 2012 ImageNet competition, it achieved a top-5 test error of 15.3%, compared with 26.2% for the runner-up. That was not an incremental improvement. It was a discontinuity large enough to reset the field’s expectations about what neural networks could do in vision. (NeurIPS Proceedings)

The core idea behind that victory was the convolutional layer. Rather than asking a model to understand an entire image as an undifferentiated block of pixels, convolution lets it learn local structure first: edges, corners, and textures, and then progressively higher-order compositions as depth increases. In one domain, those compositions become faces, dogs, and cats. In another, they can become breakouts, reversals, compressions, and cascades. This post is about what happens when that same machinery is turned away from photographs and aimed at a different kind of visual object: the recent price history of a stock.

I. The Idea

A candlestick chart is a picture. Not metaphorically. Literally. Each trading day emits a small packet of structure: Open, High, Low, Close, and Volume. When enough of those packets are lined up in sequence, human traders begin to perceive shapes. Some of those shapes are folklore. Some are overfit superstitions. But some are compressed descriptions of changing order flow, shifting participation, volatility contraction, failed continuation, or asymmetric response to information.

The question is whether a convolutional neural network can learn to read such patterns directly from raw market data.

The formulation is simple. Take the last (W) trading days of a stock’s history and arrange the five observable channels into a matrix

\[ \mathbf{X} \in \mathbb{R}^{C \times W}, \]

where (C = 5) corresponds to OHLCV and (W) is the lookback window in trading days.

The rows are not pixel intensities but market dimensions. The columns are not spatial coordinates but successive time steps. The model receives this matrix and answers a single question:

Will the stock price be higher or lower after (H) trading days?

That is the whole problem. Binary classification. Up or down. Everything else is architecture, estimation, normalization, and the long war against noise.

II. Why Convolution?

A fully connected network would treat each input coordinate as structurally unrelated to the others unless it learned those relationships from scratch. Day 7 and day 8 would arrive merely as neighboring numbers, not as adjacent moments in a process. The model would have to rediscover that nearby observations are often related, that trends exist, that reversals are local phenomena before they become global ones, and that price and volume form patterns over short spans rather than in isolated points.

A convolutional layer imposes a more appropriate prior: nearby observations tend to matter together.

For a one-dimensional convolution with kernel size (k), the output at position (j) can be written as

\[ y_j = f!\left(\sum_{i=0}^{k-1} w_i , x_{j+i} + b\right), \]

where (w_i) are the learned kernel weights, (b) is a bias term, and (f) is a nonlinear activation function.

The filter sees only a local neighborhood of length (k), but it sees that same type of neighborhood everywhere along the time axis. This is the essence of weight sharing. A pattern learned near the left edge of the window is still recognizable near the right edge. That matters in markets because a short consolidation five days ago and a short consolidation yesterday may be instances of the same geometric relation, merely displaced in time.

The first convolutional layer may learn motifs such as short impulses, narrow-range compression, expansion in volume, or abrupt rejection from a local high. Deeper layers can combine these into higher-order configurations: thrust followed by weak pullback, repeated failure near resistance, downward drift on declining participation, or acceleration after a volatility squeeze. Human chartists name such configurations. The network does not. It stores them as weights in a compositional hierarchy.

III. The Architecture

The implementation here uses a relatively lean one-dimensional CNN: deep enough to learn nontrivial temporal motifs, shallow enough to train quickly on daily data. The general design is inspired by Rahul Gupta’s S&P 500 Stock’s Movement Prediction using CNN, which appears publicly as an arXiv preprint associated with Stanford. (arXiv)

A practical architecture is:

Layer	Operation	Output shape
Input	Raw OHLCV window	((C, W))
Conv1D_1	64 filters, kernel 3, BatchNorm, LeakyReLU	((64, W))
Conv1D_2	128 filters, kernel 3, BatchNorm, LeakyReLU	((128, W))
Conv1D_3	256 filters, kernel 3, BatchNorm, LeakyReLU	((256, W))
Conv1D_4	256 filters, kernel 3, BatchNorm, LeakyReLU	((256, W))
Pool	AdaptiveAvgPool1d(1)	((256, 1))
FC_1	Linear((256,128)), LeakyReLU, Dropout((0.4))	((128))
FC_2	Linear((128,2))	((2))
Output	Softmax	(P(\text{DOWN}), P(\text{UP}))

All convolutional layers use padding (=1) with kernel size (3), so the time dimension is preserved through depth. Without padding, each convolution would shorten the sequence, and after several layers the model would be making decisions from a progressively eroded history. With padding, each layer keeps access to the full temporal extent of the window.

Batch normalization

After each convolution, activations are normalized:

\[ \hat{x}_i = \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \varepsilon}}, \]

\[ y_i = \gamma \hat{x}_i + \beta, \]

where (\mu_B) and (\sigma_B^2) are the batch mean and variance, and (\gamma), (\beta) are learned affine parameters.

In modern practice, batch normalization is best understood less as a magical cure for a single named pathology and more as a stabilizing reparameterization that often improves optimization, keeps intermediate scales under control, and allows deeper stacks to train more reliably.

LeakyReLU

The activation function is LeakyReLU:

\[ f(x) = \begin{cases} x, & x > 0, \ \alpha x, & x \le 0, \end{cases} \]

with (\alpha = 0.2).

Standard ReLU discards all negative activations. LeakyReLU retains a small gradient in the negative region, reducing the probability that units become inactive and stop adapting. In financial data, where adverse moves and downward structure are not mere absences of signal but signals in their own right, this asymmetry is useful.

Adaptive pooling

After the final convolution, adaptive average pooling compresses the time dimension to one number per channel:

\[ \mathrm{pool}(z_c) = \frac{1}{W}\sum_{t=1}^{W} z_{c,t}. \]

This is the transition from local motif detection to global summary. The convolutional layers ask what structures exist within the window. The pooling layer asks what kind of period this has been overall.

The classifier

The pooled representation is fed through two fully connected layers, with dropout in between:

\[ \mathbf{h} = \phi(\mathbf{W}_1 \mathbf{z} + \mathbf{b}_1), \]

\[ \tilde{\mathbf{h}} = \mathbf{m} \odot \mathbf{h}, \qquad m_i \sim \mathrm{Bernoulli}(1-p), \]

\[ \mathbf{o} = \mathbf{W}_2 \tilde{\mathbf{h}} + \mathbf{b}_2. \]

A softmax then converts the logits into class probabilities:

\[ P(y = k \mid \mathbf{x}) = \frac{e^{o_k}}{\sum_j e^{o_j}}. \]

The result is not a prophecy. It is a probability distribution over two outcomes.

IV. Normalization: Making Every Stock Comparable

Raw price levels are a trap. A model that sees Apple at one scale and Berkshire Hathaway at another can waste capacity learning nominal magnitude rather than directional structure. But price level and price geometry are not the same thing. A stock at (500) moving in a (10)-point range and a stock at (50) moving in a (1)-point range may exhibit the same local pattern after rescaling.

A useful approach is min–max normalization performed per window and per channel:

\[ z_{c,t} ======= \frac{x_{c,t} - \min_{1 \le s \le W} x_{c,s}} {\max_{1 \le s \le W} x_{c,s} - \min_{1 \le s \le W} x_{c,s} + \varepsilon}, \]

where (\varepsilon > 0) is a small numerical constant preventing division by zero.

This achieves two things.

First, it makes the representation approximately scale-invariant. The model learns shape more than denomination.

Second, it preserves local context. Normalizing within each window means a given move is interpreted relative to the recent range rather than to some arbitrary global scale spanning years of data. That is appropriate for pattern recognition. A three-point move is different when it occurs inside a compressed regime than when it occurs inside a high-volatility expansion.

This local normalization has a tradeoff. It removes absolute price level information, which may itself encode regime information. That is often acceptable for a pattern-based directional classifier, but it should be recognized as a modeling choice, not a neutral preprocessing step.

V. The Sliding Window

Training examples are created by sliding a fixed-length window across the historical series one day at a time.

For each valid endpoint (t), define the input window

\[ \mathbf{X}^{(t)} = \bigl[x_{c,\tau}\bigr]_{c=1,\dots,C;,\tau=t-W+1,\dots,t}, \]

and the label

\[ y^{(t)} = \mathbf{1}!\left( \mathrm{Close}_{t+H} > \mathrm{Close}_t \right), \]

where (H) is the forecast horizon.

Each example therefore consists of a (5 \times W) matrix and a binary outcome indicating whether the close price after (H) days exceeded the close at the end of the window.

If a stock has roughly 750 daily observations, a window (W=20), and horizon (H=5), the usable sample count is approximately

\[ N \approx 750 - W - H + 1 = 726. \]

That is not a large dataset by deep-learning standards. It is one reason modest architectures are preferable here. Financial prediction problems are often data-poor relative to model flexibility, which means the main challenge is not expressiveness but restraint.

The train–validation split should be chronological, not shuffled. Time-series shuffling leaks future distributional information into the training process and creates a validation score that is statistically flattering and economically meaningless. The only honest test is always the same: train on earlier data, evaluate on later data.

VI. Training

With two output logits and softmax probabilities (\hat{p}_{n,k}), the natural objective is multiclass cross-entropy:

\[ \mathcal{L} =========== -\frac{1}{N}\sum_{n=1}^{N}\sum_{k \in {0,1}} y_{n,k}\log \hat{p}_{n,k}, \]

where (y_{n,k}) is the one-hot encoded target.

Cross-entropy has a useful asymmetry: confident mistakes are punished much more severely than cautious ones. That is desirable because a useful classifier is not merely one that predicts the right side slightly more often than chance; it is one whose confidence meaningfully tracks its actual reliability.

Class imbalance

Markets often have a directional drift over long samples. If the training set contains more UP than DOWN examples, an unconstrained model can learn a lazy strategy: predict UP frequently and collect mediocre headline accuracy.

A simple correction is inverse-frequency class weighting:

\[ w_k = \frac{N}{K \, N_k}, \]

where \(K = 2\) is the number of classes and \(N_k\) is the sample count in class \(k\).

The weighted loss becomes

\[ \mathcal{L} = -\frac{1}{N}\sum_{n=1}^{N}\sum_{k \in \{0,1\}} w_k \, y_{n,k}\log \hat{p}_{n,k}. \]

This does not create signal, but it prevents the optimizer from exploiting class frequency as a substitute for learning.

Optimizer

\[ m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t, \]

\[ v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2, \]

\[ \hat{m}_t = \frac{m_t}{1-\beta_1^t}, \qquad \hat{v}_t = \frac{v_t}{1-\beta_2^t}, \]

\[ \theta_{t+1} = \theta_t - \eta \frac{\hat{m}_t}{\sqrt{\hat{v}_t}+\varepsilon}. \]

Adam adapts step sizes per parameter and tends to behave well in small-to-medium neural architectures trained on noisy objectives.

Learning-rate scheduling and model selection

When validation loss plateaus, reducing the learning rate can help the optimizer refine rather than continue to jump. A common practical rule is to halve the learning rate after several stagnant epochs.

Likewise, the best model should be selected by out-of-sample performance, not by final training epoch. Saving the parameter set with the strongest validation score is a basic but essential regularization device.

VII. Multi-Horizon Prediction

A single horizon is rarely enough. The more interesting question is what a pattern looks like across horizons.

Train one model for each horizon (H \in {1,2,5,10,30}). Then interpret the vector of predictions jointly:

Horizon	Interpretation
1 day	Noise, impulse, or immediate continuation
2 days	Very short-term direction
5 days	Weekly swing tendency
10 days	Intermediate move
30 days	Position-level directional bias

The joint configuration matters.

If all horizons point UP, the model sees unusually consistent continuation across scales.

If short horizons point UP while long horizons point DOWN, the model may be detecting a rally inside a broader decline.

If short horizons point DOWN while long horizons point UP, it may be seeing a pullback inside an underlying advance.

This cross-horizon disagreement is often more informative than any isolated prediction because it gives a crude local term structure of directional belief.

Each horizon should also be accompanied by its own validation statistics. A horizon with little out-of-sample skill is not useless if treated honestly; it indicates that the current representation does not extract stable predictive structure at that timescale. The failure itself is information.

VIII. What the Network Cannot Do

It sees only what is in the price tape and the associated volume series. Earnings surprises, regulation, litigation, geopolitics, guidance revisions, product failures, fraud, central-bank interventions, and war enter the model only after they have already altered observed market behavior.

It does not know why a pattern works. A CNN is a pattern extractor, not a causal theorist. It can learn that one local geometry is statistically followed by another. It cannot distinguish whether that relation reflects human anchoring, dealer inventory adjustment, reflexive momentum, calendar effects, or a temporary coincidence that will soon disappear.

It is vulnerable to regime change. A network trained on one distribution of volatility, liquidity, policy conditions, and participant behavior can perform poorly when those conditions shift. The model does not announce, in language, that the world has changed. It simply starts making worse predictions.

It is also vulnerable to overfitting. This is the chronic pathology of quantitative finance: a flexible learner discovers patterns that are real only inside the sample in which they were discovered. The practical defenses are familiar but incomplete—chronological validation, architectural restraint, dropout, normalization, and continuous out-of-sample monitoring. None is sufficient by itself. Together they merely reduce the odds of self-deception.

IX. The Deeper Lesson

The attraction of deep learning in markets is obvious. Financial time series are full of local motifs, nonlinear interactions, and repeated geometries embedded in noise. Convolutional networks are good at exactly that class of problem: finding useful local regularities without requiring that the analyst specify them in advance.

The danger is equally obvious. Markets are also full of accidental regularities that vanish on contact with the future.

So the real objective is not to produce theatrical certainty. It is to estimate probabilities whose confidence tracks actual skill.

A model that predicts UP with (0.90) confidence and is correct (0.52) of the time is worse than useless. It invites oversized bets based on miscalibrated conviction.

A model that predicts UP with (0.58) confidence and is correct (0.57) of the time is far more valuable. The edge is modest, but the confidence is tethered to reality.

That is the deeper criterion. Prediction is the visible output. Calibration is the hidden quality that determines whether the output can be used.

The lesson of the cat was not really about cats. It was that hierarchical structure can be learned from data. The lesson in markets is harsher: learned structure is valuable only when it survives the future.

X. Practical Notes for Deployment

A daily-horizon CNN of this type is feasible for on-demand analysis of single names or small watchlists. It is less suitable as a brute-force real-time scanner over very large universes unless predictions are precomputed and cached.

A robust production workflow should therefore include:

daily retraining or rolling refresh on a scheduled basis,
cached inference results per symbol and horizon,
out-of-sample calibration checks,
explicit transaction-cost overlays,
a fallback rule when validation skill deteriorates below a minimum threshold.

Most importantly, the classifier should not be treated as an isolated trading engine. It is better understood as a structural signal generator that contributes one piece of evidence to a broader decision process.

Now available on macropulze.com!

Appendix: Notation Reference

Symbol	Meaning
(\mathbf{X})	Input matrix of shape ((C, W))
(C)	Number of channels, here (5) for OHLCV
(W)	Lookback window in trading days
(H)	Prediction horizon in trading days
(k)	Convolutional kernel size
(w_i)	Convolutional filter weights
(\alpha)	LeakyReLU negative slope
(\gamma,\beta)	BatchNorm affine parameters
(\eta)	Learning rate
(\hat{p}_{n,k})	Predicted probability of class (k) for sample (n)
(\mathcal{L})	Cross-entropy loss
(N_k)	Number of samples in class (k)

Sources

Quoc V. Le et al., Building High-Level Features Using Large Scale Unsupervised Learning (the “cat neuron” result). (Google Research)
Alex Krizhevsky, Ilya Sutskever, Geoffrey Hinton, ImageNet Classification with Deep Convolutional Neural Networks (AlexNet). (NeurIPS Proceedings)
Rahul Gupta, S&P 500 Stock’s Movement Prediction using CNN (publicly surfaced as an arXiv preprint associated with Stanford). (arXiv)

I can also turn this into a cleaner blog-engine version with frontmatter and no citations, or into a more literary Macropulze house style.