Machine Learning for Alpha

Every second fund manager now claims to use AI or machine learning. Most of them mean they run a gradient boosting model on financial ratios and call it "ML-driven alpha." Some of them genuinely do. The problem is that the fundamental constraints of equity return prediction make ML harder here than in almost any other domain.

This module is an honest accounting of where ML helps in systematic equity investing, where it doesn't, and why the distinction matters more than the hype.

55%

Directional accuracy that is excellent

0.08

Typical R squared equity return model

250

Effective observations per year NSE

99%

ML marketing claims to ignore

The R squared in equity return forecasting is an order of magnitude smaller than in any other ML domain. If a vendor shows you a 0.6 R squared on Indian returns, it is overfit, not alpha.

Feature engineering

Economic priors first, not raw price data

Train test split

Time ordered only, never random

Cross validation

Walk forward, 5 year train 1 year test

Regularise hard

L1 L2, prefer shallow trees over deep nets

Out of sample check

Held out period that was never touched

Every ML project that fails in production fails at step 2. If you random shuffle financial data, you are letting the model see the future during training.

Where ML actually helps

Execution & TCA 42%
Feature ranking 28%
Regime classification 18%
Risk forecasting 12%

Where ML hype fails

Return prediction raw 52%
Event prediction single 30%
Reinforcement learning 18%

ML shines in execution and classification where the sample size is large. It struggles hardest in direct return prediction where noise dwarfs signal.

Model class performance on Indian monthly returns (directional accuracy)

Linear regression with priors

53.8%

Elastic net

54.1%

Gradient boosting shallow

54.6%

Random forest deep

52.4%

Neural net overfit

49.8%

Shallow gradient boosting beats every deep neural net I have built on Indian monthly data. Simpler wins because the signal is weak and the data is short.

From my notebook

In 2021 I spent four months building an LSTM to predict Nifty next day direction. Beautiful code, 52 features, dropout, attention layer, the full works. Training accuracy 71%, validation 58%, out of sample 49.4%. Essentially a coin flip with extra electricity bill. I ripped it out and replaced it with a 12 month minus 1 momentum score and a volatility regime filter. The simple version did 54.1% directional out of sample and is still live. Since then my rule: in Indian equity, start with the economic prior, keep the model linear until you can prove nonlinearity matters, and never let a model with fewer than 1000 independent observations ship to production.

The core problem: signal-to-noise ratio

In image recognition, a well-trained ML model might achieve 99%+ accuracy. In equity return prediction, a model that predicts the direction of monthly returns correctly 55% of the time is genuinely excellent. The reason: financial returns are dominated by noise. The "signal" | the predictable component | is a small fraction of total return variance.

~55%

Monthly directional accuracy of a good momentum model in Indian markets. That 5% edge above chance, compounded over years, is where the alpha comes from.

~5%

Typical R² of even strong factor models on individual stock returns. 95% of variance is unexplained. ML models face the same noise ceiling as simpler models.

This has a critical implication: ML models that work in high-signal domains (image recognition, natural language processing) are dramatically over-parameterised for equity return prediction. A deep neural network with millions of parameters is fitting noise, not signal, when your signal-to-noise ratio is this low.

The sample size problem

Indian market data from NSE goes back to roughly 2000. The Nifty 500 universe gives 500 stocks. Monthly returns gives 500 × 12 × 20 = 120,000 observations. That sounds like a lot | but each observation is not independent. The effective sample size, accounting for cross-sectional correlation (stocks move together), temporal autocorrelation, and regime non-stationarity, is far smaller. Lopez de Prado estimates effective sample sizes for equity ML models are often 10 to 50x smaller than raw observation counts.

This means complex ML models are almost always overfitted on financial data. The right response is not "don't use ML" | it's "use regularised, parsimonious models and validate rigorously out-of-sample."

What actually works in practice

Feature engineering

✓ Works well

ML's biggest contribution to factor investing is not in model complexity but in feature construction. Combining raw financial ratios into composite quality scores, detecting non-linear interactions between factors (e.g., quality matters more for high-momentum stocks), and processing text data from filings | these use ML appropriately as a feature constructor rather than a return predictor.

Ensemble methods

✓ Works well

Random forests and gradient boosting (XGBoost, LightGBM) are the most consistently useful ML tools in equity investing. They handle non-linearities, are relatively robust to overfitting with proper regularisation, and provide feature importance | helping you understand which inputs are driving predictions. Gu, Kelly & Xiu (2020) show gradient boosting materially outperforms linear models on US equity prediction.

NLP on filings

~ Conditional

Processing BSE/NSE corporate filings, earnings call transcripts, and management commentary to extract sentiment signals. Shows promise academically. In practice for India: limited data quality (many filings are PDFs of scanned documents), high implementation cost, and rapidly decaying edge as more funds adopt similar approaches. Worth exploring for large AUM; overkill for most retail systematic investors.

Regime classification

~ Conditional

Using clustering or hidden Markov models to classify market regimes (bull/bear/sideways) and switch strategy parameters accordingly. Theoretically attractive. Practically: regime transitions are only clear in hindsight, and real-time classification is noisy enough to generate costly false switches. Simple rules (trailing 200-day MA of index) often match or outperform fancy ML regime classifiers.

Deep learning (LSTM, Transformers)

✗ Mostly hype

Deep neural networks require massive data and high signal to justify their complexity. Equity return prediction has neither. LSTMs on price sequences consistently fail out-of-sample in serious academic studies. Transformers applied to return prediction are even more prone to overfitting. The few cases where deep learning genuinely works involve very high-frequency data (millisecond order book) | not monthly factor strategies.

Reinforcement learning

✗ Not practical yet

RL for portfolio optimization is an active research area. Current practical limitations: the state space (market conditions, portfolio state) is enormous; training requires massive simulation that often doesn't match real market dynamics; and RL models are notoriously fragile when market conditions shift slightly. Academic results don't transfer to live trading reliably. Worth following but not deploying.

The honest bottom line: In equity investing, a well-constructed 5-factor model with careful implementation and disciplined rebalancing will outperform a complex ML model most of the time. ML adds genuine value at the margins | in feature engineering, composite score construction, and text processing. It doesn't replace the need for economic intuition about why a signal should work.

ML and RupeeCase

RupeeCase uses composite factor scoring (a form of feature engineering) to combine multiple signals within each factor into a single robust rank | rather than simple P/E or P/B ratios alone. The composite quality score combines ROE, ROCE, debt/equity, and FCF yield with empirically determined weights, using a method closer to regularised regression than arbitrary equal weighting. No deep learning, no black boxes. Available at invest.rupeecase.com.

Glossary

Key terms from this module

Signal-to-noise ratio

The ratio of predictable (signal) to unpredictable (noise) variance in returns. Very low for equity monthly returns (~5%), limiting how much ML can help.

Effective sample size

The number of truly independent observations in a dataset, accounting for correlation. Far smaller than raw observation count for financial data.

Gradient boosting

An ensemble ML method (XGBoost, LightGBM) that builds trees sequentially, each correcting errors of the previous. Most consistently useful ML tool for equity factor research.

Regularisation

Techniques (L1/L2 penalty, dropout, early stopping) that constrain model complexity to prevent overfitting. Essential for all ML applied to financial data.

Feature engineering

The process of constructing new input variables from raw data | combining ratios, detecting interactions, processing text. Often where ML creates more value than in prediction itself.

A note from the author

Why this matters

Machine learning is the most over-hyped and under-understood tool in Indian quant finance. I wrote this module because I have seen too many smart engineers overfit a gradient-boosted model to BSE data and call it alpha. Knowing when ML adds genuine edge | and when it is just expensive curve-fitting | separates survivors from casualties in live markets.

Tanmay Kurtkoti

Founder & CEO, RupeeCase · 17 years systematic trading · QC Alpha

Want to put this into practice? RupeeCase is the systematic investing terminal built around everything you're learning here, factor scores, strategy backtests, portfolio construction for Indian markets.

Explore the terminal →

Sources & further reading

→ Gu, S., Kelly, B. & Xiu, D. (2020). Empirical Asset Pricing via Machine Learning. Review of Financial Studies. (The definitive ML in equities paper)
→ Lopez de Prado, M. (2018). Advances in Financial Machine Learning. Wiley.
→ BSE India — Corporate Announcements (NLP data source)
→ NSE India — Corporate Filings

Quick check, Module 5.3

0 correct · 0 answered

🎉

Module 5.3 complete

3 correct. Continue to Module 5.4

Research Lab Qualifier

Path 5, Module 3 of 5 done, complete all 5 + path test to unlock

Explore terminal →

✅ 5.1 Statistics → ✅ 5.2 Time Series → 📍 5.3 ML for Alpha → 5.4 Alt Data → 5.5 Out-of-Sample

← Previous

Previous, Module 5.2

Time Series Analysis

Calculator

Train / Validation / Test Split Helper

For time series, split must be chronological. Train on the oldest data, validate on the middle, test on the most recent. Never shuffle.

Total history (years)Train share (%)Validation share (%)

Quick check, Module 5.3

3 questions. Get 2 right to mark this module complete.

0 of 3 answered

Up next, Module 5.4

Alternative Data in India

What alternative data actually exists for Indian markets, what's accessible, what's institutional-only, and how to evaluate whether alt data has genuine edge.

Continue →

PRACTICE WHAT YOU LEARNED

Try systematic strategies on RupeeCase | free paper trading.

Get Started Free →

The core problem: signal-to-noise ratio

The sample size problem

What actually works in practice

Glossary

Sources & further reading

Quick check, Module 5.3

Train / Validation / Test Split Helper

3 questions. Get 2 right to mark this module complete.

What's working, what isn't.