Research Note

In Multi-Horizon Forecasting, the Short Horizon Lies First

Tetracta AI Teams · 15 June 2026

A methodology note from our quantitative-research practice. It describes a general lesson about overfitting and honest backtesting. No strategy, instrument, universe, dataset, or performance figure is disclosed.

When you forecast the same target at several horizons — short, medium, long — there is a seductive instinct to chase the short one. It feels like the easy money: more observations per unit of calendar time, faster feedback, quicker iteration. You can run a hundred experiments in the time a long-horizon study lets you run one, and every fast feedback loop whispers that you are learning faster. It is usually where a practitioner fools themselves first. The speed that feels like an advantage is also what lets a mistake compound before anyone notices it is a mistake.

We want to be precise about a thing that is easy to say loosely. The short horizon is not "harder" in some vague sense. It is harder in a specific, measurable way that makes it uniquely good at producing results that feel earned and are not. That is what this note is about — not a war story, but the anatomy of a particular self-deception, and the small set of practices that reliably catch it.

Why the short horizon overfits.

At short horizons the signal-to-noise ratio is brutal. Most of what moves a short-horizon outcome is noise — timing, microstructure, the granular accidents of when exactly a thing happened, things no model should be able to predict from the features it has. A flexible model handed a short horizon and limited data will happily memorize that noise and report a beautiful in-sample fit. The shortness is the trap: there is so little true signal that almost all of the apparent skill is the model reciting what it has already seen.

It helps to think about it as a budget. Any outcome you are trying to predict is some mixture of a slow, structural component and a fast, idiosyncratic one. As you shorten the horizon, the structural part — the part a model could legitimately learn — shrinks, while the idiosyncratic part stays loud or grows. So the fraction of the target that is genuinely predictable falls, sometimes precipitously, exactly as the number of observations rises. A naive practitioner reads the rising observation count as "more data, more power" and never notices that the thing those observations are mostly describing is noise. You have more samples of a signal that is mostly not there.

Now add a flexible model to that situation. Capacity is opportunity. A model with enough parameters, or enough feature interactions, or enough freedom in its hyperparameters, will find a way to fit whatever is in front of it. When the target is mostly noise, "whatever is in front of it" is the noise — and the model fits it with the same enthusiasm it would bring to real structure. The in-sample curve looks gorgeous. It is gorgeous because it is a photograph of the past, not a theory of the future.

A worked-style way to feel this without any data: imagine you flip a fair coin many times and, alongside each flip, you record a long list of irrelevant facts about the moment — the temperature, the second-hand position, the last three flips. With enough irrelevant facts and a flexible enough fitter, you can "explain" the sequence of heads and tails almost perfectly in the data you already have. The fit is real arithmetic; the skill is imaginary. Short-horizon forecasting is the same game played with a target that is almost a coin but not quite — and the "almost" is just enough true signal to make the illusion convincing, because some of the fit really is earned. That sliver of reality is what makes the whole thing dangerous. A pure coin would fool no careful person. A coin with a faint bias fools careful people constantly.

The longer horizon is less glamorous and more honest. Slower-moving structure is harder to overfit and likelier to reflect something real, because it persists across more independent situations and therefore cannot be explained away by the accidents of any one of them. There are fewer effectively-independent long-horizon observations in the same calendar span — which feels like a disadvantage and is, in fact, a discipline: the model is forced to find something that recurs, rather than something that merely happened. As a general rule, the shortest horizons are where signal is least likely to survive honest testing — there is so little structure relative to noise that a flexible model can memorize a short-horizon target without learning anything that generalizes. None of which is to claim the short horizon contains literally nothing; the point is that whatever it contains is typically too small to separate cleanly from the noise it is buried in.

There is a second, quieter reason the short horizon misleads. When iteration is fast and cheap, you run more trials — and every additional trial is another lottery ticket in the search for an apparent edge. This is the multiple-comparisons problem wearing a stopwatch. If you try enough specifications, some will look good purely by chance, and the short horizon's fast feedback loop encourages you to try a lot of them. The very property that makes the short horizon feel productive — rapid experimentation — is the property that inflates your false-discovery rate. The honest correction is to remember that the number you should trust is not the best result you found, but something closer to the best result you would expect to find by chance given how hard you looked. Most people quote the former and live with the latter.

A short taxonomy of how the lie gets in.

It is worth naming the channels, because "overfitting" is a single word for several distinct failures, and the fixes differ.

- Lookahead leakage. A feature carries information that, in production, would not yet be known at the moment of decision. This is the classic, and the short horizon is exquisitely sensitive to it — a sliver of accidental future information is enough to fabricate an edge where none exists. - Alignment leakage. Subtler than lookahead: the feature and the label are each individually point-in-time correct, but they are joined or timestamped against slightly mismatched clocks, so a tiny amount of the future bleeds across the seam. This kind hides in the plumbing, not the logic, and survives a careful read of the model code because the model code is innocent — the join is guilty. - Preprocessing leakage. Any step that "looks at the whole dataset" — normalizing, scaling, selecting features, imputing missing values, choosing a clipping range — computed once over all of time, including the part you are about to call out-of-sample. The fit then quietly knows the future's statistics. The cure is to fit every such transform only on data available as of the decision moment, and to apply it forward. - Selection / survivorship leakage. The set of things you study was itself chosen using the future — keeping only the entities that lasted, or that were "interesting," is a choice made with hindsight, and hindsight is information. - Target leakage. The label is constructed in a way that smuggles in something you would not have at decision time, or overlaps in time with the features it is supposed to be predicted from. Short horizons make overlapping windows easy to create by accident. - Tuning leakage (the multiple-comparisons channel above). No single feature is from the future, but the choice of model was made by peeking at the test outcome many times. The future leaks not through a column but through the experimenter's decisions.

Naming them this way matters because the defenses are not interchangeable. A perfect point-in-time feature store does nothing about tuning leakage; a clean train/test split does nothing about a preprocessing step that straddles it. You have to close each door.

The discipline that catches it.

The danger is that a short-horizon overfit looks identical to a real edge until it meets reality. The fit metric, the curve, the confidence — all the things you would normally trust — are exactly the things the overfit reproduces. So you cannot trust the result; you have to trust the procedure. Four practices, in order of importance, separate the two.

Leak-free, point-in-time backtesting. The most common way short-horizon results get inflated is lookahead — using a feature that, in production, would not yet be known at decision time. Short horizons are exquisitely sensitive to this; a sliver of accidental future information is enough to fabricate an edge. Every feature must be stamped with the moment it was actually available, and tested as of that moment. The mental discipline we use is to ask of every single input: if I froze the world at the decision instant, could I physically compute this number? If the honest answer is "only after the fact," it does not belong in the model, no matter how much it helps. A good point-in-time system makes this question answerable by construction rather than by vigilance, because vigilance fails and construction does not.
The temporal placebo. Shift or shuffle your labels in time and re-run the entire pipeline. If the "signal" survives a placebo where it cannot exist, you have not found an edge — you have found a leak or a bug. This single test catches more false positives than any other. The logic is almost embarrassingly simple and that is its strength: you deliberately break the relationship the model claims to exploit, by moving the labels to a place where they have no causal connection to the features, and then you demand that the apparent skill go away. If it does not go away, the skill was never tied to the relationship you thought; it was tied to something structural in your pipeline — an alignment seam, a preprocessing transform, a survivorship choice. Run the placebo several ways, not one: shift labels forward, shift them back, shuffle within blocks to preserve gross statistics while destroying the specific link. A real edge dies under all of them. A leak usually survives at least one, and the one it survives tells you where the body is buried. A clean battery of placebos is best treated as a precondition for believing a result at all, not as a victory lap after believing it.
Out-of-sample across regimes. A result that only holds in the window you developed it on is a memory, not a model. Multi-year, walk-forward, across different conditions — or it does not count. The key word is regimes, not merely "more data." Twice as much data from the same conditions can flatter you twice as confidently; what you actually need is exposure to situations that do not resemble the one you built in, because generalization is precisely the claim that the thing holds where you did not look. Walk-forward — train on the past, test strictly on the future, roll the window, never let a later observation inform an earlier prediction — is the honest skeleton. And the result you should quote is not the best window but the distribution across windows, including the ugly ones. An edge that is wonderful on average because it is spectacular in one regime and broken in the rest is not an edge you can live on; it is a regime bet wearing a forecasting costume.
Beware recency-overfit. Retraining constantly on the freshest data feels responsive; often it just chases the latest noise. An "improvement" can be nothing but a model overfitting the recent past and degrading the moment the present stops resembling it. There is a real tension here — the world does drift, and a model frozen forever will go stale — but "retrain more often" is not automatically the answer to drift; frequently it is a way to convert drift-anxiety into noise-chasing. The honest test is to ask whether a more-frequently-retrained variant actually generalizes better out of sample across regimes, or merely fits the recent window more snugly. The failure mode to watch for is the latter masquerading as the former: a retraining cadence that looks like adaptivity but is, on inspection, a slow-motion overfit to whatever the last little while happened to look like.

A practice we would add to the list as a quiet fifth, because it underwrites the other four: pre-register the analysis before you peek. Decide the horizons, the splits, the placebo battery, and the metric you will live on before you see how any of them turn out. The single most effective way to neutralize tuning leakage is to remove your future self's freedom to keep searching until something looks good. You are allowed to be surprised by the result; you are not allowed to redefine "good" after seeing it.

And measure honestly.

A high t-statistic is not the same as a realised risk-adjusted return; an impressive gross number is not a net one. The horizon that overfits is also the one where costs bite hardest, because you act on it more often — a short-horizon result that looks wonderful gross can be flat or negative once you pay to trade it. If the figure you quote is not the one you would actually live on, it is decoration.

This deserves more than a sentence, because it is where the short horizon takes its second revenge. Two systems can show the same gross quality and behave completely differently once you account for the cost of acting on them, and the difference is almost entirely turnover. A short-horizon model, by its nature, wants to act often — and every action has a price: the spread you cross, the friction of execution, the slippage between the number you modeled and the number you got. Those costs scale with how frequently you move, so the short horizon pays them most. It is entirely possible for a result to look strongest precisely where it is least survivable, because the gross figure ignores the toll that the horizon's frequency imposes. The honest move is to fold a realistic cost model into the evaluation from the start, so that the number you compare across horizons is a net number, the one that survives contact with the world. When you do that, the ranking of horizons can invert: the glamorous short horizon, which looked dominant gross, can fall behind the slower one that acts rarely and keeps what it earns. We are not putting figures here, and we are deliberately not implying any particular ranking holds universally — only that you cannot know which horizon is actually better until you have charged each one honestly for the privilege of acting.

There is also a humbler measurement point that the short horizon makes easy to forget: a single summary statistic, however rigorous, is one draw from a noisy process. Report the spread, not just the center. A result whose confidence interval comfortably includes "nothing" is not made real by a flattering point estimate, and the more trials you ran to find it, the more you should widen your skepticism rather than narrow it.

Why this generalizes.

This is not really a finance lesson; it is an overfitting lesson that finance happens to make vivid. Finance is simply a domain where the noise is loud, the incentives to fool yourself are large, and reality keeps unusually strict score — so the failure mode is hard to ignore. But the structure is everywhere. Anywhere you predict at multiple timescales with limited data — demand, load, equipment failure, churn, the next click, the next reading from a sensor — the shortest horizon is where leakage and overfit most convincingly masquerade as skill, and where a small team without an army of validators is most exposed.

The mapping is direct. "Costs of acting" becomes whatever the consequence of a false alarm is in your setting — a needless maintenance dispatch, a wasted intervention, a customer annoyed by a retention offer they did not need. "Regimes" becomes seasons, deployments, product changes, populations you did not train on. "Recency-overfit" becomes the model that looks sharp on last month and dissolves the moment the system it watches is reconfigured. The defense is the same everywhere: assume the short horizon is lying until a placebo, a leak-free point-in-time backtest, and an out-of-sample run across regimes — measured on the net figure you would actually live on — prove otherwise. The short horizon does not get the benefit of the doubt. It has to earn belief against a stacked deck, because a stacked deck is the only honest way to test something that is this good at flattering you.

We keep the specifics of our own systems closed — the targets, the features, the universe, the numbers. We share the discipline rather than the implementation on purpose: the discipline is the part that transfers, and the part that is true regardless of what you are forecasting. None of the practices above are proprietary, and that is exactly why they are worth writing down. The cheapest edge in the world is not overfitting in the first place — and the second cheapest is being willing to kill your own beautiful result before the market, or the machine, or the customer does it for you.

— Tetracta AI Teams · for humans, like humans.

← All research notes Talk methodology