Does the Nikkei 225 Rise the Day After an Ohtani Home Run? — Backtest Results and Three Lessons¶

For / Key Points

For: Investors who have some market experience but are not yet fluent in statistical hypothesis testing. Anyone who has seen the claim "Ohtani hits a home run, Japanese stocks go up." Anyone who wants to learn how to separate a real signal from coincidence in data analysis.

Key Points:

The Nikkei 225 opening gap on the day after an Ohtani HR is statistically significant in a naive comparison (t=2.07, p=0.038), but the significance disappears once US equities are controlled for
A day-game vs. night-game split reveals a pattern consistent with an "S&P 500 transmission channel," so we cannot dismiss it as pure spurious correlation
Unusable as a trading signal, but valuable as a hands-on lesson in spurious correlation, control variables, and multiple testing

Bottom Line: Two Questions, Two Answers¶

"The Nikkei 225 opens higher the morning after Shohei Ohtani hits a home run" — this claim circulates on social media. I backtested it on every game from his 2018 MLB debut through April 2026 (roughly 1,060 game days, of which about 260 are HR days).

Term: What is a backtest?

A backtest asks "if I had traded according to this rule in the past, would I have made money?" using historical data. Here, the rule is "buy the Nikkei 225 the morning after Ohtani hits a home run." Note that historical performance does not guarantee future results.

The answer splits along two axes.

Trading verdict: NO-GO. Cannot be used as an investment signal. The effect size is small, and transaction costs eliminate any profitability.
Phenomenon hypothesis: CONDITIONAL-NO-GO (held in reserve). The day/night split reveals a structure that does not look like pure spurious correlation, but the sample is too small to confirm.

What follows walks through the entire verification, illustrating three traps in statistical analysis with concrete examples.

Study Design: What to Measure, and How¶

Backtest credibility is determined before you run it. If you go hunting after the fact for "conditions that look like an effect," you dramatically inflate the risk of mistaking coincidence for a real signal. The design below was fixed in advance.

Dependent variable: Nikkei 225 opening gap (the percentage change from the previous close to the same day's open). The transmission channel "Ohtani HR → news → reflected in the next morning's opening price" is the most direct one¹.

Term: What is the opening gap?

The number that captures how much the first price of the day (the open) differs from the last price of the previous day (the close). For example, if yesterday's close was 39,000 yen and today's open is 39,100 yen, the opening gap is about +0.26%. Because it absorbs overnight news and overseas market moves, it works well as a proxy for "overnight sentiment."

Independent variable: A 1/0 dummy for "did Ohtani hit a home run?" I also designed a "buzz score" (HR=8 points, pitcher win=6 points, double-digit strikeouts bonus=+5 points, etc.) as a proxy for social-media attention, but I first wanted to see whether the simplest HR dummy alone produced a signal⁴.

Control variables: Previous-day S&P 500 return, USD/JPY change, Monday dummy. Without them I cannot rule out the confounding scenario "the US market just happened to be strong on the same day Ohtani performed well."

Period: From 2018 (Ohtani's first MLB year) through April 2026. I split the sample into the Angels era (2018–2023) and the Dodgers era (2024–) to look for structural change.

Step 1: Naive Comparison — At First Glance, the Effect "Exists"¶

I started with the most naive analysis: compare the next-day Nikkei opening gap on days when Ohtani hit a home run (HR days) versus days when he did not (non-HR days).

Period	HR-day opening gap	Non-HR-day	Difference
Full period (2018–2026)	+0.138%	+0.023%	+0.115%
Angels era (2018–23)	+0.084%	+0.032%	+0.052%
Dodgers era (2024–)	+0.222%	+0.001%	+0.221%

Across the full period, t=2.07 (p=0.038). By statistical convention this just barely clears the "5% significance level" (t≥1.96), so it is considered "unlikely to be coincidence." But the significance has very little headroom.

Term: t-statistic and p-value — measuring 'unlikely to be coincidence'

The t-statistic measures how large the observed difference is relative to the noise in the data. A larger t means it is less likely the difference arose by chance. The rough rule is t≥1.96 → "5% significant" (less than a 5% probability of arising by chance).

The p-value is the probability that pure chance alone would produce a result at least this extreme. p=0.038 means "out of 100 random draws, this would occur about 3.8 times." p<0.05 is the conventional pass mark in statistics, but the threshold itself has no absolute justification — it is only a convention.

The Dodgers-era difference is more than four times the Angels-era difference, consistent with the much higher Japanese-media exposure that move generated.

It is tempting to declare "effect confirmed" here, but the next step changes the picture entirely.

Step 2: Adding Control Variables — The Effect "Vanishes"¶

Question this section answers

Is the difference seen in the naive comparison caused by Ohtani's performance, or is some third factor driving both?

One of the most important moves in statistical analysis is to add control variables. I put the main drivers of Nikkei moves (S&P 500, USD/JPY, day-of-week effect) into a single regression and asked whether the Ohtani HR effect survives once their influence is netted out.

Term: What is a control variable?

A variable used to "subtract out the influence of other causes." Suppose data shows "days when more people carry umbrellas have more traffic accidents." That is just because "rainy days mean more umbrellas AND slippery roads." Add "is it raining?" as a control variable and the apparent umbrella–accident link disappears. Here, the S&P 500 plays the role of "rain."

Variable	Coefficient	t	Verdict
Previous-day S&P 500 return	+0.634	14.67	Significant
Ohtani HR dummy	+0.077%	1.52	Not significant
Monday dummy	-0.214	-1.89	10% significant

The instant the S&P 500 enters the regression, the HR-dummy t-statistic drops from 2.07 to 1.52 — below the significance threshold.

What is going on? On days when the US market is strong, the S&P 500 rises, and the same environment tends to be favorable for Dodgers games as well (atmosphere, team momentum, and other compound factors). The next day, the Nikkei 225 opens higher, pricing in the S&P 500 rally. In this structure "Ohtani HR → Nikkei rise" looks real, but the more likely story is spurious correlation: "strong US market → both S&P 500 rally AND Ohtani performance."

    Candidate common cause (strong US market)
            ↓                  ↓
    Ohtani HR (Outcome A)   Nikkei rise (Outcome B)
            ↓                  ↑
       Apparent correlation (spurious)

At this point the verdict is NO-GO. But the story did not end here.

Term: What is spurious correlation?

A phenomenon where two events appear to "go together," but in fact a third cause is driving both. The classic example: "ice cream sales rise on the same days as drowning incidents." Ice cream does not cause drownings — "hot weather" causes both. The Ohtani HR / Nikkei link could similarly be driven by a third factor, "strong US market."

Step 3: Day vs. Night Split — A Structure "Emerges"¶

Question this section answers

If everything were explained by spurious correlation with the S&P 500, why would the result depend on the time of day the game is played?

MLB games come in two flavors: day games (late-night to early-morning Japan time, overlapping S&P 500 trading hours) and night games (Japan daytime, after the S&P 500 has closed). I added the day/night split as a natural extension of the original verification plan (t+0/t+1/t+2 lag analysis). I had not pre-specified the split, but "overlap with S&P 500 trading hours" follows naturally from the causal-channel theory rather than from arbitrary data mining².

If the Ohtani HR effect were pure spurious correlation, the apparent effect should appear roughly equally in both day and night games. Conversely, if a transmission channel through the S&P 500 exists, the effect should appear only in day games — those that overlap with live S&P 500 trading.

Category	HR-day gap difference	t	n (HR / non-HR)	Verdict
Day games (overlap S&P 500)	+0.203%	2.34	107 / 338	Significant (5%)
Night games (S&P 500 closed)	+0.052%	0.72	149 / 435	Not significant

Significant only in day games; the night-game effect is essentially zero.

This result allows two interpretations.

Interpretation A: A real sentiment-transmission channel. Ohtani HR → spread on social media / US press → tiny sentiment effect on the S&P 500 during trading hours → propagates to the Nikkei the next day. In this case the S&P 500 is a mediator variable (a variable that sits between the cause and the effect), and putting it on the right-hand side of a regression as a control would eliminate the real effect along with the noise.

Term: Mediator variable — the real effect that disappears when you control for it

When the chain is "cause → intermediate step → effect," the intermediate step is a mediator. For example, in "exercise → muscle gain → higher resting metabolism," muscle mass is the mediator. If you put muscle mass into a regression as a control variable, you might wrongly conclude "exercise has no effect on metabolism." If the S&P 500 is a mediator in our case, controlling for it would erase the genuine Ohtani channel.

Interpretation B: A more refined confounding story. Day games simply happen under different market conditions (weather, weekday distribution, season-stage bias, etc.). The overlap with S&P 500 trading hours is coincidental.

The data we have is not enough to settle which is correct. I also ran a reverse-causation test (does the previous day's market return predict HR days?), and t≈1.5 — not significant. The simple reverse-causation hypothesis "HRs come out on days when the market is strong" was at least not strongly supported, but cannot be entirely ruled out either.

Postseason Amplification¶

As a side finding, I analyzed the 2024 and 2025 postseasons (33 game days in total). The next-day gap during the Division Series period reached +1.03% (n=9), about five times the regular-season effect. The sample is too small to claim statistical significance, but the direction — "the higher the attention, the larger the effect" — is consistent with the sentiment-transmission hypothesis.

Even when the postseason is included, the post-S&P 500-control t-statistic is 1.37 — still not significant — so the verdict does not change. The 2024 World Series happened to feature a quiet stretch from Ohtani at the plate, so the series-wide effect is muted. If a sentiment effect really exists, the interpretation is "even when the attention 'container' is large, the effect does not appear unless the 'contents' (actual on-field performance) are there."

Three Traps in Statistical Verification¶

The traps I ran into during this analysis generalize beyond investing — they apply to data analysis broadly.

Trap 1: Without control variables, you miss spurious correlations¶

The naive t=2.07 is "statistically significant," but adding the S&P 500 drops it to 1.52. Two variables moving together does not imply a causal relationship between them. The same structure causes ice-cream sales and drowning incidents to correlate: both are driven by "summer heat."

In an investment context, whenever you see a social-media claim like "when X happens, the stock goes up," you should always ask "is some other driver of the stock moving at the same time?"

Trap 2: Post-hoc subgroup analysis overstates significance¶

"Marginal across the full period, but significant in the Dodgers era only," "significant in day games only" — narrowing the conditions makes significant results more likely. If you try ten subgroups, by pure chance roughly one will hit 5% significance (the multiple-testing problem).

In our case the day/night split has theoretical justification (overlap with S&P 500 trading hours), so it is not pure data mining. Even so, a result based on n=107 deserves to be discounted.

Trap 3: Controlling for a mediator can erase the real effect¶

Control variables are not "the more you add, the better." Putting a variable that lies on the causal path (a mediator) on the control side will remove the genuine indirect effect along with the noise.

In our case, if "Ohtani HR → sentiment effect on the S&P 500 → propagation to the Nikkei" is the true channel, then controlling for the S&P 500 is the wrong move. Verifying that causal model itself, however, requires more refined methods (mediation analysis, structural equation models), and this dataset alone cannot answer the question.

Final Verdict and Future Verification¶

Item	Verdict
Naive HR effect	Significant (t=2.07, p=0.038)
After S&P 500 control	Not significant (t=1.52)
Dodgers era × day games	Significant (t=2.34, HR days 107 / non-HR days 338)
Trading verdict	NO-GO
Phenomenon hypothesis	CONDITIONAL-NO-GO (held in reserve)

The trading verdict is NO-GO. Even if the effect were real, a strategy that consistently captures a +0.2% gap cannot survive once you account for transaction costs and slippage³.

At the same time, "pure noise" cannot be declared either. As 2026 season data accumulates, we will be able to test whether the Dodgers-era day-game effect reproduces out-of-sample. If it does, the sentiment-transmission hypothesis takes a step forward. If it does not, we will conclude that the prior result was overfitting to historical data.

This study ran roughly 45 statistical tests in total. Applying the Bonferroni correction (a strict significance level that accounts for multiple testing), every individually 5%-significant result becomes non-significant. This too is a live example of "Trap 2."

Term: What is the Bonferroni correction?

Repeating a test many times is like buying many lottery tickets that each have a 5% chance of winning — the chance that at least one hits goes up. With 45 tests, the chance that at least one is 5%-significant by pure chance is about 90%. The Bonferroni correction "divides the pass mark by the number of tests": 0.05 ÷ 45 ≈ 0.001 becomes the new bar. Under that bar, our p=0.038 falls far short.

Data does not lie, but misreading data lets people invent stories on their own. I hope this case study is a useful reference for that point.

Data Sources and Analysis Conditions¶

Item	Content
Performance data	MLB Stats API (official, no auth required), Player ID 660271
Market data	yfinance (Nikkei 225: `^N225`, S&P 500: `^GSPC`, USD/JPY: `USDJPY=X`, VIX: `^VIX`)
Analysis period	March 2018 (MLB debut) – April 2026
Game count	1,129 regular-season games + 37 postseason games (1,066 days after daily aggregation)
Day/night classification	Based on game start time (UTC); games overlapping S&P 500 trading hours are classified as day games
Next-trading-day lookup	Forward calendar search up to 6 days from the game date (handles holidays / long weekends)
Statistical methods	Welch t-test, bootstrap (standard + block / permutation), OLS regression, Granger causality test, binomial test
Number of test families	About 45 (Bonferroni-corrected significance level: 0.001)

I also checked daily returns (close-to-close) in parallel, but intraday noise is much larger, and the opening gap is a cleaner channel for capturing overnight sentiment shifts. ↩
Even so, this is post-hoc additional analysis, and the multiple-testing discount still applies. See "Trap 2." ↩
Data-handling notes: performance data from the MLB Stats API (official, no auth), market data via yfinance. Tokyo Stock Exchange holidays (including national holidays) are mapped to the next trading day. When a single game produces multiple HRs (e.g., a 2-HR game), the HR dummy is still 1 (the multi-HR bonus is applied on the score side). The opening gap is computed as the log return between the previous trading day's close and the current open. ↩
I also ran regressions using the buzz score as the explanatory variable, but the main conclusions did not change relative to the HR dummy alone. The correlation between the score and the opening gap was +0.029 for the regular season — extremely weak — making it unlikely that refining the weighting scheme would overturn the conclusions. ↩