Skip to content

Does the Nikkei 225 Rise the Day After an Ohtani Home Run? — Backtest Results and Three Lessons

For / Key Points

For: Investors who have some market experience but are not yet fluent in statistical hypothesis testing. Anyone who has seen the claim "Ohtani hits a home run, Japanese stocks go up." Anyone who wants to learn how to separate a real signal from coincidence in data analysis.

Key Points:

  • The Nikkei 225 opening gap on the day after an Ohtani HR is statistically significant in a naive comparison (t=2.07, p=0.038), but the significance disappears once US equities are controlled for
  • A day-game vs. night-game split reveals a pattern consistent with an "S&P 500 transmission channel," so we cannot dismiss it as pure spurious correlation
  • Unusable as a trading signal, but valuable as a hands-on lesson in spurious correlation, control variables, and multiple testing

Bottom Line: Two Questions, Two Answers

"The Nikkei 225 opens higher the morning after Shohei Ohtani hits a home run" — this claim circulates on social media. I backtested it on every game from his 2018 MLB debut through April 2026 (roughly 1,060 game days, of which about 260 are HR days).

Term: What is a backtest?

A backtest asks "if I had traded according to this rule in the past, would I have made money?" using historical data. Here, the rule is "buy the Nikkei 225 the morning after Ohtani hits a home run." Note that historical performance does not guarantee future results.

The answer splits along two axes.

  • Trading verdict: NO-GO. Cannot be used as an investment signal. The effect size is small, and transaction costs eliminate any profitability.
  • Phenomenon hypothesis: CONDITIONAL-NO-GO (held in reserve). The day/night split reveals a structure that does not look like pure spurious correlation, but the sample is too small to confirm.

What follows walks through the entire verification, illustrating three traps in statistical analysis with concrete examples.


Study Design: What to Measure, and How

Backtest credibility is determined before you run it. If you go hunting after the fact for "conditions that look like an effect," you dramatically inflate the risk of mistaking coincidence for a real signal. The design below was fixed in advance.

Dependent variable: Nikkei 225 opening gap (the percentage change from the previous close to the same day's open). The transmission channel "Ohtani HR → news → reflected in the next morning's opening price" is the most direct one1.

Term: What is the opening gap?

The number that captures how much the first price of the day (the open) differs from the last price of the previous day (the close). For example, if yesterday's close was 39,000 yen and today's open is 39,100 yen, the opening gap is about +0.26%. Because it absorbs overnight news and overseas market moves, it works well as a proxy for "overnight sentiment."

Independent variable: A 1/0 dummy for "did Ohtani hit a home run?" I also designed a "buzz score" (HR=8 points, pitcher win=6 points, double-digit strikeouts bonus=+5 points, etc.) as a proxy for social-media attention, but I first wanted to see whether the simplest HR dummy alone produced a signal4.

Control variables: Previous-day S&P 500 return, USD/JPY change, Monday dummy. Without them I cannot rule out the confounding scenario "the US market just happened to be strong on the same day Ohtani performed well."

Period: From 2018 (Ohtani's first MLB year) through April 2026. I split the sample into the Angels era (2018–2023) and the Dodgers era (2024–) to look for structural change.


Step 1: Naive Comparison — At First Glance, the Effect "Exists"

I started with the most naive analysis: compare the next-day Nikkei opening gap on days when Ohtani hit a home run (HR days) versus days when he did not (non-HR days).

PeriodHR-day opening gapNon-HR-dayDifference
Full period (2018–2026)+0.138%+0.023%+0.115%
Angels era (2018–23)+0.084%+0.032%+0.052%
Dodgers era (2024–)+0.222%+0.001%+0.221%

Across the full period, t=2.07 (p=0.038). By statistical convention this just barely clears the "5% significance level" (t≥1.96), so it is considered "unlikely to be coincidence." But the significance has very little headroom.

Term: t-statistic and p-value — measuring 'unlikely to be coincidence'

The t-statistic measures how large the observed difference is relative to the noise in the data. A larger t means it is less likely the difference arose by chance. The rough rule is t≥1.96 → "5% significant" (less than a 5% probability of arising by chance).

The p-value is the probability that pure chance alone would produce a result at least this extreme. p=0.038 means "out of 100 random draws, this would occur about 3.8 times." p<0.05 is the conventional pass mark in statistics, but the threshold itself has no absolute justification — it is only a convention.

The Dodgers-era difference is more than four times the Angels-era difference, consistent with the much higher Japanese-media exposure that move generated.

It is tempting to declare "effect confirmed" here, but the next step changes the picture entirely.


Step 2: Adding Control Variables — The Effect "Vanishes"

Question this section answers

Is the difference seen in the naive comparison caused by Ohtani's performance, or is some third factor driving both?

One of the most important moves in statistical analysis is to add control variables. I put the main drivers of Nikkei moves (S&P 500, USD/JPY, day-of-week effect) into a single regression and asked whether the Ohtani HR effect survives once their influence is netted out.

Term: What is a control variable?

A variable used to "subtract out the influence of other causes." Suppose data shows "days when more people carry umbrellas have more traffic accidents." That is just because "rainy days mean more umbrellas AND slippery roads." Add "is it raining?" as a control variable and the apparent umbrella–accident link disappears. Here, the S&P 500 plays the role of "rain."

VariableCoefficienttVerdict
Previous-day S&P 500 return+0.63414.67Significant
Ohtani HR dummy+0.077%1.52Not significant
Monday dummy-0.214-1.8910% significant

The instant the S&P 500 enters the regression, the HR-dummy t-statistic drops from 2.07 to 1.52 — below the significance threshold.

What is going on? On days when the US market is strong, the S&P 500 rises, and the same environment tends to be favorable for Dodgers games as well (atmosphere, team momentum, and other compound factors). The next day, the Nikkei 225 opens higher, pricing in the S&P 500 rally. In this structure "Ohtani HR → Nikkei rise" looks real, but the more likely story is spurious correlation: "strong US market → both S&P 500 rally AND Ohtani performance."

    Candidate common cause (strong US market)
            ↓                  ↓
    Ohtani HR (Outcome A)   Nikkei rise (Outcome B)
            ↓                  ↑
       Apparent correlation (spurious)

At this point the verdict is NO-GO. But the story did not end here.

Term: What is spurious correlation?

A phenomenon where two events appear to "go together," but in fact a third cause is driving both. The classic example: "ice cream sales rise on the same days as drowning incidents." Ice cream does not cause drownings — "hot weather" causes both. The Ohtani HR / Nikkei link could similarly be driven by a third factor, "strong US market."


Step 3: Day vs. Night Split — A Structure "Emerges"

Question this section answers

If everything were explained by spurious correlation with the S&P 500, why would the result depend on the time of day the game is played?

MLB games come in two flavors: day games (late-night to early-morning Japan time, overlapping S&P 500 trading hours) and night games (Japan daytime, after the S&P 500 has closed). I added the day/night split as a natural extension of the original verification plan (t+0/t+1/t+2 lag analysis). I had not pre-specified the split, but "overlap with S&P 500 trading hours" follows naturally from the causal-channel theory rather than from arbitrary data mining2.

If the Ohtani HR effect were pure spurious correlation, the apparent effect should appear roughly equally in both day and night games. Conversely, if a transmission channel through the S&P 500 exists, the effect should appear only in day games — those that overlap with live S&P 500 trading.

CategoryHR-day gap differencetn (HR / non-HR)Verdict
Day games (overlap S&P 500)+0.203%2.34107 / 338Significant (5%)
Night games (S&P 500 closed)+0.052%0.72149 / 435Not significant

Significant only in day games; the night-game effect is essentially zero.

This result allows two interpretations.

Interpretation A: A real sentiment-transmission channel. Ohtani HR → spread on social media / US press → tiny sentiment effect on the S&P 500 during trading hours → propagates to the Nikkei the next day. In this case the S&P 500 is a mediator variable (a variable that sits between the cause and the effect), and putting it on the right-hand side of a regression as a control would eliminate the real effect along with the noise.

Term: Mediator variable — the real effect that disappears when you control for it

When the chain is "cause → intermediate step → effect," the intermediate step is a mediator. For example, in "exercise → muscle gain → higher resting metabolism," muscle mass is the mediator. If you put muscle mass into a regression as a control variable, you might wrongly conclude "exercise has no effect on metabolism." If the S&P 500 is a mediator in our case, controlling for it would erase the genuine Ohtani channel.

Interpretation B: A more refined confounding story. Day games simply happen under different market conditions (weather, weekday distribution, season-stage bias, etc.). The overlap with S&P 500 trading hours is coincidental.

The data we have is not enough to settle which is correct. I also ran a reverse-causation test (does the previous day's market return predict HR days?), and t≈1.5 — not significant. The simple reverse-causation hypothesis "HRs come out on days when the market is strong" was at least not strongly supported, but cannot be entirely ruled out either.


Postseason Amplification

As a side finding, I analyzed the 2024 and 2025 postseasons (33 game days in total). The next-day gap during the Division Series period reached +1.03% (n=9), about five times the regular-season effect. The sample is too small to claim statistical significance, but the direction — "the higher the attention, the larger the effect" — is consistent with the sentiment-transmission hypothesis.

Even when the postseason is included, the post-S&P 500-control t-statistic is 1.37 — still not significant — so the verdict does not change. The 2024 World Series happened to feature a quiet stretch from Ohtani at the plate, so the series-wide effect is muted. If a sentiment effect really exists, the interpretation is "even when the attention 'container' is large, the effect does not appear unless the 'contents' (actual on-field performance) are there."


Three Traps in Statistical Verification

The traps I ran into during this analysis generalize beyond investing — they apply to data analysis broadly.

Trap 1: Without control variables, you miss spurious correlations

The naive t=2.07 is "statistically significant," but adding the S&P 500 drops it to 1.52. Two variables moving together does not imply a causal relationship between them. The same structure causes ice-cream sales and drowning incidents to correlate: both are driven by "summer heat."

In an investment context, whenever you see a social-media claim like "when X happens, the stock goes up," you should always ask "is some other driver of the stock moving at the same time?"

Trap 2: Post-hoc subgroup analysis overstates significance

"Marginal across the full period, but significant in the Dodgers era only," "significant in day games only" — narrowing the conditions makes significant results more likely. If you try ten subgroups, by pure chance roughly one will hit 5% significance (the multiple-testing problem).

In our case the day/night split has theoretical justification (overlap with S&P 500 trading hours), so it is not pure data mining. Even so, a result based on n=107 deserves to be discounted.

Trap 3: Controlling for a mediator can erase the real effect

Control variables are not "the more you add, the better." Putting a variable that lies on the causal path (a mediator) on the control side will remove the genuine indirect effect along with the noise.

In our case, if "Ohtani HR → sentiment effect on the S&P 500 → propagation to the Nikkei" is the true channel, then controlling for the S&P 500 is the wrong move. Verifying that causal model itself, however, requires more refined methods (mediation analysis, structural equation models), and this dataset alone cannot answer the question.


Final Verdict and Future Verification

ItemVerdict
Naive HR effectSignificant (t=2.07, p=0.038)
After S&P 500 controlNot significant (t=1.52)
Dodgers era × day gamesSignificant (t=2.34, HR days 107 / non-HR days 338)
Trading verdictNO-GO
Phenomenon hypothesisCONDITIONAL-NO-GO (held in reserve)

The trading verdict is NO-GO. Even if the effect were real, a strategy that consistently captures a +0.2% gap cannot survive once you account for transaction costs and slippage3.

At the same time, "pure noise" cannot be declared either. As 2026 season data accumulates, we will be able to test whether the Dodgers-era day-game effect reproduces out-of-sample. If it does, the sentiment-transmission hypothesis takes a step forward. If it does not, we will conclude that the prior result was overfitting to historical data.

This study ran roughly 45 statistical tests in total. Applying the Bonferroni correction (a strict significance level that accounts for multiple testing), every individually 5%-significant result becomes non-significant. This too is a live example of "Trap 2."

Term: What is the Bonferroni correction?

Repeating a test many times is like buying many lottery tickets that each have a 5% chance of winning — the chance that at least one hits goes up. With 45 tests, the chance that at least one is 5%-significant by pure chance is about 90%. The Bonferroni correction "divides the pass mark by the number of tests": 0.05 ÷ 45 ≈ 0.001 becomes the new bar. Under that bar, our p=0.038 falls far short.

Data does not lie, but misreading data lets people invent stories on their own. I hope this case study is a useful reference for that point.


Data Sources and Analysis Conditions

ItemContent
Performance dataMLB Stats API (official, no auth required), Player ID 660271
Market datayfinance (Nikkei 225: ^N225, S&P 500: ^GSPC, USD/JPY: USDJPY=X, VIX: ^VIX)
Analysis periodMarch 2018 (MLB debut) – April 2026
Game count1,129 regular-season games + 37 postseason games (1,066 days after daily aggregation)
Day/night classificationBased on game start time (UTC); games overlapping S&P 500 trading hours are classified as day games
Next-trading-day lookupForward calendar search up to 6 days from the game date (handles holidays / long weekends)
Statistical methodsWelch t-test, bootstrap (standard + block / permutation), OLS regression, Granger causality test, binomial test
Number of test familiesAbout 45 (Bonferroni-corrected significance level: 0.001)


  1. I also checked daily returns (close-to-close) in parallel, but intraday noise is much larger, and the opening gap is a cleaner channel for capturing overnight sentiment shifts. 

  2. Even so, this is post-hoc additional analysis, and the multiple-testing discount still applies. See "Trap 2." 

  3. Data-handling notes: performance data from the MLB Stats API (official, no auth), market data via yfinance. Tokyo Stock Exchange holidays (including national holidays) are mapped to the next trading day. When a single game produces multiple HRs (e.g., a 2-HR game), the HR dummy is still 1 (the multi-HR bonus is applied on the score side). The opening gap is computed as the log return between the previous trading day's close and the current open. 

  4. I also ran regressions using the buzz score as the explanatory variable, but the main conclusions did not change relative to the HR dummy alone. The correlation between the score and the opening gap was +0.029 for the regular season — extremely weak — making it unlikely that refining the weighting scheme would overturn the conclusions.