Papers
Topics
Authors
Recent
Search
2000 character limit reached

Linear Regression for Panel With Unknown Number of Factors as Interactive Fixed Effects

Published 1 May 2026 in econ.EM | (2605.00614v1)

Abstract: In this paper we study the least squares (LS) estimator in a linear panel regression model with unknown number of factors appearing as interactive fixed effects. Assuming that the number of factors used in estimation is larger than the true number of factors in the data, we establish the limiting distribution of the LS estimator for the regression coefficients as the number of time periods and the number of cross-sectional units jointly go to infinity. The main result of the paper is that under certain assumptions the limiting distribution of the LS estimator is independent of the number of factors used in the estimation, as long as this number is not underestimated. The important practical implication of this result is that for inference on the regression coefficients one does not necessarily need to estimate the number of interactive fixed effects consistently.

Summary

  • The paper demonstrates that overestimating the number of factors does not affect the asymptotic distribution of the LS estimator, ensuring valid inference on β.
  • The methodology employs perturbation analysis and principles from random matrix theory to derive robust limiting distributions under factor misspecification.
  • Empirical simulations and an application to US state-level data confirm that coefficient estimates remain stable when using a conservative upper bound for factors.

Robust Inference in Interactive Fixed Effects Panel Regression with Uncertain Factor Counts

Overview

The paper "Linear Regression for Panel With Unknown Number of Factors as Interactive Fixed Effects" (2605.00614) rigorously analyzes the LS estimator in large NN, TT linear panel models where the error structure incorporates an unknown number of interactive fixed effects, i.e., a factor structure. The authors specifically address the practical setting where estimation employs a potentially misspecified (too large) number RR of factors relative to the unknown true count R0R^0. The key contribution is the characterization of the limiting distribution of regression coefficient estimators under this setting, as well as the establishment of conditions under which overestimation of RR does not affect asymptotic inference on the regression parameters.

Model and Theoretical Framework

The baseline model is

Yit=k=1Kβk0Xk,it+λi0ft0+eitY_{it} = \sum_{k=1}^K \beta_k^0 X_{k,it} + \lambda_i^{0\prime} f_t^0 + e_{it}

with YY, XkX_k, and ee of dimension N×TN \times T, and the interactive fixed effects (“factors”) represented by TT0 (TT1) and TT2 (TT3). Both factors and loadings are treated as fixed (non-random) unknown parameters in a fully deterministic incidental parameter framework. The paper targets LS estimation of TT4 by minimizing the sum of squared residuals with TT5 factors/loadings, where TT6 is permitted and possibly strictly greater than TT7.

Key technical challenges derive from the unknown TT8 and the fact that conventional numerical and theoretical methods for such panel factor models typically require TT9 fixed and known, especially when deriving distributional results for RR0.

Main Results and Methodological Advances

Asymptotic Robustness to Factor Misspecification

Principal result: Under strong factor and regularity conditions, when RR1, the limiting distribution of the LS estimator RR2 is asymptotically equivalent to that of the oracle estimator RR3. Explicitly, for RR4 at proportional rates,

RR5

This asserts that, asymptotically, overestimating the number of factors imposes no efficiency loss on RR6, and inference procedures (including bias correction) remain valid without consistent estimation of RR7, provided RR8.

The asymptotic variance and bias of RR9 (and hence R0R^00 for R0R^01) are characterized explicitly, extending previous results for known factor count [Bai 2009; Moon & Weidner 2013].

Supporting Theoretical Innovations

  • Identification under Overfitting: Sufficient conditions are established for identification of R0R^02, notably a generalized non-collinearity condition (orthogonality of regressors to low-rank perturbations) sufficient for uniqueness up to rotation/invariance.
  • Distributional Theory via Perturbation Analysis: The least squares objective’s non-convexity and dependence on principal components precludes naive Taylor expansion. The authors leverage perturbation theory for linear operators to provide a quadratic expansion of the (profile) objective in neighborhoods around R0R^03, accommodating the non-explicit dependence of eigenvalues on R0R^04 and technicalities due to the multiplicity of zero eigenvalues in the population case.
  • Explicit Rate Results for Weak/Strong Factors: Consistency at R0R^05-rate is obtained for R0R^06, with convergence holding under weaker conditions even when the factor structure is misspecified (e.g., degenerate or weak factors).
  • High-Level Assumptions Justified by Random Matrix Theory: Necessary conditions on the separation and empirical distribution of the sample covariance eigenvalues are related to random matrix theory, invoking approximate Marchenko-Pastur laws and spectral norm bounds. The main result is proven rigorously for (conditionally) independent, homoskedastic, Gaussian R0R^07 under these tools.
  • Consistency without Precise R0R^08 Estimation: A crucial practical corollary is that consistent estimation of R0R^09 is not necessary for valid asymptotic inference. This circumvents difficulties noted in empirical practice and the simulation literature where RR0 estimators can be unstable or inconsistent, with little practical impact unless RR1.

Numerical Results and Empirical Implications

Monte Carlo experiments confirm that the asymptotic robustness of RR2 to overestimation of RR3 holds in finite samples for static and dynamic panel designs. When RR4, estimators become biased and inconsistent, as expected. For RR5, bias and variance remain stable; some small-sample efficiency losses occur for excessive overfitting when RR6, RR7 are small or RR8 is large.

An empirical application to US state-level divorce rates and law reforms, extending Wolfers (2006) and Kim & Oka (2014), demonstrates that coefficient estimates stabilize rapidly for RR9 exceeding a minimum threshold: inference is essentially invariant to Yit=k=1Kβk0Xk,it+λi0ft0+eitY_{it} = \sum_{k=1}^K \beta_k^0 X_{k,it} + \lambda_i^{0\prime} f_t^0 + e_{it}0 once all systematic cross-sectional and time-series variation is absorbed.

Claims Contrasting Prior Findings

  • No Efficiency Penalty for Redundant Factors: Contrary to common intuition and previous practices (e.g., penalized factor selection), there is no asymptotic penalty for including too many factors within the LS framework, provided Yit=k=1Kβk0Xk,it+λi0ft0+eitY_{it} = \sum_{k=1}^K \beta_k^0 X_{k,it} + \lambda_i^{0\prime} f_t^0 + e_{it}1 and under the specified conditions, even when the estimation error is solely in the nuisance parameter space.
  • Consistent Estimation without Knowing Yit=k=1Kβk0Xk,it+λi0ft0+eitY_{it} = \sum_{k=1}^K \beta_k^0 X_{k,it} + \lambda_i^{0\prime} f_t^0 + e_{it}2: The paper claims consistent and valid inference for Yit=k=1Kβk0Xk,it+λi0ft0+eitY_{it} = \sum_{k=1}^K \beta_k^0 X_{k,it} + \lambda_i^{0\prime} f_t^0 + e_{it}3 even when Yit=k=1Kβk0Xk,it+λi0ft0+eitY_{it} = \sum_{k=1}^K \beta_k^0 X_{k,it} + \lambda_i^{0\prime} f_t^0 + e_{it}4 is unknown and possibly difficult to estimate consistently from the data.

Limitations and Restrictions

The main theorems impose technical assumptions that are stronger than those required when Yit=k=1Kβk0Xk,it+λi0ft0+eitY_{it} = \sum_{k=1}^K \beta_k^0 X_{k,it} + \lambda_i^{0\prime} f_t^0 + e_{it}5 is known:

  • Conditional IID Normality of Errors: The requirement for rotational invariance enables the use of random matrix theory. While this assumption is relaxed in simulations, formal proof in general non-Gaussian settings is currently precluded by gaps in the literature on eigenvector delocalization and eigenvalue spacings for more general error structures.
  • Factor Strength and Non-collinearity: The strong factor assumption ensures that principal components consistently estimate the true space spanned by the factors, and that the “noise” does not contaminate the regressor space used for inference.
  • No Low-Rank (Time-Invariant, Cross-Invariant) Regressors: The theory excludes regressors that have their support entirely in the factor space, e.g., time-invariant or cross-sectional means, without special treatment.

Implications and Future Directions

The findings have broad relevance for empirical research using high-dimensional panel data models. In environments where the correct number of unobserved factors is unclear, researchers can safely utilize a conservative upper bound for Yit=k=1Kβk0Xk,it+λi0ft0+eitY_{it} = \sum_{k=1}^K \beta_k^0 X_{k,it} + \lambda_i^{0\prime} f_t^0 + e_{it}6 without compromising inference for coefficients of interest. This flexibility improves the empirical reliability of panel factor methods, especially in applied econometric work.

The paper's methodological framework opens several avenues for further theoretical investigation:

  • Extension to Non-normal Errors: Relaxation of the independence and normality assumptions, incorporating more general (e.g., heteroskedastic or dependent) error structures, would enhance the robustness and applicability of results in economic panels.
  • Optimal Selection of Yit=k=1Kβk0Xk,it+λi0ft0+eitY_{it} = \sum_{k=1}^K \beta_k^0 X_{k,it} + \lambda_i^{0\prime} f_t^0 + e_{it}7 in Finite Samples: Development of practical guidance or data-driven methods for “safe overfitting” that balance finite-sample variance inflation versus robustness would be valuable, given the documented finite-sample efficiency trade-offs.
  • Inference under Weak/Non-Identifiable Factors: The outlined framework for consistency accommodates weak and sparse factor structures; further characterization of rates and finite-sample inference in such designs could be important for large panels with many small latent sources of dependence.

Conclusion

This work provides a rigorous theoretical foundation for LS estimation and inference on regression coefficients in panel models with unknown interactive fixed effects, showing under explicit conditions that the estimator's asymptotic law is invariant to overestimation of the factor count. This robustness greatly simplifies practical modeling decisions and advances the understanding of factor-augmented panel regression, with implications across econometric and statistical applications employing interactive effects models.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 7 likes about this paper.