Linear Regression for Panel With Unknown Number of Factors as Interactive Fixed Effects

Published 1 May 2026 in econ.EM | (2605.00614v1)

Abstract: In this paper we study the least squares (LS) estimator in a linear panel regression model with unknown number of factors appearing as interactive fixed effects. Assuming that the number of factors used in estimation is larger than the true number of factors in the data, we establish the limiting distribution of the LS estimator for the regression coefficients as the number of time periods and the number of cross-sectional units jointly go to infinity. The main result of the paper is that under certain assumptions the limiting distribution of the LS estimator is independent of the number of factors used in the estimation, as long as this number is not underestimated. The important practical implication of this result is that for inference on the regression coefficients one does not necessarily need to estimate the number of interactive fixed effects consistently.

Abstract PDF Upgrade to Chat

Authors (2)

Summary

The paper demonstrates that overestimating the number of factors does not affect the asymptotic distribution of the LS estimator, ensuring valid inference on β.
The methodology employs perturbation analysis and principles from random matrix theory to derive robust limiting distributions under factor misspecification.
Empirical simulations and an application to US state-level data confirm that coefficient estimates remain stable when using a conservative upper bound for factors.

Robust Inference in Interactive Fixed Effects Panel Regression with Uncertain Factor Counts

Overview

The paper "Linear Regression for Panel With Unknown Number of Factors as Interactive Fixed Effects" (2605.00614) rigorously analyzes the LS estimator in large $N$ , $T$ linear panel models where the error structure incorporates an unknown number of interactive fixed effects, i.e., a factor structure. The authors specifically address the practical setting where estimation employs a potentially misspecified (too large) number $R$ of factors relative to the unknown true count $R^0$ . The key contribution is the characterization of the limiting distribution of regression coefficient estimators under this setting, as well as the establishment of conditions under which overestimation of $R$ does not affect asymptotic inference on the regression parameters.

Model and Theoretical Framework

The baseline model is

$Y_{it} = \sum_{k=1}^K \beta_k^0 X_{k,it} + \lambda_i^{0\prime} f_t^0 + e_{it}$

with $Y$ , $X_k$ , and $e$ of dimension $N \times T$ , and the interactive fixed effects (“factors”) represented by $T$ 0 ( $T$ 1) and $T$ 2 ( $T$ 3). Both factors and loadings are treated as fixed (non-random) unknown parameters in a fully deterministic incidental parameter framework. The paper targets LS estimation of $T$ 4 by minimizing the sum of squared residuals with $T$ 5 factors/loadings, where $T$ 6 is permitted and possibly strictly greater than $T$ 7.

Key technical challenges derive from the unknown $T$ 8 and the fact that conventional numerical and theoretical methods for such panel factor models typically require $T$ 9 fixed and known, especially when deriving distributional results for $R$ 0.

Main Results and Methodological Advances

Asymptotic Robustness to Factor Misspecification

Principal result: Under strong factor and regularity conditions, when $R$ 1, the limiting distribution of the LS estimator $R$ 2 is asymptotically equivalent to that of the oracle estimator $R$ 3. Explicitly, for $R$ 4 at proportional rates,

$R$ 5

This asserts that, asymptotically, overestimating the number of factors imposes no efficiency loss on $R$ 6, and inference procedures (including bias correction) remain valid without consistent estimation of $R$ 7, provided $R$ 8.

The asymptotic variance and bias of $R$ 9 (and hence $R^0$ 0 for $R^0$ 1) are characterized explicitly, extending previous results for known factor count [Bai 2009; Moon & Weidner 2013].

Supporting Theoretical Innovations

Identification under Overfitting: Sufficient conditions are established for identification of $R^0$ 2, notably a generalized non-collinearity condition (orthogonality of regressors to low-rank perturbations) sufficient for uniqueness up to rotation/invariance.
Distributional Theory via Perturbation Analysis: The least squares objective’s non-convexity and dependence on principal components precludes naive Taylor expansion. The authors leverage perturbation theory for linear operators to provide a quadratic expansion of the (profile) objective in neighborhoods around $R^0$ 3, accommodating the non-explicit dependence of eigenvalues on $R^0$ 4 and technicalities due to the multiplicity of zero eigenvalues in the population case.
Explicit Rate Results for Weak/Strong Factors: Consistency at $R^0$ 5-rate is obtained for $R^0$ 6, with convergence holding under weaker conditions even when the factor structure is misspecified (e.g., degenerate or weak factors).
High-Level Assumptions Justified by Random Matrix Theory: Necessary conditions on the separation and empirical distribution of the sample covariance eigenvalues are related to random matrix theory, invoking approximate Marchenko-Pastur laws and spectral norm bounds. The main result is proven rigorously for (conditionally) independent, homoskedastic, Gaussian $R^0$ 7 under these tools.
Consistency without Precise $R^0$ 8 Estimation: A crucial practical corollary is that consistent estimation of $R^0$ 9 is not necessary for valid asymptotic inference. This circumvents difficulties noted in empirical practice and the simulation literature where $R$ 0 estimators can be unstable or inconsistent, with little practical impact unless $R$ 1.

Numerical Results and Empirical Implications

Monte Carlo experiments confirm that the asymptotic robustness of $R$ 2 to overestimation of $R$ 3 holds in finite samples for static and dynamic panel designs. When $R$ 4, estimators become biased and inconsistent, as expected. For $R$ 5, bias and variance remain stable; some small-sample efficiency losses occur for excessive overfitting when $R$ 6, $R$ 7 are small or $R$ 8 is large.

An empirical application to US state-level divorce rates and law reforms, extending Wolfers (2006) and Kim & Oka (2014), demonstrates that coefficient estimates stabilize rapidly for $R$ 9 exceeding a minimum threshold: inference is essentially invariant to $Y_{it} = \sum_{k=1}^K \beta_k^0 X_{k,it} + \lambda_i^{0\prime} f_t^0 + e_{it}$ 0 once all systematic cross-sectional and time-series variation is absorbed.

Claims Contrasting Prior Findings

No Efficiency Penalty for Redundant Factors: Contrary to common intuition and previous practices (e.g., penalized factor selection), there is no asymptotic penalty for including too many factors within the LS framework, provided $Y_{it} = \sum_{k=1}^K \beta_k^0 X_{k,it} + \lambda_i^{0\prime} f_t^0 + e_{it}$ 1 and under the specified conditions, even when the estimation error is solely in the nuisance parameter space.
Consistent Estimation without Knowing $Y_{it} = \sum_{k=1}^K \beta_k^0 X_{k,it} + \lambda_i^{0\prime} f_t^0 + e_{it}$ 2: The paper claims consistent and valid inference for $Y_{it} = \sum_{k=1}^K \beta_k^0 X_{k,it} + \lambda_i^{0\prime} f_t^0 + e_{it}$ 3 even when $Y_{it} = \sum_{k=1}^K \beta_k^0 X_{k,it} + \lambda_i^{0\prime} f_t^0 + e_{it}$ 4 is unknown and possibly difficult to estimate consistently from the data.

Limitations and Restrictions

The main theorems impose technical assumptions that are stronger than those required when $Y_{it} = \sum_{k=1}^K \beta_k^0 X_{k,it} + \lambda_i^{0\prime} f_t^0 + e_{it}$ 5 is known:

Conditional IID Normality of Errors: The requirement for rotational invariance enables the use of random matrix theory. While this assumption is relaxed in simulations, formal proof in general non-Gaussian settings is currently precluded by gaps in the literature on eigenvector delocalization and eigenvalue spacings for more general error structures.
Factor Strength and Non-collinearity: The strong factor assumption ensures that principal components consistently estimate the true space spanned by the factors, and that the “noise” does not contaminate the regressor space used for inference.
No Low-Rank (Time-Invariant, Cross-Invariant) Regressors: The theory excludes regressors that have their support entirely in the factor space, e.g., time-invariant or cross-sectional means, without special treatment.

Implications and Future Directions

The findings have broad relevance for empirical research using high-dimensional panel data models. In environments where the correct number of unobserved factors is unclear, researchers can safely utilize a conservative upper bound for $Y_{it} = \sum_{k=1}^K \beta_k^0 X_{k,it} + \lambda_i^{0\prime} f_t^0 + e_{it}$ 6 without compromising inference for coefficients of interest. This flexibility improves the empirical reliability of panel factor methods, especially in applied econometric work.

The paper's methodological framework opens several avenues for further theoretical investigation:

Extension to Non-normal Errors: Relaxation of the independence and normality assumptions, incorporating more general (e.g., heteroskedastic or dependent) error structures, would enhance the robustness and applicability of results in economic panels.
Optimal Selection of $Y_{it} = \sum_{k=1}^K \beta_k^0 X_{k,it} + \lambda_i^{0\prime} f_t^0 + e_{it}$ 7 in Finite Samples: Development of practical guidance or data-driven methods for “safe overfitting” that balance finite-sample variance inflation versus robustness would be valuable, given the documented finite-sample efficiency trade-offs.
Inference under Weak/Non-Identifiable Factors: The outlined framework for consistency accommodates weak and sparse factor structures; further characterization of rates and finite-sample inference in such designs could be important for large panels with many small latent sources of dependence.

Conclusion

This work provides a rigorous theoretical foundation for LS estimation and inference on regression coefficients in panel models with unknown interactive fixed effects, showing under explicit conditions that the estimator's asymptotic law is invariant to overestimation of the factor count. This robustness greatly simplifies practical modeling decisions and advances the understanding of factor-augmented panel regression, with implications across econometric and statistical applications employing interactive effects models.

Markdown Report Issue