Training Machine Learning Models on Encrypted Data: A Privacy-Preserving Framework using Homomorphic Encryption

Published 25 Apr 2026 in cs.CR and cs.AI | (2604.23245v1)

Abstract: The use of Machine Learning (ML) for data-driven decision-making often relies on access to sensitive datasets, which introduces privacy challenges. Traditional encryption methods protect data at rest or in transit but fail to secure it during processing, exposing it to unauthorized access. Homomorphic encryption emerges as a transformative solution, enabling computations on encrypted data without decryption, thus preserving confidentiality throughout the ML pipeline. This paper addresses the challenge of training ML models on encrypted data while maintaining accuracy and efficiency by proposing a proof-of-concept for a privacy-preserving framework that leverages Cheon-Kim-Kim-Song (CKKS) for approximate real-number arithmetic. Also, it demonstrates the feasibility of training K-Nearest Neighbors (KNN) and linear regression models on encrypted data, and evaluates encrypted inference for a basic Multilayer Perceptron (MLP) architecture. Experimental results show that models trained under Homomorphic encryption achieve performance metrics comparable to plaintext-trained models, validating the approach. However, challenges such as computational overhead, noise management, and limited support for non-polynomial operations persist. This work lays the groundwork for broader adoption of privacy-preserving ML in real-world applications, balancing security with computational feasibility.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper demonstrates a privacy-preserving ML framework using the CKKS scheme to securely perform training and inference on encrypted data.
It validates key models—linear regression, KNN, and MLP—with performance comparable to plaintext operations through polynomial arithmetic.
The framework exposes trade-offs in noise accumulation, scalability, and computation overhead while suggesting optimization avenues for future research.

Privacy-Preserving Machine Learning via Homomorphic Encryption: A Technical Overview

Motivation and Context

Machine learning workflows frequently involve processing sensitive data, thus raising significant privacy concerns especially when outsourcing computation to third-party servers or cloud platforms. Conventional encryption protects data at rest and in transit, but not during computation, leaving it exposed to potentially adversarial actors. Homomorphic encryption (HE) uniquely addresses this gap by enabling direct computation on ciphertexts, thereby maintaining confidentiality throughout the ML pipeline. The paper presents a practical framework leveraging the CKKS scheme for approximate real-number arithmetic, validating the feasibility of training and inference for key ML models—including KNN, linear regression, and MLP—on encrypted datasets (2604.23245).

Privacy-preserving ML paradigms include differential privacy, secure multiparty computation (SMC), and homomorphic encryption. Differential privacy obfuscates dataset-specific information via noise injection, often resulting in a suboptimal privacy-accuracy trade-off for limited data scenarios [wu2025]. SMC supports distributed computation without dataset exposure but is communication- and protocol-intensive, less suitable when data is centralized.

HE schemes have evolved from additive-only (e.g., Paillier [paillier1999]) and multiplicative-only (RSA) to fully homomorphic encryption (FHE), enabling arbitrary computations albeit at substantial computational cost. The two primary families for ML are: (1) integer/modular schemes (BFV/BGV), and (2) approximate real-number schemes (CKKS). CKKS is particularly well-suited for ML due to its efficient support for floating-point vector operations based on polynomial arithmetic [ckks_original, huynh2024]. CKKS trades exact precision for manageable numeric error and supports approximate arithmetic on encrypted vectors, thus enabling the implementation of essential ML algorithms under encryption.

Recent research demonstrates encrypted inference and training in large-scale models, with frameworks capable of fine-tuning LLMs via LoRA adaptation while keeping gradients and data encrypted [frery2025], and encrypted neural network inference using polynomial approximations for non-linear activations [lee2021]. Federated settings utilize multi-key CKKS (MK-CKKS) for privacy-preserving aggregation [ma2022].

Framework Implementation Details

A custom CKKS implementation in Python was developed, covering key generation, encrypted polynomial arithmetic (addition, multiplication, relinearization), and noise management. For demonstration, the polynomial degree was set at $N=8$ , much lower than production-level parameters, facilitating rapid experimentation and validation. Encryption transforms real numbers into complex polynomial coefficients, protecting data values under semantic security.

Homomorphic Operations

Addition/Subtraction: Direct coefficient-wise addition or subtraction of ciphertext polynomials. Decrypted sums consistently matched plaintext operations within $10^{-6}$ relative error.
Multiplication (Plaintext/Encrypted): Polynomial multiplication, with degree management via truncation to control noise growth. Ciphertext-ciphertext multiplication was validated using homomorphic distance computation in KNN.
Unit Tests: All basic arithmetic properties were preserved under encryption, confirming correctness for ML use cases.

Model Adaptations

Linear Regression: Supports both encrypted and plaintext data for training and inference. Polynomial regression over encrypted vectors achieves solution equivalence to plaintext regression by decrypting only aggregate ciphertexts before solving the linear system.
KNN Regressor: Distance computation is performed homomorphically. Comparisons (argmin) are non-trivial on encrypted values, but squared distance aggregation is compatible. Prediction accuracy was equivalent to plaintext KNN.
MLP (Inference): Limitations in non-linear activation evaluation required simplification to linear (identity) activation for encrypted inference. Training remained in plaintext due to infeasible encrypted loss calculation. Polynomial approximations for activations are an explicit future research direction.

Experimental Evaluation

Experiments were conducted on synthetic datasets and the Boston Housing dataset. Performance metrics (MAE, RMSE, $R^2$ ) revealed near equivalence between plaintext and encrypted modes:

Linear Regression: RMSE $\approx$ 5.14, $R^2 \approx 0.70$ in both encrypted and plaintext scenarios.
KNN: RMSE $\approx$ 4.79, $R^2 \approx 0.75$ with encrypted data.
MLP: RMSE $\approx$ 5.15, $R^2 \approx 0.70$ for encrypted inference post-plaintext training.

Decryption errors were consistently $<0.1\%$ , and synthetic data tests yielded $10^{-6}$ 0. Runtime penalties were observed for HE settings, most notably during inference for KNN due to increased ciphertext size and operation count. These results confirm functional correctness and preservation of predictive quality under encryption, provided proper parameter tuning.

Model compatibility with encrypted domains varies: linear regression is agnostic to representation, KNN requires consistent data format, and MLP training requires plaintext due to loss and gradient computation constraints.

Limitations and Challenges

Several technical and architectural limitations restrict the framework’s scalability and applicability:

Noise Accumulation: CKKS parameterization must balance security (noise magnitude) and computation depth; excessive noise growth impedes decryption fidelity.
Scalability: Ciphertext size, computational overhead, and memory usage scale unfavorably for large workloads. Production-grade security parameters drastically increase runtime and storage requirements.
Non-Polynomial Operations: Polynomial arithmetic is well-supported; however, comparison operations, non-linear activations, and certain loss functions are infeasible without approximation. Polynomial approximations suffice for some cases, but model expressiveness is ultimately limited.
Trusted Third-Party Dependency: The framework assumes a trusted key manager for encryption and decryption, creating single-point-of-failure and trust bottleneck. Multi-key HE or distributed key management offers alternatives but at additional complexity.
Preprocessing Constraints: Non-numeric preprocessing (categorical encoding) is not implemented under encryption, limiting the pipeline’s versatility.

Notably, key sizes for demonstration fall short of cryptographic standards (e.g., $10^{-6}$ 1 for 128-bit security), and trusted-party dependency is problematic for practical deployments.

Implications and Future Directions

The successful demonstration of privacy-preserving model training and inference without degradation in accuracy substantiates the practical viability of HE for sensitive data analytics. The theoretical implication is the validation of polynomial approximation pipelines for ML while maintaining end-to-end data confidentiality, thus conforming to privacy regulations and expanding collaborative data use scenarios.

Practically, this could stimulate wider adoption in sectors requiring strong privacy guarantees—such as healthcare, finance, and cross-institutional ML, facilitating shared model development without privacy compromise. Real-world, scalable adoption requires further progress on computational efficiency, cryptographic security parameters, non-polynomial function evaluation, and distributed trust models (e.g., threshold HE, client-side encryption, MK-CKKS).

Future advancements may focus on:

Efficient polynomial approximation schemes for non-linear ML components
Hardware acceleration for CKKS operations
Distributed key management architectures
Optimization of runtime and memory for large-scale ML workflows
Integration with federated learning in high-security, multi-party contexts

Conclusion

The proposed framework demonstrates that homomorphic encryption, specifically CKKS, allows accurate ML model training and inference on encrypted data with negligible error and moderate computational overhead. By leveraging polynomial operations, both linear and instance-based models (regression, KNN) preserve predictive quality; MLP inference is feasible given activation function linearity or approximation. Limitations in scalability, noise management, function approximation, and trust assumptions underscore current boundaries and suggest directions for future technical development. The results validate HE as an effective solution for privacy-preserving ML, establishing groundwork for broader real-world adoption (2604.23245).

Markdown Report Issue