Table of Contents
Master privacy-preserving ML. Complete guide to federated learning, differential privacy, encrypted computation, and protecting data in machine learning.
Introduction: Privacy-Preserving Machine Learning
Data is valuable. But data is also sensitive.
Training models requires data. Lots of data. Medical records, financial data, browsing history, location data, personal communications.
The dilemma: How to train powerful models without exposing sensitive data?
This is privacy-preserving machine learning’s core challenge.
Why it matters:
- Regulatory: GDPR, CCPA, HIPAA require data protection
- Ethical: Users deserve privacy, trust is essential
- Practical: Sensitive data (medical, financial) never fully centralized
- Competitive: Private data = competitive advantage, shouldn’t share
This guide covers privacy-preserving ML: from understanding privacy threats to techniques like federated learning and differential privacy to practical implementation.
Privacy Challenges in ML
Data Sensitivity
Examples of Sensitive Data:
- Medical records (health conditions, treatments)
- Financial data (income, transactions, debt)
- Biometric data (fingerprints, facial recognition)
- Location data (where people go)
- Behavioral data (what people search, read, watch)
Risk: If exposed, can harm individuals.
Privacy Attacks on Models
Membership Inference: Attacker determines if person’s data in training set.
Attacker: "Is patient X's data in model?"
Method: Check if model overfit to X's data
Result: High confidence on X's sample, low on random sample
Model Inversion: Attacker reconstructs training data from model.
Given model trained on images
Attacker: "What images were in training?"
Reconstruct approximate images
Extraction: Attacker gets model’s training data indirectly.
Attacker queries model: "What records match criteria X?"
Enough queries → Reconstruct database
The Privacy-Utility Trade-off
More Privacy → Less Utility (worse predictions)
More Utility → Less Privacy (more data exposed)
Challenge: Find sweet spot
Privacy Definitions
Formal Privacy
Differential Privacy: Model’s behavior essentially unchanged if we add/remove one person’s data.
Model trained on 1M records: Prediction P1
Model trained on 1M-1 records (remove one person): Prediction P2
P1 and P2 very close (ε-differential privacy)
Even removing person's data doesn't change model much
Hard to tell if person in training set
Privacy Budget
Epsilon (ε): Privacy loss parameter
ε = 0: Perfect privacy (no information about individuals)
ε = 1: High privacy (individual's data hard to identify)
ε = 5: Moderate privacy (some information leaks)
ε > 10: Low privacy (significant information leaks)
Lower ε = More privacy, less utility
Higher ε = Less privacy, more utility
Federated Learning
Core Idea
Don’t send raw data to central location. Train at data sources, send only model updates.
Process
1. Central server has model
2. Send model to device/hospital/organization
3. They train locally on their data (don't share data)
4. Send only model updates back to server
5. Server aggregates updates
6. Repeat
Data never leaves device
Only model updates shared (much less sensitive)
Advantages
- Data stays local (more private)
- Regulatory compliance (don’t centralize sensitive data)
- Better scaling (train where data is)
- Faster (less data transfer)
Disadvantages
- Communication overhead (sending models is large)
- Synchronization challenges (devices go offline)
- Non-IID data (distributions differ per device)
- Model complexity (limited by device resources)
Example: Keyboard Prediction
Google uses federated learning for keyboard:
User types on phone
Local model trained on user's typing patterns
Only model updates sent to Google
Google aggregates updates from millions of users
Improved keyboard
User's typing data never leaves phone
Only learn from millions, maintain privacy
Aggregation Algorithms
FedAvg (Federated Averaging):
1. Each device trains locally
2. Send updated weights to server
3. Server averages weights from all devices
4. Send averaged weights back
5. Repeat
Secure Aggregation:
Encrypt updates before sending
Server aggregates encrypted values
Only aggregate is decrypted
Individual updates never visible to server
Differential Privacy
Adding Noise
Protect individuals by adding noise to computations.
Query: "What's average salary in company?"
True answer: $100K
With differential privacy:
Add noise: $100K + Laplace(0, 1000)
Result: $100K + random value between -3000 and +3000
Gives approximate answer while protecting individuals
Implementation: DP-SGD
Differential Privacy applied to Stochastic Gradient Descent.
Standard SGD:
1. Compute gradient for batch
2. Update weights by gradient
DP-SGD:
1. Compute gradient for batch
2. Clip gradients (prevent outliers)
3. Add noise to gradients
4. Update weights
5. Repeat
Noise protects individual gradients
Prevents model from memorizing specific data
Trade-off
More noise → More privacy, worse model
Less noise → Less privacy, better model
Requires careful tuning.
Encrypted Computation
Train models on encrypted data without decrypting.
Homomorphic Encryption
Allows computation on encrypted data.
Encrypt data: E(X)
Compute on encrypted: E(X + Y) = E(X) + E(Y)
Decrypt result: D(E(Z)) = Z
Computation happens on encrypted data
Never see raw data
Advantage: Maximum privacy (data never decrypted)
Disadvantage: Computationally expensive (1000x slower)
Secure Multi-Party Computation (MPC)
Multiple parties compute together without revealing data.
Party A has data X
Party B has data Y
Jointly compute f(X, Y) without revealing X or Y
Secure Multi-Party Computation
Example: Auction
Three people bidding, want highest bid but don’t reveal amounts.
Person A's bid (secret): $100
Person B's bid (secret): $150
Person C's bid (secret): $120
Protocol: Each person computes share
Result: Person B won ($150) but no one knows others' bids
Implementation
Complex cryptographic protocols. Active research area.
Challenges:
- Computationally expensive
- Communication overhead
- Complexity (hard to implement correctly)
Privacy-Preserving Inference
Using models while protecting privacy of both data and model.
Client-Side Inference
Model on user’s device, not server.
User data: Stays on device
Model: On user's device
Inference: Happens on device
Result: Sent to server
Data never shared with server
Privacy maximal
Latency minimal
Encrypted Inference
Inference on encrypted data.
User encrypts input
Sends to server
Server computes on encrypted input (homomorphic encryption)
Sends encrypted output back
User decrypts result
Server never sees plaintext data
User privacy protected
Regulatory Landscape
GDPR (Europe)
Requirements:
- “Right to be forgotten” (delete data)
- Data minimization (collect minimal data)
- Purpose limitation (use only as stated)
- Lawful basis (need reason for collection)
Impact on ML:
- Can’t keep training data forever
- Model shouldn’t overfit (memorize individuals)
- Clear disclosure of data use
CCPA (California)
Requirements:
- Users know what’s collected
- Users can request data
- Users can opt-out of sale
- Transparency about AI
HIPAA (Healthcare, US)
Requirements:
- Protected health information safeguarded
- Breach notification required
- Patient consent for certain uses
Practical Considerations
Trade-offs
Privacy vs Utility:
Maximum privacy: Model barely works
Maximum utility: Individual privacy exposed
Practical: Balance based on risk tolerance
Privacy vs Efficiency:
Federated learning: More private, more communication
Centralized: Less private, more efficient
Privacy vs Interpretability:
More privacy mechanism → Harder to understand why decision made
Need balance
Implementation Challenges
Federated Learning:
- Device dropout (devices go offline)
- Non-IID data (distributions differ)
- Communication costs
- Synchronization
Differential Privacy:
- Noise tuning (how much is enough?)
- Privacy budget management (total privacy across queries)
- Utility degradation
Best Practices
- Minimize data collection: Collect only what’s needed
- Federate when possible: Train locally, aggregate centrally
- Use differential privacy: Add noise, protect individuals
- Encrypt sensitive data: In transit and at rest
- Audit regularly: Check for privacy leaks
- Transparency: Tell users about privacy practices
Key Takeaways
✓ Privacy-utility trade-off real – Can’t have both maximized
✓ Federated learning powerful – Train locally, aggregate centrally
✓ Differential privacy practical – Add noise, protect individuals
✓ Encrypted computation possible – But expensive
✓ MPC works – But complex, expensive
✓ Regulations require privacy – GDPR, CCPA, HIPAA enforce it
✓ Multiple approaches – Use combination for best results
✓ Data minimization principle – Collect only what’s needed
✓ Privacy by design – Build privacy in from start
✓ Active research area – Techniques improving, costs decreasing
Frequently Asked Questions
Q: Is federated learning secure?
A: More private than centralized, but not perfect. Updates can leak information. Use with differential privacy for better protection.
Q: How much differential privacy is enough?
A: Depends on sensitivity. ε=1 high privacy, ε=10 low. Most use ε between 1-5.
Q: Is homomorphic encryption practical?
A: Not yet. 1000x slower than regular computation. Active research for improvement.
Q: Can I use privacy-preserving ML for my use case?
A: Depends. Cost-benefit analysis: privacy benefits vs. utility loss and computational cost.
Q: Do I need privacy-preserving ML?
A: If: Sensitive data (health, finance), regulations (GDPR), or users sensitive → Yes.

