Module 8: Testing & Quality Assurance

Learning Content

Module Overview

Testing AI systems requires fundamentally different approaches than traditional software. You can't write unit tests asserting "output should be X" when the output is a probabilistic prediction. Yet quality assurance is critical—especially for Malta businesses in regulated industries (iGaming, FinTech) where AI errors can trigger regulatory violations or financial losses.

This module teaches comprehensive testing strategies for AI systems, from data validation through model performance testing to production monitoring. You'll learn how to detect problems before they impact users and maintain high-quality AI systems over time.

🔑 Key Concept: AI Testing is Probabilistic

Traditional software: test expects deterministic output ("function(5) returns 25"). AI: test expects statistical performance ("model achieves 85%+ accuracy on test set"). This fundamental difference requires new testing mindsets and tools.

The Five Layers of AI Testing

Layer 1: Data Quality Testing

Why Critical: "Garbage in, garbage out." Bad data guarantees bad models, regardless of algorithm sophistication.

Data Tests to Implement:

Schema Validation: Are expected columns present with correct data types? (Tools: Great Expectations, pandera)
Completeness Tests: Are missing value rates within acceptable thresholds? (e.g., "customer_email" field <5% missing)
Range/Value Tests: Are values within expected ranges? (e.g., age between 18-100, not negative or 999)
Distribution Tests: Is data distribution consistent with historical patterns? (detect sudden shifts)
Referential Integrity: Do foreign keys match? (e.g., every transaction links to valid customer_id)
Business Logic Tests: Do values satisfy business rules? (e.g., deposit_amount ≥ withdrawal_amount for given period)

Example Test Suite (Python with Great Expectations):

expect_column_values_to_be_between("age", min_value=18, max_value=100)
expect_column_values_to_be_in_set("country", ["Malta", "UK", "Germany", ...])
expect_column_values_to_not_be_null("email", mostly=0.95)  # Allow 5% missing
expect_column_mean_to_be_between("transaction_amount", min_value=20, max_value=150)

Layer 2: Model Training Testing

Purpose: Ensure model training process is reproducible and performs as expected.

Tests to Implement:

Reproducibility Test: Given same data and hyperparameters, does training produce same model? (Set random seeds)
Overfitting Detection: Is training accuracy >>> validation accuracy? Red flag for overfitting.
Minimum Performance Threshold: Does model beat baseline? (e.g., "accuracy > naive baseline by 10%")
Training Convergence: Does loss decrease smoothly? Or erratic behavior indicating problems?
Feature Importance Sanity Check: Are top features logically important? Or nonsensical correlations?

Layer 3: Model Performance Testing

Purpose: Validate model meets performance requirements before production deployment.

Accuracy Metrics (Choose Based on Problem Type):

Classification: Accuracy, Precision, Recall, F1, AUC-ROC
Regression: MAE, RMSE, R²
Ranking: NDCG, MAP
Business Metrics: Revenue impact, cost savings (ultimate success measure)

Stratified Testing: Test performance across critical segments

By User Segment: Does model work equally well for VIP vs. casual players? New vs. long-term customers?
By Geography: Malta vs. UK vs. Germany performance (if operating multi-market)
By Time Period: Recent data vs. old data (model may overfit recent patterns)
By Edge Cases: High-value transactions, unusual player behavior, rare events

Fairness Testing (GDPR/Ethical Requirement):

Does model discriminate based on protected attributes (gender, age, nationality)?
Are error rates similar across demographic groups?
Tools: AI Fairness 360 (IBM), Fairlearn (Microsoft)

Layer 4: Integration & System Testing

Purpose: Ensure AI system integrates correctly with business applications and handles production scenarios.

Tests to Implement:

API Contract Testing: Does API return expected response format for various inputs?
Load Testing: Can system handle expected prediction volume? (e.g., 1,000 predictions/sec for real-time fraud detection)
Latency Testing: Are predictions fast enough for user experience? (e.g., <100ms for real-time recommendations)
Error Handling: What happens with malformed inputs, missing features, or downstream system failures?
Rollback Testing: Can we quickly revert to previous model version if issues arise?

Layer 5: Production Monitoring & Testing

Purpose: Detect model degradation and issues in live production environment.

Continuous Monitoring:

Model Performance Metrics: Track accuracy, precision, recall daily/weekly. Alert if drops below threshold.
Data Drift Detection: Is input data distribution changing over time? (Concept: training data from 2023, now predicting on 2025 data—patterns may have shifted)
Prediction Drift: Is output distribution changing? (e.g., suddenly predicting 80% high-risk vs. historical 20%)
Canary Testing: Deploy new model to 5-10% of traffic first. Compare performance to existing model before full rollout.

Testing Strategy for Malta Regulated Industries

iGaming (MGA Compliance)

Mandatory Tests:

Responsible Gaming Detection: Test AI accurately flags at-risk gambling behavior (per MGA requirements). False negatives (missing problem gamblers) are regulatory risk.
Fairness Validation: Test AI personalization doesn't manipulate players or unfairly advantage house
Explainability Testing: Can you explain why specific player was flagged? (Regulatory audit requirement)
Audit Trail Validation: All AI decisions logged with timestamps, inputs, outputs, model version

FinTech (MFSA Compliance)

Mandatory Tests:

AML Detection Accuracy: Test AI catches suspicious transactions. False negatives (missing money laundering) trigger MFSA fines.
Credit Fairness: Test credit scoring AI doesn't discriminate based on protected characteristics (age, gender, nationality)
Model Explainability: Test that credit/loan decisions can be explained to customers (GDPR right to explanation)
Regulatory Reporting: Test AI generates required regulatory reports accurately

Malta Case Study: FinTech Fraud Detection Testing

Company: Malta payment processor with AI fraud detection (from previous module)

Testing Approach (Production-Ready):

Phase 1: Pre-Deployment Testing

Historical Data Testing: Model tested on 6 months of historical transactions (250K transactions, 380 confirmed fraud cases)
- Accuracy: 94.2%
- Precision: 89.1% (avoiding too many false alarms)
- Recall: 91.7% (catching most fraud)
- Performance met threshold: >90% recall (can't miss fraud)
Stratified Testing: Performance checked across segments:
- High-value transactions (>€500): 92.3% accuracy ✓
- Low-value transactions (<€50): 94.8% accuracy ✓
- New merchants: 88.9% accuracy (acceptable, limited training data)
- Established merchants: 95.1% accuracy ✓
Explainability Testing: Randomly sampled 100 fraud predictions. MAIA's neurosymbolic reasoning provided clear explanations:
- Example: "Flagged as fraud due to: (1) IP address from high-risk country, (2) email domain created 2 days ago, (3) unusual purchase pattern for merchant category"
- All 100 explanations logically sound—passed manual review

Phase 2: Shadow Mode Testing (2 Weeks)

AI ran on live transactions BUT didn't block anything (predictions logged only)
Compared AI predictions to actual fraud outcomes
Results: 93.8% accuracy in production (close to test set performance—good sign)
Detected 12 fraud cases that existing rules-based system missed (AI superior)

Phase 3: Canary Deployment (2 Weeks)

AI activated for 10% of transactions (randomly selected)
90% still used old rules-based system
Monitored false positive rate (legitimate transactions wrongly blocked)
Result: 2.1% false positive rate (acceptable; old system had 4.3%)
Zero customer complaints about wrongly blocked transactions

Phase 4: Full Rollout + Continuous Monitoring

AI activated for 100% of transactions
Daily Monitoring:
- Fraud catch rate: 94% (target: >90%) ✓
- False positive rate: 2.1% (target: <3%) ✓
- Prediction latency: 45ms average (target: <100ms) ✓
Weekly Drift Monitoring:
- Input data distribution checked (transaction amounts, merchant categories, geographies)
- Week 8: Detected distribution shift (sudden spike in cryptocurrency merchant transactions)
- Response: Retrained model with recent data including crypto transactions. Accuracy restored.

MFSA Audit (Month 6):

Auditors reviewed AI system
Requested explanation for 20 random fraud blocks
MAIA's neurosymbolic reasoning provided clear, auditable explanations for all 20
Audit passed with commendation for transparency and testing rigor

Testing Success Factors:

Multi-phase rollout (pre-deployment → shadow → canary → full) de-risked production issues
Continuous monitoring caught drift early (Week 8) before accuracy degraded significantly
Explainability testing ensured regulatory compliance (critical for MFSA audit)
Stratified testing revealed performance across important segments (high-value vs. low-value transactions)

Testing Checklist: Pre-Production

Before deploying AI to production, verify these tests pass:

☐ Data Quality: All data validation tests pass (schema, completeness, ranges, distributions)
☐ Model Performance: Accuracy meets target on holdout test set
☐ Stratified Performance: Acceptable performance across key user/business segments
☐ Fairness: No discrimination on protected attributes (GDPR requirement)
☐ Explainability: Can explain predictions (especially for regulated industries)
☐ API Integration: Correct response format, error handling, versioning
☐ Load/Latency: Meets performance requirements under expected traffic
☐ Shadow Mode: Tested on live data without impacting users (2+ weeks)
☐ Canary Deployment: Tested on 5-10% traffic without issues (1-2 weeks)
☐ Monitoring Infrastructure: Dashboards, alerts, logging in place
☐ Rollback Plan: Tested and documented process to revert to previous model
☐ Regulatory Compliance: Audit trails, explainability, compliance documentation ready

Key Takeaways

AI testing is probabilistic (statistical performance) vs. deterministic (exact outputs) testing
Five testing layers: Data Quality, Model Training, Model Performance, Integration, Production Monitoring
Data quality testing is foundational—bad data guarantees bad models regardless of algorithm
Stratified testing reveals performance across important segments (user types, geographies, edge cases)
For Malta regulated industries (MGA, MFSA): Explainability testing and audit trails are mandatory
Multi-phase rollout de-risks production: shadow mode (no impact) → canary (10% traffic) → full deployment
Continuous monitoring detects drift and degradation—models don't stay accurate forever without maintenance
Fairness testing prevents GDPR violations and ethical issues (test for discrimination across demographics)
Implement rollback capability before deploying—you need escape hatch if production issues arise

Learning Content

Module Overview

🔑 Key Concept: AI Testing is Probabilistic

The Five Layers of AI Testing

Layer 1: Data Quality Testing

Layer 2: Model Training Testing

Layer 3: Model Performance Testing

Layer 4: Integration & System Testing

Layer 5: Production Monitoring & Testing

Testing Strategy for Malta Regulated Industries

iGaming (MGA Compliance)

FinTech (MFSA Compliance)

Malta Case Study: FinTech Fraud Detection Testing

Testing Checklist: Pre-Production

Key Takeaways

📝 Knowledge Check Quiz

Question 1

Question 2

Question 3

Question 4

Question 5

💡 Hands-On Exercise