IoT Malware Detection with PySpark
Distributed big data analytics for cybersecurity — Apache Spark for large-scale network traffic analysis, anomaly detection algorithms, and malicious IoT device behavior pattern identification in streaming environments.
Visit websiteProblem & Motivation
IoT devices generate continuous high-throughput network traffic that cannot be analysed in batch on a single machine. Malware in embedded systems often manifests through subtle traffic signatures — periodic beaconing, unusual protocol usage, or lateral movement patterns. This system applies distributed stream processing to detect these patterns at scale.
Pipeline Architecture
- Ingestion: Network packet captures parsed and streamed into Apache Spark Structured Streaming for real-time processing
- Feature engineering: Flow-level aggregations (packet rate, byte statistics, inter-arrival times, protocol distribution) computed over sliding windows with PySpark
- Anomaly detection: Isolation Forest and statistical control charts applied per device to flag behavioural deviations from learned baselines
- Classification: Supervised malware family classifier (Random Forest + XGBoost) trained on labelled IoT traffic datasets (N-BaIoT, UNSW-NB15)
Results
| Model | Accuracy | F1 (macro) | FPR |
|---|---|---|---|
| Isolation Forest | — | — | < 2% |
| Random Forest | 96.4% | 0.95 | 1.8% |
| XGBoost | 97.8% | 0.97 | 1.2% |
