Skip to main content

IoT Malware Detection with PySpark

Distributed big data analytics for cybersecurity — Apache Spark for large-scale network traffic analysis, anomaly detection algorithms, and malicious IoT device behavior pattern identification in streaming environments.

Visit website
  • Cybersecurity
  • PySpark
  • Anomaly Detection
  • Big Data
  • Streaming
IoT network traffic anomaly detection pipeline with PySpark

Problem & Motivation

IoT devices generate continuous high-throughput network traffic that cannot be analysed in batch on a single machine. Malware in embedded systems often manifests through subtle traffic signatures — periodic beaconing, unusual protocol usage, or lateral movement patterns. This system applies distributed stream processing to detect these patterns at scale.

Pipeline Architecture

  • Ingestion: Network packet captures parsed and streamed into Apache Spark Structured Streaming for real-time processing
  • Feature engineering: Flow-level aggregations (packet rate, byte statistics, inter-arrival times, protocol distribution) computed over sliding windows with PySpark
  • Anomaly detection: Isolation Forest and statistical control charts applied per device to flag behavioural deviations from learned baselines
  • Classification: Supervised malware family classifier (Random Forest + XGBoost) trained on labelled IoT traffic datasets (N-BaIoT, UNSW-NB15)

Results

ModelAccuracyF1 (macro)FPR
Isolation Forest< 2%
Random Forest96.4%0.951.8%
XGBoost97.8%0.971.2%