A transfer-aware, deployment-oriented evaluation framework for NetFlow-based intrusion detection systems (TAN-IDS).
Ha Thanh D
Abstract
Machine learning-based Intrusion Detection Systems (IDS) often report high detection accuracy under controlled, single-dataset evaluation, yet experience severe performance degradation when deployed in unseen network environments due to domain shift. To bridge this gap between laboratory benchmarking and real-world deployment, this paper presents TAN-IDS, a transfer-aware and deployment-oriented evaluation framework for NetFlow-based intrusion detection. Rather than proposing a new detection model, TAN-IDS contributes a methodological evaluation framework that unifies heterogeneous traffic datasets under a compact 8-dimensional NetFlow feature interface. This constrained representation supports interoperable and deployment-realistic evaluation across datasets collected in different network settings, enabling performance degradation to be more reliably attributed to domain shift rather than feature-space incompatibilities. Within this unified interface, TAN-IDS formalizes key deployment conditions as explicit evaluation scenarios, including in-dataset evaluation, direct cross-dataset transfer, mixed-domain training, and lightweight target-domain fine-tuning. Extensive experiments conducted within the proposed evaluation framework, using representative machine learning models and neural architectures, including a lightweight Transformer-based control model, show that strong in-dataset performance does not translate to cross-dataset robustness and that increased model complexity alone is insufficient to mitigate domain shift. In contrast, domain-aware training strategies are effective: mixed-domain training improves generalization, while fine-tuning with only 5% labeled target-domain data substantially recovers attack-class recall and F1-macro, exceeding 95% in several scenarios. Overall, TAN-IDS provides a reproducible, deployment-centric evaluation framework that reveals robustness limitations overlooked by benchmark-centric IDS evaluation.