ITC-Net-blend-60: a comprehensive dataset for robust network traffic classification in diverse environments

Bayat, Marziyeh; Garshasbi, Javad; Mehdizadeh, Mozhgan; Nozari, Neda; Rezaei Khesal, Abolghasem; Dokhaei, Maryam; Teimouri, Mehdi

doi:10.1186/s13104-024-06817-5

Data Note
Open access
Published: 15 June 2024

ITC-Net-blend-60: a comprehensive dataset for robust network traffic classification in diverse environments

Marziyeh Bayat¹,
Javad Garshasbi¹,
Mozhgan Mehdizadeh¹,
Neda Nozari¹,
Abolghasem Rezaei Khesal¹,
Maryam Dokhaei¹ &
…
Mehdi Teimouri¹

BMC Research Notes volume 17, Article number: 165 (2024) Cite this article

164 Accesses
Metrics details

Abstract

Objectives

Recognition of mobile applications within encrypted network traffic holds considerable effects across multiple domains, encompassing network administration, security, and digital marketing. The creation of network traffic classifiers capable of adjusting to dynamic and unforeseeable real-world settings presents a tremendous challenge. Presently available datasets exclusively encompass traffic data obtained from a singular network environment, thereby restricting their utility in evaluating the robustness and compatibility of a given model.

Data description

This dataset was gathered from 60 popular Android applications in five different network scenarios, with the intention of overcoming the limitations of previous datasets. The scenarios were the same in the applications set but differed in terms of Internet service provider (ISP), geographic location, device, application version, and individual users. The traffic was generated through real human interactions on physical devices for 3–15 min. The method used to capture the traffic did not require root privileges on mobile phones and filtered out any background traffic. In total, the collected dataset comprises over 48 million packets, 450K bidirectional flows, and 36 GB of data.

Peer Review reports

Objective

As the volume of mobile app traffic continues to soar, the importance of reliable app identification solutions cannot be understated. Mobile app traffic identification aids network management and security and provides valuable profiling information for advertisers, insurance companies, and security agencies [12,13,14]. As a result, it has garnered significant interest from both academia and industry, leading to extensive research on the subject.

However, achieving robust application identification remains an open problem. Recent investigations [15, 16] have revealed that although existing classifiers achieve satisfactory performance when trained and tested using conventional machine learning methods (dividing a dataset into two parts for training and testing), most of them face significant performance degradation when evaluated with different datasets. This indicates a lack of robustness and compatibility of models in practical networks. This challenge stems from the unpredictable nature of real-world network environments [15, 16] and the dynamic and evolving behavior of mobile apps [13, 17,18,19]. A key requirement for achieving this goal is a dataset of captured traffic data in various network scenarios. Nonetheless, as indicated in S Table 3, existing datasets were mainly captured in a single invariant network environment.

In this paper, we address this limitation by presenting a dataset that was captured across five different network scenarios, with various factors affecting network traffic behavior. This dataset allows for the evaluation of model performance under different network conditions. It was provided in raw format (PCAP files) to give researchers the flexibility to develop models based on any traffic object (e.g., packet, flow, and bag of flows), feature, method, or innovative approaches. Moreover, this dataset was generated through real human interactions on actual smartphones and captured using a non-rooted method. This makes the data more representative and suitable than synthetic data for mobile app traffic analysis.

Data description

The methodology employed for collecting the dataset comprised three main stages: Application Selection, Traffic Capture Setup, and Traffic Generation. Moore's details about each phase are provided in the accompanying supplementary materials.

Application selection

To collect traffic data, an initial step is to determine which applications to monitor, given the vast number of applications available. We chose 60 Android applications from the top 300 free apps listed in the Google Play Store and two major Iranian Android app markets, Cafe Bazaar [1], and Myket [2]. Our selection was based on two criteria: the apps must require internet connectivity to fulfill their core functions, and they must generate traffic through user interactions. These 60 apps belong to 16 distinct categories, which are listed in S Table A1.

Traffic capture setup

We used a smartphone and a laptop to capture our traffic. The laptop ran Windows 10 and had an internal dual-band network card. We installed Wireshark [3] on it and configured it to capture traffic through the “Local Area Connection” interface. The laptop was connected to the internet and shared its connection with the smartphone via a hotspot. This allowed Wireshark to capture the smartphone’s network traffic. However, the traffic captured by Wireshark also contained significant background traffic. To isolate the target application’s network traffic, we installed PCAPdroid [4] on the smartphone in non-root mode. We used Wireshark and PCAPdroid simultaneously to record the target application’s traffic. After collecting traffic data, we separated the target application traffic from the background traffic by comparing IP addresses and ports captured by Wireshark and PCAPdroid. Any pairs in Wireshark that did not match a PCAPdroid pair were identified as background traffic and removed.

We implemented this method in Python 3 using the Scapy library [5]. You can find the code for this implementation in the Supplementary material.

Traffic generation

The dataset collection was conducted by five volunteers from ITC-LAB members over a period of six weeks, from October to December 2021. Each volunteer collected traffic from a different network Scenario (see S Table 1). Before commencing the data collection process, they were well-informed about the objectives of traffic capture and the public release of data in PCAP format. They also received training on how to collect traffic. The volunteers were required to conduct at least three experiments for every application, with each experiment consisting of interacting with a single app on a specific smartphone for 3 to 15 min. They were instructed to use the application as they normally would.

The resulting dataset is organized into separate repositories for each scenario, with a dedicated compressed file for every application. Each compressed file contains the corresponding PCAP files, all of which have been named using a consistent naming convention. The format for naming the PCAP files is as follows: (Application Name)_(Scenario ID)_(#Trace)_Final.pcap.

The entire dataset comprises 1,159 PCAP traces and 36 GB of network traffic data. S Table A2 provides further details about the dataset for each app.

Limitations

This dataset only includes traffic of a specific subset of Android applications. It does not include traffic from other operating systems or applications.
Each capture session, represented by a PCAP file, has a fixed duration ranging from 3 to 15 min.
Due to internet access restrictions in Iran, the traffic of ten applications—Coursera, Facebook Lite, Goodreads, Likee, Snapchat, Spotify, Telegram, Twitter, YouTube, and Waze—has been recorded differently compared to the other applications. (For more information, please refer to the detailed description provided in the Supplementary Materials.)

Availability of data and materials

The data described in this Data note can be freely and openly accessed on Mendeley Data under the name ITC-Net-Blend-60. Please see Table

Table 1 Overview of data files/data sets

Full size table

1 and references [6,7,8,9,10,11] for details and links to the data.

Abbreviations

PCAP:: Packet capture
PC:: Personal computer
VPN:: Virtual private network

References

Cafe Bazaar. https://cafebazaar.ir/. Accessed 24 Feb 2024.
Myket. https://myket.ir/. Accessed 21 May 2023.
Wireshark. https://www.wireshark.org/. Accessed 24 Feb 2024.
PCAPdroid. https://github.com/emanuele-f/PCAPdroid/. Accessed on 24 Feb 2024.
Scapy. https://scapy.net/. Accessed 24 Feb 2024.
Bayat M, Garshasbi J, Mehdizadeh M, Nozari N, Rezaei Khesal A, Dokhaei M, Teimouri M. ITC-Net-blend-60: A comprehensive dataset for robust network traffic classification in diverse environments-scenario A. Mendeley data. 2024. https://doi.org/10.17632/ssv23kfcgs.3.
Bayat M, Garshasbi J, Mehdizadeh M, Nozari N, Rezaei Khesal A, Dokhaei M, Teimouri M. ITC-Net-Blend-60: a comprehensive dataset for robust network traffic classification in diverse environments-scenario B. Mendeley data. 2024. https://doi.org/10.17632/3zggb53m4x.3.
Bayat M, Garshasbi J, Mehdizadeh M, Nozari N, Rezaei Khesal A, Dokhaei M, Teimouri M. ITC-net-blend-60: a comprehensive dataset for robust network traffic classification in diverse environments-scenario C. Mendeley data.2024. https://doi.org/10.17632/gp8r347j38.3.
Bayat M, Garshasbi J, Mehdizadeh M, Nozari N, Rezaei Khesal A, Dokhaei M, Teimouri M. ITC-net-blend-60: a comprehensive dataset for robust network traffic classification in diverse environments-scenario D. Mendeley data.2024. https://doi.org/10.17632/mcmf627yh5.3.
Bayat M, Garshasbi J, Mehdizadeh M, Nozari N, Rezaei Khesal A, Dokhaei M, Teimouri M. ITC-net-blend-60: a comprehensive dataset for robust network traffic classification in diverse environments-scenario E. Mendeley data. 2024. https://doi.org/10.17632/gdtnnfyr7s.3.
Bayat M, Garshasbi J, Mehdizadeh M, Nozari N, Rezaei Khesal A, Dokhaei M, Teimouri M. ITC-net-blend-60: A comprehensive dataset for robust network traffic classification in diverse environments-supplementary materials. Mendeley data.2024. https://doi.org/10.17632/4sgt9tjs4w.7.
Taylor VF, Spolaor R, Conti M, Martinovic I. Appscanner: automatic fingerprinting of smartphone apps from encrypted network traffic. In: 2016 IEEE European Symposium on Security and Privacy (EuroS&P), 2016: IEEE, pp. 439–454.
van Ede T et al. FlowPrint: Semi-supervised mobile-app fingerprinting on encrypted network traffic. In: Network and distributed system security symposium (NDSS), vol. 27. 2020.
Aceto G, Ciuonzo D, Montieri A, Pescapé A. Mobile encrypted traffic classification using deep learning: Experimental evaluation, lessons learned, and challenges. In: IEEE transactions on network and service management, vol. 16, no.2, 2019. pp. 445–458.
Khesal AR, Teimouri M. The effect of network environment on traffic classification. In: 2022 12th International Conference on Computer and Knowledge Engineering (ICCKE), 2022: IEEE, pp. 059–064.
Li W, Zhang X-Y, Bao H, Wang Q, Li Z. Robust network traffic identification with graph matching. Comput Netw. 2022;218:109368.
Article Google Scholar
Li W, Zhang X-Y, Bao H, Shi H, Wang Q. ProGraph: robust network traffic identification with graph propagation. IEEE/ACM Trans Netw. 2022.
Alan HF, Kaur J. Can android applications be identified using only TCP/IP headers of their launch time traffic? In: Proceedings of the 9th ACM conference on security & privacy in wireless and mobile networks, 2016, pp. 61–66.
Taylor VF, Spolaor R, Conti M, Martinovic I. Robust smartphone app identification via encrypted network traffic analysis. IEEE Trans Infor Forens Security. 2017;13(1):63–78.
Article Google Scholar

Download references

Acknowledgements

We wish to express our deepest gratitude to Mohammad Reza Tajzad for his invaluable insights and expertise, which significantly contributed to the success of our research. We also thank Fatemeh Delroba for her assistance in the data collection process. Additionally, we would like to extend our heartfelt thanks to Parastoo Soori for her helpful comment and suggestions on an earlier version of this manuscript, which greatly improved it. Finally, we sincerely thank all the participants in this study for their time and willingness to share their experiences. Without their contributions, our study would not have been possible.

Funding

The authors declare no source of funding.

Author information

Authors and Affiliations

Information Theory and Coding (ITC) Laboratory, University of Tehran, Tehran, Iran
Marziyeh Bayat, Javad Garshasbi, Mozhgan Mehdizadeh, Neda Nozari, Abolghasem Rezaei Khesal, Maryam Dokhaei & Mehdi Teimouri

Authors

Marziyeh Bayat
View author publications
You can also search for this author in PubMed Google Scholar
Javad Garshasbi
View author publications
You can also search for this author in PubMed Google Scholar
Mozhgan Mehdizadeh
View author publications
You can also search for this author in PubMed Google Scholar
Neda Nozari
View author publications
You can also search for this author in PubMed Google Scholar
Abolghasem Rezaei Khesal
View author publications
You can also search for this author in PubMed Google Scholar
Maryam Dokhaei
View author publications
You can also search for this author in PubMed Google Scholar
Mehdi Teimouri
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Methodology, J.G. and M.B; Data collection, M.B, M.M, N.N, A.R, and M.D; Project administration, M.T; Writing original draft preparation, M.B; Writing review and editing, J.G and M.T; supervision, M.T. All authors have read and agreed to the published version of the manuscript.

Corresponding author

Correspondence to Mehdi Teimouri.

Ethics declarations

Ethics approval and consent to participate

Based on both the presented design choices and the experimental setup, the capture process and the collected dataset do not imply any ethical concern.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file1 (DOCX 370 KB)

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article

Bayat, M., Garshasbi, J., Mehdizadeh, M. et al. ITC-Net-blend-60: a comprehensive dataset for robust network traffic classification in diverse environments. BMC Res Notes 17, 165 (2024). https://doi.org/10.1186/s13104-024-06817-5

Download citation

Received: 06 February 2024
Accepted: 03 June 2024
Published: 15 June 2024
DOI: https://doi.org/10.1186/s13104-024-06817-5

ITC-Net-blend-60: a comprehensive dataset for robust network traffic classification in diverse environments

Abstract

Objectives

Data description

Objective

Data description

Application selection

Traffic capture setup

Traffic generation

Limitations

Availability of data and materials

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher's Note

Supplementary Information

Additional file1 (DOCX 370 KB)

Rights and permissions

About this article

Cite this article

Keywords

BMC Research Notes

Contact us

ITC-Net-blend-60: a comprehensive dataset for robust network traffic classification in diverse environments

Abstract

Objectives

Data description

Objective

Data description

Application selection

Traffic capture setup

Traffic generation

Limitations

Availability of data and materials

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher's Note

Supplementary Information

Additional file1 (DOCX 370 KB)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Research Notes

Contact us