The 'BCCC-CIRA-CIC-DoHBrw-2020' dataset was created to address the imbalance in the 'CIRA-CIC-DoBre-2020' dataset. Unlike the 'CIRA-CIC-DoHBrw-2020' dataset, which is skewed with about 90% malicious and only 10% benign Domain over HTTPS (DoH) network traffic, the 'BCCC-CIRA-CIC-DoHBrw-2020' dataset offers a more balanced composition. It includes equal numbers of malicious and benign DoH network traffic instances, with 249,836 instances in each category. This balance was achieved using the Synthetic Minority Over-sampling Technique (SMOTE). The 'BCCC-CIRA-CIC-DoHBrw-2020' dataset comprises three CSV files: one for malicious DoH traffic, one for benign DoH traffic, and a third that combines both types.
The CIRA-CIC-DoHBrw-2020 dataset,created at the CIC laboratory, UNB, in 2020.This dataset is publicly available. It focuses on the implementation of the DoH protocol in an application. It involves using five different browsers, tools, and four servers to capture various types of traffic, including benign-DoH, malicious-DoH, and non-DoH traffic.The dataset’s classification process utilized a two-layered approach. Layer 1 distinguished between DoH and non-DoH traffic, while Layer 2 differentiated between benign-DoH and malicious-DoH traffic.
The network traffic was captured using Google Chrome, Mozilla Firefox, dns2tcp, DNSCat2 and Iodine, in conjunction with AdGuard, Cloudflare, Google DNS, and Quad9 servers, which responded to DoH requests. To extract relevant features from the captured PCAP file, the feature extraction tool called DoHLyzer was employed. DoHLyzer is a DoH traffic flow generator and analyzer for anomaly and attack detection and characterization. This tool generated a CSV file containing 28 statistical features, which can be categorized as rate, length, and time-based features. The feature extraction process was implemented using the Python programming language.
Table 1: Extracted Features
You may redistribute, republish, and mirror the BCCC-VolSCs-2023 dataset in any form. However, any use or redistribution of data must include a citation to the BCCC-VolSCs-2023 dataset and the following paper:
- “Unveiling DoH Tunnel: Toward Generating a Balanced DoH EncryptedTraffic Dataset and Profiling malicious Behaviour using InherentlyInterpretable Machine Learning“, Sepideh Niktabe, Arash Habibi Lashkari, Arousha Haghighian Roudsari, Peer-to-Peer Networking and Applications, Vol. 17, 2023
You can download this dataset from here.