ARASH HABIBI LASHKARI
CYBERSECURITY RESEARCHER
  Qualifications and Certificates
  Technical Committee Member
  Professional Experience
  Research & Development
  Awards and Honors
  In the News
  Pulications
  Datasets
  Teaching
  Supervising
  Workshops
  Auxiliary Activities
  Short Stories
 Datasets (IDS/IPS:2 , Malware:2 , Encrypted Traffic:2, Analyzer:1 )
Intrusion Detection Evaluation Dataset (CSE-CIC-IDS2018) 2018
The final dataset includes seven different attack scenarios: Brute-force, Heartbleed, Botnet, DoS, DDoS, Web attacks, and infiltration of the network from inside. The attacking infrastructure includes 50 machines and the victim organization has 5 departments and includes 420 machines and 30 servers. The dataset includes the captures network traffic and system logs of each machine, along with 80 features extracted from the captured traffic using CICFlowmeter-V3.0.
For more information and download this dataset, contact AWS.
Intrusion Detection Evaluation Dataset (CICIDS2017) 2017
Intrusion Detection Systems (IDSs) and Intrusion Prevention Systems (IPSs) are the most important defense tools against the sophisticated and ever-growing network attacks. Due to the lack of reliable test and validation datasets, anomaly-based intrusion detection approaches are suffering from consistent and accurate performance evolutions. Our evaluations of the existing eleven datasets since 1998 show that most are out of date and unreliable to use. Some of these datasets suffer from the lack of traffic diversity and volumes, some do not cover the variety of known attacks, while others anonymize packet payload data, which cannot reflect the current trends. Some are also lacking feature set and metadata. CICIDS2017 dataset contains benign and the most up-to-date common attacks, which resembles the true real-world data (PCAPs). It also includes the results of the network traffic analysis using CICFlowmeter-V3.0 with labeled flows based on the time stamp, source and destination IPs, source and destination ports, protocols and attack (CSV files).

The full research paper outlining the details of the dataset and its underlying principles:

Iman Sharafaldin, Arash Habibi Lashkari, and Ali A. Ghorbani, “Toward Generating a New Intrusion Detection Dataset and Intrusion Traffic Characterization”, 4th International Conference on Information Systems Security and Privacy (ICISSP), Purtogal, January 2018

For more information, visit this page.
Android Malware Dataset (CICAndMal2017) 2016
We propose our new Android malware dataset here, named CICAndMal2017. In this approach, we run our both malware and benign applications on real smartphones to avoid runtime behavior modification of advanced malware samples that are able to detect the emulator environment. We collected more than 10,854 samples (4,354 malware and 6,500 benign) from several sources. We have collected over six thousand benign apps from Googleplay market published in 2015, 2016, 2017. In this dataset, we installed 5,000 of the collected samples (426 malware and 5,065 benign) on real devices. Our malware samples in the CICAndMal2017 dataset are classified into four categories Adware, Ransomware, Scareware and SMS Malware. Our samples come from 42 unique malware families.

The full research paper outlining the details of the dataset and its underlying principles:

Arash Habibi Lashkari, Andi Fitriah A.Kadir, Laya Taheri, and Ali A. Ghorbani, “Toward Developing a Systematic Approach to Generate Benchmark Android Malware Datasets and Classification”, In the proceedings of the 52nd IEEE International Carnahan Conference on Security Technology (ICCST), Montreal, Quebec, Canada, 2018.

For more information, visit this page.
Android Adware and General Malware Dataset (AAGM) 2016
The sophisticated and advanced Android malware is able to identify the presence of the emulator used by the malware analyst and in response, alter its behavior to evade detection. To overcome this issue, we installed the Android applications on the real device and captured its network traffic. AAGM dataset is captured by installing the Android apps on the real smartphones semi-automated. The dataset is generated from 1900 applications with the following three categories:
Android Adware (250 apps): Airpush, Dowgin, Kemoge, Mobidash, Shuanet
General Android Malware (150 apps): AVpass, FakeAV, FakeFlash/FakePlayer, GGtracker, Penetho
Benign (1500 apps): 2015 and 2016 GooglePlay market (top free popular and top free new)

The full research paper outlining the details of the dataset and its underlying principles:

Arash Habibi Lashkari, Andi Fitriah A.Kadir, Hugo Gonzalez, Kenneth Fon Mbah and Ali A. Ghorbani, “Towards a Network-Based Framework for Android Malware Detection and Characterization”, In the proceeding of the 15th International Conference on Privacy, Security and Trust, PST, Calgary, Canada, 2017.

For more information, visit this page.
Tor-nonTor Network Traffic dataset 2016
To be sure about the quantity and diversity of this dataset in CIC, we defined a set of tasks to generate a representative dataset of real-world traffic. We created three users for the browser traffic collection and two users for the communication parts such as chat, mail, FTP, p2p, etc. For the non-Tor traffic we used previous benign traffic from VPN project and for the Tor traffic we used 7 traffic categories: Browsing, Email, Chat, Audio-Streaming, Video-Streaming, FTP, VoIP, P2P. The traffic was captured using Wireshark and tcpdump, generating a total of 22GB of data. To facilitate the labeling process, as we explained in the related published paper, we captured the outgoing traffic at the workstation and the gateway simultaneously, collecting a set of pairs of .pcap files: one regular traffic pcap (workstation) and one Tor traffic pcap (gateway) file. Later, we labelled the captured traffic in two steps. First, we processed the .pcap files captured at the workstation: we extracted the flows, and we confirmed that the majority of traffic flows were generated by application X (Skype, ftps, etc.), the object of the traffic capture. Then, we labelled all flows from the Tor .pcap file as X.
ISCXFlowMeter has been written in Java for reading the pcap files and create the csv file based on selected features. The dataset consists of labeled network traffic, including full packet in pcap format and csv (flows generated by CICFlowMeter) also are publicly available for researchers.

The full research paper outlining the details of the dataset and its underlying principles:

Arash Habibi Lashkari, Gerard Draper-Gil, Mohammad Saiful Islam Mamun and Ali A. Ghorbani, "Characterization of Tor Traffic Using Time Based Features", In the proceeding of the 3rd International Conference on Information System Security and Privacy, SCITEPRESS, Porto, Portugal, 2017.

For more information, visit this page.
VPN-nonVPN Network Traffic dataset 2015
To generate a representative dataset of real-world traffic in ISCX we defined a set of tasks, assuring that our dataset is rich enough in diversity and quantity. We created accounts for users Alice and Bob in order to use services like Skype, Facebook, etc. Below we provide the complete list of different types of traffic and applications considered in our dataset for each traffic type (VoIP, P2P, etc.). We captured a regular session and a session over VPN, therefore we have a total of 14 traffic categories: VOIP, VPN-VOIP, P2P, VPN-P2P, etc. We also give a detailed description of the different types of traffic generated: Browsing, Email, Chat, Audio-Streaming, Video-Streaming, FTP, VoIP, P2P.
The traffic was captured using Wireshark and tcpdump, generating a total amount of 28GB of data. For the VPN, we used an external VPN service provider and connected to it using OpenVPN (UDP mode). To generate SFTP and FTPS traffic we also used an external service provider and Filezilla as a client. To facilitate the labeling process, when capturing the traffic all unnecessary services and applications were closed. (The only application executed was the objective of the capture, e.g., Skype voice-call, SFTP file transfer, etc.) We used a filter to capture only the packets with source or destination IP, the address of the local client (Alice or Bob).
ISCXFlowMeter (formerly known as ISCXFlowMeter) has been written in Java for reading the pcap files and create the csv file based on selected features. The dataset consists of labeled network traffic, including full packet in pcap format and csv (flows generated by CICFlowMeter) also are publicly available for researchers.

The full research paper outlining the details of the dataset and its underlying principles:

Gerard Drapper Gil, Arash Habibi Lashkari, Mohammad Mamun, Ali A. Ghorbani, "Characterization of Encrypted and VPN Traffic Using Time-Related Features", In Proceedings of the 2nd International Conference on Information Systems Security and Privacy(ICISSP 2016) , pages 407-414, Rome, Italy, 2016.

For more information, visit this page.
Network Traffic Analyzer (CICFlowMeter formerly known as ISCXFlowMeter) 2015

CICFlowmeter-V3.0 (formerly known as ISCXFlowMeter) has been written in Java for reading the pcap files and create the csv file based on selected features. The dataset consists of labeled network traffic, including full packet in pcap format and csv (flows generated by CICFlowMeter) also are publicly available for researchers.