Over the years, PDF has been the most widely used document format due to its portability and reliability. Unfortunately, PDF popularity and its advanced features have allowed attackers to exploit them in numerous ways. There are various critical PDF features that an attacker can misuse to deliver a malicious payload.
In this research, we present a new evasive pdf dataset, Evasive-PDFMal2022 which consists of 10,025 records with 5,557 malicious and 4,468 benign records that tend to evade the common significant features found in each class. This makes them harder to detect by common learning algorithms.
Data collection and analysis
We have collected 11,173 malicious files from Contagio, 20,000 malicious files from VirusTotal, and 9,109 benign files from Contagio.
Once collected, we extracted 32 features from each, and after deduplicating the records, we wisely combined the two dataset records into one final file, which resulted in a more representative dataset of the PDF distribution. Moreover, we employed K-means, an unsupervised machine learning that clusters the resource data points into two groups by their similarity. The samples falling into the wrong cluster with the malicious label are taken as an evasive set of malicious records, with an intuition that the features of these samples were not so similar with the rest of the class so that they are not clustered with the majority of the same label samples.
We applied the same logic for the benign records and finally combined the results with the new "Evasive-PDFMal2022". The flowchart for the same is as shown.
37 static representative features, including 12 general features and 25 structural features extracted from each PDF file, are depicted in the table.
number of embedded files
average size of all the embedded media
No. of keywords "streams"
No. of keywords "endstreams"
Average stream size
No. of Xref entries
No. of name obfuscations
Total number of filters used
No. of objects with nested filters
No. of stream objects (ObjStm)
No. of keywords "/URI", No. of keywords "/Action"
No. of keywords "/AA", No. of keywords "/OpenAction"
No. of keywords "/launch", No. of keywords "/submitForm"
No. of keywords "/Acroform", No. of keywords "/XFA"
No. of keywords "/JBig2Decode", No. of keywords "/Colors"
No. of keywords "/Richmedia", No. of keywords "/Trailer"
No. of keywords "/Xref", No. of keywords "/Startxref"
1. General features:
These features generally describe the PDF file, such as its size, whether it contains text or images, number of pages, and the title. There are 12 features from this category.
Number of characters in the title: Legitimate PDF files usually have a proper and more meaningful titles.
Metadata size: Metadata is the section where information about the PDF file is provided, which can be exploited for embedding hidden contents.
Document Encryption: This feature shows whether the PDF document is password protected or not.
Number of pages: Malicious PDF files tend to have fewer pages (most of them have one blank page) as they are not concerned about content presentation.
Presence of text inside the PDF: As content presentation is not the objective of malware PDF files, they may include less text in their files.
Size of the whole document: The malicious PDF size usually tends to vary from the benign due to its variation in page size and content.
Number of embedded files inside the document: PDFs are capable of attaching/embedding different types of files within themselves that might be used for exploitation, including other PDF files, doc files, images, etc.
Average size of all the embedded media: Embedded files in the PDF may be of various sizes depending on what they contain. The average size might lead to an insight into the content of the embedded files.
Number of total objects inside the PDF: As PDFs are made of objects, the number of objects combined with the rest of the features can represent the PDF in general.
Number of font objects: Font objects indicate the types of fonts used for the PDF text.
Presence of a valid PDF Header: As PDF header obfuscation is common for evading anti-virus scans, malicious PDF files tend to modify the header format.
Number of images in the document: PDF files may contain one or any number of images.
2. Structural features:
These features describe the PDF file in terms of the structure, which requires a deeper parsing and provide an insight into the overall skeleton of the PDF. We propose a set of 25 features related to the PDF structure.
Number of indirect objects: This might be some indication of an obfuscation attempt.
Number of obfuscations: PDFs support many types of obfuscations such as string obfuscations of hex, octal etc. which are generally applied for evasion attempts.
Number of streams: This shows the number of sequences of binary data in the PDF.
Number of of endstreams: Keywords that denote the end of the streams.
Average Stream size: Size of the stream as the malicious code may be hidden inside streams.
Number of stream objects (ObjStm): Streams that contain other objects.
Number of Launch keywords: Launch is a keyword that can be used to execute a command or program.
Number of URI keywords: Indicates a presence of URL to which the PDF file attempts to connect to.
Number of Action keywords: Specifies a specific action upon an event.
Number of AA keywords: Specifies a specific action upon an event.
Number of SubmitForm tags: Indicates the PDF button that collects form information and sends them to specified destinations.
Number of Acroform tags: Acrobat forms are PDF files containing form fields that support scripting technologies that can be misused for attackers.
Total number of filters used: There are various types of compression filters applied on some PDF objects, which also might be exploited by attackers.
Presence of JBig2Decode filter: JBig2Decode is a common filter to encode malicious content.
Number of objects with nested filters: Nested filters can be an indication of evasion, as they make the decoding process more difficult.
XFA: XFAs are XML Form Architecture included in certain PDF 40 files that support scripting technologies that can be misused for attackers.
Colors: Different colors used in the PDF.
Trailer: Number of trailers inside the PDF.
Xref: Number of Xref tables.
Startxref: Number of keywords with ”startxref” which denotes where the Xref table is started.
Xref enteries: The number of entries in the PDF Xref tables as malformed Xref tables are another common observation in malicious PDF files.
RichMedia: Number of RichMedia keywords which denotes the number of embedded media and flash files.
You may redistribute, republish, and mirror the Evasive-PDFMal2022 dataset in any form. However, any use or redistribution of data must include a citation to the Evasive-PDFMal2022 dataset and the following paper:
- Maryam Issakhani, Princy Victor, Ali Tekeoglu, and Arash Habibi Lashkari1, “PDF Malware Detection Based on Stacking Learning”, The International Conference on Information Systems Security and Privacy, February 2022
We thank the Lockheed Martin Cybersecurity Research Fund (LMCRF) to support this project for the last two years.
You can download this dataset from here.