The cornerstone of traditional malware detection has been the signature-based approach, which compares byte sequences extracted from malware binaries. However, it struggles to detect unknown polymorphic and metamorphic malware, which constantly change their appearance.
To overcome these limitations, machine learning techniques, especially artificial neural networks, have been adapted for dynamic and adaptive malware detection. Convolutional Neural Networks (CNNs) have shown promising results in classifying malware represented as grayscale images. In this approach, 8-bit vectors extracted from raw malware binaries are transformed into grayscale values that capture structural characteristics of the malware.
While successful in many cases, the underlying conversion algorithms - how raw binaries are mapped to image representations - are not always well-documented. This leaves a gap in understanding how specific instructions or parts of the malware corpus may be omitted or abstracted during the transformation, potentially influencing classification accuracy.
The goal of this thesis is to experiment with different conversion algorithms, aiming to assess their impact on common classification metrics (precision, recall, and F1-score). Specifically, the thesis shall explore whether the exclusion of more or less common instructions can improve the model's ability to generalize. The thesis will involve a comprehensive opcode / byte / instruction analysis, followed by the application of CNNs, but also standard classifiers such as Random Forest, to evaluate the resulting image datasets.





