TAKING OVER MALWARE USING MACHINE LEARNING.

MACHINE LEARNING APPROACH TO GET-OVER MALWARE ATTACKS.

Data Science meets Cyber Security
InfoSec Write-ups

--

We all know that Cyber attacks have grown insanely for the past few years. The number of cyberattacks per week on corporate networks have increased around 50 percent in 2021 compared to 2020 because of COVID-19, leading to remote working which means company employees working outside of the office networks etc etc. But, The one question which bugs everyone in the tech world is how to stop them ? Well, the thing is we cannot.! But at least we can try to protect our system if something like “MALWARE ATTACK” ever happens. AGAIN HOW ?

RANSOMWARE INDUSTRY!

Of Course, speaking about numbers because that’s how people will get excited to read my blog further! 🤌🏻

RANSOMWARE costs the world $20 billion in 2021. Next 10 year (i.e until we reach 2031) prediction says it may rise to $265 billion. According to Google, 37% of the total businesses and organisations have been affected by the RANSOMWARE.

One side where hackers are making money through this, organisations and businesses, costs $1.85 million each time to recover from the RANSOMWARE attacks.

Yes way too much to deal with if you open your own company! :D But what if I said that, In-date we have advanced so much into the world of technology, that we are actually having a way to get out of these situations. YAYA!

Before we get to the solution, let’s first analyse the problem, because analysing things typically makes them easier to grasp.

LET’S UNDERSTAND THE TYPE OF MALWARE:

1. RANSOMWARE:

Ransomware is a type of malware that employs encryption to prevent a target from accessing its data unless a ransom is paid. The target organisation is rendered partially or completely unable to operate until payment is made, but there is no assurance that payment will result in the requisite decryption key or that the decryption key given will work properly.

Speaking about the ransomware, it reminds of the most ugly ransomware attack i.e WannaCry ransomware.

WannaCry was a crypto ransomware infection that targeted the Windows PCs. It’s a type of malware that may travel from PC to PC via networks (thus the “worm” component) and then encrypt key data once on a machine (the “crypto” part). The criminals then demand ransom money to release the files.

This disparity indicates that the hackers were primarily interested in causing confusion and terror. However, the monetary damages exceeded the ransom itself. Symantec predicted the WannaCry recovery cost at around $4 billion, which is extremely close to the almost $4.9 billion in total ransomware expenditures in 2020.

Incase you want to read more about WannaCry ransomware: Click here!

Kaspersky Article on All you need to know about WannaCry ransomware.

2. FILELESS MALWARE:

File-less malware does not initially install anything; instead, it modifies files that are inherent to the operating system, such as PowerShell or WMI. A file-less assault is not detected by antivirus software because the operating system perceives the modified files as genuine — and because these attacks are covert, they are up to 10 times more successful than regular malware attacks.

Following a brief break, Astaroth resurfaced in early February with considerable alterations to its attack chain. Astaroth is a data-stealing virus that leverages a variety of file-less tactics and exploits legal programmes in order to remain unnoticed on affected PCs.

Microsoft blog on Astaroth file-less malware.

3. SPYWARE:

Spyware gathers data on users’ activity without their knowledge or consent. Passwords, pins, financial information, and unstructured communications are examples of this. Spyware can function in a key app or on a mobile phone, in addition to the desktop browser. Even if the stolen data is not important, the impacts of spyware can reverberate across the firm, degrading performance and eroding productivity.

Some of the most well-known examples of spyware are as follows:

AzorUlt — Capable of stealing banking information, such as passwords and credit card numbers, as well as bitcoin. The AzorUlt virus is primarily distributed through ransomware operations.

Malware bytes has a very short and sweet blog on AzorUlt spyware, do checkout if interested.

TrickBot — Targets financial information theft.

Here’s the most common ways on how people gets manipulated to download the spyware. BE AWARE !

4. ADWARE:

Adware can also be considered as one type of spyware, whereas the difference is in adware it displays the unwanted or harmful advertisements. As it is generally innocuous, it can be annoying since “spammy” advertising continue to crop up while you work, considerably slowing down your computer’s performance. Furthermore, these advertising may unwittingly drive users to download more dangerous forms of malware. To guard against adware, keep your operating system, web browser, and email clients up to date so that they can prevent known adware assaults from downloading and installing.

The most famous adware attack was In 2017, Fireball adware affected 250 million PCs and devices, hijacking browsers to alter default search engines and track web behaviour. The infection, on the other hand, had the potential to be more than just a nuisance.

Checkpoint has an awesome post about FIREBALL ADWARE where they have told and discussed some out of the world stuff to make people more aware of the entire incident.

5. VIRUSES:

Viruses are a sort of malware that takes the shape of a piece of code that is put into an application, programme, or system and is distributed by the victims themselves. Viruses, one of the most frequent forms of malware, are similar to biological viruses in that they require a host, i.e. a device, to exist.

Many different forms of computer viruses are capable of stealing or destroying your data. Here are some of the most prevalent viruses and their characteristics.

This truly excellent post covers everything we’re searching for in terms of a complete grasp of viruses. Check out 7 Different Types of Computer Viruses:

QUICK QUESTION ?

What if someone unexpectedly told you, “I LOVE YOU?”

Sounds strange, right? Well, picture if a person in real life takes so long to answer to your I LOVE YOU, the same person clicks on the I LOVE YOU pop-up on their computer screen without even hesitating for a second:D. It’s dark and amusing , but it’s not a joke. IT’S A VIRUS😭😂

ILOVEYOU virus was capable of destroying any type of content, including images, audio files, and papers. Affected users who did not have backup copies permanently lost them. In March 1999, the Melissa virus, like ILOVEYOU, propagated itself through Outlook address books.

Read more about this virus here: I LOVE YOU is scam

Do you know about the famous JOKER VIRUS ?

What not to do

Do you read terms and conditions ?

6. TROJANS

A trojan programme masquerades as a genuine one, but it is actually harmful. A trojan, unlike a virus or worm, cannot propagate by itself and must be executed by its target. A trojan is typically introduced into your network by email or as a link on a website. Trojans are more difficult to combat because they rely on social engineering to disseminate and download. The simplest approach to avoid trojans is to never download or install software from an unknown source. Instead, ensure that staff only download software from trusted developers and app marketplaces that you have pre-approved.

Reminds me of The Trojan Horse Storm Worm

Back early January 2007, most of Europe was paralysed by Storm Kyrill. This Trojan horse was originally transmitted by email conversation on the theme of the storm, hence the term “Storm Worm.”

Want to know more about STORM WORM TROJAN : Click here!

7. BOTS

Bots are made up of algorithms that assist them in carrying out their tasks. The many sorts of bots are created differently to perform a wide range of tasks.

A bot is a computer programme that performs an automated task without human intervention. A computer infected with a bot can propagate the bot to other machines, forming a botnet. This network of bot-infected PCs may then be managed and utilised by hackers to conduct enormous assaults, frequently without the device owner being aware of its involvement. Bots are capable of huge attacks, such as the 2018 distributed denial of service (DDoS) attack, which took the internet down for most of the Eastern United States.

REVISE DDoS here quickly !!!!

On March 5, 2018, an anonymous client of Arbor Networks, a US-based service provider, was the target of the greatest DDoS attack to that point, with a peak rate of around 1.7 terabits per second. The previous record was established just a few days earlier, on March 1, 2018, when GitHub was struck with a 1.35 terabit per second attack.

UNDERSTANDING THE PROBLEM!

The problem is that. we have got 100 problems in the cyber world, and ransomware is just 10% of that, but the 10% is enough to make companies cry!

So keeping up with the trends, it has shown that nowadays hackers are finding brand new ways of creating profitable malware’s using artificial intelligence and machine learning. The problem with ransomware is it is difficult and next to impossible to stop, but what we do is play UNO REVERSE on them.!

INCASE YOU ARE THINKING HOW ? 😁

Here where CYBERSECURITY actually meets the MACHINE LEARNING approach, to solve our problem. We can also say that if hackers are smart enough to make the profitable malware’s using machine learning, then we are already one step ahead of them as we’ll get to know the exact patterns through which the malware was created and eventually we can move forward and try to find the solution for the same!

Now let’s know what exactly should come into our mind when we say that machine learning would help us stop the malware attack.

DEALING with WHAT, WHY and HOW?

WHAT IS MACHINE LEARNING EXACTLY?

Machine learning refers to a collection of techniques that enable computers to “learn without being explicitly programmed.”

In other words, a machine learning algorithm identifies and formalises the principles underlying the data that it encounters. The algorithm can analyse the qualities of previously unseen samples using this information. A previously unknown sample might be a new file in malware detection. Its concealed characteristic might be malicious or harmless. The model is a mathematically structured collection of principles underlying data attributes.

Machine learning employs a wide range of techniques to problem solving rather than a single strategy. These techniques have various capacities and tasks that they are best suited for.

As far as the machine learning method for malware detection is concerned, this is not the first method it has been employed for the detection of malware’s. Initially, malware detection on computers was based on heuristic characteristics that detected specific malware files by:

  1. Code fragments
  2. Hashes of code fragments or the entire file
  3. File attributes
  4. And combinations of these elements.

As we already know about the aggressive growth of cyber attacks and because today’s sophisticated threat attack vectors are so broad, cybersecurity solutions must combine with data science to provide protection at various levels. Machine learning-based detection works in tandem with other forms of malware detection in a multi-layered approach to current cybersecurity defence.

WHY MACHINE LEARNING?

Keeping it simple: Rather dealing directly with raw malware, standard machine learning algorithms preprocess the executable to extract a collection of characteristics that provide an abstract picture of the programme. The characteristics are then utilised to train a model to solve the issue at hand.

HOW ?

LET’S START WITH ALLL CODE AND ALGORITHMS

PROBLEM STATEMENT:

In recent years, the malware market has evolved so fast that syndicates have invested substantially in technology to circumvent traditional defences, prompting anti-malware groups/communities to develop increasingly strong tools to identify and stop these attacks. The most important aspect of safeguarding a computer system from a malware attack is determining if a particular file/software is malware.

DEPENDENCIES:

1. PANDAS:

Pandas has been one of the most commonly used tools for Data Science and Machine learning, which is used for data cleaning and analysis.

pip install pandas as pd

2. NUMPY:

NumPy is a Python library used for working with arrays. It also has functions for working in domain of linear algebra, fourier transform, and matrices

pip install numpy as np

3. PICKLE:

To test the model on an unknown file, the features of the provided file must be extracted. The Python pefile. The PE library is used to generate and build the feature vector, and an ML model is used to forecast the class based on the already trained model for the provided file.

pip install pickle

4. SCIPY:

The one of the most popular libraries SciPy is used for for more advanced computations.

scipy pip install scipy

5. SCIKIT:

Scikit is used for learning data mining and data analysis.

scikit pip install -U scikit-learn

STARTING WITH MODEL-BUILDING:

IMPORTING THE LIBRARIES:

import os
import pandas as pd
import numpy as np
import pickle
import sys
print(sys.path)
import sklearn.ensemble as ek
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectFromModel
import joblib
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import confusion_matrix
from sklearn.pipeline import make_pipeline
from sklearn import preprocessing
from sklearn import svm
from sklearn.linear_model import LinearRegression

LOADING THE DATASET:

dataset = pd.read_csv(‘data.csv’)
#import the path if you are using google colab, and if you are a jupyter notebook person go with me)
dataset.head() #Checking column-names and data offcourse.dataset.describe() #Pandas describe() is used to view some basic statistical details like percentile, mean, std etc. of a data frame or a series of numeric values.

NUMBER OF MALICIOUS FILES VS LEGITIMATE FILES:

dataset.groupby(dataset[‘legitimate’]).size() #Pandas’ groupby() allows us to split data into separate groups to perform computations for better analysis.

DROPPING COLUMNS WHICH ARE NOT MUCH IN USE:

X = dataset.drop(['Name','md5','legitimate'],axis=1).valuesy = dataset['legitimate'].values

EXTRATREE CLASSIFIER, WHY AND HOW?👇🏻

Extra Trees generates a huge number of unpruned decision trees from the
training dataset. In the case of regression, predictions are formed by
averaging the forecast of the decision trees, and in the case of
classification, predictions are made by employing majority voting.

ExtraTreesClassifier uses averaging to increase prediction accuracy and
control over-fitting by fitting a number of randomised decision trees
(a.k.a. extra-trees) on various sub-samples of the dataset.

ExtraTreesClassifier assists in identifying the necessary elements for
categorising a file as malicious or legitimate. ExtraTreesClassifier identifies 14 attributes as necessary.

extratrees = ek.ExtraTreesClassifier().fit(X,y)
model = SelectFromModel(extratrees, prefit=True)
X_new = model.transform(X)
nbfeatures = X_new.shape[1]
nbfeatures. #To identify the total number of features.

CROSS VALIDATION:

Cross-validation is a strategy for verifying model efficiency that
involves training the model on a portion of input data and testing it
on a previously unknown subset of input data. It is also a tool for
determining how well a statistical model generalises to an independent
dataset.

The dataset is divided into random train and test subgroups using cross
validation. The proportion of the dataset to include in the test split
is represented by test size = 0.2.

X_train, X_test, y_train, y_test = cross_validation.train_test_split
(X_new, y ,test_size=0.2) #Spliting and dividing the dataset

FEATURE EXTRACTION- WHICH WAS IDENTIFIED BY EXTRATREE CLASSIFIER:

for f in range(nbfeatures):
print("%d. feature %s (%f)" % (f + 1, dataset.columns[2+index[f]],
extratrees.feature_importances_[index[f]]))
features.append(dataset.columns[2+f])

BUILDING THE MACHINE LEARNING MODEL:

model = { "DecisionTree":tree.DecisionTreeClassifier(max_depth=10),
"RandomForest":ek.RandomForestClassifier(n_estimators=50),
"Adaboost":ek.AdaBoostClassifier(n_estimators=50),
"GradientBoosting":ek.GradientBoostingClassifier(n_estimators=50),
"GNB":GaussianNB(),
"LinearRegression":LinearRegression()
}

NOW WE’LL TRAIN EACH MODEL WITH X_TRAIN AND TESTED WITH X_TEST:

results = {}
for algo in model:
clf = model[algo]
clf.fit(X_train,y_train)
score = clf.score(X_test,y_test)
print ("%s : %s " %(algo, score))
results[algo] = score
winner = max(results, key=results.get)

CALCULATING THE FALSE POSITIVE AND NEGATIVE OF DATASET:

Recommended blog for understanding the importance of false positive and negative, while working with the dataset:

clf = model[winner]
res = clf.predict(X_new)
mt = confusion_matrix(y, res)

Following that is the testing phase, in which we actually test the malware file and determine whether or not our model is capable of detecting it. You can try building our own testing model and let us know how it goes.

LISTING SOME AMAZING REFERENCE FROM AMAZING PEOPLE OUT THERE:

1.https://github.com/dchad/malware-detection

2. https://github.com/manish-vi/malware_detection

RESEARCH PAPER-1

RESEARCH PAPER-2

If you have any questions, please contact us and we will gladly assist you!❤️

- Team Data Science Meets Cyber Security ❤️💙

--

--

Writing about Data science, Cyber security, Machine learning, Artificial Intelligence and everything you wanna know about Tech world! Happy reading you guys! ❤️