# Description

This notebook demonstrate how to clean the MachineLearningCSV from CICIDS2017 dataset from error data.

*Author*: **Mahendra Data** mahendra.data@dbms.cs.kumamoto-u.ac.jp

License: **BSD 3 clause**

# Mounting Google Drive

We will save the downloaded dataset to Google Drive.

In [1]:
from google.colab import drive
drive.mount("/content/drive")

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


# Unzip the dataset

Unzip the `MachineLearningCVE.zip`.

In [2]:
!unzip -n "/content/drive/My Drive/CICIDS2017/MachineLearningCVE.zip"

Archive:  /content/drive/My Drive/CICIDS2017/MachineLearningCVE.zip
   creating: MachineLearningCVE/
  inflating: MachineLearningCVE/Wednesday-workingHours.pcap_ISCX.csv  
  inflating: MachineLearningCVE/Tuesday-WorkingHours.pcap_ISCX.csv  
  inflating: MachineLearningCVE/Thursday-WorkingHours-Morning-WebAttacks.pcap_ISCX.csv  
  inflating: MachineLearningCVE/Thursday-WorkingHours-Afternoon-Infilteration.pcap_ISCX.csv  
  inflating: MachineLearningCVE/Monday-WorkingHours.pcap_ISCX.csv  
  inflating: MachineLearningCVE/Friday-WorkingHours-Morning.pcap_ISCX.csv  
  inflating: MachineLearningCVE/Friday-WorkingHours-Afternoon-PortScan.pcap_ISCX.csv  
  inflating: MachineLearningCVE/Friday-WorkingHours-Afternoon-DDos.pcap_ISCX.csv  


There are eight files extracted from this zip file.

1. `Monday-WorkingHours.pcap_ISCX.csv`
2. `Tuesday-WorkingHours.pcap_ISCX.csv`
3. `Wednesday-workingHours.pcap_ISCX.csv`
4. `Thursday-WorkingHours-Morning-WebAttacks.pcap_ISCX.csv`
5. `Thursday-WorkingHours-Afternoon-Infilteration.pcap_ISCX.csv`
6. `Friday-WorkingHours-Morning.pcap_ISCX.csv`
7. `Friday-WorkingHours-Afternoon-PortScan.pcap_ISCX.csv`
8. `Friday-WorkingHours-Afternoon-DDos.pcap_ISCX.csv`

In [3]:
import os
import pandas as pd

In [4]:
file_name = os.path.join("MachineLearningCVE", "Thursday-WorkingHours-Morning-WebAttacks.pcap_ISCX.csv")

# Read dataset
df = pd.read_csv(file_name, skipinitialspace=True, error_bad_lines=False)

# Change the unrecognized character in the class label

File `Thursday-WorkingHours-Morning-WebAttacks.pcap_ISCX.csv` contain unrecognized character.

Change this unrecognized character to `-`.

In [5]:
def _renaming_class_label(df: pd.DataFrame):
    labels = {"Web Attack � Brute Force": "Web Attack-Brute Force",
              "Web Attack � XSS": "Web Attack-XSS",
              "Web Attack � Sql Injection": "Web Attack-Sql Injection"}

    for old_label, new_label in labels.items():
        df.Label.replace(old_label, new_label, inplace=True)

# Renaming labels
_renaming_class_label(df)

# Saving the dataset

In [6]:
# Save to csv
df.to_csv(file_name, index=False)

# Combine All Dataset

In [7]:
DIR_PATH = "MachineLearningCVE"

FILE_NAMES = ["Monday-WorkingHours.pcap_ISCX.csv",
              "Tuesday-WorkingHours.pcap_ISCX.csv",
              "Wednesday-workingHours.pcap_ISCX.csv",
              "Thursday-WorkingHours-Morning-WebAttacks.pcap_ISCX.csv",
              "Thursday-WorkingHours-Afternoon-Infilteration.pcap_ISCX.csv",
              "Friday-WorkingHours-Morning.pcap_ISCX.csv",
              "Friday-WorkingHours-Afternoon-PortScan.pcap_ISCX.csv",
              "Friday-WorkingHours-Afternoon-DDos.pcap_ISCX.csv"]

In [8]:
df = [pd.read_csv(os.path.join(DIR_PATH, f), skipinitialspace=True) for f in FILE_NAMES]
df = pd.concat(df, ignore_index=True)

In [9]:
df.Label.value_counts()

BENIGN                      2273097
DoS Hulk                     231073
PortScan                     158930
DDoS                         128027
DoS GoldenEye                 10293
FTP-Patator                    7938
SSH-Patator                    5897
DoS slowloris                  5796
DoS Slowhttptest               5499
Bot                            1966
Web Attack-Brute Force         1507
Web Attack-XSS                  652
Infiltration                     36
Web Attack-Sql Injection         21
Heartbleed                       11
Name: Label, dtype: int64

In [10]:
df.to_csv(os.path.join(DIR_PATH, "MachineLearningCVE.csv"), index=False)

Copy to Google Drive.

In [11]:
!cp -r "MachineLearningCVE/" "/content/drive/My Drive/CICIDS2017/"

Now the dataset is saved to your Google Drive at `CICIDS2017` folder.