# Description

This notebook demonstrate how to clean the GeneratedLabelledFlows from CICIDS2017 dataset from error data.

*Author*: **Mahendra Data** mahendra.data@dbms.cs.kumamoto-u.ac.jp

License: **BSD 3 clause**

# Mounting Google Drive

We will save the downloaded dataset to Google Drive.

In [None]:
from google.colab import drive
drive.mount("/content/drive")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


# Unzip the dataset

Unzip the `GeneratedLabelledFlows.zip` and remove the extra spance character at the end of the extracted folder name.

In [None]:
!unzip -n "/content/drive/My Drive/CICIDS2017/GeneratedLabelledFlows.zip"
!mv TrafficLabelling\ / TrafficLabelling

Archive:  /content/drive/My Drive/CICIDS2017/GeneratedLabelledFlows.zip
   creating: TrafficLabelling /
  inflating: TrafficLabelling /Wednesday-workingHours.pcap_ISCX.csv  
  inflating: TrafficLabelling /Tuesday-WorkingHours.pcap_ISCX.csv  
  inflating: TrafficLabelling /Thursday-WorkingHours-Morning-WebAttacks.pcap_ISCX.csv  
  inflating: TrafficLabelling /Thursday-WorkingHours-Afternoon-Infilteration.pcap_ISCX.csv  
  inflating: TrafficLabelling /Monday-WorkingHours.pcap_ISCX.csv  
  inflating: TrafficLabelling /Friday-WorkingHours-Morning.pcap_ISCX.csv  
  inflating: TrafficLabelling /Friday-WorkingHours-Afternoon-DDos.pcap_ISCX.csv  
  inflating: TrafficLabelling /Friday-WorkingHours-Afternoon-PortScan.pcap_ISCX.csv  


There are eight files extracted from this zip file.

1. `Monday-WorkingHours.pcap_ISCX.csv`
2. `Tuesday-WorkingHours.pcap_ISCX.csv`
3. `Wednesday-workingHours.pcap_ISCX.csv`
4. `Thursday-WorkingHours-Morning-WebAttacks.pcap_ISCX.csv`
5. `Thursday-WorkingHours-Afternoon-Infilteration.pcap_ISCX.csv`
6. `Friday-WorkingHours-Morning.pcap_ISCX.csv`
7. `Friday-WorkingHours-Afternoon-PortScan.pcap_ISCX.csv`
8. `Friday-WorkingHours-Afternoon-DDos.pcap_ISCX.csv`

# Change the encoding to utf-8

File `Thursday-WorkingHours-Morning-WebAttacks.pcap_ISCX.csv` is encoded in latin1 format. We should change it to utf-8 like other files.

Now, import the libraries.

In [None]:
import os
import codecs
import pandas as pd

Change the encoding to utf-8.

In [None]:
def _to_utf8(filename: str, encoding="latin1", blocksize=1048576):
    tmpfilename = filename + ".tmp"
    with codecs.open(filename, "r", encoding) as source:
        with codecs.open(tmpfilename, "w", "utf-8") as target:
            while True:
                contents = source.read(blocksize)
                if not contents:
                    break
                target.write(contents)

    # replace the original file
    os.rename(tmpfilename, filename)

In [None]:
file_name = os.path.join("TrafficLabelling", "Thursday-WorkingHours-Morning-WebAttacks.pcap_ISCX.csv")

_to_utf8(file_name)

# Removing rows with only NaN values

File `Thursday-WorkingHours-Morning-WebAttacks.pcap_ISCX.csv` contain rows that contains only NaN values. We should remove it.

In [None]:
# Read dataset
df = pd.read_csv(file_name, skipinitialspace=True, error_bad_lines=False)

# Show number of NaN rows
print("Removing {} rows that contains only NaN values...".format(df[df.isna().all(axis=1)].shape[0]))

# Remove NaN rows
df = df[~ df.isna().all(axis=1)]

  interactivity=interactivity, compiler=compiler, result=result)


Removing 288602 rows that contains only NaN values...


# Change the unrecognized character in the class label

File `Thursday-WorkingHours-Morning-WebAttacks.pcap_ISCX.csv` contain unrecognized character.

Change this unrecognized character to `-`.

In [None]:
def _renaming_class_label(df: pd.DataFrame):
    labels = {"Web Attack \x96 Brute Force": "Web Attack-Brute Force",
              "Web Attack \x96 XSS": "Web Attack-XSS",
              "Web Attack \x96 Sql Injection": "Web Attack-Sql Injection"}

    for old_label, new_label in labels.items():
        df.Label.replace(old_label, new_label, inplace=True)

# Renaming labels
_renaming_class_label(df)

# Saving the dataset

In [None]:
# Save to csv
df.to_csv(file_name, index=False)

# Combine All Dataset

In [None]:
DIR_PATH = "TrafficLabelling"

FILE_NAMES = ["Monday-WorkingHours.pcap_ISCX.csv",
              "Tuesday-WorkingHours.pcap_ISCX.csv",
              "Wednesday-workingHours.pcap_ISCX.csv",
              "Thursday-WorkingHours-Morning-WebAttacks.pcap_ISCX.csv",
              "Thursday-WorkingHours-Afternoon-Infilteration.pcap_ISCX.csv",
              "Friday-WorkingHours-Morning.pcap_ISCX.csv",
              "Friday-WorkingHours-Afternoon-PortScan.pcap_ISCX.csv",
              "Friday-WorkingHours-Afternoon-DDos.pcap_ISCX.csv"]

In [None]:
df = [pd.read_csv(os.path.join(DIR_PATH, f), skipinitialspace=True) for f in FILE_NAMES]
df = pd.concat(df, ignore_index=True)

In [None]:
df.Label.value_counts()

BENIGN                      2273097
DoS Hulk                     231073
PortScan                     158930
DDoS                         128027
DoS GoldenEye                 10293
FTP-Patator                    7938
SSH-Patator                    5897
DoS slowloris                  5796
DoS Slowhttptest               5499
Bot                            1966
Web Attack-Brute Force         1507
Web Attack-XSS                  652
Infiltration                     36
Web Attack-Sql Injection         21
Heartbleed                       11
Name: Label, dtype: int64

In [None]:
df.to_csv(os.path.join(DIR_PATH, "TrafficLabelling.csv"), index=False)

Copy to Google Drive.

In [None]:
!cp -r "TrafficLabelling/" "/content/drive/My Drive/CICIDS2017/"

Now the dataset is saved to your Google Drive at `CICIDS2017` folder.