You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

1 line
11 KiB
Plaintext

{"nbformat":4,"nbformat_minor":0,"metadata":{"kernelspec":{"display_name":"Python 3","language":"python","name":"python3"},"language_info":{"codemirror_mode":{"name":"ipython","version":3},"file_extension":".py","mimetype":"text/x-python","name":"python","nbconvert_exporter":"python","pygments_lexer":"ipython3","version":"3.7.6"},"colab":{"name":"2.1 Preprocessing GeneratedLabelledFlows .ipynb","provenance":[],"collapsed_sections":[]}},"cells":[{"cell_type":"markdown","metadata":{"id":"ovbd1i5W0yqa","colab_type":"text"},"source":["# Description\n","\n","This notebook demonstrate how to clean the GeneratedLabelledFlows from CICIDS2017 dataset from error data.\n","\n","*Author*: **Mahendra Data** mahendra.data@dbms.cs.kumamoto-u.ac.jp\n","\n","License: **BSD 3 clause**"]},{"cell_type":"markdown","metadata":{"id":"QHdWG9Ol00PE","colab_type":"text"},"source":["# Mounting Google Drive\n","\n","We will save the downloaded dataset to Google Drive."]},{"cell_type":"code","metadata":{"id":"8Q4fip2H02Pd","colab_type":"code","colab":{"base_uri":"https://localhost:8080/","height":34},"executionInfo":{"status":"ok","timestamp":1597048196551,"user_tz":-540,"elapsed":575,"user":{"displayName":"Mahendra Data","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14Ghn7DAlkRKEg-Y82BqktrBT0ABMFy8r5576xhbKDQ=s64","userId":"08049029618478467489"}},"outputId":"178fe5c9-8916-4e08-f34b-92bb312385ff"},"source":["from google.colab import drive\n","drive.mount(\"/content/drive\")"],"execution_count":null,"outputs":[{"output_type":"stream","text":["Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount(\"/content/drive\", force_remount=True).\n"],"name":"stdout"}]},{"cell_type":"markdown","metadata":{"id":"XFnBJw9Q0yqu","colab_type":"text"},"source":["# Unzip the dataset\n","\n","Unzip the `GeneratedLabelledFlows.zip` and remove the extra spance character at the end of the extracted folder name."]},{"cell_type":"code","metadata":{"id":"exri52j_0yqu","colab_type":"code","colab":{"base_uri":"https://localhost:8080/","height":187},"executionInfo":{"status":"ok","timestamp":1597048209972,"user_tz":-540,"elapsed":13984,"user":{"displayName":"Mahendra Data","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14Ghn7DAlkRKEg-Y82BqktrBT0ABMFy8r5576xhbKDQ=s64","userId":"08049029618478467489"}},"outputId":"88445fc8-e1b7-4c0f-aabf-69932a0200c3"},"source":["!unzip -n \"/content/drive/My Drive/CICIDS2017/GeneratedLabelledFlows.zip\"\n","!mv TrafficLabelling\\ / TrafficLabelling"],"execution_count":null,"outputs":[{"output_type":"stream","text":["Archive: /content/drive/My Drive/CICIDS2017/GeneratedLabelledFlows.zip\n"," creating: TrafficLabelling /\n"," inflating: TrafficLabelling /Wednesday-workingHours.pcap_ISCX.csv \n"," inflating: TrafficLabelling /Tuesday-WorkingHours.pcap_ISCX.csv \n"," inflating: TrafficLabelling /Thursday-WorkingHours-Morning-WebAttacks.pcap_ISCX.csv \n"," inflating: TrafficLabelling /Thursday-WorkingHours-Afternoon-Infilteration.pcap_ISCX.csv \n"," inflating: TrafficLabelling /Monday-WorkingHours.pcap_ISCX.csv \n"," inflating: TrafficLabelling /Friday-WorkingHours-Morning.pcap_ISCX.csv \n"," inflating: TrafficLabelling /Friday-WorkingHours-Afternoon-DDos.pcap_ISCX.csv \n"," inflating: TrafficLabelling /Friday-WorkingHours-Afternoon-PortScan.pcap_ISCX.csv \n"],"name":"stdout"}]},{"cell_type":"markdown","metadata":{"id":"t93bkeYg0yqy","colab_type":"text"},"source":["There are eight files extracted from this zip file.\n","\n","1. `Monday-WorkingHours.pcap_ISCX.csv`\n","2. `Tuesday-WorkingHours.pcap_ISCX.csv`\n","3. `Wednesday-workingHours.pcap_ISCX.csv`\n","4. `Thursday-WorkingHours-Morning-WebAttacks.pcap_ISCX.csv`\n","5. `Thursday-WorkingHours-Afternoon-Infilteration.pcap_ISCX.csv`\n","6. `Friday-WorkingHours-Morning.pcap_ISCX.csv`\n","7. `Friday-WorkingHours-Afternoon-PortScan.pcap_ISCX.csv`\n","8. `Friday-WorkingHours-Afternoon-DDos.pcap_ISCX.csv`"]},{"cell_type":"markdown","metadata":{"id":"YEDIqro30yq9","colab_type":"text"},"source":["# Change the encoding to utf-8\n","\n","File `Thursday-WorkingHours-Morning-WebAttacks.pcap_ISCX.csv` is encoded in latin1 format. We should change it to utf-8 like other files.\n","\n","Now, import the libraries."]},{"cell_type":"code","metadata":{"id":"t13afI5a0yq0","colab_type":"code","colab":{}},"source":["import os\n","import codecs\n","import pandas as pd"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"Che9Zr53e2bI","colab_type":"text"},"source":["Change the encoding to utf-8."]},{"cell_type":"code","metadata":{"id":"f2c-Ert-owgx","colab_type":"code","colab":{}},"source":["def _to_utf8(filename: str, encoding=\"latin1\", blocksize=1048576):\n"," tmpfilename = filename + \".tmp\"\n"," with codecs.open(filename, \"r\", encoding) as source:\n"," with codecs.open(tmpfilename, \"w\", \"utf-8\") as target:\n"," while True:\n"," contents = source.read(blocksize)\n"," if not contents:\n"," break\n"," target.write(contents)\n","\n"," # replace the original file\n"," os.rename(tmpfilename, filename)"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"vyVLGuSCkEqM","colab_type":"code","colab":{}},"source":["file_name = os.path.join(\"TrafficLabelling\", \"Thursday-WorkingHours-Morning-WebAttacks.pcap_ISCX.csv\")\n","\n","_to_utf8(file_name)"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"cZvknYlAWZgO","colab_type":"text"},"source":["# Removing rows with only NaN values\n","\n","File `Thursday-WorkingHours-Morning-WebAttacks.pcap_ISCX.csv` contain rows that contains only NaN values. We should remove it."]},{"cell_type":"code","metadata":{"id":"VUsycvDFWYzU","colab_type":"code","colab":{"base_uri":"https://localhost:8080/","height":68},"executionInfo":{"status":"ok","timestamp":1597048213266,"user_tz":-540,"elapsed":17251,"user":{"displayName":"Mahendra Data","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14Ghn7DAlkRKEg-Y82BqktrBT0ABMFy8r5576xhbKDQ=s64","userId":"08049029618478467489"}},"outputId":"361fb506-c5f3-479d-a440-8e701cadf973"},"source":["# Read dataset\n","df = pd.read_csv(file_name, skipinitialspace=True, error_bad_lines=False)\n","\n","# Show number of NaN rows\n","print(\"Removing {} rows that contains only NaN values...\".format(df[df.isna().all(axis=1)].shape[0]))\n","\n","# Remove NaN rows\n","df = df[~ df.isna().all(axis=1)]"],"execution_count":null,"outputs":[{"output_type":"stream","text":["/usr/local/lib/python3.6/dist-packages/IPython/core/interactiveshell.py:2718: DtypeWarning: Columns (0,1,3,6,84) have mixed types.Specify dtype option on import or set low_memory=False.\n"," interactivity=interactivity, compiler=compiler, result=result)\n"],"name":"stderr"},{"output_type":"stream","text":["Removing 288602 rows that contains only NaN values...\n"],"name":"stdout"}]},{"cell_type":"markdown","metadata":{"id":"s2IcBrKjUSZA","colab_type":"text"},"source":["# Change the unrecognized character in the class label\n","\n","File `Thursday-WorkingHours-Morning-WebAttacks.pcap_ISCX.csv` contain unrecognized character.\n","\n","Change this unrecognized character to `-`."]},{"cell_type":"code","metadata":{"id":"dOAhC4rJUSrK","colab_type":"code","colab":{}},"source":["def _renaming_class_label(df: pd.DataFrame):\n"," labels = {\"Web Attack \\x96 Brute Force\": \"Web Attack-Brute Force\",\n"," \"Web Attack \\x96 XSS\": \"Web Attack-XSS\",\n"," \"Web Attack \\x96 Sql Injection\": \"Web Attack-Sql Injection\"}\n","\n"," for old_label, new_label in labels.items():\n"," df.Label.replace(old_label, new_label, inplace=True)\n","\n","# Renaming labels\n","_renaming_class_label(df)"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"fueJb4vK3RYs","colab_type":"text"},"source":["# Saving the dataset"]},{"cell_type":"code","metadata":{"id":"AZE7Jk9_ZWaV","colab_type":"code","colab":{}},"source":["# Save to csv\n","df.to_csv(file_name, index=False)"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"KHlTHV0hYfbX","colab_type":"text"},"source":["# Combine All Dataset"]},{"cell_type":"code","metadata":{"id":"kHmBhm_kYkyK","colab_type":"code","colab":{}},"source":["DIR_PATH = \"TrafficLabelling\"\n","\n","FILE_NAMES = [\"Monday-WorkingHours.pcap_ISCX.csv\",\n"," \"Tuesday-WorkingHours.pcap_ISCX.csv\",\n"," \"Wednesday-workingHours.pcap_ISCX.csv\",\n"," \"Thursday-WorkingHours-Morning-WebAttacks.pcap_ISCX.csv\",\n"," \"Thursday-WorkingHours-Afternoon-Infilteration.pcap_ISCX.csv\",\n"," \"Friday-WorkingHours-Morning.pcap_ISCX.csv\",\n"," \"Friday-WorkingHours-Afternoon-PortScan.pcap_ISCX.csv\",\n"," \"Friday-WorkingHours-Afternoon-DDos.pcap_ISCX.csv\"]"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"lErmt1mnaSUK","colab_type":"code","colab":{}},"source":["df = [pd.read_csv(os.path.join(DIR_PATH, f), skipinitialspace=True) for f in FILE_NAMES]\n","df = pd.concat(df, ignore_index=True)"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"1_27aWk_a8Ra","colab_type":"code","colab":{"base_uri":"https://localhost:8080/","height":289},"executionInfo":{"status":"ok","timestamp":1597048252403,"user_tz":-540,"elapsed":56356,"user":{"displayName":"Mahendra Data","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14Ghn7DAlkRKEg-Y82BqktrBT0ABMFy8r5576xhbKDQ=s64","userId":"08049029618478467489"}},"outputId":"4b145680-27b4-4e8e-8ae1-d0124ccd0979"},"source":["df.Label.value_counts()"],"execution_count":null,"outputs":[{"output_type":"execute_result","data":{"text/plain":["BENIGN 2273097\n","DoS Hulk 231073\n","PortScan 158930\n","DDoS 128027\n","DoS GoldenEye 10293\n","FTP-Patator 7938\n","SSH-Patator 5897\n","DoS slowloris 5796\n","DoS Slowhttptest 5499\n","Bot 1966\n","Web Attack-Brute Force 1507\n","Web Attack-XSS 652\n","Infiltration 36\n","Web Attack-Sql Injection 21\n","Heartbleed 11\n","Name: Label, dtype: int64"]},"metadata":{"tags":[]},"execution_count":11}]},{"cell_type":"code","metadata":{"id":"HUGbsQsEaYGg","colab_type":"code","colab":{}},"source":["df.to_csv(os.path.join(DIR_PATH, \"TrafficLabelling.csv\"), index=False)"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"WxnVQ3zro5Cy","colab_type":"text"},"source":["Copy to Google Drive."]},{"cell_type":"code","metadata":{"id":"-Qbe7PBF1rIP","colab_type":"code","colab":{}},"source":["!cp -r \"TrafficLabelling/\" \"/content/drive/My Drive/CICIDS2017/\""],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"vWkym95UfOYu","colab_type":"text"},"source":["Now the dataset is saved to your Google Drive at `CICIDS2017` folder."]}]}