You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

1 line
11 KiB
Plaintext

{"nbformat":4,"nbformat_minor":0,"metadata":{"kernelspec":{"display_name":"Python 3","language":"python","name":"python3"},"language_info":{"codemirror_mode":{"name":"ipython","version":3},"file_extension":".py","mimetype":"text/x-python","name":"python","nbconvert_exporter":"python","pygments_lexer":"ipython3","version":"3.7.6"},"colab":{"name":"2.2 Preprocessing MachineLearningCSV.ipynb","provenance":[],"collapsed_sections":[]}},"cells":[{"cell_type":"markdown","metadata":{"id":"ovbd1i5W0yqa","colab_type":"text"},"source":["# Description\n","\n","This notebook demonstrate how to clean the MachineLearningCSV from CICIDS2017 dataset from error data.\n","\n","*Author*: **Mahendra Data** mahendra.data@dbms.cs.kumamoto-u.ac.jp\n","\n","License: **BSD 3 clause**"]},{"cell_type":"markdown","metadata":{"id":"QHdWG9Ol00PE","colab_type":"text"},"source":["# Mounting Google Drive\n","\n","We will save the downloaded dataset to Google Drive."]},{"cell_type":"code","metadata":{"id":"8Q4fip2H02Pd","colab_type":"code","colab":{"base_uri":"https://localhost:8080/","height":122},"executionInfo":{"status":"ok","timestamp":1597048502633,"user_tz":-540,"elapsed":29344,"user":{"displayName":"Mahendra Data","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14Ghn7DAlkRKEg-Y82BqktrBT0ABMFy8r5576xhbKDQ=s64","userId":"08049029618478467489"}},"outputId":"d6a5b9a2-1873-4ce2-b5d5-7a6d79ef31d7"},"source":["from google.colab import drive\n","drive.mount(\"/content/drive\")"],"execution_count":1,"outputs":[{"output_type":"stream","text":["Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly\n","\n","Enter your authorization code:\n","··········\n","Mounted at /content/drive\n"],"name":"stdout"}]},{"cell_type":"markdown","metadata":{"id":"XFnBJw9Q0yqu","colab_type":"text"},"source":["# Unzip the dataset\n","\n","Unzip the `MachineLearningCVE.zip`."]},{"cell_type":"code","metadata":{"id":"exri52j_0yqu","colab_type":"code","colab":{"base_uri":"https://localhost:8080/","height":187},"executionInfo":{"status":"ok","timestamp":1597048514487,"user_tz":-540,"elapsed":41188,"user":{"displayName":"Mahendra Data","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14Ghn7DAlkRKEg-Y82BqktrBT0ABMFy8r5576xhbKDQ=s64","userId":"08049029618478467489"}},"outputId":"462ba45c-4614-4c08-a61d-5f55314554db"},"source":["!unzip -n \"/content/drive/My Drive/CICIDS2017/MachineLearningCVE.zip\""],"execution_count":2,"outputs":[{"output_type":"stream","text":["Archive: /content/drive/My Drive/CICIDS2017/MachineLearningCVE.zip\n"," creating: MachineLearningCVE/\n"," inflating: MachineLearningCVE/Wednesday-workingHours.pcap_ISCX.csv \n"," inflating: MachineLearningCVE/Tuesday-WorkingHours.pcap_ISCX.csv \n"," inflating: MachineLearningCVE/Thursday-WorkingHours-Morning-WebAttacks.pcap_ISCX.csv \n"," inflating: MachineLearningCVE/Thursday-WorkingHours-Afternoon-Infilteration.pcap_ISCX.csv \n"," inflating: MachineLearningCVE/Monday-WorkingHours.pcap_ISCX.csv \n"," inflating: MachineLearningCVE/Friday-WorkingHours-Morning.pcap_ISCX.csv \n"," inflating: MachineLearningCVE/Friday-WorkingHours-Afternoon-PortScan.pcap_ISCX.csv \n"," inflating: MachineLearningCVE/Friday-WorkingHours-Afternoon-DDos.pcap_ISCX.csv \n"],"name":"stdout"}]},{"cell_type":"markdown","metadata":{"id":"t93bkeYg0yqy","colab_type":"text"},"source":["There are eight files extracted from this zip file.\n","\n","1. `Monday-WorkingHours.pcap_ISCX.csv`\n","2. `Tuesday-WorkingHours.pcap_ISCX.csv`\n","3. `Wednesday-workingHours.pcap_ISCX.csv`\n","4. `Thursday-WorkingHours-Morning-WebAttacks.pcap_ISCX.csv`\n","5. `Thursday-WorkingHours-Afternoon-Infilteration.pcap_ISCX.csv`\n","6. `Friday-WorkingHours-Morning.pcap_ISCX.csv`\n","7. `Friday-WorkingHours-Afternoon-PortScan.pcap_ISCX.csv`\n","8. `Friday-WorkingHours-Afternoon-DDos.pcap_ISCX.csv`"]},{"cell_type":"code","metadata":{"id":"AfgrkzH33QwO","colab_type":"code","colab":{},"executionInfo":{"status":"ok","timestamp":1597048514489,"user_tz":-540,"elapsed":41186,"user":{"displayName":"Mahendra Data","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14Ghn7DAlkRKEg-Y82BqktrBT0ABMFy8r5576xhbKDQ=s64","userId":"08049029618478467489"}}},"source":["import os\n","import pandas as pd"],"execution_count":3,"outputs":[]},{"cell_type":"code","metadata":{"id":"8rWx8tw92M8H","colab_type":"code","colab":{},"executionInfo":{"status":"ok","timestamp":1597048515710,"user_tz":-540,"elapsed":42404,"user":{"displayName":"Mahendra Data","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14Ghn7DAlkRKEg-Y82BqktrBT0ABMFy8r5576xhbKDQ=s64","userId":"08049029618478467489"}}},"source":["file_name = os.path.join(\"MachineLearningCVE\", \"Thursday-WorkingHours-Morning-WebAttacks.pcap_ISCX.csv\")\n","\n","# Read dataset\n","df = pd.read_csv(file_name, skipinitialspace=True, error_bad_lines=False)"],"execution_count":4,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"s2IcBrKjUSZA","colab_type":"text"},"source":["# Change the unrecognized character in the class label\n","\n","File `Thursday-WorkingHours-Morning-WebAttacks.pcap_ISCX.csv` contain unrecognized character.\n","\n","Change this unrecognized character to `-`."]},{"cell_type":"code","metadata":{"id":"dOAhC4rJUSrK","colab_type":"code","colab":{},"executionInfo":{"status":"ok","timestamp":1597048515712,"user_tz":-540,"elapsed":42403,"user":{"displayName":"Mahendra Data","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14Ghn7DAlkRKEg-Y82BqktrBT0ABMFy8r5576xhbKDQ=s64","userId":"08049029618478467489"}}},"source":["def _renaming_class_label(df: pd.DataFrame):\n"," labels = {\"Web Attack <20> Brute Force\": \"Web Attack-Brute Force\",\n"," \"Web Attack <20> XSS\": \"Web Attack-XSS\",\n"," \"Web Attack <20> Sql Injection\": \"Web Attack-Sql Injection\"}\n","\n"," for old_label, new_label in labels.items():\n"," df.Label.replace(old_label, new_label, inplace=True)\n","\n","# Renaming labels\n","_renaming_class_label(df)"],"execution_count":5,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"fueJb4vK3RYs","colab_type":"text"},"source":["# Saving the dataset"]},{"cell_type":"code","metadata":{"id":"AZE7Jk9_ZWaV","colab_type":"code","colab":{},"executionInfo":{"status":"ok","timestamp":1597048521700,"user_tz":-540,"elapsed":48389,"user":{"displayName":"Mahendra Data","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14Ghn7DAlkRKEg-Y82BqktrBT0ABMFy8r5576xhbKDQ=s64","userId":"08049029618478467489"}}},"source":["# Save to csv\n","df.to_csv(file_name, index=False)"],"execution_count":6,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"KHlTHV0hYfbX","colab_type":"text"},"source":["# Combine All Dataset"]},{"cell_type":"code","metadata":{"id":"kHmBhm_kYkyK","colab_type":"code","colab":{},"executionInfo":{"status":"ok","timestamp":1597048521702,"user_tz":-540,"elapsed":48388,"user":{"displayName":"Mahendra Data","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14Ghn7DAlkRKEg-Y82BqktrBT0ABMFy8r5576xhbKDQ=s64","userId":"08049029618478467489"}}},"source":["DIR_PATH = \"MachineLearningCVE\"\n","\n","FILE_NAMES = [\"Monday-WorkingHours.pcap_ISCX.csv\",\n"," \"Tuesday-WorkingHours.pcap_ISCX.csv\",\n"," \"Wednesday-workingHours.pcap_ISCX.csv\",\n"," \"Thursday-WorkingHours-Morning-WebAttacks.pcap_ISCX.csv\",\n"," \"Thursday-WorkingHours-Afternoon-Infilteration.pcap_ISCX.csv\",\n"," \"Friday-WorkingHours-Morning.pcap_ISCX.csv\",\n"," \"Friday-WorkingHours-Afternoon-PortScan.pcap_ISCX.csv\",\n"," \"Friday-WorkingHours-Afternoon-DDos.pcap_ISCX.csv\"]"],"execution_count":7,"outputs":[]},{"cell_type":"code","metadata":{"id":"lErmt1mnaSUK","colab_type":"code","colab":{},"executionInfo":{"status":"ok","timestamp":1597048541186,"user_tz":-540,"elapsed":67870,"user":{"displayName":"Mahendra Data","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14Ghn7DAlkRKEg-Y82BqktrBT0ABMFy8r5576xhbKDQ=s64","userId":"08049029618478467489"}}},"source":["df = [pd.read_csv(os.path.join(DIR_PATH, f), skipinitialspace=True) for f in FILE_NAMES]\n","df = pd.concat(df, ignore_index=True)"],"execution_count":8,"outputs":[]},{"cell_type":"code","metadata":{"id":"1_27aWk_a8Ra","colab_type":"code","colab":{"base_uri":"https://localhost:8080/","height":289},"executionInfo":{"status":"ok","timestamp":1597048541411,"user_tz":-540,"elapsed":68086,"user":{"displayName":"Mahendra Data","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14Ghn7DAlkRKEg-Y82BqktrBT0ABMFy8r5576xhbKDQ=s64","userId":"08049029618478467489"}},"outputId":"eb325d09-39dc-4413-ad87-ad48d2117e1a"},"source":["df.Label.value_counts()"],"execution_count":9,"outputs":[{"output_type":"execute_result","data":{"text/plain":["BENIGN 2273097\n","DoS Hulk 231073\n","PortScan 158930\n","DDoS 128027\n","DoS GoldenEye 10293\n","FTP-Patator 7938\n","SSH-Patator 5897\n","DoS slowloris 5796\n","DoS Slowhttptest 5499\n","Bot 1966\n","Web Attack-Brute Force 1507\n","Web Attack-XSS 652\n","Infiltration 36\n","Web Attack-Sql Injection 21\n","Heartbleed 11\n","Name: Label, dtype: int64"]},"metadata":{"tags":[]},"execution_count":9}]},{"cell_type":"code","metadata":{"id":"HUGbsQsEaYGg","colab_type":"code","colab":{},"executionInfo":{"status":"ok","timestamp":1597048638032,"user_tz":-540,"elapsed":164704,"user":{"displayName":"Mahendra Data","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14Ghn7DAlkRKEg-Y82BqktrBT0ABMFy8r5576xhbKDQ=s64","userId":"08049029618478467489"}}},"source":["df.to_csv(os.path.join(DIR_PATH, \"MachineLearningCVE.csv\"), index=False)"],"execution_count":10,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"WxnVQ3zro5Cy","colab_type":"text"},"source":["Copy to Google Drive."]},{"cell_type":"code","metadata":{"id":"-Qbe7PBF1rIP","colab_type":"code","colab":{},"executionInfo":{"status":"ok","timestamp":1597048653540,"user_tz":-540,"elapsed":180208,"user":{"displayName":"Mahendra Data","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14Ghn7DAlkRKEg-Y82BqktrBT0ABMFy8r5576xhbKDQ=s64","userId":"08049029618478467489"}}},"source":["!cp -r \"MachineLearningCVE/\" \"/content/drive/My Drive/CICIDS2017/\""],"execution_count":11,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"vWkym95UfOYu","colab_type":"text"},"source":["Now the dataset is saved to your Google Drive at `CICIDS2017` folder."]}]}