initial commit reference

2 years ago · dd74e2eedb
parent 9263ae064b
commit dd74e2eedb
10 changed files with 2398 additions and 1 deletions
--- a/.gitignore
+++ b/.gitignore
@ -3,7 +3,7 @@
 __pycache__/
 *.py[cod]
 *$py.class
-
+.idea/
 # C extensions
 *.so

--- a/_reference/LICENSE
+++ b/_reference/LICENSE
@ -0,0 +1,21 @@
+MIT License
+
+Copyright (c) 2021 talshapira
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
--- a/_reference/README.md
+++ b/_reference/README.md
@ -0,0 +1,67 @@
+# FlowPic: A Generic Representation for Encrypted Traffic Classification and Applications Identification
+
+Identifying the type of a network flow or a specific application has many advantages, but become harder in recent years due to the use of encryption, e.g., by VPN and Tor. 
+Current solutions rely mostly on handcrafted features and then apply supervised learning techniques for the classification. 
+	
+We introduce a novel approach for encrypted Internet traffic classification and application identification by transforming basic flow data into a picture, em a FlowPic, and then using known image classification deep learning techniques, Convolutional Neural Networks (CNNs), to identify the flow category (browsing, chat, video, etc.) and the application in use.  Our approach can classify traffic with high accuracy, both for a specific application, or a flow category, even for VPN and Tor traffic. Our classifier can even identify with high success new applications that were not part of the training phase for a category, thus, new versions or applications can be categorized without additional training.
+
+A recent [work](https://arxiv.org/abs/2104.03182) by Yang et al. compared different recent methods for Internet Traffic Classification, and showed that our method achieves the best tradeoff between accuracy and model complexity, as shown below (FlowPic marked with [17]):
+
+<p align="center">
+<img src='http://talshapira.github.io/files/yang_2021_comaprison.png' width="400">
+</p>
+
+# Approach
+
+1. Extract records from each flow, which comprised of a list of pairs, {IP packet size, time of arrival} for each packet in the flow.
+2. Split each unidirectional flow to equal blocks (15/60 seconds).
+3. Generate 2D-histogram. For simplicity, we set the 2D-histogram to be a square image.
+4. Feed a Convolution Neural Network.
+
+<img src='http://talshapira.github.io/files/FlowPic_sys.png'>
+
+
+# FlowPics Exploration
+
+<img src='http://talshapira.github.io/files/flowpic_categories.png'>
+
+<p align="center">
+<img src='http://talshapira.github.io/files/flowpic_apps.png' width="400">
+</p>
+
+# Dataset
+
+We use labeled datasets of packet capture (pcap) files from the Uni. of New Brunswick (UNB): ["ISCX VPN-nonVPN traffic dataset" (ISCX-VPN)](https://www.unb.ca/cic/datasets/vpn.html) and ["ISCX Tor-nonTor dataset" (ISCX-Tor)](https://www.unb.ca/cic/datasets/tor.html), as well as our own small packet capture (TAU), and conduct different types of experiments; (1) multiclass classification experiments over non-VPN/VPN/Tor and merged dataset, (2) class vs. all classification experiments, (3) application identification, and (4) classification of an unknown application.
+
+Each pcap file corresponds to a specific application, a traffic category and an encryption technique. However, all these captures also contain sessions of different traffic categories, since while performing one action in an application, many other sessions occur for different tasks simultaneously. For example, while using VoIP over Facebook, there is another STUN session taking place at the same time for adjusting and maintaining the VoIP conversation, as well as an HTTPS session of the Facebook site.
+
+We use a combined dataset only from the five categories that contains enough samples: VoIP, Video, Chat, Browsing, and File Transfer. For these categories we have 3 encryption techniques: non VPN, VPN (for all classes except Browsing) and TOR.
+Notice that our categories differ slightly from those suggested by UNB. All the applications that were captured in order to create the dataset, for each traffic category and encryption technique, are shown in the folowing table:
+
+<p align="center">
+<img src='http://talshapira.github.io/files/flowpic_dataset.png' width="600">
+</p>
+
+We parsed the pcap files and constructed for each combination of traffic category and encryption technique a CSV file with the following structure - 
+|pcap_name|ip_src|port_src|ip_dst|port_dst|TCP/UDP|start_time|length|[timestamps_list]|[sizes_list]| , such that each entry corresponds to a specific unidirectional session.
+
+# TrafficParser
+
+Contains the code use to generate the dataset (npy files) per experiment.
+If you choose to use our proceesed dataset (i.e. CSV files) directly, run the scripts in the following order:
+1. Run traffic_csv_converter.py
+2. Run datasets_generator.py
+
+The other two scripts (generic_parser.py + traffic_csv_merger.py) used to generate the proceesed dataset.
+
+# License
+
+Our proceesed dataset (i.e. CSV files) is [publicly available](https://drive.google.com/file/d/1gz61vnMANj-4hKNvZv1KFK9LajR91X-m/view?usp=sharing) upon request for researchers. If you are using our dataset, please cite our related research paper, as well as UNB's related research papers:
+
+* T. Shapira and Y. Shavitt, "FlowPic: A Generic Representation for Encrypted Traffic Classification and Applications Identification," in IEEE Transactions on Network and Service Management, doi: 10.1109/TNSM.2021.3071441.
+
+* T. Shapira and Y. Shavitt, "FlowPic: Encrypted Internet Traffic Classification is as Easy as Image Recognition," IEEE INFOCOM 2019 - IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS), Paris, France, 2019, pp. 680-687.
+
+* Gerard Drapper Gil, Arash Habibi Lashkari, Mohammad Mamun, Ali A. Ghorbani, "Characterization of Encrypted and VPN Traffic Using Time-Related Features", In Proceedings of the 2nd International Conference on Information Systems Security and Privacy(ICISSP 2016) , pages 407-414, Rome, Italy.
+
+* Arash Habibi Lashkari, Gerard Draper-Gil, Mohammad Saiful Islam Mamun and Ali A. Ghorbani, "Characterization of Tor Traffic Using Time Based Features", In the proceeding of the 3rd International Conference on Information System Security and Privacy, SCITEPRESS, Porto, Portugal, 2017.
--- a/_reference/TrafficParser/datasets_generator.py
+++ b/_reference/TrafficParser/datasets_generator.py
@ -0,0 +1,87 @@
+#!/usr/bin/env python
+"""
+datasets_generator.py creates final class_vs_all dataset ready to be inserted to machine.
+The input for this module are pre-created numpy array containing all classes session 2d_histograms created in traffic_csv_conveter.py
+"""
+import glob
+import numpy as np
+# from sklearn.cross_validation import train_test_split
+from sklearn.model_selection import train_test_split
+
+CLASS = "browsing"
+TEST_SIZE = 0.1
+DATASET_DIR = "../datasets/"
+
+VPN_TYPES = {
+    "reg": glob.glob("../raw_csvs/classes/**/reg/*.npy"),
+    "vpn": glob.glob("../raw_csvs/classes/**/vpn/*.npy"),
+    "tor": glob.glob("../raw_csvs/classes/**/tor/*.npy")
+}
+
+
+def import_array(input_array):
+    print("Import dataset " + input_array)
+    dataset = np.load(input_array)
+    print(dataset.shape)
+    return dataset
+
+
+def export_dataset(dataset_dict, file_path):
+    # with open(file_path + ".pkl", 'wb') as outfile:
+    #     pickle.dump(dataset_list, outfile, pickle.HIGHEST_PROTOCOL)
+    for name, array in dataset_dict.items():
+        np.save(file_path + "_" + name, array)
+
+
+def create_class_vs_all_specific_vpn_type_dataset(class_name, vpn_type="reg", validation=False, ratio=1.2):
+    class_array_file = [fn for fn in VPN_TYPES[vpn_type] if class_name in fn and "overlap" not in fn][0]
+    print(class_array_file)
+    all_files = [fn for fn in VPN_TYPES[vpn_type] if class_name not in fn and "overlap" not in fn]
+    print(all_files)
+
+    class_array = import_array(class_array_file)
+    count = len(class_array)
+    print(count)
+
+    all_count = len(all_files)
+    count_per_class = ratio*count/all_count
+    print(count_per_class)
+
+    for fn in all_files:
+        print(fn)
+        fn_array = import_array(fn)
+        p = count_per_class*1.0/len(fn_array)
+        print(p)
+        if p < 1:
+            mask = np.random.choice([True, False], len(fn_array), p=[p, 1-p])
+            fn_array = fn_array[mask]
+
+        print(len(fn_array))
+        class_array = np.append(class_array, fn_array, axis=0)
+        print(len(class_array))
+        del fn_array
+
+    labels = np.append(np.zeros(count), np.ones(len(class_array) - count))
+    print(len(class_array), len(labels), labels[0], labels[count-1], labels[count], labels[-1])
+    dataset_dict = dict()
+
+    if validation:
+        x_train, x_val, y_train, y_val = train_test_split(class_array, labels, test_size=TEST_SIZE)
+        print(len(y_train), sum(y_train), 1.0*sum(y_train)/len(y_train))
+        print(len(y_val), sum(y_val), 1.0*sum(y_val)/len(y_val))
+
+        dataset_dict["x_train"] = x_train
+        dataset_dict["x_val"] = x_val
+        dataset_dict["y_train"] = y_train
+        dataset_dict["y_val"] = y_val
+    else:
+        dataset_dict["x_test"] = class_array
+        dataset_dict["y_test"] = labels
+
+    export_dataset(dataset_dict, DATASET_DIR + class_name + "_vs_all_" + vpn_type)
+
+
+if __name__ == '__main__':
+    # create_class_vs_all_specific_vpn_type_dataset(CLASS, validation=True)
+    # create_class_vs_all_specific_vpn_type_dataset(CLASS, vpn_type="vpn", validation=False)
+    create_class_vs_all_specific_vpn_type_dataset(CLASS, vpn_type="tor", validation=False)
--- a/_reference/TrafficParser/generic_parser.py
+++ b/_reference/TrafficParser/generic_parser.py
@ -0,0 +1,127 @@
+#!/usr/bin/env python
+"""
+Use DPKT to read in a pcap file and create one directional sessions of packets sizes (ip total length) and ts.
+"""
+import dpkt
+import os
+import socket
+import argparse
+import csv
+import time
+
+FLAGS = None
+# INPUT = "../dataset/iscxNTVPN2016/CompletePCAPs"#"../dataset/CICNTTor2017/Pcaps/tor" #"../dataset/iscxNTVPN2016/CompletePCAPs"#"./test_pacaps"#"../dataset/iscxNTVPN2016/CompletePCAPs" # ""
+INPUT = './test_pcaps/my_chat'
+FILTER_LIST = None # [(["audio", "voip"], True), (["vpn", "tor"], False)]
+
+PROTO_DICT = {dpkt.tcp.TCP: "TCP", dpkt.udp.UDP: "UDP"}
+
+
+def inet_to_str(inet):
+    """Convert inet object to a string
+        Args:
+            inet (inet struct): inet network address
+        Returns:
+            str: Printable/readable IP address
+    """
+    # First try ipv4 and then ipv6
+    try:
+        return socket.inet_ntop(socket.AF_INET, inet)
+    except ValueError:
+        return socket.inet_ntop(socket.AF_INET6, inet)
+
+
+def get_pcaps_list(dir_path, filter_list=None):
+    def filter_list_func(fn):
+        if filter_list is not None:
+            for filter_str_list, type in filter_list:
+                result = any([filter_str in fn.lower() for filter_str in filter_str_list])
+                if result is not type:
+                    return False
+        return True
+    return [(os.path.join(dir_path, fn), fn) for fn in next(os.walk(dir_path))[2] if (".pcap" in os.path.splitext(fn)[-1] and filter_list_func(fn))]
+
+
+def parse_pcap(pcap, pcap_path, file_name):
+    """Print out information about each packet in a pcap
+       Args:
+           pcap: dpkt pcap reader object (dpkt.pcap.Reader)
+    """
+    counter = 0
+    pcap_dict = {}
+
+    # For each packet in the pcap process the contents
+    for ts, packet in pcap:
+
+        # Unpack the Ethernet frame
+        try:
+            eth = dpkt.ethernet.Ethernet(packet)
+        except dpkt.dpkt.NeedData:
+            print("dpkt.dpkt.NeedData")
+
+        # Make sure the Ethernet data contains an IP packet
+        if isinstance(eth.data, dpkt.ip.IP):
+            ip = eth.data
+        elif isinstance(eth.data, str):
+            try:
+                ip = dpkt.ip.IP(packet)
+            except dpkt.UnpackError:
+                continue
+        else:
+            continue
+
+        # Now unpack the data within the Ethernet frame (the IP packet)
+        # Pulling out src_ip, dst_ip, protocol (tcp/udp), dst/src port, length
+
+        proto = ip.data
+
+        # Print out the info
+        if type(ip.data)in PROTO_DICT:
+            session_tuple_key = (inet_to_str(ip.src), proto.sport, inet_to_str(ip.dst), proto.dport, PROTO_DICT[type(ip.data)])
+            pcap_dict.setdefault(session_tuple_key, (ts, [], []))
+            d = pcap_dict[session_tuple_key]
+            size = len(ip) #ip.len
+            d[1].append(round(ts - d[0], 6)), d[2].append(size)
+            counter += 1
+
+    print("Total Number of Parsed Packets in " + pcap_path + ": " + str(counter))
+
+    csv_file_path = os.path.splitext(pcap_path)[0] + ".csv"
+    with open(csv_file_path, 'wb') as csv_file:
+        writer = csv.writer(csv_file)
+        for key, value in pcap_dict.items():
+            writer.writerow([file_name.split(".")[0]] + list(key) + [value[0], len(value[1])] + value[1] + [None] + value[2])
+
+    for k,v in pcap_dict.iteritems():
+        if len(v[1]) > 2000:
+            print(k, v[0], len(v[1]))
+
+
+def generic_parser(file_list):
+    """Open up a pcap file and create a output file containing all one-directional parsed sessions"""
+    for pcap_path, file_name in file_list:
+        try:
+            with open(pcap_path, 'rb') as f:
+                pcap = dpkt.pcap.Reader(f)
+                parse_pcap(pcap, pcap_path, file_name)
+
+        except ValueError:
+            new_pcap_file = os.path.splitext(pcap_path)[0] + "_new.pcap"
+            os.system("editcap -F libpcap -T ether " + pcap_path + " " + new_pcap_file)
+
+            with open(new_pcap_file, 'rb') as f:
+                pcap = dpkt.pcap.Reader(f)
+                parse_pcap(pcap, pcap_path, file_name)
+
+            os.remove(new_pcap_file)
+
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser()
+    parser.add_argument('--input', type=str, default=INPUT, help='Path to pcap')
+
+    FLAGS = parser.parse_args()
+    file_list = get_pcaps_list(FLAGS.input, FILTER_LIST)
+    start_time = time.time()
+    generic_parser(file_list)
+    total_time = time.time() - start_time
+    print("--- %s seconds ---" % total_time)
--- a/_reference/TrafficParser/sessions_plotter.py
+++ b/_reference/TrafficParser/sessions_plotter.py
@ -0,0 +1,75 @@
+#!/usr/bin/env python
+"""
+sessions_plotter.py has 3 functions to create spectogram, histogram, 2d_histogram from [(ts, size),..] session.
+"""
+
+import matplotlib.pyplot as plt
+import numpy as np
+
+MTU = 1500
+
+def session_spectogram(ts, sizes, name=None):
+    plt.scatter(ts, sizes, marker='.')
+    plt.ylim(0, MTU)
+    plt.xlim(ts[0], ts[-1])
+    # plt.yticks(np.arange(0, MTU, 10))
+    # plt.xticks(np.arange(int(ts[0]), int(ts[-1]), 10))
+    plt.title(name + " Session Spectogram")
+    plt.ylabel('Size [B]')
+    plt.xlabel('Time [sec]')
+
+    plt.grid(True)
+    plt.show()
+
+
+def session_atricle_spectogram(ts, sizes, fpath=None, show=True, tps=None):
+    if tps is None:
+        max_delta_time = ts[-1] - ts[0]
+    else:
+        max_delta_time = tps
+
+    ts_norm = ((np.array(ts) - ts[0]) / max_delta_time) * MTU
+    plt.figure()
+    plt.scatter(ts_norm, sizes, marker=',', c='k', s=5)
+    plt.ylim(0, MTU)
+    plt.xlim(0, MTU)
+    plt.ylabel('Packet Size [B]')
+    plt.xlabel('Normalized Arrival Time')
+    plt.set_cmap('binary')
+    plt.axes().set_aspect('equal')
+    plt.grid(False)
+    if fpath is not None:
+        # plt.savefig(OUTPUT_DIR + fname, bbox_inches='tight', pad_inches=1)
+        plt.savefig(fpath, bbox_inches='tight')
+    if show:
+        plt.show()
+    plt.close()
+
+
+def session_histogram(sizes, plot=False):
+    hist, bin_edges = np.histogram(sizes, bins=range(0, MTU + 1, 1))
+    if plot:
+        plt.bar(bin_edges[:-1], hist, width=1)
+        plt.xlim(min(bin_edges), max(bin_edges)+100)
+        plt.show()
+    return hist.astype(np.uint16)
+
+
+def session_2d_histogram(ts, sizes, plot=False, tps=None):
+    if tps is None:
+        max_delta_time = ts[-1] - ts[0]
+    else:
+        max_delta_time = tps
+
+    # ts_norm = map(int, ((np.array(ts) - ts[0]) / max_delta_time) * MTU)
+    ts_norm = ((np.array(ts) - ts[0]) / max_delta_time) * MTU
+    H, xedges, yedges = np.histogram2d(sizes, ts_norm, bins=(range(0, MTU + 1, 1), range(0, MTU + 1, 1)))
+
+    if plot:
+        plt.pcolormesh(xedges, yedges, H)
+        plt.colorbar()
+        plt.xlim(0, MTU)
+        plt.ylim(0, MTU)
+        plt.set_cmap('binary')
+        plt.show()
+    return H.astype(np.uint16)
--- a/_reference/TrafficParser/traffic_csv_converter.py
+++ b/_reference/TrafficParser/traffic_csv_converter.py
@ -0,0 +1,197 @@
+#!/usr/bin/env python
+"""
+Read traffic_csv
+"""
+
+import os
+import argparse
+import csv
+from sessions_plotter import *
+import glob
+import re
+
+FLAGS = None
+INPUT = "../raw_csvs/classes/browsing/reg/CICNTTor_browsing.raw.csv"#"../dataset/iscxNTVPN2016/CompletePCAPs" # ""
+INPUT_DIR = "../raw_csvs/classes/chat/vpn/"
+CLASSES_DIR = "../raw_csvs/classes/**/**/"
+
+# LABEL_IND = 1
+TPS = 60 # TimePerSession in secs
+DELTA_T = 60 # Delta T between splitted sessions
+MIN_TPS = 50
+
+# def insert_dataset(dataset, labels, session, label_ind=LABEL_IND):
+#     dataset.append(session)
+#     labels.append(label_ind)
+
+# def export_dataset(dataset, labels):
+#     print "Start export dataset"
+#     np.savez(INPUT.split(".")[0] + ".npz", X=dataset, Y=labels)
+#     print dataset.shape, labels.shape
+
+#
+# def import_dataset():
+#     print "Import dataset"
+#     dataset = np.load(INPUT.split(".")[0] + ".npz")
+#     print dataset["X"].shape, dataset["Y"].shape
+
+
+def export_dataset(dataset):
+    print("Start export dataset")
+    np.save(os.path.splitext(INPUT)[0], dataset)
+    print(dataset.shape)
+
+
+def export_class_dataset(dataset, class_dir):
+    print("Start export dataset")
+    np.save(class_dir + "/" + "_".join(re.findall(r"[\w']+", class_dir)[-2:]), dataset)
+    print(dataset.shape)
+
+
+def import_dataset():
+    print("Import dataset")
+    dataset = np.load(os.path.splitext(INPUT)[0] + ".npy")
+    print(dataset.shape)
+    return dataset
+
+
+def traffic_csv_converter(file_path):
+    print("Running on " + file_path)
+    dataset = []
+    # labels = []
+    counter = 0
+    with open(file_path, 'r') as csv_file:
+        reader = csv.reader(csv_file)
+        for i, row in enumerate(reader):
+            # print row[0], row[7]
+            session_tuple_key = tuple(row[:8])
+            length = int(row[7])
+            ts = np.array(row[8:8+length], dtype=float)
+            sizes = np.array(row[9+length:], dtype=int)
+
+            # if (sizes > MTU).any():
+            #     a = [(sizes[i], i) for i in range(len(sizes)) if (np.array(sizes) > MTU)[i]]
+            #     print len(a), session_tuple_key
+
+            if length > 10:
+                # print ts[0], ts[-1]
+                # h = session_2d_histogram(ts, sizes)
+                # session_spectogram(ts, sizes, session_tuple_key[0])
+                # dataset.append([h])
+                # counter += 1
+                # if counter % 100 == 0:
+                #     print counter
+
+                for t in range(int(ts[-1]/DELTA_T - TPS/DELTA_T) + 1):
+                    mask = ((ts >= t * DELTA_T) & (ts <= (t * DELTA_T + TPS)))
+                    # print t * DELTA_T, t * DELTA_T + TPS, ts[-1]
+                    ts_mask = ts[mask]
+                    sizes_mask = sizes[mask]
+                    if len(ts_mask) > 10 and ts_mask[-1] - ts_mask[0] > MIN_TPS:
+                        # if "facebook" in session_tuple_key[0]:
+                        #     session_spectogram(ts[mask], sizes[mask], session_tuple_key[0])
+                        #     # session_2d_histogram(ts[mask], sizes[mask], True)
+                        #     session_histogram(sizes[mask], True)
+                        #     exit()
+                        # else:
+                        #     continue
+
+                        h = session_2d_histogram(ts_mask, sizes_mask)
+                        # session_spectogram(ts_mask, sizes_mask, session_tuple_key[0])
+                        dataset.append([h])
+                        counter += 1
+                        if counter % 100 == 0:
+                            print(counter)
+
+    return np.asarray(dataset) #, np.asarray(labels)
+
+
+def traffic_csv_converter_splitted(file_path):
+    def split_converter(ts, sizes, dataset, counter):
+        if ts[-1] - ts[0] > MIN_TPS and len(ts) > 20:
+            # print ts[0], ts[-1]
+            h = session_2d_histogram(ts-ts[0], sizes)
+            # session_spectogram(ts, sizes, session_tuple_key[0])
+            dataset.append([h])
+            counter += 1
+            # if counter % 100 == 0:
+            #     print counter
+
+            total_time = ts[-1] - ts[0]
+            if total_time > TPS:
+                for ts_split, sizes_split in zip(np.split(ts, [len(ts)/2]), np.split(sizes, [len(sizes)/2])):
+                    split_converter(ts_split, sizes_split, dataset, counter)
+
+    print("Running on " + file_path)
+    dataset = []
+    # labels = []
+    counter = 0
+    with open(file_path, 'r') as csv_file:
+        reader = csv.reader(csv_file)
+        for i, row in enumerate(reader):
+            # print row[0], row[7]
+            session_tuple_key = tuple(row[:8])
+            length = int(row[7])
+            ts = np.array(row[8:8+length], dtype=float)
+            sizes = np.array(row[9+length:], dtype=int)
+
+            # if (sizes > MTU).any():
+            #     a = [(sizes[i], i) for i in range(len(sizes)) if (np.array(sizes) > MTU)[i]]
+            #     print len(a), session_tuple_key
+
+            if length > 10:
+                split_converter(ts, sizes, dataset, counter)
+
+    return np.asarray(dataset)
+
+
+def traffic_class_converter(dir_path):
+    dataset_tuple = ()
+    for file_path in [os.path.join(dir_path, fn) for fn in next(os.walk(dir_path))[2] if (".csv" in os.path.splitext(fn)[-1])]:
+        dataset_tuple += (traffic_csv_converter(file_path),)  ################
+
+    return np.concatenate(dataset_tuple, axis=0)
+
+
+def iterate_all_classes():
+    for class_dir in glob.glob(CLASSES_DIR):
+        if "other" not in class_dir: #"browsing" not in class_dir and
+            print("working on " + class_dir)
+            dataset = traffic_class_converter(class_dir)
+            print(dataset.shape)
+            export_class_dataset(dataset, class_dir)
+
+
+def random_sampling_dataset(input_array, size=2000):
+    print("Import dataset " + input_array)
+    dataset = np.load(input_array)
+    print(dataset.shape)
+    p = size*1.0/len(dataset)
+    print(p)
+    if p >= 1:
+        raise Exception
+
+    mask = np.random.choice([True, False], len(dataset), p=[p, 1-p])
+    dataset = dataset[mask]
+    print("Start export dataset")
+
+    np.save(os.path.splitext(input_array)[0] + "_samp", dataset)
+
+
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser()
+    parser.add_argument('--input', type=str, default=INPUT, help='Path to csv file')
+
+    FLAGS = parser.parse_args()
+    ##
+    # iterate_all_classes()
+
+    # dataset = traffic_class_converter(INPUT_DIR)
+    # dataset = traffic_csv_converter(INPUT)
+
+    input_array = "../raw_csvs/classes/browsing/reg/browsing_reg.npy"
+    random_sampling_dataset(input_array)
+
+
+    # export_class_dataset(dataset)
+    # import_dataset()
--- a/_reference/TrafficParser/traffic_csv_merger.py
+++ b/_reference/TrafficParser/traffic_csv_merger.py
@ -0,0 +1,122 @@
+#!/usr/bin/env python
+"""
+traffic_csv_merger.py merge all filtered traffic_csvs from the same original_pacps_dataest, class and vpn type into one merged csv file.
+"""
+
+import os
+import argparse
+import csv
+from sessions_plotter import *
+
+FLAGS = None
+
+# INPUT = "../raw_csvs/CICNTTor2017/tor"
+# INPUT = "../raw_csvs/iscxNTVPN2016" # ""
+# INPUT = "D:/TS/Internet Traffic Classification/TrafficParser/test_pcaps/my_chat"
+INPUT = "./test_pcaps/my_chat"
+# OUTPUT1 = "CICNTTor_browsing_tor.raw.csv"
+# OUTPUT2 = "CICNTTor_browsing_others_tor.raw.csv"
+# OUTPUT1 = "iscx_email.raw.csv"
+# OUTPUT2 = "iscx_email_others.raw.csv"
+# OUTPUT3 = "iscx_video_voip.raw.csv"
+OUTPUT1 = "my_chat.raw.csv"
+OUTPUT2 = "my_chat_others.raw.csv"
+
+# FILTER_LIST = [(["audio", "voip"], True), (["tor", "vpn"], False)] #-> voip, , "tor"
+# FILTER_LIST = [(["video", "youtube", "vimeo", "netflix"], True), (["tor", "vpn"], False)] #-> video
+# FILTER_LIST = [(["audio", "voip"], True), (["spotify"], False)] #-> voip, , "tor"
+# FILTER_LIST = [(["ftps", "scp", "sftp", "file"], True), (["mail", "pop", "tor"], False), (["vpn"], True)]
+FILTER_LIST = [(["chat"], True), (["vpn"], False)]
+
+
+def get_csvs_list(dir_path, filter_list=None):
+    def filter_list_func(fn):
+        if filter_list is not None:
+            for filter_str_list, type in filter_list:
+                result = any([filter_str in fn.lower() for filter_str in filter_str_list])
+                if result is not type:
+                    return False
+        return True
+
+    return [(os.path.join(dir_path, fn), fn) for fn in next(os.walk(dir_path))[2] if (".csv" in os.path.splitext(fn)[-1] and filter_list_func(fn))]
+
+
+def traffic_csv_reader(file_list):
+
+    output1 = open(OUTPUT1, 'w')
+    writer1 = csv.writer(output1)
+    counter1 = 0
+    output2 = open(OUTPUT2, 'w')
+    writer2 = csv.writer(output2)
+    counter2 = 0
+    # output3 = open(OUTPUT3, 'wb')
+    # writer3 = csv.writer(output3)
+    # counter3 = 0
+
+    rate_list = []
+    for i, (file_path, file_name) in enumerate(file_list):
+        print("Running on " + str(i) + " file - " + file_path)
+        with open(file_path, 'r') as csv_file:
+            reader = csv.reader(csv_file)
+            for i, row in enumerate(reader):
+                session_tuple_key = tuple(row[:8])
+                length = int(row[7])
+                ts = np.array(row[8:8+length], dtype=float)
+                if length >= 20:
+                    total_time = ts[-1] - ts[0]
+                    sizes = np.array(row[9+length:], dtype=int)
+                    # print row[0], length, total_time, length/total_time
+                    rate = length/total_time
+                    rate_list.append(rate)
+                    # if (sizes > MTU).any():
+                    #     a = [(sizes[i], i) for i in range(len(sizes)) if (np.array(sizes) > MTU)[i]]
+                    #     print len(a), session_tuple_key, a
+
+                    # if ("facebook" in row[0] and rate > 40) or ("facebook" not in row[0] and 20 <= rate and length >= 1000): # for iscx_voip
+                    # if "facebook" not in row[0] and 40 <= rate and length >= 1000: # for iscx_voip_vpn
+                    # if ("youtube" in row[0] and row[2] == '443' and total_time > 10 and rate > 15) or ("vimeo" in row[0] and rate > 30 and total_time > 15) or ("netflix" in row[0] and rate > 60) or ("facebook" in row[0] and rate > 60) or (40 <= rate and row[5] == "UDP" and "facebook" not in row[0]): # for iscx_video
+                    # if length > 6000 and rate > 10 and total_time > 10 and (row[2] == '443' or row[2] == '80'): # for iscx_video_vpn
+                    # if total_time > 30 and rate > 30 and (row[2] == '443' or row[2] == '80'): # for CICNTTor_video
+                    # if total_time > 30 and (row[2] == '443' or row[2] == '80'): # for CICNTTor_video
+                    # if total_time > 10 and rate > 10 and length > 1000 and (session_tuple_key[-1] != '3326' and session_tuple_key[-1] != '6367'): # for CICNTTor_voip
+                    # if (total_time > 10 and ((rate > 100) or ("skype" in row[0] and rate > 10))) or ("rent" in row[0] and total_time > 20 and row[2] != '21943' and row[4] != '28904'): #for iscx_file
+                    # if ("torrent"  in row[0] and total_time > 30 and rate > 10 and (row[2] == '443' or row[2] == '80')) or ("torrent" not in row[0] and total_time > 30 and rate > 10 and int(row[2]) not in [22, 1781, 59886, 35968]): #for iscx_file_vpn
+                    # if (total_time > 20) and (("POP" in row[0] and rate > 150) or (("IMAP" in row[0] and rate > 10 and row[7][0] == '8') or row[0] == 'FTP_filetransfer' and total_time > 20 and rate > 100) or ( rate > 10 and "SFTP" in row[0])): #for CICNTTor_file
+                    # if total_time > 20 and rate > 5: #for CICNTTor_file_tor
+                    # if ("skype" in row[0] and total_time > 20 and rate > 3 and row[1][:2] == "10") or ("ftps" in row[0] and total_time > 20 and rate > 300 and row[2] != '1781') or ("sftp" in row[0] and total_time > 20 and rate > 5 and (('A' in row[0] and row[4]=='22') or ('B' in row[0] and row[2]=='22'))):#for iscx_file_vpn
+                    if (total_time > 50 and rate<5) and (("whats" in row[0] and ("185.60." in row[1] or "185.60." in row[3])) or ("hang" in row[0] and ("216.58." in row[1] or "216.58." in row[3])) or ("book" in row[0] and ("192.114." in row[1] or "192.114." in row[3]))): #my_chat
+                    # if (row[1] in ["131.202.240.242","131.202.240.45"] and row[3] in ["131.202.240.242","131.202.240.45"]) or ("gmail" in row[0] and "131.202.240.87" in [row[1], row[3]] and total_time > 40 and row[5] == 'TCP' and rate < 0.3): #for scx_chat
+                    # if ("skype" in row[0] and (row[1] in ["86.4.212.228", '157.56.52.13', '64.4.23.162'] or row[3] in ["86.4.212.228", '157.56.52.13', '64.4.23.162'])) or (total_time > 20 and rate < 2 and (("205.188." not in (row[1]+row[3]) and "hang" not in row[0]) or ("hang" in row[0] and "216.58" in (row[1]+row[3])))): #for iscx_chat_vpn
+                    # if total_time > 20 and rate < 2: #for CICNTTor_chat
+                    # if total_time > 20 and (row[2] in ['80', '443'] or row[4] in ['80', '443']): #for CICNTTor_browsing_tor
+                    # if total_time > 20:
+                        writer1.writerow(row)
+                        counter1 += 1
+                        print(session_tuple_key, total_time, rate)
+                        # session_spectogram(ts, sizes, session_tuple_key[0])
+                    # elif "facebook" in row[0] and rate > 50:# --> voip
+                    else:
+                        writer2.writerow(row)
+                        counter2 += 1
+                    # # #
+                    # if "skype" in row[0] and (row[1] in ["86.4.212.228", '157.56.52.13', '64.4.23.162'] or row[3] in ["86.4.212.228", '157.56.52.13', '64.4.23.162']):# and row[2] == '443'  and rate >10:
+                    #     print session_tuple_key, total_time, rate
+                    #     session_spectogram(ts, sizes, session_tuple_key[0])
+
+    print("Total sessions in " + OUTPUT1 + " : " + str(counter1))
+    print("Total sessions in " + OUTPUT2 + " : " + str(counter2))
+    output1.close()
+    output2.close()
+
+    print(rate_list)
+    plt.hist(rate_list)
+    plt.show()
+
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser()
+    parser.add_argument('--input', type=str, default=INPUT, help='Path to csvs folder')
+
+    FLAGS = parser.parse_args()
+    file_list = get_csvs_list(FLAGS.input, FILTER_LIST)
+    print("Total number of files " + str(len(file_list)) + " in " + INPUT)
+    traffic_csv_reader(file_list)
--- a/_reference/overlap_multiclass_reg_non_bn.ipynb
+++ b/_reference/overlap_multiclass_reg_non_bn.ipynb
--- a/_reference/sessions_plotter.py
+++ b/_reference/sessions_plotter.py
@ -0,0 +1,42 @@
+import matplotlib.pyplot as plt
+import numpy as np
+
+MTU = 1500
+
+
+def session_spectogram(ts, sizes, name=None):
+    plt.scatter(ts, sizes, marker='.')
+    plt.ylim(0, MTU + 100)
+    plt.xlim(ts[0], ts[-1])
+    # plt.yticks(np.arange(0, MTU, 10))
+    # plt.xticks(np.arange(int(ts[0]), int(ts[-1]), 10))
+    plt.title(name + " Session Spectogram")
+    plt.ylabel('Size [B]')
+    plt.xlabel('Time [sec]')
+
+    plt.grid(True)
+    plt.show()
+
+
+def session_histogram(sizes, plot=False):
+    hist, bin_edges = np.histogram(sizes, bins=range(0, MTU + 1, 1))
+    if plot:
+        plt.bar(bin_edges[:-1], hist, width=1)
+        plt.xlim(min(bin_edges), max(bin_edges)+100)
+        plt.show()
+    return hist.astype(np.uint16)
+
+
+def session_2d_histogram(ts, sizes, plot=False):
+    # ts_norm = map(int, ((np.array(ts) - ts[0]) / (ts[-1] - ts[0])) * MTU)
+    ts_norm = ((np.array(ts) - ts[0]) / (ts[-1] - ts[0])) * MTU
+    H, xedges, yedges = np.histogram2d(sizes, ts_norm, bins=(range(0, MTU + 1, 1), range(0, MTU + 1, 1)))
+
+    if plot:
+        plt.pcolormesh(xedges, yedges, H)
+        plt.colorbar()
+        plt.xlim(0, MTU)
+        plt.ylim(0, MTU)
+        plt.set_cmap('binary')
+        plt.show()
+    return H.astype(np.uint16)