initial commit reference

main
yulonger's Desktop 2 years ago
parent 9263ae064b
commit dd74e2eedb

2
.gitignore vendored

@ -3,7 +3,7 @@
__pycache__/
*.py[cod]
*$py.class
.idea/
# C extensions
*.so

@ -0,0 +1,21 @@
MIT License
Copyright (c) 2021 talshapira
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

@ -0,0 +1,67 @@
# FlowPic: A Generic Representation for Encrypted Traffic Classification and Applications Identification
Identifying the type of a network flow or a specific application has many advantages, but become harder in recent years due to the use of encryption, e.g., by VPN and Tor.
Current solutions rely mostly on handcrafted features and then apply supervised learning techniques for the classification.
We introduce a novel approach for encrypted Internet traffic classification and application identification by transforming basic flow data into a picture, em a FlowPic, and then using known image classification deep learning techniques, Convolutional Neural Networks (CNNs), to identify the flow category (browsing, chat, video, etc.) and the application in use. Our approach can classify traffic with high accuracy, both for a specific application, or a flow category, even for VPN and Tor traffic. Our classifier can even identify with high success new applications that were not part of the training phase for a category, thus, new versions or applications can be categorized without additional training.
A recent [work](https://arxiv.org/abs/2104.03182) by Yang et al. compared different recent methods for Internet Traffic Classification, and showed that our method achieves the best tradeoff between accuracy and model complexity, as shown below (FlowPic marked with [17]):
<p align="center">
<img src='http://talshapira.github.io/files/yang_2021_comaprison.png' width="400">
</p>
# Approach
1. Extract records from each flow, which comprised of a list of pairs, {IP packet size, time of arrival} for each packet in the flow.
2. Split each unidirectional flow to equal blocks (15/60 seconds).
3. Generate 2D-histogram. For simplicity, we set the 2D-histogram to be a square image.
4. Feed a Convolution Neural Network.
<img src='http://talshapira.github.io/files/FlowPic_sys.png'>
# FlowPics Exploration
<img src='http://talshapira.github.io/files/flowpic_categories.png'>
<p align="center">
<img src='http://talshapira.github.io/files/flowpic_apps.png' width="400">
</p>
# Dataset
We use labeled datasets of packet capture (pcap) files from the Uni. of New Brunswick (UNB): ["ISCX VPN-nonVPN traffic dataset" (ISCX-VPN)](https://www.unb.ca/cic/datasets/vpn.html) and ["ISCX Tor-nonTor dataset" (ISCX-Tor)](https://www.unb.ca/cic/datasets/tor.html), as well as our own small packet capture (TAU), and conduct different types of experiments; (1) multiclass classification experiments over non-VPN/VPN/Tor and merged dataset, (2) class vs. all classification experiments, (3) application identification, and (4) classification of an unknown application.
Each pcap file corresponds to a specific application, a traffic category and an encryption technique. However, all these captures also contain sessions of different traffic categories, since while performing one action in an application, many other sessions occur for different tasks simultaneously. For example, while using VoIP over Facebook, there is another STUN session taking place at the same time for adjusting and maintaining the VoIP conversation, as well as an HTTPS session of the Facebook site.
We use a combined dataset only from the five categories that contains enough samples: VoIP, Video, Chat, Browsing, and File Transfer. For these categories we have 3 encryption techniques: non VPN, VPN (for all classes except Browsing) and TOR.
Notice that our categories differ slightly from those suggested by UNB. All the applications that were captured in order to create the dataset, for each traffic category and encryption technique, are shown in the folowing table:
<p align="center">
<img src='http://talshapira.github.io/files/flowpic_dataset.png' width="600">
</p>
We parsed the pcap files and constructed for each combination of traffic category and encryption technique a CSV file with the following structure -
|pcap_name|ip_src|port_src|ip_dst|port_dst|TCP/UDP|start_time|length|[timestamps_list]|[sizes_list]| , such that each entry corresponds to a specific unidirectional session.
# TrafficParser
Contains the code use to generate the dataset (npy files) per experiment.
If you choose to use our proceesed dataset (i.e. CSV files) directly, run the scripts in the following order:
1. Run traffic_csv_converter.py
2. Run datasets_generator.py
The other two scripts (generic_parser.py + traffic_csv_merger.py) used to generate the proceesed dataset.
# License
Our proceesed dataset (i.e. CSV files) is [publicly available](https://drive.google.com/file/d/1gz61vnMANj-4hKNvZv1KFK9LajR91X-m/view?usp=sharing) upon request for researchers. If you are using our dataset, please cite our related research paper, as well as UNB's related research papers:
* T. Shapira and Y. Shavitt, "FlowPic: A Generic Representation for Encrypted Traffic Classification and Applications Identification," in IEEE Transactions on Network and Service Management, doi: 10.1109/TNSM.2021.3071441.
* T. Shapira and Y. Shavitt, "FlowPic: Encrypted Internet Traffic Classification is as Easy as Image Recognition," IEEE INFOCOM 2019 - IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS), Paris, France, 2019, pp. 680-687.
* Gerard Drapper Gil, Arash Habibi Lashkari, Mohammad Mamun, Ali A. Ghorbani, "Characterization of Encrypted and VPN Traffic Using Time-Related Features", In Proceedings of the 2nd International Conference on Information Systems Security and Privacy(ICISSP 2016) , pages 407-414, Rome, Italy.
* Arash Habibi Lashkari, Gerard Draper-Gil, Mohammad Saiful Islam Mamun and Ali A. Ghorbani, "Characterization of Tor Traffic Using Time Based Features", In the proceeding of the 3rd International Conference on Information System Security and Privacy, SCITEPRESS, Porto, Portugal, 2017.

@ -0,0 +1,87 @@
#!/usr/bin/env python
"""
datasets_generator.py creates final class_vs_all dataset ready to be inserted to machine.
The input for this module are pre-created numpy array containing all classes session 2d_histograms created in traffic_csv_conveter.py
"""
import glob
import numpy as np
# from sklearn.cross_validation import train_test_split
from sklearn.model_selection import train_test_split
CLASS = "browsing"
TEST_SIZE = 0.1
DATASET_DIR = "../datasets/"
VPN_TYPES = {
"reg": glob.glob("../raw_csvs/classes/**/reg/*.npy"),
"vpn": glob.glob("../raw_csvs/classes/**/vpn/*.npy"),
"tor": glob.glob("../raw_csvs/classes/**/tor/*.npy")
}
def import_array(input_array):
print("Import dataset " + input_array)
dataset = np.load(input_array)
print(dataset.shape)
return dataset
def export_dataset(dataset_dict, file_path):
# with open(file_path + ".pkl", 'wb') as outfile:
# pickle.dump(dataset_list, outfile, pickle.HIGHEST_PROTOCOL)
for name, array in dataset_dict.items():
np.save(file_path + "_" + name, array)
def create_class_vs_all_specific_vpn_type_dataset(class_name, vpn_type="reg", validation=False, ratio=1.2):
class_array_file = [fn for fn in VPN_TYPES[vpn_type] if class_name in fn and "overlap" not in fn][0]
print(class_array_file)
all_files = [fn for fn in VPN_TYPES[vpn_type] if class_name not in fn and "overlap" not in fn]
print(all_files)
class_array = import_array(class_array_file)
count = len(class_array)
print(count)
all_count = len(all_files)
count_per_class = ratio*count/all_count
print(count_per_class)
for fn in all_files:
print(fn)
fn_array = import_array(fn)
p = count_per_class*1.0/len(fn_array)
print(p)
if p < 1:
mask = np.random.choice([True, False], len(fn_array), p=[p, 1-p])
fn_array = fn_array[mask]
print(len(fn_array))
class_array = np.append(class_array, fn_array, axis=0)
print(len(class_array))
del fn_array
labels = np.append(np.zeros(count), np.ones(len(class_array) - count))
print(len(class_array), len(labels), labels[0], labels[count-1], labels[count], labels[-1])
dataset_dict = dict()
if validation:
x_train, x_val, y_train, y_val = train_test_split(class_array, labels, test_size=TEST_SIZE)
print(len(y_train), sum(y_train), 1.0*sum(y_train)/len(y_train))
print(len(y_val), sum(y_val), 1.0*sum(y_val)/len(y_val))
dataset_dict["x_train"] = x_train
dataset_dict["x_val"] = x_val
dataset_dict["y_train"] = y_train
dataset_dict["y_val"] = y_val
else:
dataset_dict["x_test"] = class_array
dataset_dict["y_test"] = labels
export_dataset(dataset_dict, DATASET_DIR + class_name + "_vs_all_" + vpn_type)
if __name__ == '__main__':
# create_class_vs_all_specific_vpn_type_dataset(CLASS, validation=True)
# create_class_vs_all_specific_vpn_type_dataset(CLASS, vpn_type="vpn", validation=False)
create_class_vs_all_specific_vpn_type_dataset(CLASS, vpn_type="tor", validation=False)

@ -0,0 +1,127 @@
#!/usr/bin/env python
"""
Use DPKT to read in a pcap file and create one directional sessions of packets sizes (ip total length) and ts.
"""
import dpkt
import os
import socket
import argparse
import csv
import time
FLAGS = None
# INPUT = "../dataset/iscxNTVPN2016/CompletePCAPs"#"../dataset/CICNTTor2017/Pcaps/tor" #"../dataset/iscxNTVPN2016/CompletePCAPs"#"./test_pacaps"#"../dataset/iscxNTVPN2016/CompletePCAPs" # ""
INPUT = './test_pcaps/my_chat'
FILTER_LIST = None # [(["audio", "voip"], True), (["vpn", "tor"], False)]
PROTO_DICT = {dpkt.tcp.TCP: "TCP", dpkt.udp.UDP: "UDP"}
def inet_to_str(inet):
"""Convert inet object to a string
Args:
inet (inet struct): inet network address
Returns:
str: Printable/readable IP address
"""
# First try ipv4 and then ipv6
try:
return socket.inet_ntop(socket.AF_INET, inet)
except ValueError:
return socket.inet_ntop(socket.AF_INET6, inet)
def get_pcaps_list(dir_path, filter_list=None):
def filter_list_func(fn):
if filter_list is not None:
for filter_str_list, type in filter_list:
result = any([filter_str in fn.lower() for filter_str in filter_str_list])
if result is not type:
return False
return True
return [(os.path.join(dir_path, fn), fn) for fn in next(os.walk(dir_path))[2] if (".pcap" in os.path.splitext(fn)[-1] and filter_list_func(fn))]
def parse_pcap(pcap, pcap_path, file_name):
"""Print out information about each packet in a pcap
Args:
pcap: dpkt pcap reader object (dpkt.pcap.Reader)
"""
counter = 0
pcap_dict = {}
# For each packet in the pcap process the contents
for ts, packet in pcap:
# Unpack the Ethernet frame
try:
eth = dpkt.ethernet.Ethernet(packet)
except dpkt.dpkt.NeedData:
print("dpkt.dpkt.NeedData")
# Make sure the Ethernet data contains an IP packet
if isinstance(eth.data, dpkt.ip.IP):
ip = eth.data
elif isinstance(eth.data, str):
try:
ip = dpkt.ip.IP(packet)
except dpkt.UnpackError:
continue
else:
continue
# Now unpack the data within the Ethernet frame (the IP packet)
# Pulling out src_ip, dst_ip, protocol (tcp/udp), dst/src port, length
proto = ip.data
# Print out the info
if type(ip.data)in PROTO_DICT:
session_tuple_key = (inet_to_str(ip.src), proto.sport, inet_to_str(ip.dst), proto.dport, PROTO_DICT[type(ip.data)])
pcap_dict.setdefault(session_tuple_key, (ts, [], []))
d = pcap_dict[session_tuple_key]
size = len(ip) #ip.len
d[1].append(round(ts - d[0], 6)), d[2].append(size)
counter += 1
print("Total Number of Parsed Packets in " + pcap_path + ": " + str(counter))
csv_file_path = os.path.splitext(pcap_path)[0] + ".csv"
with open(csv_file_path, 'wb') as csv_file:
writer = csv.writer(csv_file)
for key, value in pcap_dict.items():
writer.writerow([file_name.split(".")[0]] + list(key) + [value[0], len(value[1])] + value[1] + [None] + value[2])
for k,v in pcap_dict.iteritems():
if len(v[1]) > 2000:
print(k, v[0], len(v[1]))
def generic_parser(file_list):
"""Open up a pcap file and create a output file containing all one-directional parsed sessions"""
for pcap_path, file_name in file_list:
try:
with open(pcap_path, 'rb') as f:
pcap = dpkt.pcap.Reader(f)
parse_pcap(pcap, pcap_path, file_name)
except ValueError:
new_pcap_file = os.path.splitext(pcap_path)[0] + "_new.pcap"
os.system("editcap -F libpcap -T ether " + pcap_path + " " + new_pcap_file)
with open(new_pcap_file, 'rb') as f:
pcap = dpkt.pcap.Reader(f)
parse_pcap(pcap, pcap_path, file_name)
os.remove(new_pcap_file)
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('--input', type=str, default=INPUT, help='Path to pcap')
FLAGS = parser.parse_args()
file_list = get_pcaps_list(FLAGS.input, FILTER_LIST)
start_time = time.time()
generic_parser(file_list)
total_time = time.time() - start_time
print("--- %s seconds ---" % total_time)

@ -0,0 +1,75 @@
#!/usr/bin/env python
"""
sessions_plotter.py has 3 functions to create spectogram, histogram, 2d_histogram from [(ts, size),..] session.
"""
import matplotlib.pyplot as plt
import numpy as np
MTU = 1500
def session_spectogram(ts, sizes, name=None):
plt.scatter(ts, sizes, marker='.')
plt.ylim(0, MTU)
plt.xlim(ts[0], ts[-1])
# plt.yticks(np.arange(0, MTU, 10))
# plt.xticks(np.arange(int(ts[0]), int(ts[-1]), 10))
plt.title(name + " Session Spectogram")
plt.ylabel('Size [B]')
plt.xlabel('Time [sec]')
plt.grid(True)
plt.show()
def session_atricle_spectogram(ts, sizes, fpath=None, show=True, tps=None):
if tps is None:
max_delta_time = ts[-1] - ts[0]
else:
max_delta_time = tps
ts_norm = ((np.array(ts) - ts[0]) / max_delta_time) * MTU
plt.figure()
plt.scatter(ts_norm, sizes, marker=',', c='k', s=5)
plt.ylim(0, MTU)
plt.xlim(0, MTU)
plt.ylabel('Packet Size [B]')
plt.xlabel('Normalized Arrival Time')
plt.set_cmap('binary')
plt.axes().set_aspect('equal')
plt.grid(False)
if fpath is not None:
# plt.savefig(OUTPUT_DIR + fname, bbox_inches='tight', pad_inches=1)
plt.savefig(fpath, bbox_inches='tight')
if show:
plt.show()
plt.close()
def session_histogram(sizes, plot=False):
hist, bin_edges = np.histogram(sizes, bins=range(0, MTU + 1, 1))
if plot:
plt.bar(bin_edges[:-1], hist, width=1)
plt.xlim(min(bin_edges), max(bin_edges)+100)
plt.show()
return hist.astype(np.uint16)
def session_2d_histogram(ts, sizes, plot=False, tps=None):
if tps is None:
max_delta_time = ts[-1] - ts[0]
else:
max_delta_time = tps
# ts_norm = map(int, ((np.array(ts) - ts[0]) / max_delta_time) * MTU)
ts_norm = ((np.array(ts) - ts[0]) / max_delta_time) * MTU
H, xedges, yedges = np.histogram2d(sizes, ts_norm, bins=(range(0, MTU + 1, 1), range(0, MTU + 1, 1)))
if plot:
plt.pcolormesh(xedges, yedges, H)
plt.colorbar()
plt.xlim(0, MTU)
plt.ylim(0, MTU)
plt.set_cmap('binary')
plt.show()
return H.astype(np.uint16)

@ -0,0 +1,197 @@
#!/usr/bin/env python
"""
Read traffic_csv
"""
import os
import argparse
import csv
from sessions_plotter import *
import glob
import re
FLAGS = None
INPUT = "../raw_csvs/classes/browsing/reg/CICNTTor_browsing.raw.csv"#"../dataset/iscxNTVPN2016/CompletePCAPs" # ""
INPUT_DIR = "../raw_csvs/classes/chat/vpn/"
CLASSES_DIR = "../raw_csvs/classes/**/**/"
# LABEL_IND = 1
TPS = 60 # TimePerSession in secs
DELTA_T = 60 # Delta T between splitted sessions
MIN_TPS = 50
# def insert_dataset(dataset, labels, session, label_ind=LABEL_IND):
# dataset.append(session)
# labels.append(label_ind)
# def export_dataset(dataset, labels):
# print "Start export dataset"
# np.savez(INPUT.split(".")[0] + ".npz", X=dataset, Y=labels)
# print dataset.shape, labels.shape
#
# def import_dataset():
# print "Import dataset"
# dataset = np.load(INPUT.split(".")[0] + ".npz")
# print dataset["X"].shape, dataset["Y"].shape
def export_dataset(dataset):
print("Start export dataset")
np.save(os.path.splitext(INPUT)[0], dataset)
print(dataset.shape)
def export_class_dataset(dataset, class_dir):
print("Start export dataset")
np.save(class_dir + "/" + "_".join(re.findall(r"[\w']+", class_dir)[-2:]), dataset)
print(dataset.shape)
def import_dataset():
print("Import dataset")
dataset = np.load(os.path.splitext(INPUT)[0] + ".npy")
print(dataset.shape)
return dataset
def traffic_csv_converter(file_path):
print("Running on " + file_path)
dataset = []
# labels = []
counter = 0
with open(file_path, 'r') as csv_file:
reader = csv.reader(csv_file)
for i, row in enumerate(reader):
# print row[0], row[7]
session_tuple_key = tuple(row[:8])
length = int(row[7])
ts = np.array(row[8:8+length], dtype=float)
sizes = np.array(row[9+length:], dtype=int)
# if (sizes > MTU).any():
# a = [(sizes[i], i) for i in range(len(sizes)) if (np.array(sizes) > MTU)[i]]
# print len(a), session_tuple_key
if length > 10:
# print ts[0], ts[-1]
# h = session_2d_histogram(ts, sizes)
# session_spectogram(ts, sizes, session_tuple_key[0])
# dataset.append([h])
# counter += 1
# if counter % 100 == 0:
# print counter
for t in range(int(ts[-1]/DELTA_T - TPS/DELTA_T) + 1):
mask = ((ts >= t * DELTA_T) & (ts <= (t * DELTA_T + TPS)))
# print t * DELTA_T, t * DELTA_T + TPS, ts[-1]
ts_mask = ts[mask]
sizes_mask = sizes[mask]
if len(ts_mask) > 10 and ts_mask[-1] - ts_mask[0] > MIN_TPS:
# if "facebook" in session_tuple_key[0]:
# session_spectogram(ts[mask], sizes[mask], session_tuple_key[0])
# # session_2d_histogram(ts[mask], sizes[mask], True)
# session_histogram(sizes[mask], True)
# exit()
# else:
# continue
h = session_2d_histogram(ts_mask, sizes_mask)
# session_spectogram(ts_mask, sizes_mask, session_tuple_key[0])
dataset.append([h])
counter += 1
if counter % 100 == 0:
print(counter)
return np.asarray(dataset) #, np.asarray(labels)
def traffic_csv_converter_splitted(file_path):
def split_converter(ts, sizes, dataset, counter):
if ts[-1] - ts[0] > MIN_TPS and len(ts) > 20:
# print ts[0], ts[-1]
h = session_2d_histogram(ts-ts[0], sizes)
# session_spectogram(ts, sizes, session_tuple_key[0])
dataset.append([h])
counter += 1
# if counter % 100 == 0:
# print counter
total_time = ts[-1] - ts[0]
if total_time > TPS:
for ts_split, sizes_split in zip(np.split(ts, [len(ts)/2]), np.split(sizes, [len(sizes)/2])):
split_converter(ts_split, sizes_split, dataset, counter)
print("Running on " + file_path)
dataset = []
# labels = []
counter = 0
with open(file_path, 'r') as csv_file:
reader = csv.reader(csv_file)
for i, row in enumerate(reader):
# print row[0], row[7]
session_tuple_key = tuple(row[:8])
length = int(row[7])
ts = np.array(row[8:8+length], dtype=float)
sizes = np.array(row[9+length:], dtype=int)
# if (sizes > MTU).any():
# a = [(sizes[i], i) for i in range(len(sizes)) if (np.array(sizes) > MTU)[i]]
# print len(a), session_tuple_key
if length > 10:
split_converter(ts, sizes, dataset, counter)
return np.asarray(dataset)
def traffic_class_converter(dir_path):
dataset_tuple = ()
for file_path in [os.path.join(dir_path, fn) for fn in next(os.walk(dir_path))[2] if (".csv" in os.path.splitext(fn)[-1])]:
dataset_tuple += (traffic_csv_converter(file_path),) ################
return np.concatenate(dataset_tuple, axis=0)
def iterate_all_classes():
for class_dir in glob.glob(CLASSES_DIR):
if "other" not in class_dir: #"browsing" not in class_dir and
print("working on " + class_dir)
dataset = traffic_class_converter(class_dir)
print(dataset.shape)
export_class_dataset(dataset, class_dir)
def random_sampling_dataset(input_array, size=2000):
print("Import dataset " + input_array)
dataset = np.load(input_array)
print(dataset.shape)
p = size*1.0/len(dataset)
print(p)
if p >= 1:
raise Exception
mask = np.random.choice([True, False], len(dataset), p=[p, 1-p])
dataset = dataset[mask]
print("Start export dataset")
np.save(os.path.splitext(input_array)[0] + "_samp", dataset)
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('--input', type=str, default=INPUT, help='Path to csv file')
FLAGS = parser.parse_args()
##
# iterate_all_classes()
# dataset = traffic_class_converter(INPUT_DIR)
# dataset = traffic_csv_converter(INPUT)
input_array = "../raw_csvs/classes/browsing/reg/browsing_reg.npy"
random_sampling_dataset(input_array)
# export_class_dataset(dataset)
# import_dataset()

@ -0,0 +1,122 @@
#!/usr/bin/env python
"""
traffic_csv_merger.py merge all filtered traffic_csvs from the same original_pacps_dataest, class and vpn type into one merged csv file.
"""
import os
import argparse
import csv
from sessions_plotter import *
FLAGS = None
# INPUT = "../raw_csvs/CICNTTor2017/tor"
# INPUT = "../raw_csvs/iscxNTVPN2016" # ""
# INPUT = "D:/TS/Internet Traffic Classification/TrafficParser/test_pcaps/my_chat"
INPUT = "./test_pcaps/my_chat"
# OUTPUT1 = "CICNTTor_browsing_tor.raw.csv"
# OUTPUT2 = "CICNTTor_browsing_others_tor.raw.csv"
# OUTPUT1 = "iscx_email.raw.csv"
# OUTPUT2 = "iscx_email_others.raw.csv"
# OUTPUT3 = "iscx_video_voip.raw.csv"
OUTPUT1 = "my_chat.raw.csv"
OUTPUT2 = "my_chat_others.raw.csv"
# FILTER_LIST = [(["audio", "voip"], True), (["tor", "vpn"], False)] #-> voip, , "tor"
# FILTER_LIST = [(["video", "youtube", "vimeo", "netflix"], True), (["tor", "vpn"], False)] #-> video
# FILTER_LIST = [(["audio", "voip"], True), (["spotify"], False)] #-> voip, , "tor"
# FILTER_LIST = [(["ftps", "scp", "sftp", "file"], True), (["mail", "pop", "tor"], False), (["vpn"], True)]
FILTER_LIST = [(["chat"], True), (["vpn"], False)]
def get_csvs_list(dir_path, filter_list=None):
def filter_list_func(fn):
if filter_list is not None:
for filter_str_list, type in filter_list:
result = any([filter_str in fn.lower() for filter_str in filter_str_list])
if result is not type:
return False
return True
return [(os.path.join(dir_path, fn), fn) for fn in next(os.walk(dir_path))[2] if (".csv" in os.path.splitext(fn)[-1] and filter_list_func(fn))]
def traffic_csv_reader(file_list):
output1 = open(OUTPUT1, 'w')
writer1 = csv.writer(output1)
counter1 = 0
output2 = open(OUTPUT2, 'w')
writer2 = csv.writer(output2)
counter2 = 0
# output3 = open(OUTPUT3, 'wb')
# writer3 = csv.writer(output3)
# counter3 = 0
rate_list = []
for i, (file_path, file_name) in enumerate(file_list):
print("Running on " + str(i) + " file - " + file_path)
with open(file_path, 'r') as csv_file:
reader = csv.reader(csv_file)
for i, row in enumerate(reader):
session_tuple_key = tuple(row[:8])
length = int(row[7])
ts = np.array(row[8:8+length], dtype=float)
if length >= 20:
total_time = ts[-1] - ts[0]
sizes = np.array(row[9+length:], dtype=int)
# print row[0], length, total_time, length/total_time
rate = length/total_time
rate_list.append(rate)
# if (sizes > MTU).any():
# a = [(sizes[i], i) for i in range(len(sizes)) if (np.array(sizes) > MTU)[i]]
# print len(a), session_tuple_key, a
# if ("facebook" in row[0] and rate > 40) or ("facebook" not in row[0] and 20 <= rate and length >= 1000): # for iscx_voip
# if "facebook" not in row[0] and 40 <= rate and length >= 1000: # for iscx_voip_vpn
# if ("youtube" in row[0] and row[2] == '443' and total_time > 10 and rate > 15) or ("vimeo" in row[0] and rate > 30 and total_time > 15) or ("netflix" in row[0] and rate > 60) or ("facebook" in row[0] and rate > 60) or (40 <= rate and row[5] == "UDP" and "facebook" not in row[0]): # for iscx_video
# if length > 6000 and rate > 10 and total_time > 10 and (row[2] == '443' or row[2] == '80'): # for iscx_video_vpn
# if total_time > 30 and rate > 30 and (row[2] == '443' or row[2] == '80'): # for CICNTTor_video
# if total_time > 30 and (row[2] == '443' or row[2] == '80'): # for CICNTTor_video
# if total_time > 10 and rate > 10 and length > 1000 and (session_tuple_key[-1] != '3326' and session_tuple_key[-1] != '6367'): # for CICNTTor_voip
# if (total_time > 10 and ((rate > 100) or ("skype" in row[0] and rate > 10))) or ("rent" in row[0] and total_time > 20 and row[2] != '21943' and row[4] != '28904'): #for iscx_file
# if ("torrent" in row[0] and total_time > 30 and rate > 10 and (row[2] == '443' or row[2] == '80')) or ("torrent" not in row[0] and total_time > 30 and rate > 10 and int(row[2]) not in [22, 1781, 59886, 35968]): #for iscx_file_vpn
# if (total_time > 20) and (("POP" in row[0] and rate > 150) or (("IMAP" in row[0] and rate > 10 and row[7][0] == '8') or row[0] == 'FTP_filetransfer' and total_time > 20 and rate > 100) or ( rate > 10 and "SFTP" in row[0])): #for CICNTTor_file
# if total_time > 20 and rate > 5: #for CICNTTor_file_tor
# if ("skype" in row[0] and total_time > 20 and rate > 3 and row[1][:2] == "10") or ("ftps" in row[0] and total_time > 20 and rate > 300 and row[2] != '1781') or ("sftp" in row[0] and total_time > 20 and rate > 5 and (('A' in row[0] and row[4]=='22') or ('B' in row[0] and row[2]=='22'))):#for iscx_file_vpn
if (total_time > 50 and rate<5) and (("whats" in row[0] and ("185.60." in row[1] or "185.60." in row[3])) or ("hang" in row[0] and ("216.58." in row[1] or "216.58." in row[3])) or ("book" in row[0] and ("192.114." in row[1] or "192.114." in row[3]))): #my_chat
# if (row[1] in ["131.202.240.242","131.202.240.45"] and row[3] in ["131.202.240.242","131.202.240.45"]) or ("gmail" in row[0] and "131.202.240.87" in [row[1], row[3]] and total_time > 40 and row[5] == 'TCP' and rate < 0.3): #for scx_chat
# if ("skype" in row[0] and (row[1] in ["86.4.212.228", '157.56.52.13', '64.4.23.162'] or row[3] in ["86.4.212.228", '157.56.52.13', '64.4.23.162'])) or (total_time > 20 and rate < 2 and (("205.188." not in (row[1]+row[3]) and "hang" not in row[0]) or ("hang" in row[0] and "216.58" in (row[1]+row[3])))): #for iscx_chat_vpn
# if total_time > 20 and rate < 2: #for CICNTTor_chat
# if total_time > 20 and (row[2] in ['80', '443'] or row[4] in ['80', '443']): #for CICNTTor_browsing_tor
# if total_time > 20:
writer1.writerow(row)
counter1 += 1
print(session_tuple_key, total_time, rate)
# session_spectogram(ts, sizes, session_tuple_key[0])
# elif "facebook" in row[0] and rate > 50:# --> voip
else:
writer2.writerow(row)
counter2 += 1
# # #
# if "skype" in row[0] and (row[1] in ["86.4.212.228", '157.56.52.13', '64.4.23.162'] or row[3] in ["86.4.212.228", '157.56.52.13', '64.4.23.162']):# and row[2] == '443' and rate >10:
# print session_tuple_key, total_time, rate
# session_spectogram(ts, sizes, session_tuple_key[0])
print("Total sessions in " + OUTPUT1 + " : " + str(counter1))
print("Total sessions in " + OUTPUT2 + " : " + str(counter2))
output1.close()
output2.close()
print(rate_list)
plt.hist(rate_list)
plt.show()
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('--input', type=str, default=INPUT, help='Path to csvs folder')
FLAGS = parser.parse_args()
file_list = get_csvs_list(FLAGS.input, FILTER_LIST)
print("Total number of files " + str(len(file_list)) + " in " + INPUT)
traffic_csv_reader(file_list)

File diff suppressed because one or more lines are too long

@ -0,0 +1,42 @@
import matplotlib.pyplot as plt
import numpy as np
MTU = 1500
def session_spectogram(ts, sizes, name=None):
plt.scatter(ts, sizes, marker='.')
plt.ylim(0, MTU + 100)
plt.xlim(ts[0], ts[-1])
# plt.yticks(np.arange(0, MTU, 10))
# plt.xticks(np.arange(int(ts[0]), int(ts[-1]), 10))
plt.title(name + " Session Spectogram")
plt.ylabel('Size [B]')
plt.xlabel('Time [sec]')
plt.grid(True)
plt.show()
def session_histogram(sizes, plot=False):
hist, bin_edges = np.histogram(sizes, bins=range(0, MTU + 1, 1))
if plot:
plt.bar(bin_edges[:-1], hist, width=1)
plt.xlim(min(bin_edges), max(bin_edges)+100)
plt.show()
return hist.astype(np.uint16)
def session_2d_histogram(ts, sizes, plot=False):
# ts_norm = map(int, ((np.array(ts) - ts[0]) / (ts[-1] - ts[0])) * MTU)
ts_norm = ((np.array(ts) - ts[0]) / (ts[-1] - ts[0])) * MTU
H, xedges, yedges = np.histogram2d(sizes, ts_norm, bins=(range(0, MTU + 1, 1), range(0, MTU + 1, 1)))
if plot:
plt.pcolormesh(xedges, yedges, H)
plt.colorbar()
plt.xlim(0, MTU)
plt.ylim(0, MTU)
plt.set_cmap('binary')
plt.show()
return H.astype(np.uint16)
Loading…
Cancel
Save