# Anomaly Detection: File Create, Update & Delete Deltas

In this notebook we will explore the time it takes between a Create/Update file event and the accompanying Delete event. 

<div class="alert alert-info">
    Before running this notebook the notebook <a href="https://2ims40.pages.dev/file_create_delete.html">file_create_delete.ipynb</a> should be run to generate the 'file_times.csv' file
</div>

In [12]:
from datetime import datetime
import pandas as pd
from tqdm.auto import tqdm
from sklearn.cluster import KMeans
from IPython.display import display

First we import the `file_times.csv` file to get the dataframe with all the Deletion, Creation & Update timestamps per `TargetFilename`. </br>
We filter out the rows where `UpdateTime` is 0, because these rows consist of only Deletion events

In [15]:
df = pd.read_csv('../file-create-delete/file_times.csv', index_col=0)
usable = df[df['UpdateTime'] != '0'].reset_index(drop=True)
usable

Unnamed: 0,TargetFilename,DeletionTime,CreateTime,UpdateTime
0,C:\Windows\ServiceProfiles\NetworkService\AppD...,2022-12-09 09:51:40.809,2022-11-09 03:01:12.045,2022-12-23 15:27:40.678
1,C:\ProgramData\regid.1991-06.com.microsoft\reg...,2022-12-09 09:51:40.984,2022-11-09 10:53:53.605,2022-12-23 17:42:06.559
2,C:\ProgramData\Microsoft\Diagnosis\DownloadedS...,2022-12-09 09:51:41.265,2022-12-01 14:34:16.181,2022-12-23 15:27:39.590
3,C:\ProgramData\Microsoft\Diagnosis\parse.dat,2022-12-09 09:51:41.307,2022-12-01 14:34:16.775,2022-12-23 15:27:39.900
4,C:\Windows\System32\LogFiles\WMI\Diagtrack-Lis...,2022-12-09 09:51:41.704,2022-12-09 09:51:29.902,2022-12-09 09:51:29.926
...,...,...,...,...
34167,C:\Users\User\AppData\Roaming\Microsoft\Window...,2022-12-23 18:39:03.490,2022-12-23 18:36:09.885,2022-12-23 18:39:39.656
34168,C:\Windows\SERVIC~1\LOCALS~1\AppData\Local\Tem...,2022-12-23 18:39:18.876,2022-12-23 18:39:17.709,2022-12-23 18:39:17.709
34169,C:\Windows\SERVIC~1\LOCALS~1\AppData\Local\Tem...,2022-12-23 18:39:18.981,2022-12-23 18:39:17.709,2022-12-23 18:39:17.709
34170,C:\Windows\SERVIC~1\LOCALS~1\AppData\Local\Tem...,2022-12-23 18:39:18.988,2022-12-23 18:39:17.709,2022-12-23 18:39:17.709


To use this dataset we need to add 2 columns, namely `CreateDelta` & `UpdateDelta` </br>
These columns will be filled with the caluculation of the `DeletionTime - CreateTime` and `DeletionTime - UpdateTime` respectively 

In [16]:
datatimeformat = "%Y-%m-%d %H:%M:%S.%f"

usable['CreateDelta'] = [0] * usable.shape[0]
usable['UpdateDelta'] = [0] * usable.shape[0]

for index, row in tqdm(usable.iterrows(), total=usable.shape[0]):
    usable.loc[index, 'CreateDelta'] = (datetime.strptime(row['DeletionTime'], datatimeformat) - datetime.strptime(row['CreateTime'], datatimeformat)).total_seconds()  
    usable.loc[index, 'UpdateDelta'] = (datetime.strptime(row['DeletionTime'], datatimeformat) - datetime.strptime(row['UpdateTime'], datatimeformat)).total_seconds()

usable

  0%|          | 0/34172 [00:00<?, ?it/s]

Unnamed: 0,TargetFilename,DeletionTime,CreateTime,UpdateTime,CreateDelta,UpdateDelta
0,C:\Windows\ServiceProfiles\NetworkService\AppD...,2022-12-09 09:51:40.809,2022-11-09 03:01:12.045,2022-12-23 15:27:40.678,2616628.764,-1229759.869
1,C:\ProgramData\regid.1991-06.com.microsoft\reg...,2022-12-09 09:51:40.984,2022-11-09 10:53:53.605,2022-12-23 17:42:06.559,2588267.379,-1237825.575
2,C:\ProgramData\Microsoft\Diagnosis\DownloadedS...,2022-12-09 09:51:41.265,2022-12-01 14:34:16.181,2022-12-23 15:27:39.590,674245.084,-1229758.325
3,C:\ProgramData\Microsoft\Diagnosis\parse.dat,2022-12-09 09:51:41.307,2022-12-01 14:34:16.775,2022-12-23 15:27:39.900,674244.532,-1229758.593
4,C:\Windows\System32\LogFiles\WMI\Diagtrack-Lis...,2022-12-09 09:51:41.704,2022-12-09 09:51:29.902,2022-12-09 09:51:29.926,11.802,11.778
...,...,...,...,...,...,...
34167,C:\Users\User\AppData\Roaming\Microsoft\Window...,2022-12-23 18:39:03.490,2022-12-23 18:36:09.885,2022-12-23 18:39:39.656,173.605,-36.166
34168,C:\Windows\SERVIC~1\LOCALS~1\AppData\Local\Tem...,2022-12-23 18:39:18.876,2022-12-23 18:39:17.709,2022-12-23 18:39:17.709,1.167,1.167
34169,C:\Windows\SERVIC~1\LOCALS~1\AppData\Local\Tem...,2022-12-23 18:39:18.981,2022-12-23 18:39:17.709,2022-12-23 18:39:17.709,1.272,1.272
34170,C:\Windows\SERVIC~1\LOCALS~1\AppData\Local\Tem...,2022-12-23 18:39:18.988,2022-12-23 18:39:17.709,2022-12-23 18:39:17.709,1.279,1.279


We will only look at `.ps1` and `.psm1` files because these files are created by the malicious process and rather unusual.

In [40]:
only_powershell = usable[(usable['TargetFilename'].str.endswith('.ps1')) | usable['TargetFilename'].str.endswith('.psm1')].reset_index()
only_powershell

Unnamed: 0,index,TargetFilename,DeletionTime,CreateTime,UpdateTime,CreateDelta,UpdateDelta
0,261,C:\Windows\Temp\__PSScriptPolicyTest_mj1liz3f....,2022-12-09 10:09:48.710,2022-12-09 10:09:48.655,2022-12-09 10:09:48.655,0.055,0.055
1,262,C:\Windows\Temp\__PSScriptPolicyTest_jiygjqrb....,2022-12-09 10:09:48.711,2022-12-09 10:09:48.656,2022-12-09 10:09:48.656,0.055,0.055
2,382,C:\Windows\System32\WindowsPowerShell\v1.0\Mod...,2022-12-09 10:18:41.302,2022-12-01 14:48:02.052,2022-12-09 10:18:41.303,675039.25,-0.001
3,412,C:\Windows\SystemTemp\448EA551-9FF6-4E24-9C07-...,2022-12-09 10:18:51.069,2022-12-09 10:18:32.711,2022-12-09 10:18:32.711,18.358,18.358
4,824,C:\Users\User\AppData\Local\Temp\__PSScriptPol...,2022-12-09 10:41:59.492,2022-12-09 10:41:59.457,2022-12-09 10:41:59.457,0.035,0.035
5,825,C:\Users\User\AppData\Local\Temp\__PSScriptPol...,2022-12-09 10:41:59.493,2022-12-09 10:41:59.457,2022-12-09 10:41:59.457,0.036,0.036
6,834,C:\Windows\Temp\SDIAG_db5d8708-c972-4fdb-bbd5-...,2022-12-09 10:42:03.587,2022-12-09 10:41:58.243,2022-12-09 10:41:58.262,5.344,5.325
7,840,C:\Windows\Temp\SDIAG_db5d8708-c972-4fdb-bbd5-...,2022-12-09 10:42:03.624,2022-12-09 10:41:58.321,2022-12-09 10:41:58.323,5.303,5.301
8,841,C:\Windows\Temp\SDIAG_db5d8708-c972-4fdb-bbd5-...,2022-12-09 10:42:03.624,2022-12-09 10:41:58.323,2022-12-09 10:41:58.323,5.301,5.301
9,842,C:\Windows\Temp\SDIAG_db5d8708-c972-4fdb-bbd5-...,2022-12-09 10:42:03.624,2022-12-09 10:41:58.323,2022-12-09 10:41:58.323,5.301,5.301


In [37]:
display(only_powershell[only_powershell['TargetFilename'].str.contains('__PSScriptPolicyTest_b35xidpj')])

alerts = only_powershell[(only_powershell['CreateDelta'] < .5) & (only_powershell['UpdateDelta'] < .5)].reset_index(drop=True)

display(alerts)

Unnamed: 0,index,TargetFilename,DeletionTime,CreateTime,UpdateTime,CreateDelta,UpdateDelta
53,32960,C:\Users\User\AppData\Local\Temp\__PSScriptPol...,2022-12-23 14:18:14.867,2022-12-23 14:18:14.469,2022-12-23 14:18:14.469,0.398,0.398


Unnamed: 0,index,TargetFilename,DeletionTime,CreateTime,UpdateTime,CreateDelta,UpdateDelta
0,261,C:\Windows\Temp\__PSScriptPolicyTest_mj1liz3f....,2022-12-09 10:09:48.710,2022-12-09 10:09:48.655,2022-12-09 10:09:48.655,0.055,0.055
1,262,C:\Windows\Temp\__PSScriptPolicyTest_jiygjqrb....,2022-12-09 10:09:48.711,2022-12-09 10:09:48.656,2022-12-09 10:09:48.656,0.055,0.055
2,824,C:\Users\User\AppData\Local\Temp\__PSScriptPol...,2022-12-09 10:41:59.492,2022-12-09 10:41:59.457,2022-12-09 10:41:59.457,0.035,0.035
3,825,C:\Users\User\AppData\Local\Temp\__PSScriptPol...,2022-12-09 10:41:59.493,2022-12-09 10:41:59.457,2022-12-09 10:41:59.457,0.036,0.036
4,3273,C:\Windows\Temp\__PSScriptPolicyTest_oudiyrti....,2022-12-16 09:58:35.467,2022-12-16 09:58:35.333,2022-12-16 09:58:35.333,0.134,0.134
5,3274,C:\Windows\Temp\__PSScriptPolicyTest_0njfuqwh....,2022-12-16 09:58:35.469,2022-12-16 09:58:35.336,2022-12-16 09:58:35.336,0.133,0.133
6,3753,C:\Users\User\AppData\Local\Temp\__PSScriptPol...,2022-12-16 10:37:54.150,2022-12-16 10:37:54.065,2022-12-16 10:37:54.065,0.085,0.085
7,3754,C:\Users\User\AppData\Local\Temp\__PSScriptPol...,2022-12-16 10:37:54.157,2022-12-16 10:37:54.069,2022-12-16 10:37:54.069,0.088,0.088
8,21283,C:\Windows\Temp\__PSScriptPolicyTest_zaksyazf....,2022-12-19 08:23:25.208,2022-12-19 08:23:25.107,2022-12-19 08:23:25.107,0.101,0.101
9,21284,C:\Windows\Temp\__PSScriptPolicyTest_xn2yn2xz....,2022-12-19 08:23:25.208,2022-12-19 08:23:25.107,2022-12-19 08:23:25.107,0.101,0.101


In [None]:
list(only_powershell[only_powershell['Cluster'] == 4].reset_index(drop=True)['TargetFilename'])[-5:-1]

['C:\\Users\\User\\AppData\\Local\\Temp\\__PSScriptPolicyTest_jrydxbmj.a5t.psm1',
 'C:\\Users\\User\\AppData\\Local\\Temp\\__PSScriptPolicyTest_b35xidpj.e2q.ps1',
 'C:\\Users\\User\\AppData\\Local\\Temp\\__PSScriptPolicyTest_cprwb2zm.jcw.psm1',
 'C:\\Users\\User\\AppData\\Local\\Temp\\__PSScriptPolicyTest_giuslhqj.bbb.ps1']

## Computing true/false postives/negatives

In this section, we compute the number of true positives, true negatives, false positives and false negatives, as well as some metrics related to these quantities. Before we continue, it is useful to define all of these quantities:
- The number of true positives is the number of images (which are run as a process) which use exactly one port, and which are in the process tree of a known malware process;
- The number of true negatives is the number of images (which are run as a process) which use more than one port, and which are not in the process tree of a known malware process;
- The number of false positives is the number of images (which are run as a process) which use exactly one port, and which are not in the process tree of a known malware process;
- The number of false negatives is the number of images (which are run as a process) which use more than one port, and which are in the process tree of a known malware process.

For the purposes of the above definition, the known malware processes are those associated with one of the following images:
- `C:\Users\User\Downloads\2ecbf5a27adc238af0b125b985ae2a8b1bc14526faea3c9e40e6c3437245d830.exe`
- `C:\Users\User\AppData\Roaming\Microsoft\Windows\Start Menu\Programs\Startup\Systdeeem.exe`

<div class="alert alert-info">
    Note: the following 5 images are in the process trees of any of the known malware processes:
    <ul>
        <li><code>C:\\Users\\User\\AppData\\Local\\Temp\\__PSScriptPolicyTest_jrydxbmj.a5t.psm1</code></li>
        <li><code>C:\\Users\\User\\AppData\\Local\\Temp\\__PSScriptPolicyTest_b35xidpj.e2q.ps1</code></li>
        <li><code>C:\\Users\\User\\AppData\\Local\\Temp\\__PSScriptPolicyTest_cprwb2zm.jcw.psm1</code></li>
        <li><code>C:\\Users\\User\\AppData\\Local\\Temp\\__PSScriptPolicyTest_giuslhqj.bbb.ps1</code></li>
    </ul>
</div>

In [23]:
# These are the malware filenames, as given above, but encoded into properly 'formatted' strings for use in Python:
malware_filenames = set([
    'C:\\Users\\User\\AppData\\Local\\Temp\\__PSScriptPolicyTest_jrydxbmj.a5t.psm1',
    'C:\\Users\\User\\AppData\\Local\\Temp\\__PSScriptPolicyTest_b35xidpj.e2q.ps1',
    'C:\\Users\\User\\AppData\\Local\\Temp\\__PSScriptPolicyTest_cprwb2zm.jcw.psm1',
    'C:\\Users\\User\\AppData\\Local\\Temp\\__PSScriptPolicyTest_giuslhqj.bbb.ps1'
])
malware_filenames

{'C:\\Users\\User\\AppData\\Local\\Temp\\__PSScriptPolicyTest_b35xidpj.e2q.ps1',
 'C:\\Users\\User\\AppData\\Local\\Temp\\__PSScriptPolicyTest_cprwb2zm.jcw.psm1',
 'C:\\Users\\User\\AppData\\Local\\Temp\\__PSScriptPolicyTest_giuslhqj.bbb.ps1',
 'C:\\Users\\User\\AppData\\Local\\Temp\\__PSScriptPolicyTest_jrydxbmj.a5t.psm1'}

Using the set of malware filenames given above, the number of true positives is the number of filenames which occur in this set, which is computed below:

In [45]:
true_positives = len(malware_filenames.intersection(set(only_powershell['TargetFilename'])))
true_positives

4

To compute the number of false negatives, we simply take the set difference instead:

In [30]:
false_negatives = len(malware_filenames.difference(set(only_powershell['TargetFilename'])))
false_negatives

0

Next, the number of false positives is the number of elements which are not in the set of malware filenames, but which are selected as alerts of filenames:

In [38]:
false_positives = len(set(alerts['TargetFilename']).difference(malware_filenames))
false_positives

20

Finally, the number of true negatives is the number of files which have a deletion event and an accompanying creation event, but which have not been detected as a true positive, false negative or false positive. This gives:

In [43]:
true_negatives = len(usable['TargetFilename']) - true_positives - false_negatives - false_positives
true_negatives

34148

Finally, we compute some metrics using these quantities:

In [46]:
accuracy = (true_positives + true_negatives) / (true_positives + false_positives + true_negatives + false_negatives)
precision = true_positives / (true_positives + false_positives)
recall = true_positives / (true_positives + false_negatives)
FPR = false_positives / (false_positives + true_negatives) # false positive rate
TNR = true_negatives / (false_positives + true_negatives)
F1_score = 2 * precision * recall / (precision + recall)

print("Accuracy            = " + "{0:.3f}".format(accuracy))
print("Precision           = " + "{0:.3f}".format(precision))
print("Recall              = " + "{0:.3f}".format(recall))
print("False Positive Rate = " + "{0:.3f}".format(FPR))
print("True  Negative Rate = " + "{0:.3f}".format(TNR))
print("F1-score            = " + "{0:.3f}".format(F1_score))

Accuracy            = 0.999
Precision           = 0.167
Recall              = 1.000
False Positive Rate = 0.001
True  Negative Rate = 0.999
F1-score            = 0.286
