The Forensic team provided us with over 19TB of data which included 1000+ email PS1/OST data and files in the form of PDF, word, Excel, text, etc.
We first developed a conceptual risk matrix comprising of Sentiment Analysis, Key word identification and K-means clustering. Each output was given a specific rank. The rank allowed us to reach the highest risky emails and text documents.
I order to build the training data-set, we selected sample set of emails. These emails were jointly evaluated by the forensic team with their risk rating. This was used to train the Machine Learning to identify the risky emails from the remaining email data. We used the Multinomial Naive Bayes Classifier. The classifier provided an output which was also considered in the risk score.
The identified high risk emails texts were provided for evaluation manually. Based on the feedback from the forensic team, the model was retrained with the additional data set.
In addition to the above we used Microsoft Cognitive Services along with Alteryx to clean and transform the data.
The final results were presented in Tableau and the operational data was presented in Power Bi.
Post deployment of the ML model, there forensic team were able to identify more critical transactions.