Bot or Not: Encrypted Traffic Analysis to Drive Security Operations
Encryption of network traffic has enabled us to transact business online, communicate securely with our colleagues, friends and family and especially in a COVID world, allowing many of us to be productive while sheltering in place from home. However, the pervasive use of encryption does hamper the use of traditional network security approaches that rely on analyzing the raw clear text data stream. Encrypted traffic analysis approaches can leverage the power of artificial intelligence to eliminate some of the blind spots. In this post, we discuss an approach we have found to be very effective.
Google popularized reCAPTCHA that detects if a web request originated from a human or a bot. The same distinction would be useful for security analysts looking at a network session too. If a session originated from a bot, from a network context, that “bot” is typically an application, and possibly a malicious application. In either case, having this context is useful to an analyst looking to make a risk management decision. For instance, a human connecting to Twitter might be completely acceptable use, but on the other hand an application repeatedly connecting to Twitter might be an indicator of command and control.
However, making this determination is getting harder as more and more of the network traffic is encrypted. One way around this issue is to decrypt the traffic with key sharing or MITM techniques and do the analysis on the decrypted data. These kinds of solutions are expensive to manage, interfere with many applications, add complexity to deployments especially in a TLS 1.3 world. Importantly, it might also violate the expectations of privacy that modern users have in the enterprise networks and put the organization at odds with privacy laws and regulations.
At Awake, we have invested heavily in finding the metadata of the encrypted traffic without actually decrypting it. As an example, in this blog we describe how we analyze SSH and RDP traffic to surface security-relevant activities, such as the presence of keystrokes from a human, or multiple login failures, large file transfers etc.
As with our earlier work in TLS – the key features that allowed us to detect those activities are: the session initialization parameters, and the encrypted packet metadata. We wrote custom parsers to extract the session initialization parameters, and collect the metadata about the encrypted packets such as their size, the timing between consecutive packets etc.
We collected training data for these protocols and trained using various ML techniques. One interesting observation was that deep learning methods did pretty well in detecting the activities. But, they are too slow to run at wirespeed. Instead, an ensemble of classical ML techniques such as support vector machines (SVM) and random forests gave the same accuracy but with an order of magnitude improvement in model performance.
To get the accuracy of deep learning but using only classical ML techniques, we had to also perform some feature engineering. For example, the models performed much better if we used a moving window to combine nearby packets instead of considering them separately.
Let us take a look at an example. The screenshot below shows an RDP session. However, the platform has assessed that this session is most likely generated by a human as there are keystrokes and mouse movement to control the remote system. This context is tremendously valuable since the analyst can now use it to understand risk: is this human access normal and expected? Is the user authorized to do it etc? On the other hand if this was an application performing the access, the analyst could quickly determine if that behavior was expected. In other words, this approach delivers reCAPTCHA for the network, but is entirely passive.
The analysis above is just a case study of what is achievable through the techniques of encrypted traffic analysis. We don’t have to be limited to just detecting bots behind the encrypted network activities. The approach is general enough to distinguish between large file downloads vs. video streaming, detect data exfiltration, and detect other remote administration tools being used maliciously on the network. And all of this without ever decrypting the data. We will discuss these results in future blogs.
If you liked what you just read, subscribe to hear about our threat research and security analysis.
Awake Data Science Team
Dig Deeper with These Resources
Awake Security 2 Minute Explainer Video
What if security could think? What if it could sense danger, calculate risk, and react quickly based…
The Internet’s New Arms Dealers: Malicious Domain Registrars
This report dives into the results of a multi-month investigation that uncovered a massive global surveillance campaign…