As privacy concerns rise, it comes as no surprise that over half of the 1 million most frequently visited websites worldwide now use HTTPS. You’d be hard-pressed to find a security professional who doesn’t applaud this result, and for good reason: the widespread use of session-layer encryption on the internet has made an indelible impact on user privacy and security for the better.

From the perspective of a security analyst or threat hunter, however, the trend toward encryption defeats existing network traffic analysis techniques and leads to more opacity. During incident response, homing in on traffic sent through an SSH tunnel essentially boils down to hitting a dead end. And attackers know that—so more of their traffic is being encrypted! The defenders are left unable to see inside that tunnel and thus hopeless to assess whether its contents are innocuous or a threat. Suddenly, a tool that champions security is a nuisance to an analyst that wants transparent answers about the traffic in question.

The state of the art

When confronted with encrypted or unknown traffic, whether that be a TLS session, an SSH tunnel, or simply an unrecognized communication protocol, conventional traffic analysis tools just don’t suffice. Deep packet inspection, for all that it can do, just won’t cut it. We need to go beyond just reading what’s in the clear.

Yet, as an industry, innovation in this space is lacking. The “best” solutions have focused on decrypting the entire flow. This is a valid approach, but one that is losing favor due to its privacy and policy implications, not to mention TLS 1.3. The fundamental issue is that network analysis tools continue to lean heavily on the ability to fingerprint traffic using cut-and-dry metadata (think port numbers) that we can read in the clear when a packet comes across the wire. We won’t have this luxury forever, and one might argue we don’t even have it today! So we need to adapt to succeed in our perpetual effort to see into our networks.

Could machine learning help?

The good news…

The security industry has a mountain of wealth in the form of everyone’s favorite 21st-century currency: data. Every frame that goes across the wire is a piece of data with a signal hidden inside it, obscured by noise and encryption. Networks across the world process more of this data every second than one could hope to fathom, producing a nearly infinite stockpile of semi-structured, content-rich data, and there’s information to be gleaned from all of it. Extracting valuable insights from all this data is a problem that should have signal processing experts and machine learning enthusiasts alike chomping at the bit – and at Awake, we have been thrilled to sink our teeth into it.

To extract any real value from all this data, it needs to be coalesced into a palatable form. Awake’s platform provides all the infrastructure we need to do exactly that. An organized, indexed knowledge graph of everything about a network—from topology to metadata to behavior patterns to raw packets, all associated with real-world entities—is a goldmine to a machine learning engineer.

We’re utilizing that infrastructure to take pivotal steps toward AI-driven network insights, using advanced deep learning applications that leverage the combination of institutional knowledge and our signal-rich network data.

But… how?

Wait—skeptics may say—you can’t possibly learn enough information from encrypted / unknown traffic to actually leverage it during an investigation… network analysis is dead! Au contraire, my skeptical friend, the information provided is indeed rich — just not in the way we’re accustomed to. Unknown traffic is unpalatable for payload signature-based detection. I won’t contest that. But what, really, are we doing with signature-based detection? At best, we’re detecting specific activities that may represent a single piece in the puzzle: an interactive C2 server communication here, a data exfiltration there.

What happens, then, when we want to look for an abstract concept, like, say, lateral movement? Is there a payload signature for that? Can we rely on payload signatures to identify something like lateral movement? I’m going to go out on a limb here and say “no;” that’s a hopeless endeavor. We have to take a different approach to this.

To detect something as nebulous as lateral movement, we need to look at more than just individual packets and sessions. We need to expand the scope of our analyses from packets, to sessions, to behaviors across multiple devices and services over time. As we do that, however, what we’re looking for becomes increasingly buried in noise (often in the form of legitimate traffic). Conventional network traffic analysis tools can’t handle this: the problem is too noisy and too non-linear for today’s plug-and-play monitoring applications to prove effective.

So we’ve taken another approach: using state-of-the-art machine learning techniques to tackle these issues. We’re challenging the status quo and saying: what if abstract behaviors like lateral movement can be detected by combining the powers of traditional detection techniques and deep learning? What if we could examine an encrypted tunnel, identify distinct behaviors within the session, and analyze those behaviors to detect malicious use? What if we could tell you that, while we can’t peek into the encrypted data, the session sure looks and smells like an interactive shell. Oh, and, by the way, it’s being controlled by a suspect domain sitting in the cloud.

“What if” no more.

By Ian Johnson
Data Science Intern
Artificial Intelligence
Computer Science Problems
Network Traffic Analysis