By David Pearson
For this post, I wanted to introduce readers to some of the challenges and the plight of the SOC analyst as she examines just 10 minutes of regular web browsing traffic. Although this may not sound very sexy, this traffic is a significant portion of activity in an average enterprise network. In this case, 10 minutes from one system equated to approximately 57,000 packets, 1,100 sessions, and 26 protocols. The protocol breakdown, as determined by Wireshark, is shown below:
Interestingly, the largest percentage of traffic is Unlabeled TCP. Given the sheer amount of traffic categorized as such, I performed a secondary, port-based analysis to determine which packets were likely related to HTTP or TLS. Doing so produces the following revised chart:
The breakdown above highlights the fact that while TLS is gaining ground, a significant portion of common web browsing is still unencrypted. Additionally, the data points out that it is exceptionally tedious and time-consuming to get a completely accurate picture of what is actually occurring on common TCP ports. Although Wireshark is a great tool, the lion’s share of traffic found communicating with TCP ports 80 and 443 was not directly identified as HTTP or TLS; the amount of manual analysis required to confirm
Stepping back to a higher-level view, I was interested in seeing what kinds of services were active in the background while I perused. Various social media platforms appeared, as did numerous browser plugins that had been installed and mostly forgotten. Asset tracking software kept a steady pulse on my laptop’s whereabouts. I may have merely been reading the news, but so much more was happening.
If only there were a “malicious” flag
Whether you’re new to the profession, or you’ve been at it for years and your first book was the Intel Assembly Language Reference Manual, it is commonplace to encounter “bad-looking” network traffic. Successful attacks often however look similar to known good traffic. Given this, most analysts will tell you that a significant part of their job is inspecting supposedly trusted data.
Fortunately, there are no signs of compromise within this ten minutes of traffic, but there are numerous examples of streams that are “bad-looking.” I’m sure analysts remember the first time they encountered a large number of DNS PTR requests for IPs of 1e100.net, as shown in the following screenshot. Such a domain name looks quite suspicious, given its strange mixture of numbers and letters, as well as its lack of readability. However, it is a domain owned by Google to uniquely identify its public-facing servers. (As a side note, 1e100 is shorthand for the floating-point notation of 1 googol.)
While searching through the numerous HTTP sessions present, I uncovered plenty of traffic to content distribution networks (CDNs). These can be really tricky for several reasons. First, one might expect traffic for a news site to be coming from the actual site that is being contacted, which is often not the case. Second, there are a few large and well-known CDNs (BTW though well-known, whether or not we should implicitly trust these types of networks is a question for a different day), as well as a number of smaller companies. When encountering more obscure CDNs, how can we tell that it’s legitimate and not simply an attacker taking advantage of our implicit trust (registering myreallycoolcdn.com is no harder than creating any other site)?
More troubling to me than active content is the abundance of octet stream (including executable) data. This data is harder to quickly understand and make a judgment call about. Not only is it prevalent in legitimate network traffic everywhere, but by nature it can be so many different kinds of data and be used for so many different functions. Sometimes information about the stream will be available, but oftentimes this is not the case. Examining it is at best tedious and at worst incredibly time-consuming.
One of the most common “bad-looking” situations to uncover in data sets is the use of IP addresses as endpoint identifiers in upper level protocols such as HTTP, as opposed to host names. (E.g.: seeing an IP address in the Host field of an HTTP request.) It immediately requires more effort to analyze than an actual hostname, since IP address ranges are inherently harder to memorize than a simple name. While this is generally regarded as a bad practice to be avoided, there are plenty of legitimate services that use direct IP address communication. Still, because many malware samples use IP addresses to communicate, analysts will therefore pay close to attention to this traffic. Fortunately there is only one unique example in my 10 minutes of data (and an easily-identifiable one, since it is a multicast address); however this pattern occurs frequently in every enterprise we’ve examined, and the analysis is often not trivial.
Stepping away from traffic that looks bad, there are several other interesting characteristics of the data set that I’d like to share.
TLSv1.0 Usage Exists
One domain in particular (by a very large multinational company, no less) utilizes TLSv1.0, despite other domains owned by the same company using TLSv1.2. Fortunately, all of the negotiations utilize a cipher suite that is currently known to be secure. However, keeping up with what is or isn’t secure—and whether the risk of an insecure cipher suite is acceptable in your environment—requires cross-referencing multiple sites, keeping up with the latest research in the field, and communicating across various stakeholders. This is time that many analysts simply don’t have.
QUIC provides interesting information
Another interesting protocol you might run into is QUIC – Quick UDP Internet Connections. This was created by Google to provide the benefits of TLS without the overhead associated with TCP. We first started seeing traffic utilizing this protocol in early 2016, and have since monitored its adoption and behavior. One interesting aspect of QUIC is that its handshake indicates the client’s OS type, OS version, browser type and version, and the server/service being accessed. Of course, while this is not necessarily highly sensitive information, it is more revealing than what can be learned from a traditional TCP-based TLS negotiation, thereby providing additional knowledge about a user and the device they are using. And since this is not a “traditional” Internet protocol, it’s quite likely that many analysts have not come across it in depth, nor would they know what could be exposed by its use.
Hmm, Spotify had a P2P Network?
While searching across traffic communicating with Spotify’s music streaming service, I discovered a high-numbered UDP port sending several packets each minute to the same UDP port on two broadcast addresses, as well as responses from local devices. Looking more closely, it turns out that each packet had a “SpotUdp” plaintext string in its payload, which piqued my interest. After a brief search, I discovered that until mid-2014, Spotify had a P2P network that a lot of people didn’t seem to know about. While the network was phased out over two years ago, this particular connection still exists, and still clearly has some local subnet P2P communication. If legacy things like this exist but often go undetected, imagine how hard it is for junior analysts trying to hunt and discern what is or isn’t legitimate!
TCP Errors Always Exist
Of the roughly 53000 TCP packets within the data set, there are a few hundred error packets. This is less than 1%, but in that small number of packets was quite a number of unique errors: TCP ACK for an unseen segment, Duplicate ACK, Out-Of-Order packet arrival, Previous segment not captured, Spurious Retransmission, and a Zero Window size. Error packets are often ignored when doing routine analysis; how many times might an attack have been discovered earlier if this source of information could be more coherently analyzed? As a simple example, seeing duplicate ACK packets could mean network congestion and loss, or it could mean that an attacker is attempting to intercept communication between two hosts.
Multicast is Noisy
During these ten minutes of traffic on a private company wireless network, there are a lot of devices “informing” the network of their presence. Since it’s all multicast traffic, I was able to see my coworkers’ computers, phones, and some applications, as well as smart TVs, printers, and scanners. Simply by passively listening on a secured wireless network, I was able to do quite a bit of reconnaissance that otherwise would be difficult and time-consuming.
Why Does it all Matter?
My reason for performing this exercise was first and foremost to show how much one can learn by simply analyzing benign activity on a network link. I figured it would take me a few hours to deeply analyze the traffic and write this post, but I was sorely mistaken: it actually took me more than a full day. If just ten minutes of traffic from one device (at a throughput rate of ~600 kbps) in a relatively common scenario required that much effort, imagine if the network instead had thousands of users going about their daily routines. How hard would it be to discover what the bad things are? How would one identify hygiene issues (that can easily become bad things)? And how can one identify and exclude what is not actually a problem? How does one begin or make progress with proactive threat hunting? Clearly, even an effort as simple as trying to string together network events that occur across the span of a few seconds—when found within a network of any appreciable size—becomes daunting and difficult to correlate.