By Keith Amidon
As described in more detail previously in this blog, at Awake Security™ our vision for the Security Knowledge Graph™ involves integrating data from many sources to extract, infer, and present in the most usable form the knowledge a security analyst needs to successfully protect their organization. We recognize that fully realizing that vision will take time, and so in version 1.0 of our solution we’ve focused on raw packets captured from the network. “Why?” is the natural question and this post provides the answer.
Fundamentally, there are four unique properties of network data that together show why it can provide the solid foundation on which a reliable Security Knowledge Graph can be built:
- (Nearly) everything operates on the network
- Network data is easily accessible
- Network data is detailed and reliably labeled
- Network data implicitly documents relationships
Let’s examine each of these in more detail.
(Nearly) Everything Operates on the Network
Almost all tasks for which a computing device could be deployed in support of an organization are more quickly and efficiently accomplished if those devices are connected in a network with each other and the rest of the world through the Internet. This is true of both devices that are actively controlled by people (laptops and desktops in the campus network) and devices that operate mostly autonomously (servers, control systems, etc). From the perspective of information security, it is quite safe to assume that (nearly) all devices that are important to the work of an organization operate on the network, and therefore the Security Knowledge Graph built from that data will contain comprehensive information specific to the organization’s work and threats to that work.
Network Data is Easily Accessible and Comprehensible
The three major sources of data relevant to security investigations are device data reported by agents, device data reported by remote logging, and network data itself. The device data includes not only that from endpoints and servers but also network element devices like firewalls and infrastructure services like Active Directory. Today, of these three sources of data, network data requires the least deployment and maintenance effort and provides the most information.
From a deployment and maintenance perspective, typical network architectures aggregate traffic from many devices in a few common transit points such as the public Internet boundary and campus/datacenter edge. These transit points are relatively static, changing infrequently. Network monitoring deployed at these transit points can provide a wealth of information about the activity of devices in the environment, as communication across these transit points is almost universally required to complete the tasks for which the devices are deployed. Such an architecture requires correctly deploying and maintaining software and/or configuration at several orders of magnitude fewer locations vs. device-based solutions, with a corresponding reduction in effort.
Furthermore, as BYOD and IoT adoption has accelerated, it has become less and less possible to manage all the devices in the network even if the resources to do so are available. In most networks today, there are many connected devices on which it is impossible for the organization to deploy agents and/or remote logging configuration.
History has also shown that attackers actively attempt to circumvent host agents and logging, so another problem with this data from hosts is that they may be lying. Even if host data is accurate and security policy mandates that only approved devices may be connected and this is well-enforced, it is still necessary to ensure unmanaged devices have not been inserted into the network by an attacker. The only direct source of information about unmanaged devices regardless of their policy compliance is the network, and therefore a graph built from any other data source provides an incomplete view.
The most common objections to the accessibility of network data are:
- Cloud applications that use non-standard and application-specific protocols over HTTP;
- Increasing adoption of cryptography in network protocols.
There is no denying that these are real trends that will, if they continue on their current trajectory, reduce the accessibility of data from captured network traffic. However, based on practical and real-world experience we believe at the current time, the network is still the highest leverage point from which to build the Security Knowledge Graph. There are two primary reasons for this.
Firstly, while these trends have been clearly, but slowly, growing for many years, the power of this data set is obvious in examining the data we have from early deployments of our product. As of the time this was posted, standardized protocols such as unencrypted HTTP, SMB, and DNS still account for approximately 62 percent of all traffic. Conversely, HTTPS and related encrypted protocols only account for approximately 19 percent. And even with this traffic, we have found that the metadata accessible in the majority of these protocols is still useful for answering the most critical analyst questions such as identity and notability of the devices and services involved.
Secondly, assuming these trends continue and eventually obscure from packet capture based systems the detailed data needed in the knowledge graph, that data will still be accessible from other infrastructure systems or the end hosts themselves. Even in that case, we believe the analysis of communications patterns on the network will provide the best foundation because it is the least falsifiable record of activity and relationships.
At Awake, we have always known that a network packet capture and analytics system would never, on its own, provide all the data an analyst would like in the knowledge graph. We therefore already integrate data from other sources. Starting with capture data gives us easy access to the best data available today and the framework we’ll need to address the challenges of tomorrow.
For all these reasons, we believe that the statement “Network data is easily accessible” is true today and will remain true for a long time to come.
Network Data is Detailed & Reliably Labeled
Modern networks are heterogeneous environments in which devices from many suppliers, deployed and upgraded independently and at widely varying times, must cooperate to accomplish their tasks. To do so, they must exchange data in mutually understandable formats that evolve slowly and with attention paid to backward compatibility. In the language of a network engineer, these are the “protocols” that devices use to exchange data. To ensure devices interoperate successfully, these protocols are defined in gory detail, with every bit of information included in them full described, from the meaning of the item to the order in which the of bits of its binary representation are transmitted. This means that the data on the network is very explicitly defined and labeled (by the protocol) and changes slowly (due to the effort required to update or define new protocols).
This is in sharp contrast to the two other sources of security data, host agents and logs. Host agents typically extract data from files or the memory of the device. File formats change much more frequently and are typically more poorly documented (if at all) then protocols since they often only have to satisfy the proprietary needs of a (few) versions of a single application from a single supplier. In-memory information takes this volatility to the extreme, being tied to implementation decisions made in the applications that frequently change from version to version.
Log data falls somewhere in-between. Structured logging is not widely deployed so logging data is typically not labeled at all. It is also unique to individual applications and tied to specific versions as developers typically modify logging to reflect changing implementation choices and/or understanding of the problems they are typically written to help troubleshoot. These characteristics of the data make it extremely challenging to build a broad-based solution that can look at different systems and perform analytics across all of them.
Since, as discussed earlier, (nearly) all devices operate on the network, it is highly likely that much of the information required in the security investigation process will be available on the network at some point and because it is well labeled by the protocol with which it is exchanged, it will be easy to reliably find. The Security Knowledge Graph built on this data will be much more maintainable. All it takes is a product engineered to mine that information fully, but that is a topic for a future post.
Network Data Implicitly Documents Relationships
At least as important as the entity information in the Security Knowledge Graph are the relationships between entities. These relationships are often clearly visible in data on the network. As a trivial example, since network communication always involves two (or more) entities, it implicitly documents “communicates with” relationships. There are many other protocol exchanges that document additional relationships. For example, successful domain authentication requests document “is a user of device” relationships, file transfers document “has this data” relationships, and protocol activity can document “provides this service” relationships. Thus, a knowledge graph built on network data that will contain information on these rich relationships and the timelines by which they get established. This is a rich and deep topic to which we’ll return in more detail in future posts.
Properly leveraged, these unique properties of network data allow the rapid assembly of an extensive Security Knowledge Graph, covering all the most important entities the security analyst encounters during investigation. Thus, even in isolation, the network-based Security Knowledge Graph provides tremendous value in the investigative process by providing access to information that was previously difficult or impossible to obtain.
As exciting as it is to have those capabilities now, we’re equally excited about the potential to use this network-built knowledge graph as the foundation (ground truth) on which less reliable and extensive data sources (such as agents and logs) can be reconciled and integrated. We’ve only begun to explore the possibilities, but early indications are that there is a “network effect” in the reliable integration of additional data sources that will rapidly increase investigative power. We hope you’ll join us on the journey.