Blog Post

Improving analyst productivity with the Security Knowledge Graph

In the blog introducing Awake, Michael mentioned how observing analysts in the real-world influenced our product design. An important example is the Security Knowledge Graph™, which addresses a vexing problem we saw again and again in security operation centers (SOCs).

The Problem

The investigative process in the SOC today is manual, repetitive and error-prone, and much of this work is driven by the analyst’s need to develop context around an alert to determine if it is real or not.

Consider a simple example: investigating an alert that indicates suspicious behavior from a device in the enterprise network often starts with the IP address of the system in question. From here, the analyst might use the SIEM to find all events associated with this IP address before the alert was generated. But, given the dynamic use of addressing in modern networks, this may miss events from times when the device in question was using another IP, and—perhaps worse—may pick up irrelevant data from unrelated systems that also used that address.

So, to get accurate, complete results, the analyst actually must first identify the device and track where it has been on the network, using DHCP logs or a device management system to extract the hostname and a history of its IP addresses. Then the analyst can query the SIEM or other data sources, using the hostname for data sources that capture that, and IP addresses and time ranges for ones that don’t. Typically the data that then emerges is enormous, so the analyst needs to find a way to summarize the events to a manageable level.

But hold on—that’s often not the end of it. Next, the analyst may identify the person using the device to determine their role and department, if they were part of a targeted attack or if they were off-duty, were other devices used by that person also affected even though no alert was generated from those devices etc. If the alert is about some anomalous connection to external IPs, the analyst uses third-party tools or intelligence to assess the reputation of the external IP address. And so on.

Over the last two years, as we worked with more than a dozen SOCs, we saw this process frequently involve 30+ tools and consume many hours, only to end inconclusively. To make matters worse, if another alert is generated for the same device after a couple of days (usually on another IP address by then), everything starts over from the beginning.

Needless to say, this is tiring and unproductive and takes a toll on the analyst.

This is the fundamental challenge analysts face: while the analyst operates with a mental model that includes devices, people and organizations, the tools she is saddled with operate on IP addresses, network sessions and protocols. Matching those two models is left to the analyst and the time this consumes means that many threats go uninvestigated.

Existing solutions fail because they neglect to capture these real-world entities explicitly in their data model, instead requiring analysts to piece together information about entities at query time from low-level data. The problems with this approach are many:

  • it is hard to formulate the right queries
  • the process has to be repeated again and again
  • the queries themselves can be very slow to run, impairing productivity, since they often require self-joins on huge tables

The Security Knowledge Graph™

From the start, we were convinced that the right approach is to support the analyst by having the system itself identify and track the entities that match the analyst’s mental model, even before a query is made. Then, the analyst can query the system directly about the entity of interest and get results instantaneously that aggregate information gathered from days, weeks or months of observation, with no need to piece data together manually. Think of this as similar to instantly looking up the balance on your bank account, versus having to compute it each time by tallying the transactions that have been performed on it since you opened the account.

We call this capability the Awake Security Knowledge Graph™ data model.

To build the Security Knowledge Graph, we had to decide what kinds of entities the system should model to make the analyst’s life easiest. As the above discussion makes clear, information on an internal IP address as such is typically not helpful, since it mixes events and attributes of multiple devices using the address at different times. Therefore, we chose a “device”—that is, a communicating endpoint, which might be a server, client, IOT or BYO device—as a foundational entity type. A device may have different IP addresses over time.

To determine the full set of entities to architect Awake’s advanced security analytics around, we combined feedback from our early security team partners with our own deep in-house investigative expertise. In the end, we produced a model that can be summarized in the following diagram:

![security knowledge graph](/wp-content/uploads/2017/07/SKG.png)

In addition to devices, analysts need to reason about the *person* who uses a device (who may have multiple usernames and credentials), *internal organizational entities* that person is a member of, and *external organizational entities* they interact with. They work on *data* that may appear as a file transfer or attachment, and a person may interact with a given piece of data. As should be clear, our model includes capturing the relationships among real-world entities.

Using the Security Knowledge Graph

The results of the Security Knowledge Graph approach exceeded even our own expectations. In our early deployments, analysts have been able to investigate alerts ten times faster than they could with previous tools, and get more conclusive results. Importantly, the benefits are not restricted to just investigations either.

The impact of this entity data model on the hunting process is even more dramatic, both in terms of time and the quality of the hunt. For instance, finding all devices that are running Windows 7 with a particular patch version is trivial, as the data model would have summarized the OS running on the device by collecting large numbers of indicators present in network data—and this works purely through passive network observation, without requiring agents or log data. However, you could also take the complexity of the query up a notch: it’s also easy to find, in seconds, all such devices that have also connected to a given external domain.

It is one thing to just have the information above, but we also recognized that the utility of the Security Knowledge Graph would also depend on the speed and ease with which analysts could extract answers out of it. And so, to enable our security analytics to execute queries like this and even more complex ones in seconds, we introduced the notion of pre-correlation: correlating events at ingestion time with their associated entities. Most solutions on the market cannot effectively do this due to the sheer scale of data volume (more on this later) and hence they lack the interactivity needed for an effective hunting process.

These examples show the power of focusing on the meaningful entities before events, tying those entities to the events, and allowing that data to be queried interactively. We have more examples on the Awake Solutions and Why Awake pages on our website.

What we describe above might sound easy and indeed, as the saying goes, if it were easy everyone would do it. In fact, this involved some serious computer science challenges that we will describe in a follow-on blog post. Warning: not for the faint-hearted. Some algorithms were harmed in the making of part II.

Debabrata Dash
Debabrata Dash

Co-Founder & Chief Data Scientist