Blog Post

Solving the Hard Problems to build the Security Knowledge Graph

In an earlier blog post, we described our Security Knowledge Graph™ data model and the benefits it provides to security teams. In this post we turn our attention to some of the “really hard” problems we had to solve to turn that vision into software reality.

In creating Awake Security™, we wanted to solve a fundamental problem: enable analysts to answer questions without worrying about where the data that forms the answer comes from or its format. We looked at the approaches of existing security vendors and found that no one was actually solving this particular problem. Most security technologies out there either chose to produce more data (raise more alerts) or deliver an aggregation of data sources into a single console—the proverbial larger haystack. Even those technologies that focused on analytics were only trying to make their underlying voluminous data source more consumable. Rather than starting at this end of the spectrum—i.e., the data—we chose to start with the analyst questions. That in turn led us to to conceive of the Security Knowledge Graph as the foundation of our advanced security analytics solution.

Knowledge graphs have been used for a few years, primarily for search. For instance, both Google and LinkedIn use knowledge graphs to significantly improve the quality of their search results. However, applying their approaches as-is to security does not work. First, you have far less data at your disposal to look for inconsistencies, and, second, doing this in real-time and passively so as not to impact network uptime is non-trivial.

Suffice to say, we had to cope with some interesting technical challenges, and what follows here is a description of some of the ones we ran into along the way.

Feature Extraction Challenges

As mentioned above, Google and LinkedIn have a lot more data than an enterprise system can ever access or operate on in real-time. However, the data in the enterprise is typically more structured than data on the web as a whole. To take advantage of this structure, we built a proprietary set of high performance parsers to mine the network traffic to its fullest extent, extracting information that has so far been untapped by existing tools. These signals enable us to identify and track entities as well as create rich security profiles for each of them.

Data Challenges

To build this system we also had to grapple with the “four V’s of big data” (for those interested, we highly recommend watching the original talk by Michael Stonebraker as well):

  1. Velocity: The volume of network traffic is about 3 orders of magnitude higher than that of the logs in a SIEM. Even after parsing, the signals extracted out of the packets are still about 10x the volume of the logs.
  2. Variety: The system must handle a variety of data formats: unstructured packet data and files present on the network, structured signals extracted from the parsers, the Security Knowledge Graph in a graph format, workflow information from the analysts in document format, and so on.
  3. Volume: We store and query 30+TB of packets and 20+TB of extracted signals within a single server. That volume of data overwhelms many well-known “big data” systems available today.
  4. Veracity: While the network is closest to the ground truth as possible, it is not perfect. That does mean that the system needs to continuously verify the results of the analytics and fix the modeled entities as new data is ingested.

We set out to find an off-the-shelf or commercial system that could help us solve these problems and unfortunately we couldn’t find one. After a lot of prototyping and debating, we therefore decided to build a custom system using several excellent open source components such as Greenplum, Kafka, Samza as a foundation.

We built, from the ground up, a multi-model database capable of both storing all these kinds of data at a high ingestion rate, and also querying them with a simple query language that delivers interactive performance. While that is a mouthful, offering all of that capability in a single system also means we are NOT adding more tools for the analyst to work on.

Analytics Challenges

Of course, getting data into the system is only the first step. Once the data is in the system, analytics still pose two major challenges:

  • constructing the Security Knowledge Graph from the signals extracted out of the network, and
  • analyzing the knowledge graph to provide security-relevant entity characteristics, as starting points for investigations.

It would be hard to do justice to both these challenges here, so we provide a glimpse of the first challenge and will discuss the knowledge graph analytics in a future post.

Let’s consider, for instance, what might appear to be the relatively simple task of tracking a device over multiple IPs. Every time we see activity on the network, we try to infer the device that could be generating the traffic. As we see more traffic our inference becomes stronger. Sometimes, however, we may have to go back and revise our inference as we get evidence of traffic that contradicts our model. We also have to account for devices with multiple OSes (VMs), multiple users (shared systems), multiple IPs (multi-homed systems), proxies/NAT devices, etc.

Over the last couple of years we have extensively tuned our system to both, recognize all these corner cases and do so efficiently given the volume and velocity of the data. (BTW: for some more details on the process we use for this purpose, please consult this whitepaper).

Other Challenges

There are additional challenges in summarizing the thousands of domains a device can access over a lifetime or in tracking the devices over multiple IPs without adding the immense operational overhead of collecting the logs from all DHCP servers. (And how often do you find a network with pristine DHCP anyway?) Also, the graph needs to be able to incorporate human-generated insights, like hypotheses about suspicious behavior or labels denoting job function, beyond what automated systems can provide. Those insights need then to be factored as queries are run and analysis continues. In a nutshell, as time progresses, the complexity and volume of data that needs to be analyzed continuously increases.

The Journey Continues

Needless to say, we could keep going on about the challenges but this seems enough for one blog post. Solving these tough problems, so that every single organization and their security analysts don’t have to, is what brought us together to start Awake Security and is what continues to motivate us every day. This is an ongoing journey with lots of tricky computer science problems ahead of us. We look forward to sharing more details of our solutions over time on this blog.


If you liked what you just read, subscribe to hear about our threat research and security analysis.

Debabrata Dash
Debabrata Dash

Co-Founder & Chief Data Scientist