By Manasa Chalasani
Designing the End-to-End Investigation Experience
Over the past two years, Awake partnered with dozens of security teams across Fortune 500 / Global 2000 organizations for in-depth research and feedback. We are grateful to have these design partners who opened their SOC for interviews, answered questions and gave continuous product feedback as we built our Advanced Security Analytics Solution.
This research phase entailed going into SOCs with an unbiased view to interact with every person in the security team including analysts across tiers and functions, SOC Managers and CISOs to watch them go about their day to day jobs, understand their responsibilities, motivations and aspirations, what they like and dislike about their jobs and the tools they use, how they collaborate with the broader team, and what keeps them awake at night.
Right away we saw that the underlying problems each of these organizations talked about were consistent across the board. Everyone was actively rethinking their SOC’s org structure and/or processes, but they were all doing so in their own unique way.
As we dug deeper and spent time observing and interviewing the teams, we could not help but conclude that the SOC, the last line of defense, was broken in at least four different ways:
- Too many alerts
- Time to investigate each alert
- Inability to hunt
- Analyst burnout
Too many alerts
Existing tools were generating hundreds of thousands of events daily. Different organizations handled this barrage of alerts differently – ranging from building teams that worked on tweaking algorithms that prioritized these alert streams, to security analysts eyeballing the alert stream to pick off certain alerts that “caught their eye”. It was a never-ending stream of alerts and the SOC could only hope that the critical alerts were at the top of the stream and got investigated.
One thing that these teams did not need was more alerts. Or as reiterated by someone in one of these SOCs – “Do not trust vendors whose product goes ‘bing’”
Over 45 minutes to investigate each alert
This number was consistent across verticals, sizes of teams and organizations. The 45 minutes was not surprising when you consider that the analyst needed to pivot between 30+ tools to manually generate context about an alert, just to answer the question “Should I care?”
They would need to understand the device associated with the IP address at the time of the alert from DHCP logs, pivot again to get to who owned that device, and then to Outlook or an HR database to understand what the role of that person was. They would need to manually stitch together a query to understand the activity of this device around the time of the alert as the device may have moved across IP addresses. This is before they even got to external sources of information like WHOIS, VirusTotal etc.
Needless to say, we found that this process was manual, repetitive and error prone. And even worse, perhaps unsurprisingly, most investigations ended inconclusively.
It is critical to understand the questions that analysts had, the information that was needed to answer them, and when those answers were needed in the investigative workflow. Furthermore, it is important to provide the answers in a format that fits the analyst’s mental model, rather than needing them to manually interpret information, for instance, device names and activities instead of IP addresses and network protocols. Getting this right would shrink the 45-min investigative gap and allow the team to get through many more alerts in the alert queue, accurately and conclusively.
Want to hunt, but lack time / resources
All teams acknowledged that they did not want to rely solely on alerts from their tools, they wanted to hunt for threats proactively without a priori information. This is especially useful for finding non-malware based activity which by definition rarely results in an alert.
Unfortunately, despite the best intentions for the organizations we worked with, hunting was mostly aspirational rather than operational due to the challenges in doing it successfully. The process typically boiled down to hiring deeply experienced analysts with skillsets to operate today’s complex hunting tools and ensuring that they didn’t get pulled into aiding alert investigation work and enabling other analysts.
This is tough to achieve on both counts. One such expert analyst told us that they probably get a day per week at best to spend on hunting. Even then, it takes expert analysts a lot of manual and time-consuming work to get to conclusive results. As one analyst told us, “investigating a list of results produced by a simple hunting query like ‘Show me activity to uncategorized domains in the past 24 hours’ would take me my entire 8-hour work day, and still not be complete.”
For hunting to be effective, it is important to understand the workflow and mental model of expert analysts to make hunting accessible to analysts across levels, provide more starting points for a hunt and assistive components along the way. And at the same time, it is vital to make the process less manual and time consuming for expert analysts. The process must encourage the asking of natural follow-on questions rather than discouraging that with slow responses.
Analyst burn out; tough to hire and train
We consistently heard analyst hiring and retention issues as a top concern for security managers. The statistic of 12-18 months as the burnout time for a security analyst popped up across our research. One analyst noticed this statistic and said, “I’m glad it is not only me.”
When asked what comes to mind about being an analyst, terms like “outmatched” “defeated and deflated,” “impossible,” and “grinding,” were mentioned. Walking through their day-to-day investigative process made it clear why these terms were used. Some organizations even went to the extent of rotating analysts between the SOC and other IT functions to avoid burn out. On the other end of the spectrum though, analysts who had success stories to share on uncovering bad behavior in their networks knew how productive, satisfying, and dare we say fun this job could be.
The challenge here is to make this investigative work creative and engaging by taking out the mundaneness of the task. We also needed to capture tribal knowledge from the more senior analysts and use it to enrich the security program as a whole and to help train less experienced analysts. It is also important to draw out tasks that are repeatable and automate them as much as possible so that the creative flow doesn’t get interrupted by tedious, mind-numbing data analysis e.g. tracking devices as they hop across multiple IP addresses.
Designing the product
The point of all this research was to design and implement our product architecture based on these learnings. Awake has custom parsers that extract data needed to enable entity analytics, i.e. to infer entities, their attributes, relationships, and notable behaviors as envisioned in the analyst’s mental model and record that in the Security Knowledge Graph™. Perhaps surprisingly, much of this data was previously unmined. Activity analytics then fuse this information with the detailed records of transactions and thus provide a seamless investigative experience. Finally, an intelligent workbench exposes the raw data as well as the results of the analytics in a workflow-driven interface built specifically for analysts.
The learnings from our research also provided a few key guiding principles for the Awake user experience:
- Analyst first. Build a product that is centered around all analyst workflows. The product should be usable by all analysts irrespective of what they do – investigate alerts or hunt.
- Map to the human mental model. Analysts should be able to ask simple or complex questions of their environment using their mental model, rather than having to adapt their model to the underlying data formats. For example, an analyst should be able to operate on named devices from the real-world rather than IP addresses.
- Reduce manual and repetitive tasks. Anticipate analyst questions and provide all the necessary information in a human readable format, while leaving the decision process of “Should I care” to the analyst.
- Make it interactive. Deliver results back quickly to enable follow-on questions and allow the analyst to focus on fast and complete investigations.
- Easy to learn, reduced complexity. All analysts should be able to learn and use the product to efficiently investigate. The product should provide aids to make complex tasks like hunting more intuitive, accurate and fast while not needing advanced skills to operate.
This is a journey
The fun thing about the journey so far is that as we implement the learnings from our research we expose new workflows that were previously not possible. And those in turn deliver their own set of analyst experience learnings. It is why we believe that user experience is a key foundational element of what we do at Awake.
Stay tuned for more on this blog on how we put the guiding principles above into practice while designing a product that addresses the challenges in the SOC. In the meantime, let us know your thoughts on our findings – if they resonate with you, if you have built systems to overcome some of these burdens, or if you want to work at Awake and help us solve some of these problems – especially in our user experience team :).
BTW this is not an afterthought, but we must again express our gratitude to the hundreds of security professionals who continue to give us ongoing feedback, insights and have let us observe their investigative process as we evolve our Advanced Security Analytics Solution. Without you none of this would be possible #AnalystFirst.