Elimination of false positives reduces event volume by 40%
Our client, a large telecommunications provider, had teams working across multiple time zones and operated a sprawling and complex IT system. The scale and density of this system made gaining visibility into IT services extremely difficult, leaving the security team blind to what was happening with its IT environment.
With little-to-no operations monitoring tools in place to proactively monitor these systems, the team had no visibility on the availability of its IT services or KPIs such as failure rates, send request times, and response times. This lack of oversight made it difficult for the team to prioritise issues—and nearly impossible for them to find the root cause of any problem. Without the ability to investigate the source of issues, the team were unable to take proactive measures to prevent them from reoccurring, severely impacting service performance.
This was not only a problem for IT teams, but also for executives, who lacked the insight into IT business operations that would help them make decisions.
The company needed a solution that would map KPIs to critical service components, enabling the operations team to effectively drill down into issues in real time and conduct in-depth investigations to find resolutions.
RiverSafe had previously implemented Splunk Enterprise Security for the customer and given the success of the implementation and the positive client feedback on the platform, RiverSafe was again engaged to deploy Splunk’s IT Service Intelligence (ITSI) tool to help it tackle its visibility problem.
Splunk ITSI uses machine learning to analyse existing data and predict future issues. As well as forecasting potential services bottlenecks, it can also troubleshoot problems and help users resolve issues fast. Delivering comprehensive monitoring across the entire IT environment, Splunk ITSI would also give the team full observability of their IT infrastructure.
In particular, the engineering team wanted to gather metric data relating to the Kubernetes platform. RiverSafe reconfigured and implemented data collection metrics used elsewhere in the IT environment in Splunk to allow engineers to collate and access this information from different data sources in one place.
The outcome: Actionable insights in days, not weeks
In less than a week, the RiverSafe team implemented Splunk ITSI and began running monitoring services.
With data from existing KPIs already indexed by the Splunk platform, the team are now able to access service insights even faster. These KPIs allow the operations team to identify trends, detect patterns, and proactively address any anomalies that occur before issues arise.
Instant visibility with glass table visualisations
To enable rapid and proactive issue resolution, RiverSafe implemented custom glass table visualisations in Splunk ITSI. This enables the team to navigate large volumes of data and reduce the time needed to identify and resolve problems.
This simple and accessible dashboard gives the team an instant, digestible overview of its web portal performance metrics.
These KPIs included the number of open tickets and failed login attempts, memory usage, API call success rates, average response times, and overall health of container services.
Event response time cut by 8 minutes
As a result of RiverSafe’s work, the team have been able to centralise events from all its previously siloed solutions into a single interface with Splunk ITSI.
The event analytics in Splunk ITSI help to prioritise responses and react more quickly to customers’ infrastructure events, empowering them to provide a better service. This is thanks in part to Splunk ITSI’s ability to identify and filter out false positives from the event management process. Excluding these invalid events reduced the total event volume by 40%, helping operators focus on the events that really matter.
With fewer events to process, a single interface to work from, and a streamlined event analytics framework in Splunk ITSI, operators now process events eight minutes faster on average. This boost in efficiency has led to a major improvement in the company’s SLA performance.
Overall, Splunk ITSI has delivered enhanced operational visibility, meaning the team can locate bottlenecks in workflows quickly and deliver fast recovery and troubleshooting solutions.
Along the way, RiverSafe also provided best practices for using Splunk ITSI and recommended the most effective ways to collect data, including proposing an alternative metric type that would save on storage space when collecting logs.