Splunk, a platform for searching, monitoring, and examining machine-generated big data, has launched a new release of application monitoring tool SignalFx Microservices APM. The new release combines NoSample tracing, open standards-based instrumentation and artificial intelligence (AI)-driven directed troubleshooting from SignalFx and Omnition. SignalFx Microservices APM supports open source and open standards-based instrumentation with the goal of flexible data collection designed for modern cloud environments.
Splunk also further expanded its observability offerings with a major feature release in SignalFx Infrastructure Monitoring for containerised data: Kubernetes Navigator. Kubernetes Navigator uses AI-driven analytics to surface recommendations intended to expedite triaging and troubleshooting. Workflow integration between Kubernetes Navigator and Splunk Enterprise or Splunk Cloud aims to reduce context switching and provide insights with the goal of accelerated root-cause analysis.
InfoQ asked Karthik Rau, area general manager for application management, Splunk, to answer some questions relating to the new release:
InfoQ: How do teams use Splunk solutions, including SignalFx, to obtain a complete picture in hybrid environments that include legacy, heritage or cherished applications and platforms, such as SAP or mainframe alongside public/private cloud and microservices-based products?
Karthik Rau: When we look at the market, we see a trend to move workloads from private, to hybrid to public clouds; a journey to become cloud-native where applications are no longer constructed as monoliths. Because of the ephemeral nature of cloud infrastructure, complex interdependencies of hundreds, sometimes thousands, of microservices, and DevOps teams release code multiple times per day, problems occur much more frequently and are much harder to troubleshoot and resolve. This new complexity frequently results in customer-impacting service outages, slowdowns and errors. SignalFx Microservices APM aims to collect and analyse data across hybrid environments and uses a combination of AI and ML to drive relevant information to the surface, with the goal of allowing developers to spend less time searching for the source of problems and more time resolving them.
InfoQ: How can Splunk help teams gain visibility into the four key DevOps metrics as identified by Dr. Nicole Forsgren et al in Accelerate i.e. deployment frequency, lead time (code commit to deploy in production), MTTR and change fail rate?
Rau: There are two important aspects to this: firstly, the application delivery pipeline and secondly, as part of that lifecycle, production monitoring. Splunk Enterprise and Splunk Cloud provide application lifecycle analytics, which provides visibility into the development process, connecting tools across the development toolchain and providing visibility into the code quality and DevOps metrics. SignalFx Microservices APM adds a production monitoring and troubleshooting solution for on-premise, hybrid, or cloud applications. SignalFx Microservices APM aims to collect all traces, providing DevOps teams with levels of granularity that help them understand the behaviour of their software and accelerate deployment frequency. Combined with our streaming analytics engine, our customers can see the impact of such releases, with the hope of minimising Mean Time to Detect (MTTD), and act accordingly. Our AI-Driven Directed Troubleshooting, that combs through the traces data and surfaces recommendations means DevOps teams can pinpoint and resolve the root-cause of an issue with the intention of reducing MTTR and helping developers. Our monitoring-as-code approach can enable DevOps teams to deploy multiple versions of code or canary releases, track the impact of each release, and roll-back if there’s a problem, with the intention of reducing change failure rate and fixing problems before they impact end users.
InfoQ: Can Splunk help teams calculate the value realised from a new feature and if so, how?
Rau: SignalFx supports custom business metrics that tie back to the production application so DevOps teams and business stakeholders can see how code changes can positively (or negatively) impact application uptime and user experience, and correlate that to, for example in an e-commerce application, units of good sold, in real time.
InfoQ: What are the observability challenges that microservices architectures cause and how does Splunk solve them?
Rau: Microservices have a lot of advantages in terms of scaling and time to market, but they also introduce their own challenges and high degrees of complexity – the infrastructure on which they run is typically ephemeral, spinning up and spinning down very quickly, services and individual instances of services scale fast and, as their numbers multiply, the interactions between them multiply even faster, causing the amount of data to skyrocket and creating very complex interdependencies. You often have multiple versions of the same microservice running at the same time, and these versions are released sometimes several times a day. Finally, DevOps teams try to find the optimal tools and frameworks for each microservice, and as a result rely heavily on open source and open standards. SignalFx Microservices APM was designed specifically for microservices and ingests and analyses the data using AI and streaming analytics to get insights quickly, as well as leveraging and contributing to open standards such as OpenTelemetry, which Splunk co-founded.
InfoQ: What are some examples of insights that Kubernetes Navigator provides?
Rau: One such example is a noisy neighbour problem. Application workloads run on containers that are dynamically managed by Kubernetes across shared infrastructure resources. A noisy neighbour, which could be caused by a simple misconfiguration on a memory limit, could increase the memory consumption on a particular node, impacting the rest of the containers, and application workloads, on that node. This might result in end users experiencing slow performance or errors as they interact with the application. Kubernetes Navigator makes suggestions on what specific pod or workload might be causing the anomalies, with the goal of reducing triaging and troubleshooting time.
InfoQ: Where is the line between infrastructure and application in a product-centric, cloud and microservices world?
Rau: In order to survive and thrive in today’s increasingly product-centric world, an equal focus should be put on infrastructure and application. End user interactions are at the core of every business today, and their experiences are fragile. End users that have to wait too long for an application to load do not care whether the root cause is in the infrastructure or in the application. Having a unified, full stack view of both applications and infrastructure, and being able to correlate the two is recommended, and can have a direct impact on revenue, and ultimately, overall brand loyalty. Another consideration is the evolution of cloud infrastructure in the sense that it is becoming much more software-defined and ephemeral. Developers no longer need to rely on IT teams to rack and stack servers in a data centre. They can simply go to any cloud provider and, with a few simple clicks of the mouse button, provision any amount of infrastructure resources they need in a matter of minutes. They can also use serverless functions, which abstract away infrastructure altogether. This evolution of infrastructure has been critical to accelerating innovation and the delivery of software.
InfoQ: How does Splunk integrate with ChatOps and service desk or incident management solutions such as ServiceNow, Jira Service Desk or Cherwell?
Rau: Splunk’s VictorOps incident response system integrates with service desks like ServiceNow as well as chat-oriented tools like Slack. Incident tickets in ServiceNow are correlated with incidents in VictorOps, and updates and closures of tickets are synchronised between ServiceNow and VictorOps. Similarly, VictorOps integrates with Slack. When an incident is opened, a Slack channel is opened, and chat that occurs in that channel is synchronised between Slack and VictorOps. You can use Slack commands to escalate, snooze and close events. Combined, VictorOps can synchronise across ServiceNow and Slack so operations teams and developers can chat in their preferred tool. Teams can also curate interactions between people for post incident review reporting.
InfoQ: What does being a gold member of Cloud Native Computing Foundation (CNCF) mean?
Rau: We became a gold member to demonstrate our commitment to open source and deepen our relation with the DevOps community. While Splunk has been actively involved in open source for many years with offerings and contributions to numerous projects, this commitment has accelerated with the acquisitions of SignalFx, Omnition – a founding contributor to the OpenTelemetry project, and others.
Our own CNCF contributions have included projects like Cortex and Prometheus, Envoy, Fluentd and others, both as maintainers and contributors. More recently, our team is focused on bringing the OpenTelemetry project to fruition to provide developers with the most flexibility in collecting data from their applications while avoiding proprietary, heavy-weight and performance-impacting agents.
To learn more about the CNCF’s projects, review the CNCF Cloud Native Interactive Landscape.
Leave a Reply