This article is part of a larger piece that was originally published on the Dynatrace blog. Click HERE to read the original in its entirety.
Leveraging cloud-native technologies like Kubernetes or Red Hat OpenShift in multicloud ecosystems across Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP) for faster digital transformation introduces a whole host of challenges. The sheer number of used technologies, components, and services in use contributes to the growing complexity of modern cloud environments.
Also, these modern, cloud-native architectures produce an immense volume, velocity, and variety of data. Every service and component exposes observability data (metrics, logs, and traces) that contains crucial information to drive digital businesses. But putting all the data points into context to gain actionable insights can’t be achieved without automation in web-scale environments. Also, if you don’t add more context like user experience or interdependencies between components you’ll find critical information is missing.
Logs and events play an essential role in this mix; they include critical information which can’t be found anywhere else, like details on transactions, processes, users, and environment changes. They are required to understand the full story of what happened in a system.
Putting logs into context with metrics, traces, and the broader application topology enables and improves how companies manage their cloud architectures, platforms and infrastructure, optimizing applications and remediate incidents in a highly efficient way.
Unfortunately, organizations struggle to effectively use logs for monitoring business-critical data and troubleshooting. Legacy monitoring, observability-only, and do-it-yourself approaches leave it up to digital teams to make sense of this data. The huge effort required to put everything into context and interpret the information manually does not scale. Here are six reasons why:
- Collecting data requires massive and ongoing configuration efforts
Using log data together with traces and metrics entails a massive effort to set up, configure, and maintain log monitoring for the many technologies, services, and platforms involved in cloud-native stacks, if you don’t follow a highly automatic approach.
- Connecting data siloes requires daunting integration endeavors
Some companies are still using different tools for application performance monitoring, infrastructure monitoring, and log monitoring. To connect these siloes, and to make sense out of it requires massive manual efforts including code changes and maintenance, heavy integrations, or working with multiple analytics tools.
- Manually maintaining dependencies among components doesn’t scale
Observability-only solutions often require manual tagging to define relationships between different components and their data points. While feasible in smaller teams or for certain initiatives, this requires vast standardization efforts in global enterprises and web-scale environments.
- Business context is missing without user sessions and experiences
The “three pillars of observability,” metrics, logs, and traces, still don’t tell the whole story. Without user transactions and experience data, in relation to the underlying components and events, you miss critical context.
- Manual alerting on log data is not feasible in large environments
Tracking business-critical information from logs is another area that requires automation. Solutions that require setting and adjusting manual thresholds do not scale in bigger application environments.
- Manual troubleshooting is painful, hurts the business, and slows down innovation
Some solutions that use logs for troubleshooting only provide manual analytics to search for the root causes of issues. This manual approach may be easy in some use cases but can break down quickly for a failure on a service level in a complex Kubernetes environment. Identifying which log lines are relevant because an individual pod caused this problem requires expertise, manual analysis, and too much time, preventing the most valuable people from innovating.