Share this:
An interview with Tom Mack, technologist at Visual One Intelligence.
In today’s increasingly complex IT environments, infrastructure teams face a common yet often misunderstood challenge: the accumulated errors problem. This phenomenon—where multiple small issues combine to cause significant outages—is one of the most persistent challenges in modern infrastructure management.
We talked to our own Tom Mack to better understand this problem and some of the other factors that contribute to the costly problem of unplanned outages – as well as what to do about it. What follows is an edited account of our conversation:
The Accumulated Error Problem: A Chain Waiting to be Broken
Consider this scenario familiar to many IT professionals: during a resource constraint, you temporarily disable resilient storage mode in your VMware vSAN environment. Weeks later, after acquiring the necessary equipment to restore normal operations, the temporary configuration change gets overlooked in the daily rush of priorities. This single oversight creates no immediate problems, but when another minor issue arises months later, the combination leads to a complete outage.
This pattern repeats across enterprise infrastructures worldwide. The first event alone wouldn’t have caused an outage. The second event alone wouldn’t have caused an outage. It’s the accumulation of these small misconfigurations—the technical debt that builds up over time—that eventually leads to system failure, often at the worst possible moment.
And that’s the accumulated errors problem: In most cases, outages happen due to a series of accumulated errors:
1. none of which lead to an outage in-and-of-themselves, and
2. any of which could prevent the outage if found and fixed in advance!
Beyond DevOps: Configuration Drift and Preventative Monitoring
Some organizations attempt to address this challenge through DevOps practices. While this approach has merit, it’s not a complete solution. DevOps implementations suffer from configuration drift as settings defined at one point gradually shift away from their intended state over time.
Plenty of tools can forcibly revert configurations to their defined state, but such draconian approaches aren’t practical for many organizations. The ideal solution combines flexible DevOps practices with continuous monitoring to ensure best practices are maintained—creating “the best of both worlds” where configurations remain consistent without excessive rigidity.
Understanding the Anatomy of Outages
Modern infrastructure includes numerous built-in redundancies and self-healing capabilities. Storage systems, virtualization platforms, networking equipment—all come with features designed to maintain operations through various failure scenarios. This means most outages aren’t caused by equipment failures but by preventable errors that accumulate over time.
When analyzing infrastructure failures, it’s helpful to categorize outages into three groups:
- External attacks (a separate security concern)
- Equipment failures (a small percentage of incidents)
- Preventable configuration errors (the majority of outages)
It’s this third category where the most significant improvements can be made. By breaking the chain of accumulated errors before they cascade into system failures, organizations can dramatically reduce their unplanned downtime.
The REST API Revolution and Infrastructure Visibility
Infrastructure visibility has improved dramatically in recent years. Just fifteen years ago, monitoring storage systems required proprietary connections or clunky script-based solutions unique to each vendor. Today’s widespread adoption of REST APIs has transformed these capabilities, enabling deeper inspection of system configurations across multiple platforms.
This evolution allows monitoring tools to move beyond simple status indicators (the traditional “sea of green, yellow, and red lights”) to meaningful analysis of configuration standards and best practices. However, REST API maturity varies significantly across vendors and platforms.
Many REST APIs developed just a few years ago were single-threaded, causing timeouts when systems were busy handling other tasks. This limitation has prompted vendors to develop more robust multi-threaded APIs that better support comprehensive monitoring and analytics.
The Real-Time Monitoring Myth
One of the most persistent myths in infrastructure management is the value of real-time monitoring from third-party tools. This capability is frequently requested in RFPs and product evaluations, but rarely delivers the practical value organizations expect.
Consider what happens during an actual outage: if a storage array experiences problems, and you engage that vendor’s support, they won’t rely on your third-party monitoring platform. They’ll use their own native tools, which have deeper visibility into their systems than any third-party solution ever could.
This reality means real-time monitoring often delivers little practical value beyond populating status dashboards. What’s far more valuable is predictive capability—the ability to anticipate potential issues hours or days before they occur, rather than merely reporting on what’s happening right now.
It’s a strange paradox in the industry: real-time monitoring is something everybody requests but few actually use effectively. The resources spent developing and maintaining real-time monitoring capabilities often yield minimal return compared to investments in predictive analytics.
Balancing Health and Efficiency: The Storage Optimization Challenge
Every IT organization faces a fundamental challenge: how to maximize the return on infrastructure investments without compromising operational stability. This balance is particularly evident in storage optimization, where aggressive utilization can improve cost-efficiency but increase risk.
Data reduction technologies illustrate this challenge perfectly. A given array’s data reduction ratio might historically fluctuate between 3.2:1 and 4.2:1 depending on workload characteristics. When operating at the higher ratio (4.2:1), the system appears to have ample capacity—but this creates vulnerability if the ratio decreases due to workload changes.
In such scenarios, organizations operating near the high end of their data reduction range should maintain lower utilization thresholds (perhaps 75% rather than 90%) to accommodate potential ratio decreases. This vendor-specific approach to optimization highlights why generic capacity planning often falls short. Each storage platform implements features differently, requiring tailored risk-versus-return analysis.
Cross-Fleet Performance Optimization
Beyond single-system optimization, true infrastructure efficiency requires performance optimization across an organization’s entire technology fleet. This means simultaneously analyzing:
- CPU and memory utilization within VMs
- Traffic patterns between VMs
- Host-level resource consumption
- Cluster-wide performance metrics
- Storage backend performance
- Application-level metrics
This comprehensive visibility enables more sophisticated optimization strategies, such as determining whether workloads should move between clusters rather than just between hosts within a cluster. The same analytics can inform cloud migration decisions by comparing on-premises and cloud costs for specific workloads.
Storage: The Ultimate Infrastructure Bottleneck
At the heart of infrastructure performance lies a fundamental hierarchy of speed:
- Processors: Lightning fast
- Memory: 1,000 times slower than processors, but still extremely fast
- All-flash storage: 100 times slower than memory
- Spinning disk: 1,000 times slower than memory
This performance cascade means that storage is the bottleneck for almost all workloads. No matter how much you optimize compute resources, eventual performance limitations almost always trace back to storage constraints. This reality makes storage optimization the cornerstone of effective infrastructure management.
Breaking the Chain Before It Breaks You
The key to preventing infrastructure outages lies in breaking the chain of accumulated errors before they cascade into system failures. This requires:
- Comprehensive visibility across all infrastructure components
- Continuous best practice scanning to identify configuration drift
- Predictive analytics that anticipate potential problems before they occur
- Vendor-specific optimization models that account for unique platform characteristics
By implementing these capabilities, organizations can transform infrastructure management from a reactive firefighting exercise into a strategic optimization practice that directly impacts both operational stability and bottom-line costs.
As infrastructure environments continue to grow more complex, the ability to spot and address accumulated errors will only become more critical. The most successful IT organizations will be those that implement proactive systems to identify and resolve these issues before they combine to create preventable outages.
About Visual One Intelligence®
Visual One Intelligence® is an infrastructure tool with a unique approach—guaranteeing better & faster monitoring, observability, and FinOps insights by leveraging resource-level metrics across your hybrid infrastructure.
By consolidating independent data elements into unified metrics, Visual One’s platform correlates and interprets hybrid infrastructure data to illuminate cost-saving and operations-sustaining details that otherwise stay hidden.
These insights lead to less downtime, lower costs, better planning, and more efficient architectures.