IT Infrastructure Insights

Storage is Guilty Until Proven Innocent

Feb 26, 2020

Share this:

Problem-solving in an Information Technology department can frequently look more like building a legal defense case than actually addressing issues to resolve problems. When a problem arises, storage is one of the first targets of blame. Storage is always guilty until proven innocent. Part of the problem is what IT has to go through to prove its innocence.

Proving innocence, from a storage administrator’s perspective, this can be particularly frustrating since problem-solving often requires addressing vague complaints such as “it’s running slow” or “the storage isn’t performing well.”  Such descriptors paint such a broad stroke it is challenging to validate where the pain point is occurring and requires hunting for problems in a variety of different places without necessarily having visibility into the nature of the problem itself.

Storage problem solving is often a series of educated guesses, hoping that one works.  Isn’t it ironic that our problem-solving methodology is part of a problem?

Utilizing Visual One Intelligence® (Visual One), storage administrators can examine the full gamut of connected devices affecting a user’s experience to help identify why “it’s running slow.”  It eliminates the guesswork, and get this, enables storage administrators to solve the problem BEFORE users even notice a problem.

For example, a large healthcare company and Visual One customer recently experienced just such an occurrence.  An application team complained of poor storage performance on one of their critical workloads, but the storage supporting that workload was operating at full speed.

The storage administrator used the performance reporting that Visual One provides on the VMWare environment, Brocade SAN switches, and IBM Storwize storage system and process of elimination to identify the culprit.  Visual One’s VMware reports showed that CPU and Memory utilization were all within reasonable thresholds.  The Storwize array also showed no latency in performance.  The IT team then accessed the SAN switch report to show a specific SAN port the application was using, generating unusually high Class 3 Discard Errors.  After a few zoning changes, suddenly the application team stopped blaming the storage.  Instead of trial and error (and lost productivity), IT quickly identified the problem, and everyone returned to work.

Visual One provides a unique ability to see the performance of a given host (physical or virtual), the zones and ports associated with the host, and the backend storage that the host is using.  Being able to view the fabric connectivity, the host performance, storage performance, and the connections between all of those devices and correlate the information is uniquely helpful for problem-solving.

Where to Start

One advantage that our healthcare customer had is they were already running our software and knew exactly where to start, a process we guide customers through in their early days of using our solution.

For troubleshooting, one of the typical “starting points” for performance issues is the “performance by host” screens within Visual One.  Performance trends can be viewed at a host level to validate storage performance metrics at the same time that Visual One is showing compute and memory performance numbers.  The problem can then be worked backward from host to fabric and from the fabric to the backend storage.

Another advantage of Visual One is the ability to observe potential problems on an array and put the array under the Visual One microscope to note whether the problem is device-specific or not.

Another Visual One customer, an eCommerce company, complained that one of their expensive arrays was showing slow performance metrics. The arrays latency numbers were closer to full seconds than the milliseconds they expected.  Visual One was able to present the array and its latency numbers visually.

At first glance, the latency problems appeared to be an issue with the storage system. The Visual One solution enabled IT to examine the storage pools further. It identified that not all of the storage pools were experiencing latency. Only one was.  Also, within that pool, only one host was experiencing the latency problem. Not only was the problem isolated to a single host, is on a single node on that host.  The conclusion?  When a user had provisioned the storage from the storage system, and they assigned the LUNs to the same node (and used a pool with only hard disk drives.  In short, the array wasn’t experiencing a problem at all. It was doing what the user was telling it to do.

The user, by violating storage best practices, single-handedly caused array-wide latency.  Since Visual One also shows historical data, the IT team was able to identify precisely when the configuration change had occurred and absolved the vendor, and most of IT, from the predetermined guilt previously prescribed to them.

With Visual One, the IT team was able to resolve the issue within a few hours without wasting time and sweating through trial and error. Before Visual One, the customer’s standard process for diagnosing performance-related issues involved gathering large performance files from an array. Then manually looking through history logs and implementing specific performance tracking tools to debug the potential problems.  With Visual One recording all of the environment’s telemetry data to the cloud our customers can resolve issues, with historical information to prove how it happened, without the heavy lifting.