Leveraging Automated Testing for Complex IT Systems

By Tim Schauer—Innovative Defense Technologies (IDT) routinely uses automated testing to reduce costs and increase product quality for our customers through the use of our specialized testing tools.  Currently, our main users are Department of Defense clients that have extremely complex embedded systems that must meet thousands of unique and varied requirements.  With the introduction of each new feature into the product, a huge round of regression testing is typically needed to ensure that new features have not adversely affected already fielded systems.

As an experienced IT professional, I maintain that one of the most critical areas where automated testing and re-testing must occur is in the IT infrastructure itself. If we are to consider the challenges of testing a software baseline, the very first thing to ensure is that we have a well-defined and consistent infrastructure upon which to test.

Reliability, Maintainability, Availability

All automated software testing should, as a minimum, ensure that the infrastructure itself meets rigorous requirements that industry best practices have already defined. What are these best practices?  All industries have a basic set of guidelines that each IT infrastructure should meet, based on business objectives.  Normally, the more stringent the objectives, the higher the costs associated with meeting these objectives.  However, these objectives can usually be summed up in the acronym RMA, which stands for:

  • Reliability
  • Maintainability
  • Availability

Briefly, reliability is the probability that a system will perform its intended function over the lifetime of the product.  Normally, there are environmental constraints put upon the reliability of a product.  Reliability often declines with age due to the wear and tear of components, and parts can fall out of tolerance with the original specification.

Maintainability is the capacity of a system to be kept within specified conditions over a period of time with a reasonable amount of effort. This measurement has a direct bearing on man-power costs; the harder a system is to maintain, the more man-hours are expended to bring the system back into specification.  Additionally, the more widely systems vary in their complexity, the more business must invest in diverse technical skills to maintain these difference systems.

Availability is the percentage of time that the system is available for productive work (i.e. the five 9s of network availability). Many times availability gets lumped in with performance metrics.  Even though a system may technically be available, if its performance is such that it can no longer produce useful work, the system, for all intents and purposes, has become unavailable.

Without a solid RMA foundation, the IT infrastructure would quickly become fragmented, unavailable and very often useless to the business at hand. Therefore, testing and monitoring this infrastructure has become a major factor in maintaining IT systems.

Monitoring IT Systems: Self-healing and Expert Systems

The ability to measure and monitor IT systems is a major industry.  There are both commercial and open source solutions that measure everything from network response time to disk I/O.  Almost all modern operating systems have these functions already built into the base OS itself, but the need exists to collate and aggregate these metrics over diverse systems, widely dispersed geography and also long time periods.

Many monitoring systems today are basic tests repeated continuously (on the order of many times per minute), to ensure that IT ‘catches’ the problem before anyone else does.  Network routers ‘test’ the paths between systems and if a test fails, automation kicks in to ‘reroute’ packets around the failed path.  This is sometimes called self-healing.

Unfortunately, many of the monitoring systems in use today are not self-healing, but very basic in their ability to respond to problems. Normally, a system will ‘alert’ an operator to a problem and the operator must then take some action to correct the problem.  This is good in that it notifies personnel fairly quickly of an issue. Unfortunately, the alert is of no use if it is ignored, if the operator action to correct the issue is ineffective, or worse, if it exacerbates the problem.

Some systems have used ‘expert systems’ to counteract lags or inefficiency in operator response.  These expert systems ensure that many specific conditions are met in a problem sequence before a definitive action is taken, resulting in a high probability of successful corrective maintenance.  The more expert the system, the more self-healing occurs in the system.

These expert systems are very complex tests of a known system with expected results of the system enumerated in minutiae for many different inputs.  The more complex the IT infrastructure, the more complex the test needed to ensure overall integrity of the system.  In fact, due to the complexity of the system, the tool set needed to test the system may be very large in scope.

Higher System Availability through Continuous Monitoring

By leveraging current automated testing tool sets and re-using developer tools (such as continuous test and integration tools) the more quickly a robust IT infrastructure can be realized. The ability to re-run tests automatically, via a unix ‘cron’ job or some other ‘batch’ mechanism, is the key to ensuring continuous monitoring towards higher availability.

IDT has developed a technology suite, Automated Test and ReTest (ATRT), that enables the IT infrastructure itself to benefit from the concept of automated testing and re-testing. To fully realize this benefit, we must start with the specific requirements to be achieved to ensure a viable set of tests.  Let’s assume that the IT department normally is responsible for network security.  This includes ensuring all current security relevant patches are done on all network devices, all security configurations on network devices are continually ensured to be correct, all critical devices are backed up, daily inventory of new devices are reported and a weekly penetration test of all network devices is performed.

In a homogeneous OS environment, the above task is fairly straight forward. However, as more OS’s come into the mix,  tools that are compatible across all the varied OS types must be purchased (normally rocketing up the price), or the IT department must maintain separate infrastructures to test each different type of device (a normal but usually frustrating aspect of IT).  The real power then, is the ability to aggregate all of these heterogeneous systems together under one umbrella, create reports  in a common fashion and solve outstanding issues with expert system knowledge embedded in the system test itself.

In my next blog, I will enumerate an in-house solution that IDT is using with both open-source tools and IDT’s own products to continuously ensure a viable IT infrastructure.  By leveraging the automated testing and re-testing of specific IT ‘touch points’, or known areas of the system where testing should result in consistent sets of returned data, the IT infrastructure can run with a high assurance of continued reliability and availability. Additionally, adding expert systems also works to increase our maintainability and availability.

Tim Schauer is a Senior Systems Analyst at Innovative Defense Technologies (IDT).