The Red X® System for Product Reliability
John Abrahamian, Richard Hell, Craig Hysong
Shainin Problem Solving and Prevention
Globalization has Triggered a New Way of Thinking in the Engineering Domain
A New Paradigm of Reliability Must Follow
A System for Product Reliability
Why a new product development testing approach is more critical than ever in today’s competitive environment.
The engineering domain has changed significantly during the last decade. Today, collaborative engineering, digital mock-up and computer-aided engineering are common tools and practices across the manufacturing industries. Even companies who continue to build their strategies and value systems around Clients, Costs, and Competition have experienced significant changes in the way they develop products.
One driver of this change is that the customer has access to a wealth of comparative information about companies, products, competitors, and costs. This has resulted in an increase in fact-based buying, a decrease in brand loyalty and consequent ever-changing behavior in customer requirements. New products are released to the market place at a pace unimaginable ten years ago. Product life-cycles are shortened, as are the budgets and time for development. Adding further complexity to this situation as regards product development and production is the fact that warranties, which were once just about dealing with replacement or repair of the product within a specific period of time, have now been extended into parameters such as reliability, down-time, and maintenance costs. What’s more, warranty costs have exploded: In 2007 two of the top global automakers spent roughly 3% of their auto sales revenue fixing vehicles under warranty1. Also, the shift in value creation with a clear transformation of the supplier ecosystem to a global basis—sourcing products from all across the planet--requires a rethink of the reliability system currently implemented by many manufacturing clients in order to provide assurance of the quality and reliability of the components and finished products.
All of these criteria are directly impacted by product quality, its sustainability and predictability, hence a real requirement for reliability process covering the entire product lifecycle.
The competitor who once gained market dominance by being “first to market” has been replaced by the competitor who succeeds in satisfying customer‘s needs and is “best to market”. However, the focus today is still mainly based on the traditional development model and on an increasing supplier proliferation along the entire product value stream. If the predictions of the shift in value creation in the automotive industry as shown in Exhibit 1 are symptomatic for other manufacturing industries in addition to automotive, the need for a new “Paradigm of Product Reliability” system is obvious. Shorter development cycles, shifts in value creation and the exponential increase in product functionality and complexity are the driving forces behind an improved and efficient Product Reliability system. In order to achieve increased functionality today’s products must be a seamless blend of mechanical and electrical systems controlled flawlessly by software logic of ever increasing magnitude.
1Warranty Week from SEC Data
Shortcomings of reliability development
Many engineers and managers view product failure during development testing as a negative that reflects poorly on their abilities. They would prefer that the product under development not fail. While it might seem that “successful” test results shows that they’ve done their job, that approach misses failure modes and poor reliability.
- How often has a product produced in the model shop to tolerance norms performed stellar in development testing yet the same product, produced under normal production conditions, performs poorly in the customer’s hands?
- How often has product successfully tested to the perceived customer requirements for cycles, time or miles yielded failures in the field well under the specified requirements?
- How often has the customer discovered new failure modes not previously discovered in development testing?
Chances are, you’ve experienced all of these unpleasant surprises. “Success Testing” may make the designers feel good but it delivers very little reliability information to your product development team. When a product passes a “Success Test” all the team know is that the product did not fail under the loads applied for the time they were applied. But how strong is the product? Was the test sample at the strong end of the distribution of product and was it just about to fail? Were the loads applied and their manner of application really the same as those your customer will apply? Or were they modified for the convenience of the testing lab? Testing to failure provides much more useful information. What was the actual failure mode? How long after the expected end of life did the failure occur?
Driving product to failure conceptually is critical in the early development stages and sets the foundation for how reliable a product can be. Effective conceptualized testing starts with risk analysis to identify high-risk areas of the design and to focus limited modeling and testing resources in those areas. The output of conceptualized testing is the foundation for a product’s reliability performance and mitigates the chances of product failure during customer use. Probing a product’s performance limits to time, environments and use profiles is critical in the later development stages to validate whether reliability targets will be achieved.
Testing product to failure enables one to determine whether it has been under- or over-designed. Because testing a product to failure can be a lengthy task, and because there is a need to reduce product development times, in order to obtain timely and useful information environments, cycles, and loads must be realistically “stepped up” such that the design is over stressed to failure. However, this increase in pressures must be carefully planned and executed to avoid “stepping over” failure modes or creating unrealistic failure modes. Otherwise, the results obtained may not be as useful and meaningful as they should be.
In any reliability development system the possibility of missing a critical environment or stress level exists. To help assure that this isn’t missed, field validation—checking out product in the hands of customers—must be executed. Field validation analyzes good product from the field and compares field degradation or decay to that observed during test to failure. If the rate of field decay is greater than expected, that is an indication of a missing environment or stress level and allows for action to be taken before the customer experiences a problem.
I. Risk Analysis
Our approach begins with a risk analysis that highlights potential problems with a specific design through a deeper understanding of how the product delivers its various functions. By focusing attention on the areas of the product that represents the greatest risk, the resulting development activities and testing can have the greatest impact on improving reliability. This phase also produces a focused reliability test plan at the component and system level, accounting for the unique risks posed by the product and its intended applications. Because this phase can begin before any hardware has become available, product reliability can be improved earlier than is possible in other approaches. High-risk functions can be determined by examining the reliability performance of previous, similar product designs as well as highlighting the change points between those similar designs and the proposed one.
Each high-risk function is diagrammed with a function model. This diagram captures how the product is intended to achieve the high-risk functions and is created using a small, cross-functional team including design, product, analytical, test, and manufacturing engineers. This has proven to be a disciplined and efficient means to highlight critical cause/effect relationships and identify potential weak links early in the development cycle that can later be modeled, simulated, or tested. Other methodologies often begin with a block diagram of the various components and an FMEA of how each one can fail. The function model approach provides a greater insight into the physics of how the product is supposed to function and deliver customer enthusiasm. Weaknesses such as coupling and other forms of design conflicts are exposed and can be addressed. It is common and easy for teams to make improvements in the design during this phase, as their combined understanding of the product increases.
The conceptualized failure modes revealed during the risk analysis are later explored through focused, design-specific testing to further improve the team’s knowledge of the product in the areas of uncertainty. The resulting reliability test plan is broad enough to address the three patterns of failure:
- Infant mortality due to manufacturing variation
- Random failures due to inadequate design margin
- Wear-out failures due to a product’s sensitivity to specific environments and cyclic loading
There is not a unique step for improvement or optimization. Those activities occur throughout the development and within each of the phases, as knowledge is gained as to how the design functions and how failures occur.
II. Conceptualized Failure Mode Testing
Testing to confirm the cause/effect relationships between a specific design and the resulting functions, as well as failure modes, occurs in this phase. Each product design has its own area of specific risks, and consequently specific tests to explore those risks. Infant mortality-type reliability failures exhibit a decreasing failure rate over time; they are due to poor manufacturing quality and variation in the product introduced to the field. Units on the weak end of the distribution fail early and are removed from service, leaving only those fit to survive in place. The testing to expose this sensitivity must focus on manufacturing variation and its effect on the product performance. Special test build samples are created in order to understand this relationship.
The Risk Analysis phase identifies the features and properties of specific components that tend to vary. Testing to failure is the norm—as opposed to pass/fail or testing to success—to learn how close the product is to a particular failure mode, thereby allowing safety margins to be calculated. It is common for new measurement systems to be developed here that yield a variable reading of a product’s propensity to fail, replacing traditional ones that can only reveal that a product has failed. Often, these measurement systems can remain in place and serve the function of quality audit or end-of-the-line test within a production process, to prevent poor quality product from being shipped. As unexpected failures or unexplained performance results occur, a disciplined root-cause approach using detailed failure analysis and convergent strategies allow the team to quickly and efficiently identify the critical component differences driving the variation in expected output. This minimizes the risk of the design team making a series of changes without a demonstrated understanding of the physics of the failures. If high-risk manufacturing operations need to be addressed, the same risk analysis and conceptualized failure mode testing steps can be applied in parallel with the product design activities.
Using the conceptualized failure modes established during the risk analysis, the product is exposed to a variety of anticipated environments and is tested to failure to determine its robustness to wear-out type failure modes. The same principles applied in earlier tests that consider the effect of product variation are included, but in this case examining the potential interaction between specific design features with harsh stresses and environments. As weak links are exposed, the design can be improved to withstand the specific environments. All of the testing we’ve discussed so far has focused on high-risk functions and their resulting failure modes. The next round of testing is a discovery phase, intended to reveal unknown, unanticipated risks. Overstress Probe Testing™ identifies the weak links in the design by combining multiple stress environments and cycling the product to its breaking point.
III. Overstress Probe Testing™
Overstress Probe Testing™ overlays the customer usage stress-time distribution with the weakest failure mode distribution for the product being developed. Comparing the tails of these two distributions allows us to calculate the true margin of safety, as well as the expected product reliability in the field with statistical confidence. This is accomplished by testing a minimum number of samples to failure while simulating the actual customer’s usage. Overstress Probe Testing™ accomplishes these objectives in the most economical fashion through a clever modification of Weibull analysis.
The testing of a product to failure accomplishes two very important goals:
- Identifies the weak link in the product design by revealing the dominant, earliest failure mode which the customer is likely to experience. Consequently, the resources of the organization can be targeted to improving this area, which is efficient and cost-effective.
- Allows the engineer to compare the time and stress values at which the product fails and the time and stress values at which the customer uses the product. This allows management to make business decisions with full knowledge of the margin of safety.
The need to shorten product test time has resulted in an increase in the use of accelerated testing techniques. These approaches increase the environmental stresses until a failure is achieved. But it is not as simple as just turning up the stress dials on different environments. Choosing the right environmental stress levels and their ratios to each other is critical in creating a failure mode that is actually representative of what the customer might experience. Turning up one stress environment--such as vibration--may result in a test to failure in an accelerated fashion. But a real problem with this approach is that it isn’t clear whether the failure is an artifact of the way in which the test was run or if it is realistic. Disagreement over how to interpret the test results is common, and places the organization in the difficult situation of either adding cost or risking premature failures. In point of fact, the way the various environmental stresses are increased determines which failure modes will occur and ultimately the validity of the test. Since we don’t know which failure mode represents the weakest link, we can’t arbitrarily turn up certain stress environments over others. Just as important, we must be thorough with the environments we include in the test, and not use the limitations of the current equipment as an excuse to omit critical environments. Field failures are often the result of an interaction between two or more stress environments. Our experience is that persistence and creativity are required to develop tests that include the various environments—even if they’re not readily anticipated.
The following graphic illustrates how the Overstress Probe Test™ deals with these problems. The vertical axis combines the stress levels from each environment. The way to determine the correct environments to include in the test is to measure a sample of product in the customer application. An estimate of the worst stress level for each environment can be determined statistically from the field measurements. Guessing at these levels or merely setting them to specification requirements often results in an incorrect ratio of stress levels, and thus an inaccurate assessment from testing.
Many failure modes need environmental stresses applied over a number of cycles to manifest themselves, such as fatigue, wear, corrosion, or other conditions where material properties degrade. Notice that in the first portion of the probe test, called the “Operating Rectangle,” the product is cycled to the maximum combined stress levels along the operating profile. Only after the product has had an exposure to these combined environments and a chance to degrade in strength should we move to the overstress portion of the test. If we were to jump to higher stress levels at the start of the test, while it would certainly accelerate the test to failure and shorten the test time, the result would be missing an entire category of failure types; specifically wear-out.
Once we begin overstressing, the right approach is to continue along the combined stress vector and increase the many stresses simultaneously, in small steps to reduce the risk of jumping over a failure mode. By keeping the stress increases in the same ratios along the vector, we increase the likelihood that our test failures approximate the field failures. We can then apply the Weibull model to the failures to estimate the product reliability in the field. Using a stress-time percent as the horizontal axis of the Weibull plot to replace the traditional time or cycles allows us to plot the failures, estimate the population failure distribution, and determine the design margin relative to the combined stresses. Reliability can be estimated with appropriate confidence bands. This modification to Weibull was conceived and applied during the development of the lunar module for the Apollo space program in the early sixties. This test method has been applied to a variety of products since, each resulting in a product that became a dominant force in its market, distinguished by its superior reliability.
The following table highlights the key differences between overstress probe testing and conventional accelerated testing.
| Conventional Accelerated Testing | Overstress Probe Testing™ | Why This Matters |
|---|---|---|
| Single environment at a time | Multiple environments tested simultaneously | Some failures will only occur when two or more specific environments are present |
| Large jumps in stress levels | Small step increase in stress levels | Some failure modes (primarily wear out) won’t be missed |
| Holding at steady state conditions | Cycling the various combined environments along a profile which approximates customer usage | Testing close to customer usage will result in realistic test failures |
| Stress levels based on equipment capability, material limits, past test conditions, gut feel | Stress levels based on measured customer usage | Allows the stress-time margin of safety to be calculated |
IV. Field Validation
The last step in the Red X® System is Field Validation. Whenever a new product is launched, it is common to return or recall all failed units for analysis, but how often is perfectly good product recalled from the field? In Conceptualized Failure Mode and Overstress Probe Testing™, the use profiles, loads, and environments are developed from previous product applications or focused customer groups. If any of these are inaccurate, then product reliability could be at risk even with successful development testing.
Field Validation uses a Shainin® tool called Service Monitoring to systematically repurchase good product from the field over time. Some units are torn down and inspected for precursors to failure, while others are put through the same tests described earlier. Their resistance to failure can then be compared to new product performance. A rate of decay greater than expected is an indication of an unknown field environment or stress level; and reliability is at risk. Service Monitoring permits the discovery of accelerated field decay before the customer experiences any failures. This enables intelligent decisions to be made about current and future product while protecting the customer at all times.
Conclusion
Consumer behavior continues to evolve, with an ever-increasing demand in product functionality combined with the ability to acquire up-to-date information on product quality and reliability prior to making a buying decision. Given that the ongoing trend of collaborative engineering allows different suppliers to participate in product development, an enhanced reliability system is essential to balance the disadvantages of a fragmented engineering value chain. Shainin’s reliability system along with a company’s existing tools and methods enable a seamless integration of the heterogeneous virtual engineering value chain in today‘s product development environment, while simultaneously reflecting possible risks along the entire product life-cycle. Through a more efficient utilization of resources the Shainin® reliability approach can increase the product stability and performance significantly without any major disadvantages to development budgets or time to market.
Product reliability is often the distinguishing characteristic among competing products. The system we have outlined offers a balance between early, focused activities to address known high risk areas and discovery type testing, and actions to expose unknown problems. Combining these proactive steps with Shainin’s powerful investigative strategies to quickly address problems as they occur, results in a system for achieving superior reliability.
About the Authors
John AbrahamianVice President, Engineering Practice
John Abrahamian worked for several years as a design and development engineer of jet engines and space propulsion systems. Since joining Shainin in 1998, John was instrumental in maximizing gains for clients using Shainin® strategies for problem solving and problem prevention on a wide range of product lines. Over the last ten years John has been one of the key figures in the development and implementation of Shainin’s Risk ReduXion® and Product Reliability Methodologies.
Richard HellExecutive Vice president, Sales & Marketing Worldwide
Richard served as a Director for one of the world’s most distinguished information technology supplier’s, with a focus on industrial solutions. He held many executive positions for a major German auto manufacturer, where he was a member of their corporate strategy team, focusing on IT strategy, architecture and standards worldwide. Based in Munich, Germany, Richard is responsible for the strategic business aspects of the Shainin portfolio, enabling End-to-End service delivery for People Enablement, Problem Solving Services, and Process Optimization & Problem Prevention.
Craig HysongExecutive Vice President
Craig has held a wide range of senior positions within the automotive industry. His management experience includes manufacturing engineering, new product launch, cost improvement and product reliability. Craig‘s outstanding contribution towards enhancing problem solving and quality management methods can be reflected in the number of awards he has received from various industry associations. As an Executive Vice President of the company, Craig continues to work with clients to improve the delivery of technology and the development of standardized work practices. Craig is currently responsible for consulting services and certification worldwide.
The authors welcome your questions and comment through email by clicking on their names above, or on our Forum where you can create a topic thread and discuss this paper with others in the broader Shainin community.