Stop Putting Out Fires: Enhancing System Reliability
What You’ll Learn:
- How can systems be designed to guard against unexpected downtime?
- How does system reliability impact product development?
- In what ways can artificial intelligence and machine learning improve system reliability?
- What are the best practices for measuring system reliability?
A critical dilemma organizations face is how to keep their operations running smoothly without unforeseen hiccups. To save upfront costs, many operations tend to run their assets until a failure occurs. Yet, when equipment is allowed to run to failure, it damages more than just the system.
In addition to the loss of potential revenue, downtime can result in unused materials, urgent repair expenses, overtime staff costs, damaged customer loyalty and tarnished brand reputation. A 2023 report revealed that the world’s largest manufacturers lose $1.5 trillion yearly to production outages and that unplanned downtime costs 50% more now than in 2019-2020.
The solution is to create an interconnected system of manufacturing steps that work more reliably for extended periods without failures. Artificial intelligence (AI) and machine learning (ML) provide additional tools to increase system reliability by spotting trends and problems earlier. This helps organizations avoid problems, enhance the bottom line and reduce the person-hours needed when unexpected failures appear. This is the concept of system reliability—measuring how well a system, such as IT and manufacturing equipment—functions correctly, consistently and for a set period.
READ MORE: Reliability, Supply Chain Linked in New Survey
Organizations that improve system reliability, enhanced by incorporating AI and ML, take a giant step toward higher production outputs, improved plant and employee safety, and reduced costs. This is accomplished when systems are designed, maintained and operated at the highest and most reliable level possible.
It Starts at the Design Phase
In athletics, ability starts with availability—being on the field or court daily and not sidelined by injury. The same concept applies to manufacturing equipment and IT systems. In today’s automated and robotics manufacturing operations, availability means a resource is ready to perform the tasks requested and maintainability is built into the design applications so that finding and fixing bugs do not require a total rebuild.
This leads to reliability, where a resource carries out the transactions or steps required without error in all but the worst-case, unexpected scenarios. Designing systems to guard against the unexpected, mapping out all possible issues ahead of time, accounting for scalability and forecasting high demand ahead of time are crucial.
A failure mode and effects analysis (FMEA) can be part of this process to identify potential weak spots in a system. FMEA uses a step-by-step examination to determine possible design, manufacture or assembly failures.
READ MORE: Discovering Defects: Design for Reliability Programs
It’s critical to break down the silos often found in organizations—especially larger ones—to take full advantage of input from programmers, hardware engineers, sales and marketing teams that forecast future demand for products and services, and executives that espouse a vision of where and when growth is headed.
There are several areas to examine when designing system reliability:
- Computing power. Determine how much computing power will be needed now, in five years and in 20 years; what services end-users will demand; and the queries they will make.
- Associated IT systems design. Ascertain how to design and incorporate associated IT systems to allow for expansion with relative ease without going back to the drawing board and starting again.
- Required tools. In manufacturing operations, where high productivity and quality and a minimum of downtime are vital, it’s essential to determine if the tools (such as a software program) and asset availability are in place in case a quick fix is necessary.
- Capacity. In manufacturing industries, there is always a demand to increase production capacity. It’s important to consider this during the design stage.
Another critical design component is a lifecycle cost analysis, which assesses the total cost of building, maintaining, upgrading or disposing of an asset. The lifecycle cost analysis of each component in a system can also be identified separately, allowing design engineers to propose mix-and-match scenarios that may bring additional vendors into the fold.
The essential elements of this analysis are to determine the weak points of the analysis and align this examination with current and future business needs. This is where removing those silos to receive all the input needed from various departments upfront can save on maintenance or replacement costs in the future.
Incorporating AI and ML
AI, ML and the algorithms written to power these still-emerging technologies can be employed to detect processing and quality errors quickly, producing data that highlights a component or program that needs to be updated or replaced. Those reports can lead to ease of maintenance and enhance the reliability and availability of a system. An AI overlay incorporated in the design phase allows IT and other engineering personnel to shift their attention elsewhere to tasks requiring a more intuitive human touch to keep operations running more smoothly.
Software launches plagued by unforeseen bugs, systems that crash due to unanticipated high demand (and the inability to scale) and products with inherent design flaws not caught at the blueprint stage can dominate the news. The solution is to select components that align with the organization’s objectives.
WATCH: Video Insights: Playbook for Maintenance and Reliability
An oil services company experiencing drilling field equipment failure overcame inefficient planning when factoring drill breakage points and realized a $30 million savings. Integrating AI led to more accurate forecasts and additional planned replacement and maintenance schedules, with less unexpected downtime.
How to Measure System Reliability
System reliability is measured using metrics such as mean time to repair (MTTR) and mean time between failure (MTBF), which is the time it takes to detect failures and the mean time until any failure should be mapped ahead of time and compared against each other, formulating the best mix of such scenarios MTTR, the average time to repair a service or system after a failure occurs, can be represented as an equation for the accounting and financial departments to compare total downtime versus total failures as a ratio.
It’s essential for the philosophy of “get it right the first time” to be the mantra that every team player involved adheres to. When a problem arises, the ease of maintenance and ability to pinpoint a problem area, aided by the incorporation of AI and ML into the initial launch, can reduce the mean time needed to detect and repair a problem. A reliability engineer appointed to oversee this team can help tie it together and set an example for the design’s availability, reliability and maintainability.
In the design phase, it’s critical to anticipate how long it will take for failures in hardware components (conveyors, routers, switches and other equipment), software applications, and physical connectivity and cabling to occur, where wear-and-tear is a reality. Once a defect arises, assessing the average time to find that occurrence is essential. Building adequate monitoring systems, aided by AI and algorithms, can shorten that cycle.
Even the best components and systems are expected to fail. Determining the mean time to failure (MTTF) during the design phase is an exercise that can save significant headaches and reduce downtime. In the growing field of site reliability engineering, software tools that monitor tasks such as system management and application performance become more reliable if these other factors are forecasted upfront.
Mapping out the mean time needed to detect and repair failures fosters more collaboration at the front end of the design process. Over time, this camaraderie will become part of an organization’s DNA. Some businesses have a head start establishing that culture: A baseline for highly regulated industries like the energy sector has already been established for safety and maintenance standards.
READ MORE: Designing Process and Refinery Infrastructure for Reliability and Resilience
A reliability engineer (RE) is essential to reducing system downtime. Upper management’s support for this position is imperative. A seasoned RE provides input and asks questions about component lifespan, balances upfront costs versus the price tag for the repairs and replacement of components and establishes key performance indicators (KPIs) to track data points such as average parts wear, maintenance time, repair costs and unplanned stoppages. The RE can also provide overall vision and direction in the initial design phase and serve as the leader team members can turn to for guidance after a failure occur.
Proactive Maintenance is Foundational to a Solid Bottom Line
Anyone who has been in an industry, particularly some long-established sectors such as parts manufacturing, assembly line plants or the petroleum sector, is familiar with unexpected downtime and the teams in place that are “putting out fires” regularly. In many circumstances, there are also safety factors to consider.
Over the past few decades, data-driven businesses that rely heavily on digital services face this downtime reality. It is costly in terms of productivity and its impact on the company’s financial health. Organizations that improve system reliability, perhaps by incorporating AI and ML, can achieve higher production outputs, improved plant/employee safety, reduced costs and increased profitability and customer satisfaction.
About the Author

Sanjib Das
Sanjib Das, CMRP, is a professional engineer with more than 22 years of experience in developing and implementing reliability tools such as asset criticality assessment, reliability-centered maintenance, spare part assessment, risk assessment and root cause analysis across the oil and gas, chemical, automotive and biotech industries. He has a proven track record of improving reliability and reducing breakdown maintenance by 10-20%. Das holds a master’s degree in mechanical engineering from the National University of Singapore. Connect with him on LinkedIn or [email protected].
