What to do about Fragile Systems

September 30, 2023

When we look at legacy systems built on out-of-support platforms in deprecated languages using dated data repositories, we can easily relate to their fragility. We’ll discuss this in Part 1.

However, there are also new products built on less solid foundational practices that introduce a whole new level of fragility into the solutions ecosystem. These systems are plagued by best intentions and poor execution. We’ll discuss these in Part 2.

Finally, there exist fragile systems serving purposes that are well accepted under certain circumstances. Let’s learn when it’s OK to have fragility in Part 3.

Ignoring difficult situations and decisions is almost a form of art. We hope the bad we aren’t addressing will remain dormant so we never have to tackle its ramifications head-on. To be clear:

Hope is not a strategy!

The more we ignore, the deeper the hole of despair when bad happens. Let’s examine more practical approaches to deal with fragile systems to, perhaps, avoid bad altogether.

We need to address technical debt before it leads to bankruptcy!

Part 1 – Legacy Systems

I was recently on a support call to renew my membership with an organization and there was a glitch where they couldn’t take my credit card and needed to reboot their legacy system as part of their standard operating procedure. OK. That happens. They took my phone number and said they would call me right back after the reboot completed. That was 6-weeks ago and still no call back. This was actually my third attempt to renew my membership as the other two also ended abruptly due to software “glitches”. Lost revenue and lost customer. Worse is that this will never be noticed because these old, fragile systems have no telemetry that track their losses! It’s all absorbed into acceptance that these things happen.

Every organization has some sort of legacy system, like this. It’s worse when they are a key dependency to other modern systems’ operation. Any original authors have moved on or retired or were consultants now long gone. These legacy systems generally require heavy manual intervention and secretly cost organizations dearly in lost revenue, lost customers, frustrated employees, and are moderate to high security risks.

A thorough analysis of the actual cost and loss from legacy systems is rarely undertaken. While maintenance and operational costs are more easily attainable, actual impact and hidden costs are much more difficult to obtain.

Discover hidden costs & fallout with suggested takeaways

Cost & impact of human manual interaction to correct errors:

You actually have to pay people to manually fix things when legacy failure issues arise. While fixing these issues, other work is not being done (it’s a double-hit).
Employees fixing issues manually often become more frustrated and dissatisfied which leads to decreased morale & productivity.
Customer experience is negatively impacted when manual interaction is necessary. There is often some form of compensation in the form of discount or free service for their troubles which creates additional cost for the mistake being corrected.
Takeaway: Calculate all compounded costs, compensations, and delays associated with every manual interaction type including impact to your customer NPS.

Risk of unsupported platforms (hardware, OS, software, and such) subject to failure with long or no recovery options:

Because many legacy systems cannot be redundant, organizations often must pay for cold stand-by configurations to rehydrate when bad happens. What they fail to do, however, is regularly test recovery.
Backups never tested for restoration are essentially no backup at all.
Takeaway: Tally the cost for holding cold stand-bys at bay and run regular restoration exercises of these systems and databases to determine cost, time, and impact should rehydration become necessary. How many of your systems are actually unrecoverable?

Loss of revenue due to your inability to properly service customers where & how they expect to be serviced:

You are losing customers more often due to the many disparate and disconnected systems you force your customer facing resources to use. When you receive a poor review, you blame it on your support personnel, not your software deficiencies, and spend money training your resources to do better instead of fixing your systems which serves to further diminish morale.
If your support staff churn is high, consider looking into the systems they are using to interact with customers. Do these systems serve their needs or do they exacerbate their effectiveness?
If your legacy systems are difficult for your own resources, how much more are they for your upstream customers? Self-service might actually drive self-destruct.
Takeaway: Revise customer and employee satisfaction surveys to ascertain system vs. human deficiencies to understand where to focus. Monitor self-service and equate early departures to lost customers. Count & calculate this impact cost.

Fines & penalties where legacy systems are no longer compliant with the latest regulations:

There exist a number of organizations that consciously pay rather high fines as they are perceived to be cheaper than replacing the legacy software responsible for them!
Systems out of compliance are rarely in isolation. They often negatively affect both internal and external partner integrations.
Takeaway: Tally existing and potential compliance penalty costs for each legacy system and quantify all integrations affected should a legacy system fall out of compliance.

Fear of breaking production when touching & releasing updates to fragile legacy systems:

There exist production systems where access to or understanding source code is greatly compromised. Minor configuration changes are scary and anything involving code is outright frightening. Many carry up and down-stream dependencies that require hundreds of teams to work together to release.
Legacy deployments are often void of any rollback process. If it breaks a production rollout, that may remain broken for days to weeks — forcing manual process playbooks to be activated.
Many legacy systems suffer from hard-coded IP addresses deeply embedded inside their integrations. Simply moving servers to a different or dynamic network address space will break production instantly.
Takeaway: Build a dependency tree for each legacy system and calculate the compounded costs for weeks of down time for each ecosystem affected.

Lost innovation opportunities and competitive advantage because of legacy dependency anchors preventing modernization:

Every organization must leap onto some emerging technology based fad (from RPA bots to AI). When legacy integrations are present, you either bolt-on the new tech or run the new tech as shadow-IT.
Some innovations like those around Zero Trust require new ways of writing code that simply are not possible with legacy systems. Catering to their least common denominator, organizations either entirely miss out or must greatly compromise advancements in what is new and emerging.
My car’s oil change still requires paper forms. My carwash scans my vehicle and keeps meticulous electronic records of all my visits and tells me when I get my “free” wash! Over time, noticeable advancements in customer experiences through technology will negatively impact those who are too slow to change because of legacy anchors.
Takeaway: Transparently create a list of new technologies currently having industry impact that you are avoiding. Calculate the business & financial impact & benefits you’re missing by missing out.

Brand reputation damage/lawsuits as many legacy systems exist under security exceptions:

Security has one major vulnerability — legacy systems. Legacy systems literally can break any security policy and still get an “exception“, which is some mysterious get-out-of-accountability card for everyone in the org, under the premise there is no other option.
Reality check: exceptions never lessen the vulnerability threats they are created for. In fact, organizations are generally at greater risk for each exception because the legacy systems are most likely more
Where there are exceptions, there also are necessary overly restrictive security perimeters that inhibit other more modern & secure systems from interacting in more optimal ways.
Takeaway: Review all security exceptions and calculate the depth of impact and related costs, including necessary modern system workarounds, those which might arise from lawsuits, and impact from damage to your corporate reputation.

Whew!

Now, calculate & total these costs & potential losses. According to the Consortium for Information & Software Quality (CISQ) studies in 2018, 2020, and 2002, the cost of poor software quality in the US has grown to $2.41 Trillion, with Legacy Systems accounting for $520 Billion.

Change is difficult because we’re presented with too many “fix frameworks“, tried a few, and learned our culture, talent, maturity, sponsorship, and commitment fall short of what’s required to execute to completion. In a traditional manner, it truly is a race where only Unicorns win.

Let’s Go Non-Traditional

Are you communicating the costs and risks associated with our fragile legacy systems to all proper audiences?

Name your most at-risk legacy systems in the “Risk Factors” section of your SEC 10K/Form 40-F or equivalent annual report identifying the financial & business impact should they fail.
This heightens awareness and provides fair warning should/when they fail and negatively impact your revenue stream.
One insurance company I admire already does this and has funds in reserve to accommodate future failure.

Obsolete legacy systems by modernizing connected systems instead:

The strangler fig and the abstraction layer patterns both attempt to modernize legacy systems by gradually replacing parts of the legacy system with new, more modern components, however, upstream dependencies and downstream tight couplings often derail any progress.
Focus instead on replacing/modernizing your connected systems, first. This will obsolete your legacy systems so they can be sunset.

Apply Gall’s Law over big bang modernization efforts:

Projects with funding have a natural scope-creep through bolt-on features being added as the new necessary without additional funds, resources, and time.
Avoid this by attaining alignment with executive leadership to focus on one single Business Outcome for a legacy modernization initiative. Deliver only to that outcome. When bolt-ons are presented, weigh them against that one single business outcome and reject if it’s not critical path.

Force vendors to share the risk:

The flowery case studies vendors present are with environments and organizational maturities that are an impedance mismatch to yours. Vendors are often just as surprised when their product falls short!
Tie your purchase contracts to production delivery, operation, and performance contingencies.
Vendor platforms often seek to take control over your data as that’s easier than integrating with your sources of truth.
Architect your use of their platform to either sync only necessary metadata (in milliseconds) and link back to your source of truth, or provide push updates to your source systems.

Pull the plug (a.k.a. rip off the Band-Aid):

I am reminded regularly of systems that continue to run and produce output for which there is no audience. Heighten audits to rigorously seek these out and perform a [warm] shut-down/pause identified systems (if an audience does exist, they will surface and you can reactivate).
High risk systems need to reduce their vulnerability footprint before harm ensues. For gradually increasing periods, [warm] shut-down/pause your most vulnerable legacy apps/systems. Give affected consumer apps/systems two options — (1) find a new way of operating without the targeted legacy system or (2) transfer budget & resources to help provide for its replacement.

Better understand legacy systems with Generative AI tools:

Many legacy apps/systems have nobody that fully understands them — especially when the platform, language, and data environment are no longer supported.
Generative AI companion tools (a.k.a. co-pilots) are your new best friend. Not only can they explain legacy code, they can often find flaws in it.

Fragile Legacy Systems are the Achilles’ Heel in the pursuit for the modern enterprise. The longer you wait to aggressively address them, the more they will cost you in the long-run and in ways you can’t imagine nor measure until it’s too late.

Part 2 – Modern System Fragility

Many organizations are being forced into providing quick software based digital customer experiences — from IoT immersed products such as the jewelry we wear to flashy audio/visual entertainment options to the cars we drive. Not all these organizations are prepared to produce robust software at the scale they must now deliver.

My wife’s new car has an infotainment system that adjusts everything from dashboard style, music, to seat position based on queues ranging from the key fob to phone bluetooth signature. It gets (guesses?) who’s driving right about 60% of the time. When it fails, you must manually log-in with your user id. Yes, that’s right, user id. That takes 90 seconds, on average, to load your settings. If you’re impatient (because your primo Costco parking space is being hounded by 5 other patrons) and drive off before it’s done loading, an error message appears and everything turns off. I mean everything: no navigation, no cell phone integration, no radio. Black screen.

Our home audio/visual entertainment system is no different. Just sneeze in the family room and I’m on the phone to my system’s licensed integrator scheduling a service call (at my expense) to reconfigure their software to regain volume control. This is almost a monthly event.

I consulted my favorite Generative AI duo (Bing Chat w/ChatGPT-4) for a less anecdotal example and it answered in the most unexpected way — it actually failed!

At first I thought it was being funny with me by maybe simulating a failure, but rebooting and further testing (with recorded videos) found it to be completely unresponsive and broken. While Bing Chat acknowledges no outage, ChatGPT did as shown.

Fragile. Modern. Systems.

No matter what your business, it will rely to some extent on software systems for it’s livelihood. Ford CEO Jim Farley said it best, “We’re not a car company anymore. We’re a software company that happens to make cars.”

Now, replace his “car” with your business’ products and read that aloud.

Are you? Are you a software company that happens to do what you do? If not, then heed Wednesday Adams’ (from the Adams Family) warning, “Be afraid. Be very afraid.” Why? Because there are real software companies coming that will learn to do what you do, faster than you can learn to be a software company! Garrett Camp & Travis Kalanick of Uber and Logan Green & John Zimmer of Lyft had no prior experience in transportation and look at their impact.

In this June 8, 2023 Fully Charged podcast, Farley points toward several factors that promote fragility:

Legacy Business Practices — We farmed out all the modules that control the vehicles to our suppliers, because we can bid them against each other [for price]…We have about 150 modules with semiconductors all through the car. The problem is that the software is written by 150 different companies. And they don’t talk to each other.
Masters of Legacy Domains — I kept watching our [Internal Combustion Engine] ICE engineers try to figure out how to do over-the-air updates, or change the software for the vehicle [but] they’re not software people.
Legacy Architectures — It’s shocking to me how many [automakers] are sticking with very old electrical architectures and software from a confederacy [of vendors]. That will never work. No matter how many software engineers they hire, the code’s not going to work.
Talent Struggles — It’s difficult for legacy car companies to get software right. (Yes, he used the term “legacy car companies.”)

Lets sum this up to a propensity to persist Legacy (a leadership call) and a lack of necessary Experience (a talent call).

Legacy thinking forced onto new platforms generally results in what Farley expresses. New platforms are generally meant to be leveraged with new ways of working for good reason. Yet, we insist in forcing legacy practices on them and somehow expect better outcomes.

One justification for this poor behavior is when new platforms are built and ride on the shoulders of legacy designs that are now “hidden”. Those reluctant to change will hold to those legacy underpinnings as a justification to prevent the actual intent for the modernization. Yes, serverless functions ultimately execute on a server somewhere, but the paradigm shift/intent is to engineer and leverage them as if this new realm is made of unbounded compute that scales infinitely. A different way of thinking, designing and architecting that many legacy leaders oppose and prevent.

Talent ignorance (you don’t know what you don’t know) may be the most influential and tactical cause of modern fragility. It’s so easy to spin-up an IDE, or low-code environment, and pull together unrelated & disparate open-source snippets to quickly mash together some application that can be placed into production in short order as a minimal viable product (MVP) — void of any Non-Functional Requirements.

How is Farley addressing this? “That’s why, at Ford, we decided for our second-generation [EVs] to completely insource electrical architecture. To do that you need to write all the software yourself. But car companies haven’t written software like this, ever.” They are becoming a software company, taking accountability and responsibility for all components and integrations. To do this properly, Farley had to “split the company into three pieces” separating legacy teams & practices from modern software engineering teams & practices (Bimodal, anyone?). This also meant attracting new talent.

Let’s Promote the Ability of a System to Thrive in Adversity and Adapt to Change

Set the stage for New Platforms to succeed:

As Farley shared, legacy players may not be the best choice for new platforms and they may likely introduce fragility.
Consider choosing leaders & teams who will reap the greatest benefit & empowerment to champion a new platform.
One financial organization I admire built their Cloud platform in this way leveraging with their software engineering discipline bi-modally isolated from their legacy infrastructure discipline. It worked.
Focus on learning the new ways of working that a new platform promotes over forcing your legacy ways of working upon the new platform.

Optimize highly connected/integrated/distributed system components

Hub systems that connect to everything offer countless more vectors for failure that diminish your overall reliability.
Your availability is not the sum of each dependency, it’s the product. If you have three system dependencies, each with an up-time of 95%, your up-time will never be better than 85.7%! If those three dependency systems increased their up-time to 99% (4% better each), your up-time will be 97% (11.3% better!).
Work to increase/tighten the robustness of each individual connection, even just a little, as it will greatly improve overall resilience.
Eliminate unnecessary connections & dependencies where they offer little business value or the risk of failure is far greater than the benefit.
Increase active telemetry to alert failures sooner and leverage the Circuit Breaker pattern to provide more constructive responses for dependency failures.

Understand patchwork designs & implementations

There are many benefits to leveraging and contributing to open-source code, when used and managed responsibly. However, in the interest of quickly getting to production/market, some teams will cobble together/copy snippets from mash-ups, open-source, and friends & family without proper rigor and understanding that introduces some of the hardest to debug fragilities.
When individual instances work but fail to scale, it’s often attributed to a lack of maturity, experience, and rigor in design & process. These need to be readily addressed through better peer reviews, testing, and best practices that promote a culture of excellence.
Generative AI companion/co-pilot tools can be leveraged to examine and rate “borrowed” code before it is introduced into your solutions. Thoroughly understand, evaluate, and scrutinize all harvested code as if it were your own.

Missed Customer Expectations

Fragility presents itself in forms that range from poor performance to unfortunate behaviors. Never assume your system’s fragility is in any way acceptable to your customers.
Somebody had to tell the emperor he had no clothes. If everyone is telling you something is just fine, seek those who see it differently and understand why.
One of the best ways to truly get a window into what’s fragile is to mine public support forums for your products and examine unanswered It will identify confusion in using your product and help you discover missing features & capabilities.

Fragile Modern Systems are an embarrassment and threat to the modern enterprise. They amplify deficiencies in leadership, talent, and process that diminishes confidence and loses customers.

Part 3 – When it’s OK to Build Fragile Systems

There is actually a case for Fragile Systems. In times of crisis, temporary and disposable systems which may lack rigor, security, NFRs, etc. These systems can be spun-up in days and are frequently critical for the survival of human lives, environment, and/or business continuity.

These are One use and Done applications that serve a single purpose for a very limited time.
They are active and accessible only for small windows of use and shut off.
This code dies. It’s never shared and not used as a seed/starter for subsequent work.
The risk imposed must be shadowed by the good the system serves for the short period it is being used.

In such situations, proactive crisis management teams can prepare systems with greater rigor and resilience following guidance such as, A Crisis Situations Decision-Making Systems Software Development Process With Rescue Experiences.

Temporary Fragile Systems can serve a most valuable service to the modern enterprise when they are leveraged to resolve one specific critical emergency or temporal task and are then forever abandoned.

Image by Ivan Vranic on Unsplash

Arrange a Conversation

Browse

Article by channel:

Everything you need to know about Digital Transformation