I am a pilot. I am not a software engineer or software writer. That said, I use software like everyone else in just about everything I use every day. As pilots, we are all to familiar with the problems on the Boeing 737 MAX. We are being told that faulty software is the cause. Yes, there were or could have been problems with the pilot training, but Boeing is re-writing the software and when complete, the problem will go away and the aircraft will be safe. Or will it?
Airbus is having an issue as we speak with the A350. In a mandatory airworthiness directive (AD) reissued in July 2019, EASA urged operators to turn their A350s off and on again to prevent “partial or total loss of some avionics systems or functions.”
This must be done at exactly 149 hours.
In 2015, the Boeing 787 suffered a similar yet different problem. A memory overflow bug was discovered that caused the generators to shut themselves down after 248 days of continual power-on operation.
These are just a couple of the latest incidents that are occurring on our newest generation of aircraft. Why are we having these computer-related problems?
I have been doing some research and believe me, it is hard—very hard—to sift through the BS on this subject.
In the “old days,” testing was straightforward. As an example, many of us have seen the video of the wing bending on the original Boeing 747. Straightforward. You bend something until it breaks, do it again and again, and you then have a pretty good idea of when it will break. If it is well within the limits you set, you are good to go.
You cannot bend or break software. So, what do you do? You put it through testing that some people consider industry standard and others don’t. Here is what I found for a description on testing.
The definition of software testing, according to the ANSI/IEEE 1059 standard is, “A process of analyzing a software item to detect the differences between existing and required conditions (i.e., defects) and to evaluate the features of the software item.”
Makes sense to me. Let me give you an example. I have a cell phone. It is full of software. How many times must I turn it on and off before it will fail? Will it always fail the same way? Will my model of phone fail for me and you at the same time? We cannot answer these questions. With the bending wing, we can, and we have a very good idea that, at a point in the bending process, the wing will fail.
Software is not a wing. It is a code written for a unit it will help operate. Specifically, source code is made up of the numerous lines of instructions that software programmers write to create all software applications. Once the source code is written, it is compiled into a machine-readable program which is installed on a computer as an application.
So how is it tested so that we are as sure as possible that it will not fail, or we know exactly when it will fail and we can replace it before that time?
There is manual testing, automation testing, static testing and dynamic testing. Then there are approaches to testing like white box, black box and gray box. Finally, there testing levels. They are unit testing, integration testing, system testing, and acceptance testing. This is all very impressive, but it still doesn’t tell me how long that unit will run on the software and why exactly it will fail, like the wing will fail.
So why does software fail? Here is some of what I found on that subject.
- Lack of user participation
- Changing requirements
- Unrealistic or unarticulated project goals
- Inaccurate estimates of needed resources
- Badly defined system requirements
- Poor reporting of the project’s status
- Lack of resources
- Unmanaged risks
This is all very reassuring. I mentioned that I am a pilot and not a programmer. Given this, how do I know that the software testing that goes into an aircraft is more complete than that which goes into a cell phone? I have no way of knowing that. If a hydraulic pump on an aircraft engine fails, it is sent out and bench tested. A fault is found, and a directive is issued so that all other pumps can be inspected and fixed. All the operators who use those pumps are notified. It is a simple and time-tested procedure.
Does a software failure on one aircraft necessarily mean that item will fail on all aircraft? With the hydraulic pump we know that things such as temperature, lubrication, vibration, and other factors can cause the failure. How so with software? We are in a highly regulated business. Software and the people who write it are not. They are, for the most part, self-regulated. Once you are a certified as a software engineer, you can write for anyone who will hire you.
Just look at how often the “check engine” light illuminates on a car or truck. That is a computer program. From what I can find out, there is not much more that goes into the software for an aircraft as there is that vehicle.
How often are we faced with software failures in aviation? I suggest that we do not have a clue. Unlike the pump, a software problem can go unreported. With a software failure, maintenance usually just does a reset and the problem goes away and then may or may not reappear. I for one do not believe that we keep a complete record of these small failures. I have experienced it firsthand and I saw the reaction of both my company and the manufacturer.
In one case I was flying an Airbus. On descent, I was about to level at 10,000 feet. I was hand flying the aircraft with the auto-throttles off. I moved the throttles forward and got no response. At that point, I was cleared to 9000 ft. I told the first officer to check for a problem quickly. Everything was in the right place. Everything.
Nothing I could do restored my control of engine power. I was cleared to 7000 feet and I knew that I would be staying there for some time. It was night and the weather was 400 overcast and I was nowhere near an airport I could glide to. At 7000 ft. I purposely let the speed decay to what is known on the Airbus as Alpha Floor. This is a computer program that Airbus installed so that the aircraft could not be stalled (not at all like Boeing’s MCAS). When Alpha Floor is reached, the aircraft is programed to go into TOGA thrust (take off and go-around). My hope was that if a computer glitch got me into this predicament, then the computer might just get me out of it. It did. The aircraft responded as it was supposed to, and everything was restored, and we landed without incident.
On landing I pulled the cockpit voice recorder and flight data recorder tapes and maintenance removed them. I did all that was required for such an incident and went to the hotel.
I was a commuter and when I returned home the next day, my wife was on the phone and told me it was for me. It was my company and Airbus in Toulouse, France. There was a great deal of concern over the incident—as there should have been. In the end, my company sent all the tapes to Airbus to investigate, they did not report it to the regulatory authorities and Airbus did a software change to all their aircraft using that system.
In 1984, I was flying between Dubai and Male, Maldives. I was flying a DC-8 and we had an Omega navigation system on board. It had just been installed and I had never used one before. At exactly the second that the sun broke the horizon that morning, the aircraft started to turn off track. The Omega was driving the autopilot at the time. If I had allowed it to continue, it would have kept turning. I found out later that there was shielding missing in the computer unit that caused this anomaly.
On the Boeing 747-200, some of the autopilots would suddenly and dramatically go into roll mode. I had this happen while flying to Paris one night. It happened to a Taiwanese flight flying from New York to Taipei. The aircraft rolled over onto its back at FL 390 and the crew did not regain control until around FL 170.
And one more. On my second to last flight before retirement, I was flying the polar route to Hong Kong. At about 50 miles past the North Pole, we began to get master caution warnings. “Smoke in the Lavatory,” “Cargo Compartment Smoke,” and the very worst one on an Airbus: “Electrical Smoke or Fire.” These warnings came every seven minutes and before we could react, the warning disappeared from the screen. This went on for the rest of the flight. I contacted my company via sat phone, and they got Airbus on the line. Airbus told me, yes, there is a problem with the warning computer, and they are aware of it. There is a fix coming out in two weeks. Software fix.
These are just the problems I have had. Multiply this by the number of “electric” aircraft we have in the air and the number of problems will be staggering.
Who can you trust, what can you trust?
Ask yourself this. Why do we need all this fly-by-wire and many other computer systems? Let’s face it, manual flight control systems for a pilot are much more intuitive and user-friendly. You can feel an aircraft when flying manual controls but there is no feel in fly by wire. We have all this computer equipment because it is lighter and saves money. Weight is money and airlines love it. Aircraft are designed by engineers and technicians and not by pilots. Yes, they do ask our opinion, but do they really incorporate it into the final design? Very little. The bottom line drives all of this and nothing more.
As pilots we should know what kind of testing is being done on software, who does it and what the expectations of it are. Again, it is up to the regulators to do their job, but I fear as I see what is happening with the MAX, they will allow economics to be the big winner.
I am about to leave flying for good. Many of you on the other hand are just starting. It behooves you to look deeply into this problem as it will probably affect you for the rest of your career.
- Is software enough to keep pilots safe? - March 2, 2020
A great article with a great subject. I fly one of the most (if not the most, by a relevant number of measures) airliners in activity, and couldn’t agree more. The day the softwares can fly as safe as we do, there will be no need for pilots. Back in 1988, when the A320 was released, it might have looked that this future was just around the corner. Now we know for sure it wasn’t, and it is not going to be. Designs being launched now are gonna be flying for decades. The balance between the goods and bads of the softwares as airplanes get even more “electrical” are crucial for us to keep flying as safe is it can and must be. The automation of systems and its redundancy nowadays are mind blowing – and welcome – but so are the glitches hidden in this new environment.
Humans make mistakes, whether they make a logic error when coding, forget to tighten a hydraulic fitting during mx, or stomp on a rudder pedal when they shouldn’t. Certainly the author isn’t suggesting that we regress to 1950’s technologies? What we should do is realistic testing of software, plus develop real time reporting and processing of error logs that can identify unexpected behavior trends before they cause accidents. And continue to learn, develop, and refine.
I am a retired software engineer. I think you left out one very relevant factor in why software is so hard to make bug-free; complexity. A computer operating system is considered to be the most complex human creation in our history. And the reason software is so much more complex than hardware is simple; it’s much easier to design and build complex software than complex hardware. For complex hardware you have to design and manufacture each of the myriad components, then assemble them and get them to work together. That’s a very labor-intensive and time-consuming process. For complex software, you don’t have to manufacture anything. You just write it. This ease of “manufacture” allows you to build far more complexity into a software system. The benefit of that is you can design systems that do many many things. But the more complex the system, the more difficult it is to test for every possible way the pieces of that system can interact with each other and with other systems.
Phil, as an engineer who has written industrial control system software for decades, I agree that the software used to control ANYTHING in the physical world, especially when safety of personnel is a consideration, must be especially concise and well documented so as to permit sufficient testing to insure that errors and anomalies are handled in a “fail-safe” manner.
Fly-by-wire has been used for several decades, beginning with military a/c, many of which were inherently unstable, often to allow the pilot to successfully fly the plane.
Extensive use of fly-by-wire in lieu of force augmentation of mechanical linkages on commercial a/c flight controls should only be implemented with “best 2 of 3” control computer “voting” systems [common in industrial process control systems]; where the 2 of 3 allows execution errors to be ignored. That, however, means $$$; and don’t get me started with the common practice in the US and EU of “off-shoring” software creation. If it’s for my cell phone, not so bad, but when my safety is in question; NO THANKS.
I am a long time (punched cards start, PC and WWW finish) computer operator, programmer, systems designer and CIO person. At one time a Private Pilot flying mostly simple to fly Cessna type aircraft. Now retired I do not code or fly any more.
My opinion, there is no way to fully test complex systems like the software flying these monster sized commercial aircraft.
My opinion is these major computer systems are built on individual little programs meshed together into one complex full system.
The problem is most of these systems are designed and programmed by people who are not pilot trained. Most are built under a time crunch. Testing is often not done nearly as much or as detailed as it should be. Some of the humans involved may be fully trained some not trained to the point they should be. The recent Boeing computer based incidents is an excellent example of the pressures to get systems into production without the testing they should have.
Systems are built around programs that are designed to fit the conditions the programmers put their concept of what will happen in real life into the code.
Then pilots get so used to the automation that they forget the skills the old time stick and rudder cable based control surfaces the old time pilots mastered.
The result is a combined set of what if’s that are going to fail if some combination of events happen that even one of these many designer, programmers and pilots did not consider solutions for put into the finished product.
What amazes me is the fact that many of these complex interdependent lines of code work as well as they do, until something as simple as pitot system heat fails.
Great article for a very hot topic. Engineers and the system is pushing to replace pilot by machines. That day I will stay on the ground and we should teach our kids that computers should be used to help us not replace us. If we continue on this path it will be the destruction of humanity but people do not understand the danger and new pilots are no way near the average old pilots. Thank you.
Hi Kent,
You ask why fly-by-wire (FBW) is necessary. I can tell you one reason–the certification requirement for flight control system separation.
Under transport category certification standards it’s essentially impossible to route cables and push rods, or hydraulic lines, between the cockpit controls and the control surfaces far enough apart to survive the worst possible failures.
Engine rotor burst zone is the really difficult area to manage because the rules say a bursting rotor has infinite energy. In other words, the flying shrapnel from the exploding engine will pierce whatever structure–or control system–in it’s path. So control circuits must be spaced far enough apart that at least one system survives the worst.
Obviously, routing wires far apart from an FBY computer to the control surface offers many more options than mechanical links or hydraulic lines. Several years ago the head of engineering for one of the major airframe makers told me he didn’t believe a new design could be certified under the transport rules except by using FBW to meet the flight control system separation requirements.
Airplanes with older certification basis, such as the 737, and others, can still be built using mechanical/hydraulic flight controls, but they are grandfathered.
Well you may not think wide separation of multiple control operating linkage is important, ask the guys who survived the DC10 center engine explosion that took out all three closely spaced hydraulic control lines to the elevator. And there have been others where a floor collapse took out all control lines. It can, and has happened.
So, even though FBW can improve the flying qualities of an airplane by enhancing stability, preventing flight outside the envelope, and saving weight, the really big reason it’s here to stay is control circuit routing and separation.
Mac McClellan
Mr. McClellan – Having read your articles for several decades, I value your opinion. I, however, believe that the use of modern, lightweight, armor materials to shield the multiple control systems’ paths [hydraulics] would be a better solution to the problem. As you noted, even the triple redundant systems on the DC-10 were rendered inoperative by the uncontained failure of the center engine. How much fuel would have to be deleted from the design to armor the critical flight controls? I contend that it wouldn’t be as significant as the bean counters claim.
Hi Leonard,
First of all, it’s not me who has determined that control cable or hydraulic line runs can’t be shielded from an uncontained engine failure. It’s the international certification authorities.
The high pressure components of an engine can be rotating at 30,000 rpm or more. It takes somebody with much greater math skills than me to calculate the potential energy of parts flying off a wheel rotating at that speed. But the experts have concluded the energy is “infinite” in terms of creating an effective shield against worst case failure.
The engine fan, however, must be contained if it throws a blade. The fan is turning much more slowly, a few thousand rpm, so even though fan blades are larger than the compressor and turbine blades of the high pressure section, the potential energy is much less.
Engine makers use kevlar or other high strength materials to form a belt around the engine fan at the front of the engine. In certification testing a blade must be intentionally fractured at operating speed and the belt must contain the failed blade.
As we saw in the engine failure on the Boeing 737 operated by Southwest, even fan blade containment may not be enough. In that case the fracture blade was successfully prevented from departing radially and damaging the airframe, but the violence of the fan blade failure broke parts of the engine cowling. It was those broken cowling parts that went aft over the wing knocking out a cabin window that caused a passenger to be sucked out and killed.
The massive potential energy of something spinning as fast as the high pressure section of a jet engine is difficult to comprehend. The risk is so great that engine makers often use a “spin pit” during early engine development. The spin pit is a hole in the ground so if the engine comes apart the flying shrapnel is sent into the dirt where it can do no harm.
Mac Mc
Articles like this and the responses, Mac’s was particularly interesting, are why this is my favorite aviation magazine.
As a now retired software engineer, as well as a pilot, I agree. Remember, Dijkstra said “Testing only reveals the presence of bugs, it does not prove their absence.” We already have hardware with a Mean Time Between Failure (MTBF) of more than 50,000 hours in quite ordinary hardware, but what is the MTBF of software? Consider your Windows computer, how often do you have to re-boot it to ‘fix’ a glitch, problem, or mis-operation? And this is with a system that supposedly has been ‘developed’ over more than 30 years, and still it is not bug-free!
I agree with the reasons for FBW – it is in order to meet the separation requirements. Nevertheless, I am reminded of a statement by one of my colleagues that ‘I love Fly By Wire, as long as the wires in question are about a quarter of an inch in diameter and positively connected to the control surfaces!’
Finally, remember that for all the positive spin put out by the Artificial Intelligence and Expert Systems community none of these systems are ‘intelligent’ in the accepted sense of being sentient. They rely on algorithms that use a vast data base of previous examples and if the present condition is not there then they fail – ask Tesla. I do hope we never develop a truly sentient machine, as it might well decide that we are irrelevant. I also hope that we never develop fully autonomous vehicles, I truly believe that there should be human sentience in the cockpit of an aircraft, the bridge of a ship, and the driver’s seat of an automobile. Fortunately, at least at present, no computer has even the ‘intelligence’ of a cockroach, and I hope it remains that way. The point was made by another correspondent that computers were supposed to aid not replace us.
It is thought provoking articles like this that keep me reading this magazine.
I am old school retired airline pilot. I like the advanced automation of the newer airplanes. I just hope the engineers will still integrate a cutout switch to allow the airplane to revert to basic stick and rudder flying. Automation is here to stay. However, I am not sure today’s young pilots are acquiring the stick and rudder skills and knowledge needed to revert to manual flying in an emergency. Captain Sully successfully demonstrated he still had good stick and rudder skills!
“The advent if agriculture was the beginning of the decline of mankind.”Technology is hurrying it along. In 80 years i have flown commercially about 10 times. Never liked it.
Where to begin?
Yes, I am a pilot and have been a software guy for over 30 years.
“As pilots, we are all to familiar with the problems on the Boeing 737 MAX. We are being told that faulty software is the cause. ”
I beg to differ. In software, we have the term “bug.” It has a very interesting origin. You might want to look it up sometime. A bug means the software is not doing what it was designed to do. I write a program to add 2 + 2 and it spits out 5 … that’s a bug.
On the other hand, if the software is doing exactly what it was designed to do, and we don’t like it … that is faulty design, not faulty software. I believe the MCAS software was working the way it was intended, right to the point of the crash.
Don’t blame the computer, blame the people. Boeing’s sales and marketing demanded this cludge. The engineers were forced to set aside what they knew was right and implement it. What happened here was not a computer problem.
The overarching theme of the article seems to be “This technology is causing problems”, and “Why do we need it?”
Wow, that is a lot to unpack.
Let’s start with 80% of all accidents are pilot error. Along these lines, the same article could, and has, been written about getting rid of these fallible pilots. The tech is being put there to monitor things a pilot just cannot. It is put there to help us stop killing people with stupid mistakes.
TransAisia pilot shuts down the wrong engine and everyone dies.
Air France 447. If the pilots had just let go of the controls, everyone would still be alive.
The list of accidents caused by the pilot is very, very long.
The author describes the throttles not responding in an Air Bus. This was on a descent from 10,000 to 9,000. I don’t know about you, but why didn’t you immediately declare an emergency? 1) He relied on the same technology he was criticizing to save the aircraft (which it did). 2) If the technology didn’t save him how many more options would he have after waiting for the energy to bleed off and the fail safe to take over?
What happened with the throttles was dangerous. Could have gotten everyone killed. But, the company followed up. Does this problem still happen?
The professional pilots I know are wonderful. They take their job very seriously. They also understand the role of today’s pilot in the cockpit.
We are on the verge of self driving cars. From a tech standpoint, it is several orders of magnitude more complex than flying an airplane. Years ago we had aircraft that could go from point A to point B without a pilot.
The role of the pilot today isn’t really to fly the plane. I know, it will take a moment to sink in. It is to monitor and manage the systems and intervene if something goes wrong. That includes knowing the aircraft and systems well enough to troubleshoot something the software was never programmed to handle. As the systems get better, the job is less and less exciting. Great if you are a passenger, right?
I am not saying these systems are perfect. They are not. However, as problems are solved it gets better.
I’m not saying the public is ready for an autonomous aircraft. I’m also pretty sure the public doesn’t know what causes the majority of aviation accidents.
I used to fly a 1980 Piper Warrior. It had a six pack and two needles. I got my instrument ticket with it. No traffic, no weather, no GPS, no autopilot. It was all on me. After the end of a long IFR flight, I was exhausted.
I fly a Velocity now. It has dual EFIS, with dual AHRS and multiple redundant electrical systems. I have a WAAS GPS navigator coupled to the autopilot. I fly with an iPad using ForeFlight linked to an ADS-B in receiver. I have onboard weather, traffic and a backup attitude instrument on the iPad. I can go on, but you get the point. I have a lot of software and technology in my cockpit. Stuff that wasn’t there in the Warrior. It isn’t perfect and I have had problems. However, I would feel naked in the air without all that wonderful, useful information.
I had to spend significant time learning how to use these systems. I have to manage them in flight. I get way more out of it than I have to put in.
Lament all you want about the wind, scarf, goggles, stick and rudder days. The technology, even with its warts, has been making things better and safer.
The answer is no if pilots do not have confident in the flying software. I’m a retired software engineer with electrical engineering and software engineering degrees. In an engineering world everything happening is a probability of occurrence. The crashes occurred because the probability of occurrence was 1 or 100%. The MCAS software was a phantom, dormant, or dead code that no pilots knew existed. The safety was compromised or none confidence from start. I worked for government 30 years and had inspected software on military aircraft and could say I did not have confident because I did not know what was in software. An aircraft has many electronic boxes and each box has its own embedded software and each box could be manufactured by different vendors. My inspection was just to check if a current software version was loaded in each box before the contractor sold off the aircraft to government. I did not have permission to do anything else. I suspected FAA was doing similar to commercial aircrafts or worse let manufacturer inspected its own software and software versions for aircraft system airworthiness certification. There is no such thing as aircraft software airworthiness certification and I believe this is a big mistake. No one knows dead code exists then no one knows to demand dead code be removed or tested thoroughly or training be provided to combat the dead code just in case they became ghost code that took over control. To be highly confident is to know exactly what is in the software. This is the job of the software independent verification and validation (IV&V) people. Verification means to check the software works per contractor specification and validation means to check the software meets end users/pilots requirements. This is the job of FAA but I guess FAA doesn’t really know software itself and got bullied by the manufacturer. The next question is how do you do IV&V of possibly million lines of code in an aircraft to ensure every function the software does is accountable to an end user/pilot. That means no extra/phantom/dormant/dead/ghost code allowed or strictly controlled. This is where priority comes into play. For example, you will do only the critical functions like MCAS and others but not so much others like software turns on lights etc. And testing/IV&V are expensive and they are a matter of economic. You don’t want it to be too expensive that consumers may pay high price for safety. It’s a balance between safety confidence and cost. MHO