Troubleshooting

Goals

Solutions don’t usually work first time, and this How To covers the core checks you should take to identify corrective actions. More detailed discussions are in the Further Reading section. The checklists below based on material written by Tony Kuphaldt in his Troubleshooting chapter from Lessons in Electric Circuits, Reference, Volume 5.

Questions to Ask Before Proceeding

When facing a problem, first ask the following questions. The answer to these will guide you to the checklist of possible further questions and actions

Complete the _Triage Checklist_ checklist to isolate the failure.

If the system, sub-system or component is still not functioning correctly, go through the following. Here ‘system’ refers to the isolated section identified through the Triage Checklist checklist above

Has the system ever worked before? If no go to the Likely Failures in Unproven Systems checklist.
Has this system proven itself to be prone to certain types of failure? If yes, go to the Prior Occurrence checklist.
Is the failure sudden, or unexpected? Is data available for the previous behaviour of the system? If yes to either question, go to the Likely Failures in Proven Systems checklist.

If the above does not isolate the problem, complete the following and restart with the Triage Checklist checklist

Final checks checklist

Triage Checklist

Isolate the system or sub-system causing the failure

Swap identical components. In a system with identical or parallel subsystems, swap components between those subsystems and see if your problem moves with the swapped component. If it does, you’ve just swapped the faulty component. Replace the component and check the items in the Prior Occurrence checklist.
Remove parallel components. Where a system contains parallel or redundant components, start removing these components (one at a time) and see if things start to work again.
Divide system into sections and test those sections. Check the inputs and outputs of each sub-system, measuring the signals going into or out of each sub-system. Verify your measurements against simulation, theory, or prior tests to identify the faulty sub-system.
Simplify and rebuild. Strip down to a small, working section or system. Then rebuild gradually to isolate the part of the system where the failure is evident.
Measures and trap a signals. Set up instrumentation (such as a oscilloscope, data-logger, or multimeter set on “record” mode) to monitor a signal over a period of time. This is especially helpful when tracking down intermittent problems, which have a way of showing up the moment you’ve turned your back and walked away.

Prior Occurrence

Check historic failure modes first

Check for recent alterations. Check that your changes have not altered the historic behaviour of the system. Re-run historic tests to verify the expected outputs have not changed.
Identify correct functionality. Look for what the system is doing correctly; in other words, identify where the problem is not, and focus your efforts elsewhere. Components or subsystems necessary for the parts that are giving a correct result are probably okay. The degree of fault can often tell you what part of it is to blame.
Hypothesize. Check hypotheses, based on your knowledge of how a system works. Check for failures, starting with the most likely based on circumstances, history, or knowledge of component weaknesses.

Likely Failures in Proven Systems

Check the following, in the order listed. Make sure you have completed the Triage checklist first.

Operator error. Check procedures, instructions, and other directions fully to ensure you are following them correctly.
Bad wire connections. Check connection points in plug-and-socket connectors, terminal strip, or splices using a multimeter in continuity mode. Make sure all wire and components are fully inserted into breadboards and other temporary circuits. Check mechanical switch contacts, ensuring they work correctly. Check termination of wires, to ensure a proper electrical: especially for stranded wires.
Check for ground faults. Visually, and with a multimeter, check that all wires, outputs and conductors are correctly grounded. With a multimeter, check metal casings, paths and connections for faults which are (temporarily) giving you a short circuit.
Power supply problems. Use a multimeter or oscilloscope to check the inputs voltage and current is correct. In the case of AC power, check the expected frequency and phase matches the system requirements. Use a multimeter to check that fuses and other disconnecting components are working correctly.
Active components. Check that your static handling procedures are correct, and that the you have no damaged components though improper handling. Check datasheets for diagrams, tables and graphs of expected behaviour, and then verify your components against these, to check components have not aged or failed.
Passive components. Use an LCR bridge or multimeter to check the component value, and functionality. Check the following, in the following order of likely failure
- Capacitors (shorted), especially electrolytic capacitors. The paste electrolyte tends to lose moisture with age, leading to failure. Over-voltage transients puncture thin dielectric layers.
- Diodes open (rectifying diodes) or shorted (Zener diodes).
- Inductor and transformer windings open or shorted to conductive core. You can often detect insulation breakdown failures related to overheating (insulation breakdown) by smell.
- Resistors open, seldom shorted. Usually this is due to over-current heating, although it is less frequently caused by over-voltage transient (arc-over) or physical damage (vibration or impact). Resistors may also change resistance value if overheated!

Likely Failures in Unproven Systems

Check the following, in the order listed. Make sure you have completed the Triage checklist first.

Wiring problems. Check for assembly errors, such as connection to the wrong point or poor connector fabrication. Double-check breadboard connections against the circuit diagram.
Power supply problems. Use a multimeter or oscilloscope to check the inputs voltage and current is correct. In the case of AC power, check the expected frequency and phase matches the system requirements. Use a multimeter to check that fuses and other disconnecting components are working correctly. Check that the circuit load is not larger than expected, resulting in overloading and subsequent failure of power supplies.
Defective components. Check all components — active or passive — against expected values and behaviour. Check components against datasheets, especially that your pin connections are correct for the component you are testing (not all ‘identical’ components have the same pin layout). Check datasheets for diagrams, tables and graphs of expected behaviour, and then verify your components against these.
Improper system configuration. Check the inputs and output of microcontrollers and microprocessors, using a multimeter or oscilloscope. Look for voltage and timing mismatches, signal propagation delays, PWM outputs and other improper behaviour from the program code. Check the tolerance of components for power ratings, impedance mismatches, and other limits. Check that components with configuration “jumpers” or switches are “programmed” to give the expected behaviour. Check that you have calibrated sensors, instruments, and controlling mechanisms, and that the calibration procedures are correct.
Design error. Check the fundamental theory of operation for your circuit, and that this solution is appropriate. Ideally, cross-check this design against theory, using simulations to identify possible issues and problems. Check the outputs of the simulation and theory match the output of the system you are testing.

Final Checks

When all else fails, make sure you have discussed and discounted the following items. Then return to the Triage checklist when you have finished, and repeat the troubleshooting checklists.

Don’t assume brand-new components will always be good. While it is often true that a new component will be in good condition, it is not always true. It is also possible that a component has been mis-labelled and may have the wrong value (usually this mis-labeling is a mistake made at the point of distribution or warehousing and not at the manufacturer, but again, not always!).
Not periodically checking your test equipment. This is especially true with battery-powered instruments, as weak batteries may give spurious readings. When using instruments to safety-check for dangerous voltage, remember to test the meter on a known source of voltage both before and after checking the circuit, to make sure the meter is in proper operating condition.
Assuming there is only one failure to account for the problem.Single-failure system problems are ideal for troubleshooting, but sometimes failures come in multiple numbers. In some instances, the failure of one component may lead to a system condition that damages other components. Sometimes a component in marginal condition goes undetected for a long time, then when another component fails the system shows problems with both components.
Mistaking coincidence for causality. Just because two events occurred at almost the same time does not necessarily mean one event caused the other! They may be both consequences of a common cause, or they may be totally unrelated! If possible, try to duplicate the same condition suspected to be the cause and see if the event suspected to be the coincidence happens again. If not, then there is either no causal relationship as assumed. This may mean there is no causal relationship between the two events whatsoever, or that there is a causal relationship, but just not the one you expected.
Self-induced blindness. After a long effort at troubleshooting a difficult problem, you may become tired and begin to overlook crucial clues to the problem. Take a break and let someone else look at it for a while. Having time to (unconsciously) think though a problem can make an amazing difference. On the other hand, it is usually a bad idea to solicit help at the start of the troubleshooting process. Effective troubleshooting involves complex, multi-level thinking, which is hard to communicate with others. More often than not, “team troubleshooting” takes more time and causes more frustration than doing it yourself. An exception to this rule is when the knowledge of the troubleshooters is complementary: for example, a technician who knows electronics but not machine operation, teamed with an operator who knows machine function but not electronics.
Failing to question the troubleshooting work of others on the same job. This may sound rather cynical and misanthropic, but it is sound scientific practice. Because it is common to overlook ‘insignificant’ details, troubleshooting data received from another troubleshooter should be personally verified before proceeding. This is a common situation when troubleshooters “change shifts” and a technician takes over for another technician who is leaving before troubleshooting completes. It is important to exchange information, but do not assume the prior technician checked everything they said they did, or checked it against specifications.