Login

We are compiling an embedded C++ application that is deployed in a shielded device in an environment bombarded with [ionizing radiation][1]. We are using GCC and cross-compiling for ARM. When deployed, our application generates some erroneous data and crashes more often than we would like. The hardware is designed for this environment, and our application has run on this platform for several years.

Are there changes we can make to our code, or compile-time improvements that can be made to identify/correct [soft errors][2] and memory-corruption caused by [single event upsets][3]? Have any other developers had success in reducing the harmful effects of soft errors on a long-running application?

[1]:

[To see links please register here]

[2]:

[To see links please register here]

[3]:

[To see links please register here]

Here are some thoughts and ideas:

**Use ROM more creatively.**

Store anything you can in ROM. Instead of calculating things, store look-up tables in ROM. (Make sure your compiler is outputting your look-up tables to the read-only section! Print out memory addresses at runtime to check!) Store your interrupt vector table in ROM. Of course, run some tests to see how reliable your ROM is compared to your RAM.

**Use your best RAM for the stack.**

SEUs in the stack are probably the most likely source of crashes, because it is where things like index variables, status variables, return addresses, and pointers of various sorts typically live.

**Implement timer-tick and watchdog timer routines.**

You can run a "sanity check" routine every timer tick, as well as a watchdog routine to handle the system locking up. Your main code could also periodically increment a counter to indicate progress, and the sanity-check routine could ensure this has occurred.

**Implement [error-correcting-codes][1] in software.**

You can add redundancy to your data to be able to detect and/or correct errors. This will add processing time, potentially leaving the processor exposed to radiation for a longer time, thus increasing the chance of errors, so you must consider the trade-off.

**Remember the caches.**

Check the sizes of your CPU caches. Data that you have accessed or modified recently will probably be within a cache. I believe you can disable at least some of the caches (at a big performance cost); you should try this to see how susceptible the caches are to SEUs. If the caches are hardier than RAM then you could regularly read and re-write critical data to make sure it stays in cache and bring RAM back into line.

**Use page-fault handlers cleverly.**

If you mark a memory page as not-present, the CPU will issue a page fault when you try to access it. You can create a page-fault handler that does some checking before servicing the read request. (PC operating systems use this to transparently load pages that have been swapped to disk.)

**Use assembly language for critical things (which could be everything).**

With assembly language, you _know_ what is in registers and what is in RAM; you _know_ what special RAM tables the CPU is using, and you can design things in a roundabout way to keep your risk down.

Use `objdump` to actually look at the generated assembly language, and work out how much code each of your routines takes up.

If you are using a big OS like Linux then you are asking for trouble; there is just so much complexity and so many things to go wrong.

**Remember it is a game of probabilities.**

A commenter said
> Every routine you write to catch errors will be subject to failing itself from the same cause.

While this is true, the chances of errors in the (say) 100 bytes of code and data required for a check routine to function correctly is much smaller than the chance of errors elsewhere. If your ROM is pretty reliable and almost all the code/data is actually in ROM then your odds are even better.

**Use redundant hardware.**

Use 2 or more identical hardware setups with identical code. If the results differ, a reset should be triggered. With 3 or more devices you can use a "voting" system to try to identify which one has been compromised.

[1]:

[To see links please register here]

How about running many instances of your application. If crashes are due to random memory bit changes, chances are some of your app instances will make it through and produce accurate results. It's probably quite easy (for someone with statistical background) to calculate how many instances do you need given bit flop probability to achieve as tiny overall error as you wish.

This is an extremely broad subject. Basically, you can't really recover from memory corruption, but you can at least try to **fail promptly**. Here are a few techniques you could use:

- **checksum constant data**. If you have any configuration data which stays constant for a long time (including hardware registers you have configured), compute its checksum on initialization and verify it periodically. When you see a mismatch, it's time to re-initialize or reset.

- **store variables with redundancy**. If you have an important variable `x`, write its value in `x1`, `x2` and `x3` and read it as `(x1 == x2) ? x2 : x3`.

- implement **program flow monitoring**. XOR a global flag with a unique value in important functions/branches called from the main loop. Running the program in a radiation-free environment with near-100% test coverage should give you the list of acceptable values of the flag at the end of the cycle. Reset if you see deviations.

- **monitor the stack pointer**. In the beginning of the main loop, compare the stack pointer with its expected value. Reset on deviation.

What you ask is quite complex topic - not easily answerable. Other answers are ok, but they covered just a small part of all the things you need to do.

[As seen in comments][1], it is not possible to fix hardware problems 100%, however it is possible with high probabily to reduce or catch them using various techniques.

If I was you, I would create the software of the highest [Safety integrity level][2] level (SIL-4). Get the IEC 61513 document (for the nuclear industry) and follow it.

[1]:

[To see links please register here]

[2]:

[To see links please register here]

One point no-one seems to have mentioned. You say you're developing in GCC and cross-compiling onto ARM. How do you know that you don't have code which makes assumptions about free RAM, integer size, pointer size, how long it takes to do a certain operation, how long the system will run for continuously, or various stuff like that? This is a very common problem.

The answer is usually automated unit testing. Write test harnesses which exercise the code on the development system, then run the same test harnesses on the target system. Look for differences!

Also check for errata on your embedded device. You may find there's something about "don't do this because it'll crash, so enable that compiler option and the compiler will work around it".

In short, your most likely source of crashes is bugs in your code. Until you've made pretty damn sure this isn't the case, don't worry (yet) about more esoteric failure modes.

rees639372

violinists708125

trisomic622

inherently212203

sickroom903

dulcilrfnzchxdz