Radiation Hardness Assurance

Databases of component tests, standards etc

European Space Components Information Exchange System ESCIES

CERN Radiation Working Group RadWG


Radiation Facilities

New for 2014: CHARM facility at CERN

Exhaustive List of facilities (from RADECS 2011)

Commonly used facilities:

Neutrons: Prospero, CEA-Valduc, France: IEEE paper

Neutrons: CERN East Hall irradiation facility

Heavy Ions/Protons/Neutrons/Gammas at Louvain la Neuve, Belgium: homepage

Heavy ions at Brookhaven, USA: homepage

Protons: PSI, Switzerland

ProtonsCERN East Hall irradiation facility

X-rays: CERN PH-ESE X-ray facility


FPGAs in radiation

TWIKI compiled by Jorgen Christiansen

ASICs in radiation

Radiation effects on deep sub-micron CMOS (compiled by Kostas Kloukinas)

Radiation tests on AMS 0.35um CMOS

More useful links for radiation effects are available here.


Radiation levels (at nominal luminosity of 2 x 1032)

The results of the Monte Carlo simulations of the radiation levels in LHCb can be found on the LHCb background web pages.

Radiation levels for selected regions known to contain significant amounts of electronics have been extracted from this and is shown in tabular form. For each defined region maximum and average values are given. 
These numbers include uncertainties on the simulations ( factor 2) but NOT the safety factors required for qualification (2) and component variation (2 - 100 ).

 

Ensuring reliable operation of electronics working in an environment with radiation is in most cases a complicated, time consuming and an expensive task. Normal commercial electronics components are not qualified to work in environments with radiation and specially designed and/or specially qualified components are required above a radiation dose of a few hundred  Rad.

Radiation effects

Radiation effects on electronics are normally divided into 3 different categories according to their effect on the electronic components:

Total ionizing dose (TID): Total Ionizing Dose effects on modern integrated circuits cause the threshold voltage of MOS transistors to change because of trapped charges in the silicon dioxide gate insulator. For sub-micron devices these trapped charges can potentially "escape" by tunneling effects. Leakage currents are also generated at the edge of (N)MOS transistors and potentially between neighbor N-type diffusions. Commercial digital CMOS processes can normally stand a few Krad  without a significant increase in power consumption. Modern sub-micron technologies tend to be more resistant to total dose effects than older technologies (in some cases up to several hundred Krad). High performance analog devices ( e.g. amplifiers, ADC, DAC) may though potentially be affected at quite low doses. Total dose is measured in Rad or Gray ( 1 Gray = 100 Rad.)

Displacement damage: Hadrons may displace atoms (therefore called displacement effect) in the silicon lattice of active devices and thereby affect their function. Bipolar devices and especially optical devices ( e.g. Lasers, LEDs, Optical receivers, Opto-couplers) may be very sensitive to this effect. CMOS  integrated circuits are normally not considered to suffer degradation by displacement damage. The total effect of different types of hadrons at different energies are normalized to 1 Mev Neutrons using the NIEL ( Non Ionizing Energy Loss) equivalent.

Single event effects (SEE): Single Event Effects refer to the fact that it is not a cumulative effect but an effect related to single individual interactions in the silicon. Highly ionizing particles can directly deposit enough charge locally in the silicon to disturb the function of electronic circuits. Energetic Hadrons ( > ~20Mev) can by nuclear interactions within the component itself generate recoils that also deposits sufficient charge locally to disturb the correct function. The different SEE effects are normally characterized by an energy threshold and a sensitivity cross-section at energies well above the threshold.

Single event upset (SEU): The deposited charge is sufficient to flip the value of a digital signal. Single Event Upsets normally refer to bit flips in memory circuits ( RAM, Latch, flip-flop) but may also in some rare cases directly affect digital signals in logic circuits.

Single event latchup (SEL): Bulk CMOS technologies (not Silicon On Insulator) have parasitic bipolar transistors that can be triggered by a locally deposited charge to generate a kind of short circuit between the power supply and ground. CMOS processes are made to prevent this to occur under normal operating conditions but a local charge deposition from a traversing particle may potentially trigger this effect. Single event latchup may be limited to a small local region or may propagate to affect large parts of the chip. The large currents caused by this short circuit effect can permanently damage components if they are not externally protected against the large short circuit current and the related power dissipation.

Single event burnout (SEB): Single event burnout refers to destructive failures of power MOSFET transistors in high power applications. For HEP applications this destructive failure mechanism is normally associated to failures in the main switching transistors of switching mode power supplies. 

Radiation qualification

The radiation hardness qualification of electronics components is a complicated task made difficult by limited access to radiation testing  facilities and the observed variability of the radiation hardness of normal commercial components. A whole set of radiation tests normally needs to be performed to ensure sufficient immunity to the different effects. A full radiation hardness qualification normally consists of the following tests: A: Total dose using an X-ray source ( X-ray tests must be performed on unpackaged chips) or a Cobalt 60 gamma source. B: Displacement damage using Neutrons from a nuclear reactor or special Neutron sources ( e.g. Prospero). C: Single event tests using high energy proton beams ( 60 Mev and above) and/or ion beams ( e.g. at Louvain, PSI, Upsala, etc.). Total dose effects will normally have a significant annealing effect after irradiation. It is therefore important to monitor the behavior of irradiated components over a sufficiently large time period after irradiation until their behavior has stabilized. The required annealing time can be shortened significantly by increasing the component temperature ( e.g. 100 deg. centigrade). It has in some cases with bipolar technologies been found that induced radiation damage gets worse when using low dose rates. This effect called Low Dose Rate (LDR) effect will in some cases require tests to be performed at relative low dose rates, "comparable" to the dose rates seen in the final application, therefore enforcing long test periods. For Commercial Of The Shelf (COTS) components the radiation qualification is often seriously compromised by the fact that components with the same component identification may potentially come from alternative fabrication lines with significantly different radiation hardness characteristics.

As previously mentioned, single event effects are statistical effects. The fact that no single event effects have been observed in a component during radiation testing is no guarantee that other components ( or for that matter the same one again) will not experience a single event effect after a low flux of particles. If a large number of components have been tested, and a statistically sufficient number of single event occurrences (e.g. 10-100) have been seen, then one has a quite good knowledge about the sensitivity of a given component. If one single component has only been tested with a limited total flux, without any single event effects observed, one can not exclude that single event effects will be observed in the final system ( Many chips in final system). It is therefore important to evaluate if the tests performed give sufficient statistical assurance that a large number of failures may not occur in the final system. Single event Latchup, possibly being destructive, and single event burn-out are major concerns in this context. To insure a high resistance to single event latchup it can in some cases be required to perform tests with ion beams as these have a much higher "cross section" of generating single event effects (not depending on statistics of nuclear reaction of proton with atom within the component itself).

The radiation qualification for relatively low radiation levels can in some cases be performed by a single test using high energy protons. High energy protons will give a combined effect of displacement damage, total dose damage and will finally be the potential cause of single event effects via nuclear reactions within the silicon. A quick estimate of total dose effects and displacement damage from protons can be obtained from the table below.

Proton energy TID from 1011 protons/cm2 NIEL from 1011 protons/cm2
50 Mev ~14 kRad 1.8 1011 n/cm2
200 Mev ~6 kRad 1.0 1011 n/cm2

When performing radiation hardness qualification tests it is very important to conform to well defined test procedures. Well defined testing procedures will allow radiation test results to be compared with tests of other similar circuits. If well defined procedures have not been followed a whole series of doubts will quickly surface ( used dose rate, Annealing procedure and time, etc. ).  ATLAS has already invested significant effort in making well defined testing procedures.

Safety factors

The radiation level estimations for LHCb are generated from Monte Carlo simulations with FLUKA ( previous results from  MARS also available). Uncertainties related to the Monte Carlo simulations and their assumptions on interaction models are normally estimated to be of an order of a factor two. The radiation hardness qualification of components will also have uncertainties associated with them (e.g. dosimeter uncertainties). The components them selves may have significant uncertainties depending on the origin of the components. ASICs from a well defined processing batch will only have a relatively small uncertainty on their measured radiation hardness. Commercial components purchased as a single lot from a well defined production batch will normally also only have limited radiation hardness variations. Commercial components purchased in different lots from independent distribution sources can be expected to have significant variations in radiation hardness. 

The safety factors to apply to the qualification of components for LHCb strongly depends on the type of radiation effect, the type of component and the specific use of the component in LHCb. The final choice of safety factors used all boils down to general risk management. The total risk of failing components compromising the correct function of LHCb must be minimized within an acceptable budget. Components used in locations where they can not easily be exchanged must be qualified with significant safety factors. Components ( e.g. modules) that can be exchanged within a few hours can potentially be qualified with lower safety factors.

A clear distinction must be made between accumulated effects and single event effects. Single event effects are of statistical nature and may therefore occur at any time and at any place (obviously proportional to flux of particles and sensitivity of components). For single event effects it is important to ensure that the time between failures is sufficiently long to guarantee an effective running of the whole experiment over extended periods. Single event upsets can be recovered by a simple re-initialization of the electronics. The re-initialization of the electronics can be done at several levels. State-machines or pipeline registers can normally be recovered by a "simple" reset. Single event upsets in configuration registers will required a reloading of parameters via ECS. In both cases it will be necessary to restart  active data taking with the DAQ system. It is important to ensure that this kind of soft failures does not occur so often that the system will spend a significant part of its time resolving random single event upsets. Single event upsets that prevents single detector channels to work correctly can in many cases be accepted during limited time periods, if and only if this do not significantly affect the physics and the triggers of the experiment. Bit flips in event data itself can normally be tolerated if they do not have any effect on the correct handling of following events. Single event latchup's ( and single event burnout) will in many cases be a fatal failure requiring repair, unless special latchup protection circuits have been used. Single event Latchup must therefore be proven to happen sufficiently seldom that the whole LHCb experiment can work for several weeks without repair. Even hard failures can in some cases be accepted during extended periods if it can be guaranteed that the failure do not seriously affect the physics performance of the experiment. In many cases a few local "dead" detector channels will not have a significant effect on the physics of the experiment. It must though be ensured that local failures are prevented from disturbing higher levels of the system and thereby affect data collected from other parts of the detector.

Cumulative effects risk to make large parts of an electronic system unusable after a given radiation threshold has been reached. Such a situation may occur after several years of operation at a time when the components used have become obsolete and can not be purchased commercially. For systems with large variations in radiation levels for different parts of the system ( e.g. small part of front-end electronics very close to beam line) it can be envisaged to exchange limited parts of the electronics system after a certain number of years. For systems with a more uniform radiation exposure it is unrealistic to start exchanging components when they start to fail one by one. In this case it must be proven that the system can stand the radiation levels over many years of operation (10 years).

Safety factors for cumulative effects (total dose and displacement damage)

    Simulation uncertainty:                                                   Factor 2
    Radiation qualification uncertainty:                                 Factor 2
    Component to component variation 
        Same fabrication line, no technology changes:            Factor 2
        Same manufacturer, unknown fabrication line:            Factor 100
        Different manufacturer:                                              Re-qualification required

For cumulative effects this adds (multiplies) up to total safety factors between 8 to 400. It is clearly seen that large safety factors are required if the components used do not come from the same fabrication line ( or similar line with same process ) as the components initially qualified for radiation resistance. It is in fact quite difficult in practice to guarantee that commercial components come from production lines with the same process characteristics. The safety factor related to this ( 100 ) can though be significantly lowered (2) if a new production lot, all coming from the same production line (but not necessarily the same as the qualified ones), are re-qualified by testing a new set of samples from the final lot. 

The safety factor of 100 for the case of an unknown fabrication line does in fact not really make sense since certain technology changes may have very large effects on the radiation tolerance and is in principle unpredictable. The chosen safety factor has been taken as to have some confidence that in practice things will not get worse than this. In most cases such a large safety factor will enforce a re-qualification except for locations with very low radiation levels ( e.g. concrete tunnel).

The minimum total safety factor of 8 can only in special justified cases be decreased. If it can be justified that the radiation qualification has been made with very precise monitoring of the radiation levels the safety factor related to this can potentially be decreased. If a thorough radiation qualification of a statistically significant part of the final production batch have been made (more than 10 units) the safety factor related to component to component variations can potentially be decreased. The acceptance of such exceptional cases can only be made after a special review organized by the electronics coordinator and a final acceptance by the LHCb technical board.

Safety factors for single event effects

    Simulation uncertainty:                                                   Factor 2
    Radiation qualification uncertainty:                                  Factor 2

As single event effects are a question of statistics, the number of possible failures must be estimated and the effect of the failures on the system must be evaluated. No strictly defined radiation hardness criteria can therefore be given. From a system perspective a few guidelines can though be defined. Acceptable failure rates for different failure types can be defined at the system level. To define acceptable failure rates for individual sub-systems a simple model assuming ~10 independent sub-systems (individual sub-detectors, L0 trigger, L1 trigger, DAQ, etc. ) in LHCb is used. Failure rates must be handled differently according to the following classification:

Single bit flips in event data with no effect on following events. Single bit errors in event data can in general be accepted when it is assured that it will not have any negative effect on following events. Any bit flips in event headers and trailers will in many cases have effects on the system synchronization and can therefore not be considered to belong to this failure class.

SEU requiring reset of front-end electronics and re-synchronize DAQ system. Bit flips in state-machines, pipeline registers and other synchronization logic in the front-end electronics, that only needs a front-end reset sequence ( L0 + L1 front-end reset) to recover correct function, can potentially be handled at rates up to several times per minute if really required. So high reset rates will though have considerable effects on the running of trigger and DAQ systems and will need many careful considerations and optimizations to be made. The initial goal of the whole experiment is that this should not be performed more than once per hour giving a maximum failure rate of of the order of once per day per sub-system.

SEU requiring instant re-initialization of front-end electronics via ECS. Bit flips in setup and configuration registers in the front-end electronics will need to be corrected by the ECS system downloading new configuration data. The ECS system will initially be implemented to be capable of downloading the complete configuration of the whole LHCb experiment within tens of minutes. As an initial goal it will not be accepted to perform a ECS reconfiguration of the LHCb system more often than once per day. This translates into a maximum failure rate per sub-system of maximum once per week. At later stages a faster and more intelligent ECS system performing only partly re-configuration of systems with detected failures can be envisaged but this will define a whole set of special requirements to the ECS system. As the ECS system and its interfaces to the front-end electronics are required to work correctly to perform such a re-initialization it must be required that all ECS interfaces are basically immune to all single event effects. It is important to have built-in verification schemes that allows bit flips in important configuration registers to be detected. Otherwise the experiment may continue to run for extended periods without detecting that collected data is corrupted. Simple schemes like parity checking can immediately detect such cases. A less efficient approach is to let the ECS constantly read back configuration data and check its content. If a sub-system relies on such a continuous read-back of configuration data it must have been clearly agreed upon with the ECS system coordinator, as this can pose a significant load on the system. Online verification software in the DAQ system will finally perform histograming and cross-correlations on a limited event sample. Such high level verification routines will obviously need quite some time to determine that certain detector channels are  malfunctioning.

SEU in configuration data that can wait for next planned re-configuration. For bit flips in configuration data that does not need immediate correction for LHCb to continue to work efficiently one can possibly wait for the next reconfiguration to be made (~once per day). Even in this case it is important to have schemes to detect that such a condition has occurred.

Hardware failures requiring instant repair. Fatal system failures requiring instant repair for the experiment to work is obviously the most serious failure mechanism. Such failures can be caused by single event latchup in integrated circuits or single event burn out in a power supply (or other possible failure mechanisms not related to radiation). The time required to repair such failures strongly depend on the location of the electronics: 

Counting room: Radiation levels in the counting room is so low that electronics will not be affected and immediate access can be given while LHC is running. Electronics can be repaired with a few hours notice (assuming spare parts available). 

Cavern with insignificant residual radiation: Can in principle be accessed with a few days notice. Access periods will strongly depend on the running conditions of the LHC machine itself. At startup of LHC it is planned to have short weekly shut-downs of the LHC machine. When reliable operation of LHC is achieved, access to the cavern will depend on agreed shut-down periods between all the LHC experiments and the LHC machine. 
Maximum one failure per month for whole LHCb, Maximum one failure per year per sub-system

Cavern with residual radiation: Residual radiation will in some cases limit access to long shutdown periods ( ~once per year) where things have time to cool down. The regions around the interaction point ( vertex tank) and the beam pipe will in this respect pose potential problems. Electronics where a single point failure can prevent LHCb to collect worthy physics data should never be placed in zones with significant residual radiation.
Maximum once per 10 years per subsystem (in principle never)

Inside detectors: Electronics located inside detectors can only be repaired when detectors are open which can only be done during long shut-down periods (once per year). 
Electronics modules vital for the global LHCb experiment must never be placed inside detectors. 

Hardware failures not requiring instant repair: Electronics dealing with limited number of isolated detector channels can normally be accepted to have hardware failures for limited time periods without affecting seriously the physics of LHCb. For electronics located in the cavern, without residual radiation these failures will be repaired at the first possible occasion ( ~once per month). For electronics located in zones with significant residual radiation, or within the detector itself, repairs can only be performed once per year and it must be ensured that only an insignificant number of detector channels will be lost over a period of one year of running.

Radiation levels

The results of the Monte Carlo simulations of the radiation levels in LHCb can be found on the LHCb background web pages.

Radiation levels for selected regions known to contain significant amounts of electronics have been extracted from this and is shown in tabular form (LHCb light configuration). For each defined region maximum and average values are given. 
These numbers include uncertainties on the simulations ( factor 2) but NOT the safety factors required for qualification (2) and component variation (2 - 100 ).

"Safe" radiation levels

In principle one would like to define a set of radiation levels for the different effects below which one can assume that normal  commercial electronics can be used without passing a radiation qualification procedure. In practice it is unfortunately very difficult (impossible) to define such realistic safe radiation levels. The radiation hardness of normal electronics components have been seen to have variations over many orders of magnitude depending on the technology used ( bipolar, old CMOS, modern sub-micro CMOS, SOI, GaAs, etc) and the type of circuit ( digital, analog, optical, etc.). For a given technology one can get some indications of radiation hardness by looking at radiation tests of similar circuits in the same technology. Most integrated circuits have been found to work correctly up to 1krad total dose but several exceptions exist. For single event effects no safe definitions can be made as it is a question of statistics ( sensitivity, flux, number of circuits in system, etc.) and the use of the component in the system.

Based on radiation tests performed in the space and the HEP community certain trends can though clearly be seen:

Old CMOS technologies ( > 0.35um)
Insensitive to displacement damage. 
Digital circuits have in general been seen to work up to few (10) krad total dose within an acceptable increased power consumption.
Large variations in sensitivity to TID have been seen for analog circuits.
Large variations in sensitivity to SEL have been observed.

Modern CMOS technologies ( < 0.35):
Modern CMOS technologies have in general been seen to be more resistant to radiation as trapped charges in the gate-oxide can escape by tunneling.
Insensitive to displacement damage.
Digital circuits have in general been seen to work up to several tens of krad total dose with slightly increased power consumption.
Large variations in sensitivity to TID have been seen for analog circuits.
Most modern technologies have been found to have a good immunity to SEL but a few problematic cases have been reported.

Bipolar technologies:
Normally insensitive to SEU and SEL.
Certain technologies have shown large sensitivity to displacement damage.
Some circuits have been found to have a significant low dose rate effect.
High speed bipolar technologies have been seen to be less sensitive than slower standard technologies.

Optical components:
Have in several cases been found to be very sensitive to displacement damage.

Handling of LHCb radiation hardness policy

This web page defines the basic LHCb policy on radiation hardness assurance as accepted by the Technical Board (September 2002). Radiation levels in locations with electronics are given and a set of safety factors to apply to these levels have been defined. All electronics circuits to be used in the experimental cavern must be qualified to stand the defined radiation levels including safety factors. For locations with exceptionally low radiation levels (like the concrete tunnel) it can in certain cases be accepted to skip the qualification procedure for certain types of circuits, but only after official acceptance by the LHCb management (via the electronics coordinator). Failures from single event effects must be carefully estimated and classified according to the classifications given in this document and be verified against the acceptable failure rates. It is the responsibility of the sub-detector coordinator together with the electronics contact person of the sub-detector to define a front-end and readout electronics implementation that complies with the defined LHCb radiation hardness policy. In cases where the defined radiation hardness policy can not be fulfilled within budget, manpower or technical constraints the LHCb management must be informed ( via electronics coordinator) at the earliest possible time. The LHCb management can in certain cases, after a careful investigation,  decide to accept exceptions to the defined policy. 

Each sub-system, with electronics in the experimental cavern, is expected to document (as a LHCb note) the approach chosen to handle the radiation tolerance problem in their system. Such a document must contain the following information:

Definition of system and requirements to each sub-system
Definition of individual components: technology, supplier, function, etc.
Identification of critical components for radiation hardness
Estimates of failures rates for single components and total sub-system according to defined failure classification.
Test and qualification requirements
Test and qualification procedures
Radiation hardness assurance in final production
Repair and maintenance scenario

Conformity to the defined rules on radiation hardness will be verified during reviews of the electronic systems. It is though clear that it is in principle far too late to discover a significant radiation hardness problem during a production readiness review. Problems with conformity to the defined policy must have be signaled to the LHCb management as early as possible to allow time to react to a given problem.

Radiation testing

As previously mentioned, it is vital to use well defined testing methods to obtain reliable test results (use defined ATLAS procedures). Circuits should if possible be tested to radiation levels where they fail to know the radiation hardness margins of the individual components.

The LHC community has to a large extent converged to use a limited set of radiation testing facilities. The most common are listed below and more details can be obtained by consulting the ATLAS radiation facilities web page or by contacting Federico Faccio.

Leuvain, Belgium: Cyclotron, Protons and heavy ions.

PSI, Switzerland: Cyclotron, Protons and Pions

Prospero, France: Neutron source

CERN micro electronics group: X-ray for naked chips

Multiple Cobalt 60 gamma sources.

Radiation hard/tolerant design

For environments with high levels of radiation special technologies made to be immune to radiation must often be used ( e.g. DMILL). Modern sub-micron CMOS  technologies can often also be used in high radiation environments if special precautions are made in their design ( e.g. enclosed transistors with guard rings)

Basically all CMOS technologies will be sensitive to single event upsets in their memory elements unless special schemes have been used. The general principles used to be insensitive to single event upsets is to use triple redundant logic and memories with error correcting codes ( e.g. Hamming coding). Circuits with large memories and S-RAM based FPGAs should only be used in radiation environments after a careful analysis of single event upset problems.

The problem of single event burnout in power MOSFETs can in many cases be resolved by using a de-rating factor of ~2 of the main voltage and current limitations of the power transistor (implies redesign of power supply).

This page was last modified by KW on July 12, 2017 . This page has been accessed Hit Counter number of times.