SEU problems in LHCb front-end electronics (lhcb-elec)

SEU problems in LHCb front-end electronics (lhcb-elec)

From: Jorgen.Christiansen@cern.ch
Date: 7/13/00
Time: 11:07:14 AM
Remote Name: 137.138.142.33
Remote User:

Comments

Dear LHCb electronics designers As you should all know by now, We have to take the problem of Single Event Upsets (SEU) in the front-end electronics quite serious. If we have a suspicion that a certain front-end implementation may have a significant SEU problem then it will be needed to find alternative solutions. To estimate the rate of SEUs of a given implementation, in the LHCb radiation environment, is far from a simple task. Currently we though do not need to get very detailed information on the SEU problem. In the first attempt we only need to get an understanding of the order of magnitude of the problem, to be capable of analyzing alternative implementation scenarios. For SEU problems in different functional parts of the front-end, it is vital to estimate the effect on the performance of the whole LHCb experiment. It is clear that the loss of data from a single (or a few) detector channels will not have a serious effect on the total LHCb system. For this to be true, it must be prevented that a localized error propagates to higher levels of the system. A problem in a module handling a whole sector of a detector, or a module in the trigger system, will obviously be fatal to the correct function of the whole experiment and must be corrected immediately. It is also important to understand which actions have to occur at the system level to correct such an error condition. If the error condition can be resolved with a front-end reset (L0 reset or more seriously L1 reset) the time to recover will be very short (once again: only if the error has not propagated to higher levels). If the front-end needs to be reconfigured by the Experimental Control System (ECS) then a significant time will be required (seconds). The worst that can happen is that something is wrong, but we do not detect that the system is not working properly, and significant amounts of data written to tape is corrupted. It is important that extensive checking features are built into all the electronics. We have to assume, to some extent, that we have a unreliable system which must be used to collect reliable data !. To make some first estimates of the order of magnitude of this SEU problem one must work with a set of simplified assumptions. Such assumptions can be: WHICH ELEMENTS: We do not consider SEU in the event data itself (unless in event headers) We only consider problems which may cause multiple events to be affected. This immediately points to a certain set of critical points: Setup parameters in the front-end (gain, thresholds, LUT, etc.) State machines Bunch and Event ID counters Programming data of FPGA's Instruction (and to some extent data) memories and caches of DSP's and CPU's. RADIATION ENVIRONMENT: The radiation environment in the LHCb experiment, related to the SEU problem, is currently not very well known. Using the following assumptions one can though get some first estimates. We assume that there are no heavy ionizing particles. We can assume that Hadrons (protons, neutrons, pions, ) are the major cause of SEU. We can assume that all hadrons with an energy below 10Mev can not provoke SEU. For electronics located in electronics racks (on the periphery of the major detectors or on the electronics balconies) one can use, as a starting point, a conservative estimated flux of 1e11 hadrons/cm2/year with an energy above ~10Mev (taken at edge of Ecal from LHCb note 2000-15). For electronics inside detectors specific estimates must be made. SENSITIVITY OF COMPONENTS: The SEU thresholds for hadrons can be assumed to be higher than ~10Mev. The SEU cross section is in the range of 1 - 500 e-15 cm2 per bit for SRAM or flip-flops in most common IC technologies. More detailed information on SEU cross sections of standard components can be found on: http://radnet.jpl.nasa.gov/Compendia/P/ProtonSeeCompendium.htm Many useful links to papers and databases on radiation effects can be found from the RD49 link web page: http://rd49.web.cern.ch/RD49/Links.html

EXAMPLE: As an example can be taken an Xilinx XC4010XL-4 ( 10-20K gates, 245 Kbits programming) which has been measured to have a proton cross section of 4.4e-15 cm2 per bit. This gives an upset rate per year per chip of: 245k x 4.4e-15 x 1e11 = 107 SEU upsets per year. This may at first sight not seem too serious, but when multiplying with the number of chips in a sub-system (lets assume 1000 FPGA's in one sub-system) it results in a fault every few minutes !. Such a level of SEU upsets in components, which have to be reconfigured from ECS, is unacceptable or must be handled in very particular ways. CONCLUSIONS Be very careful with the use of devices which have significant amounts of memory: SRAM based FPGA. Look Up Tables (LUT) Processors ( e.g. DSPs) SEU rate of simple registers in front-end chips inside detectors may also be problematic if hadron flux very high. Just to let you know: When reviews of the different electronic systems will be made, significant emphasis will be put on the problem of handling SEUs and other error conditions. I hope this Email can be taken as a combined warning and help to let you get a better feeling about the SEU problems you have to handle in your designs, when our experiment finally have to run from year 2005. Easy solutions to this problem is hard to find but a few options can be mentioned: Reduce to a minimum the use of devices with large memories. Ensure quick error detection and recovery Implement redundancy on critical functions Improve local shielding Move behind radiation shielding wall Best regards Jorgen