Error checking and recovery

During data taking with the complete LHCb experiment large amounts of data must be collected from many channels ( ~1 million) at high speed (40 MHz). Correct processing of all data in the trigger systems and in the DAQ system relies on the correct synchronization and correct function of a large number of front-end chips (~ 50 K ), modules ( ~ 1500), links ( ~10K) and processors ( thousands). Malfunctions (hard and soft) of components in such a large system must be expected during data taking, making it vital to include extensive error checking/recovery functions in the system.

A large part of the electronics in the front-end system has to work in an environment with radiation. This radiation will not only limit the life time of the components used (total dose effect) but will also provoke intermittent malfunctions (Single event upsets). Single event upsets will in some cases corrupt data being collected for an event. This will in most cases not be a serious problem as corrupted data can be identified and corrected during detailed event reconstruction. Single event upsets in the control part of the front-end will on the other hand generate serious problems. If a part of the front-end losses correct clock synchronization or losses an entire event fragment without being identified, large amounts of event data from the experiment will be corrupted.

Several levels of error checking/recovery functions must be built into the front-end electronics and DAQ system:

Generation of special error checking events.
Running self-tests at regular intervals.
Re-initialize system at regular intervals.
Continuous self check of front-end parameters.
Monitoring of buffer overflow conditions.
Error detecting/correcting codes on critical event data (e.g. parity/hamming/CRC/etc.).
Extensive crosschecks between event data from different sources
Addition of special error checking data to events ( e.g. Bunch identification, Event identification, etc.)
Self checking state machines and logic. (e.g. one hot encoded with continuous state verification)
Single event upset immune architectures ( e.g. 3 to 1 majority logic)
Watch dog timers.
etc.

One of the major problems in deciding which level of error checking/error recovery to built into the system is the difficulty to determine the possible error rate of the final system. Taking into account the size and complexity of the system to be built and the hostile environment in which it has to work, it is clear that an extensive level of error checking and error recovery functions is desirable. The first priority is to implement checks for errors that may influence a large set of events. Errors in single isolated events is as previously explained not a major concern. The prime concern for the front-end is the correct synchronization between the different sub-systems and the correct merging of event fragments into consistent blocks of event data to send to the DAQ system.

A specific problem relates to the extensive use of high speed optical links. Such links needs to obtain and maintain a continuous bit, word and frame synchronization to work correctly in the system. In many cases a simple single bit error (e.g. radiation induced) will translate into the loss of word and frame synchronization because of the special encoding scheme used on such links ( e.g. 8B/10B encoding). To resynchronize a link it will in most cases be required to send special synchronization patterns. This can be done on demand from the ECS system but will be relatively slow and will therefore only work in systems where this problem occurs at very low rate. Alternatively the links can send such synchronization patterns at regular intervals (e.g. once per machine cycle or between each event fragment) or when no active data are transported on the link (e.g. for readout links).

Local recovery of detected error conditions would be ideal, but will in practice most likely be quite difficult to implement. The exact cause of an error condition can be hard to precisely identify and therefore difficult to correct (e.g. the loss of an event fragment from a part of the front-end could have been caused by an overflow of a buffer or could also be caused by a faulty attached event ID giving an apparent loss of event fragments). The most important is that an error condition is detected and all event data which could suffer from this will be marked as error prone. When an error is detected the experiment control system (ECS) must be informed and the ECS system must decide how to handle this.

An important part of the system monitoring performed by the ECS during running is to observe the error status of all front-end modules. It is therefore important that the front-end electronics makes extensive error status information available via its ECS interface. Error conditions that only have affected individual events should be counted with error counters. Fatal errors can be signaled with simple error flags.

Resetting front-end

In many cases the system can only be brought into correct working mode, when an error has been detected, by performing a reset of a large part (the whole) front-end system. If the error rate is high it becomes vital that the system can be reset quickly with a minimum event loss. When the error rate becomes higher than a certain level it even becomes advantageous to reset the system at regular intervals. The reset rate in this case depends strongly on the error rate and the loss of events associated with the reset procedure. As it is impossible to give a qualified guess on the error rate it becomes important to have a well defined reset procedure which can be performed frequently with a minimum event loss.

Several different kinds of resets are required to correctly initialize the system and start data taking. After a power-on the complete front-end must be initialized via the ECS system (loading of front-end parameters, clearing buffers, resetting state machine and processors, configuring FPGA's, etc.). Such a reset will be too slow to be used as an online reset when an error condition has been identified in a small part of the front-end. Additional resets are also required to synchronize the complete front-end system to the bunch collisions of the LHC machine and to synchronize all front-end chips and modules to each other. This kind of resets must be distributed by the TTC system to obtain the required time precision to use it for synchronization.

To insure a properly working front-end system after a major reset, it is advantageous to reset all state machines and buffers in the L0 front-end. Especially the L0 front-ends located in a radiation environment may suffer from random errors. The L1 front-end located in the counting house can be considered reliable.

The reset of the complete front-end can be generated by the readout supervisor in such a way that it is known that all event buffers in the front-end are empty when issued ( the readout supervisor does not generate any L0 trigger accepts in a specified time window before issuing the reset). The L0 front-end reset can be issued in the large bunch gap of the LHC machine allowing up to ~100 clock cycles to reinitialize the L0 pipeline buffer without any loss of events. Additionally 160 clock cycles will be available before the first L0 trigger accepts can be generated.

The reset of the front-end electronics is partitioned into a TTC driven L0 front-end reset and a ECS driven L1 front-end reset (previously the L1 reset was also mapped into a TTC broadcast). L0 resets will be generated during the large LHC bunch gap and conditioned by the fact that the L0 derandomizer in a correctly working front-end is empty. This prevents aborting the transfer of events from the L0 derandomizer to the L1 front-end which could "confuse" the L1 front-end. When a L0 reset is issued the L1 front-end electronics must prepare itself for the reception of a new event stream.

Requirements to front-end

Tagging accepted L0 data with Bunch ID and L0-Event ID

All event data in the L0 pipeline could in principle be tagged with a bunch count ID. This would though imply a large hardware overhead and would not improve significantly the error checking capabilities. It is more reasonable to tag data accepted by the L0 trigger with the bunch ID when it is written into the L0 derandomizer buffer. In some cases it may be advantageous to use a L0 buffer pipeline address ( or equivalent) as an alternative to the Bunch ID. The bunch ID can be calculated from such a pipeline address and the pipeline address will in addition be capable of detecting faults in the L0 buffer control logic.

Check equivalent bunch ID and L0/L1 Event ID when merging event data.

One of the main tasks of the front-end system is to reduce and concentrate the collected event data from the different detectors. Where data are concentrated they must be assembled into consistent blocks of data belonging to the same event (local event building). It is here important to ensure that data from different events are not mixed. The bunch ID and the L0 event ID of the different fragments are unique identifiers of events. The correct bunch/event IDs must be verified when data is merged and in case of a mis-match the event must be tagged as being corrupted. When such an error occurs it is likely that following events will have similar problems. The different event fragments could in principle be realigned locally after such an error condition. It is though likely that data is still corrupted after such a realignment because of incorrect tagging of event fragments in the data sources.

Monitor buffer occupancies

The readout supervisor is responsible for limiting the L0 trigger rates such that no buffers in the front-end overflows. It may though still occur that some buffers overflow because of a malfunction in the system or an exceptional high occupancy in a part of a detector (occupancies of buffers after zero suppression depends on detector channel occupancy). The L0 derandomizers are prevented to overflow centrally by the Readout Supervisor based on a set of strict requirements to these buffers. Buffers in the L1 front-end is prevented to overflow using a hardwired throttle signal to the Readout Supervisor. If any of these buffers in a front-end module overflows the system synchronization will be lost and the whole front-end system must be reset. It is therefore important that all buffer overflows are detected immediately and that the ECS is informed about this ( via status registers ).

Verify event ID of ROT and destination broadcasts.

The TTC distribution of event Read-Out Types (ROT) and Multi Event Packet (MEP) addresses carries 2 LSB event identifiers that must be cross checked with the L0 event ID's of the corresponding events in the L1 front-end electronics.

Synchronization of high speed optical/serial links:

Synchronization patterns should be sent at regular intervals and/or when no active data is being transferred.

Recommendations to front-end

Use parity/hamming/CRC checks on data transmitted over links

It is strongly recommended to use some kind of data checking when transmitting data over links between modules. A minimum data check is the use of parity at the word level or a CRC check on a complete event fragment. Especially the event header containing event and bunch identification should be well checked/protected.

Use parity/hamming/CRC check on data stored in memories.

The most sensitive part of electronics to single event upsets are memories. The LHCb front-end consists to a large part of buffer memories so a significant amount of bit failures can be expected while the system is running. The event data contained in these buffers will normally consists of digitized detector signals and an event header. The digitized detector signal is not important to protect but a bit failure in the headers may cause the front-end to fail. It can here be noted that a single particle in some cases may "flip" two neighbor bits in a memory (double bit error not detected by parity).

Use one-hot encoded state machines with continuous check.

Registers used to implement state machines controlling the different functions of the front-end electronics are also sensitive to single event upsets. When state machines are implemented with normal binary state encoding it is very hard (impossible) to determine if the state machine has gone wild. If one-hot encoded state machines are used ( state encoding where one and only one state flip-flop is set at any time) it is possible to detect any single bit upset of the state register. One hot encoded state machines needs more flip-flops to implement the state register ( e.g. 8 instead of 3) but the overhead is in general quite small as the state encoding/decoding logic is reduced (and in general the state machine can also run faster). The state check consists simply of verifying that one and only one bit is set in the state register in any clock cycle. One-hot encoded state machines are normally supported by most logic synthesis tools but the state checking must be "hand" coded in the input (VHDL or Verilog) to the synthesis tool. A better and more expensive solution is to use single bit error immune state machines based on Hamming coding or triple redundant logic.

Use continuous parity check of setup parameters

The front-end electronics will contain a large set of front-end parameter registers which needs to be loaded with predefined values before the experiment can start data taking. These registers will also be sensitive to single event upsets and a change of any single bit may have serious consequences on the correct function of the electronics. The issue of the front-end reset to reinitialized the state of the front-end will in this case not correct the problem. The front-end parameters must be reloaded via the ECS system. The use of a parity bit per parameter register or a common parity for a whole set of parameters can detect this kind of failure (parity check must be continuously performed !). A better and more expensive solution is to use single bit error immune configuration registers based on Hamming coding or triple redundant logic.

A simple, but much less efficient, approach to this problem is to continuously download correct front-end parameters while the system is running. If the system used to load the parameter registers does not have a very high level of security (checking) the continuous writing may in fact provoke more errors than the effect of single event upsets in the register itself.

Use of FPGA's

FPGA's storing the configuration of its gates and internal connections in on-chip static memory is a special and rather delicate problem for use in radiation areas as there in general is no built in function to continuously verify their content. Some FPGA's support that their internal configuration is read out, while the circuit is actively being used. In such cases the programming can be continuously read out and compared to their correct value, or alternatively a checksum of the configuration data can be calculated and compared against a locally stored reference. If SRAM based FPGA's are needed in the implementation of a front-end module the only way to correct a failure is to force the reloading of the programming data, when a specific module has been found to fail repetitively. The process of reloading FPGA's controlled by the ECS system will be quite slow and this scheme will only work if the error rate is very low. The use of alternative FPGA technologies, not using static memory for its configuration, should be seriously considered for electronics located in an environment with radiation ( in detector, in cavern). It has also in rare cases been seen that the change of a programming bit has damaged the FPGA when several internal bus drivers become enabled at the same time.

Use of microprocessors

The use of normal microprocessors or special purpose DSP's in a environment with radiation poses similar problems as the use of static memory based FPGA's. In this case the processor can probably in most cases be reset together with the rest of the front-end electronics. If the program memory has been corrupted it must be reloaded from the ECS system (or local PROM). The inclusion of watchdog timers in the program of the processors may in some cases improve the error detection capability. The use of parity check on the memory of the processor would be ideal but will in many cases be difficult to implement on micro-controllers and special purpose DSP's.

This page was last modified by JC on 11 May, 2006 . This page has been accessed number of times.