Abstract:
This workshop is intended to explore opportunities, due to the availability of many/multi-core technologies for space missions, to provide the desired goal: fail-operate, i.e., operate through failures. This workshop will have four main sessions (16 hours total, spread over 2 days. In addition to new fault tolerance techniques required for multicore computing and new fault management capabilities enabled by the high throughput afforded by multicore computing, the workshop organizers believe that there is a large body of fault-tolerance research that may have been previously shelved due to the impracticality of deploying it using older technologies. With the advent of many/multi-core processing in space, much of this research should perhaps be revisited, as it might provide the insights needed to achieve operate-through using the newer technologies.
Theme and Goals:
Session 1: Problem Definition
Because most of the work in dealing with radiation effects has been at the component level (RHBP, RHBD) this session will attempt to define the fault set that needs to be addressed from that level through the areas that affect the ability to fail-operate or operate-through. Examples include the hardware faults that we may see as we continue to shrink features sizes, independent of those anticipated by radiation effects – including early component failure and degraded performance. The goal of this session will be to establish a common vocabulary for faults and multi/many-core processors, to identify a fault set likely to be seen in future multicore computers and to identify areas of opportunity and challenges unique to deploying many/multi-core computers in space missions.
Session 2: Fault Tolerance (A Local Perspective)
Just as most RHBP and RHBD work has focused at the component level, so has much of fault tolerance focused on a more local and hard fault viewpoint. Although this is important work and essential to understanding fault behavior, the advent of multi and many-core processors creates both an opportunity for additional fault tolerance and challenges in fault containment and localization. In this area, it is possible that prior research work on fault identification and mitigation – work that was previously not practical to use in systems due to SWaP and/or raw processing power constraints – could be revisited and might provide some unique new approaches to leveraging multi/many-core space processing. Fault propagation is another area to be considered in this session – just as multi/many-core processing brings opportunities for localized fault tolerance, the technology also brings opportunities for fault propagation and the challenge of assessing single points of failure, fault effects, fault detection and localization and fault localization and fault mitigation, especially in real time mission critical applications.
Session 2 will close with a panel session summarizing the different perspective and encouraging (or so the organizers hope) a lively discussion among the participants in preparation for session 3.
Session 3: Fault Management (A System Perspective)
When a component fails, it does not necessarily follow that the system fails. Further, a component may not fail in the classic sense (completely stop operating or responding) but may still behave in a way that endangers the mission (reduced performance, slightly incorrect numerical results, excessive retransmits,….). As in session 2, the organizers believe that it is possible that prior research work on fault identification and mitigation – work that was previously not practical to use in systems due to SWaP and/or raw processing power constraints – could be revisited and might provide some unique new approaches to leveraging multi/many-core space processing. Software is a key component in space processing systems - what approaches to managing software faults are appropriate and effective in a multi/many-core environment? This session will address architectures (both hardware and software) that aid operate-through, and also touch on the challenge in dealing with operate-through when the system has heterogeneous components (different processor types – RAD750, Maestro, ARM, Freescale, and different accelerator types – DSP, FPGA, GPGPU, as well as different sensor types,….). Verification/validation of the effectiveness of operate-through approaches (beyond the traditional radiation testing environments) is part of this discussion.
Session 4: Commercial Practices – And The Gaps To Be Addressed By Space Processing
As feature sizes shrink, some faults that were formerly associated only with harsh environments are now seen at sea level in commercial processors/workstations. These, plus others that are related to the newer technologies as discussed in Session 1, are a design consideration for exascale terrestrial computer systems. Our quest for techniques to ensure that missions can operate-through may therefore get increased assistance from commercial vendors. This can already be seen in the widespread availability of such techniques as hardware error correction for memories, registers, and caches – but there is likely more on the horizon in both hardware and software (systems software such as the operating systems, and application techniques). Some of the commercial practices come at a cost – some of the error correction techniques can involve timeouts that erode critical timelines before reporting an uncorrectable error. The organizers will be encouraging participation by leading commercial hardware and software vendors and will close the session with a panel that will debate the gaps that space processing research will need to fill based on a ten-year projection of capabilities and areas of “opportunity”.
Relevance:
Incorporation of new technologies is often resisted due to perception of increased risk. At the same time, failure to advance performance of space missions can also increase risk. The emerging many-core options give us an opportunity to revisit the approaches to achieving mission success despite SWaP contraints. Just as some older applications research has been show to provide useful insights into application performance in many-core, so may some prior research work on fault tolerance and fault management be now worth revisiting as it might at last be deployable to aid mission success.
Organization:
Session Chair(s): Marti Bancroft, MBC, USA, marti@dragonsden.com
Rafi Some, NASA JPL, USA, raphael.r.some@jpl.nasa.gov
marti@dragonsden.com |