Monday, November 11, 2013

How NASA fixed a computer on another planet

I just love it when something that sounds like science fiction is actually true. In 1997 the Mars Pathfinder landed on Mars but after a few days it started having trouble: the computer was "crashing" and resetting, preventing the pathfinder from finishing its daily tasks, which it resumed until the next day.

If you are interested on the technical details, you may find them HERE, in the words of someone who actually was involved in the mission. If you are curious but want a quick and simple explanation, I'll try to provide it here the way I (as a noob) understood it.

Super quick overview of the system:
The Pathfinder was controlled by a Real Time Operating System (RTOS), a software that has as one of its main characteristics the ability to schedule the tasks that will run in order of priority and even "pause" them if something with high priority comes along.  

The CPU was connected to a Bus, which is a sort of "pathway" shared by different devices, one of them called ASI/MET, used for gathering meteorological data. The access to this Bus was controlled by the RTOS in two tasks with the highest priority in the entire system.

What went wrong:
One of the fun things of RTOSs is that sometimes a variable is used by two different tasks but you have to make sure only one of them is changing it at a time. You can protect the access to this variables using yet another variable called "semaphore" that a task can "take" when it's going to access the variable and "release" when it's done with it. If a task needs access to a variable but the semaphore is "in the hands" of another task, the task is put on hold until it can take it. 

Now, imagine that a low priority task takes a semaphore but then it gets "paused" before it can release it because the RTOS needs to run a higher priority task. And then other higher priority tasks keep getting in the way so that the poor little task that has the semaphore never gets a chance to continue. Finally, imagine that a super important task has to run but it needs the semaphore... yeah, that's a problem.

In a nutshell, that's what was causing the problem with the Pathfinder. The low priority task that controlled the ASI/MET took (and didn't get the chance to return) a semaphore that was also needed by one of the high priority tasks that controlled the Bus, preventing it from completing. When the other task that controlled the Bus tried to run, it saw that the first one didn't complete, it figured something was terribly wrong and declared the error that started the reset of the entire system.

How it was fixed:
Once NASA figured out what the problem was, after carefully making sure that wouldn't break something else, they enabled a feature in the RTOS called "Priority Inheritance". This feature pretty much allows the priority of a task to be "bumped up" if it's holding a semaphore that a higher priority task needs. To apply the change they didn't transmit to the Pathfinder an entire new copy of the software, they just sent the pieces that needed to change and then the system "patched" itself. 

And that was it... easy... right?