Common FAH Errors

mmonninmmonnin Centreville, VA
edited February 2004 in Folding@Home
I found this thread at the community and thought it might be helpful when something goes wrong or someone gets an error of some sort.

http://forum.folding-community.org/viewtopic.php?t=4824&highlight=
I have compiled here a list of the most common errors that are found in Folding@Home. There is an explanation of the cause and resolution (if known) for each error listed. I have also listed a few guidelines for posting log files and how to ID a unit/core. If you find another fairly common error (i.e. more than 1 person gets it), PM me and I'll add it to the list. Any feedback, positive and negative, is very welcome.

Moderators and Administrators: please do not edit this post. I work on a local copy that I keep for safekeeping and ease of editing, so any changes made will be lost during the next update. Simply contact me regarding any changes and I'll be more than happy to make them.

Since first posting this, I've noticed that there are far less questions about these errors (specifically EARLY_UNIT_END). Good to see that it's helped! 8-)
  • EARLY_UNIT_END
  • FILE_IO_ERROR
  • CLIENT_DIED
  • UNKNOWN_ERROR
  • Client-Core Communications Error
  • BAD_FRAME_CHECKSUM
  • SPECIAL_EXIT
EARLY_UNIT END:
Quite possibly the most common error found today. EARLY_UNIT_END is usually caused by one of two things: a bad WU or an unstable system.

If you get one isolated EARLY_UNIT_END, it's most likely just a WU that is bad. It's not a problem, and you shouldn't worry about it. It's usually caused when atoms in the WU reach impossible positions and Gromacs can't continue.

Multiple EARLY_UNIT_END errors are a sign of a severe problem with your machine. Machines that are clocked too high, have heat problems, or possibly have SSE forced on (AMD only) will generate this error. You should stop F@H if you get more than one EARLY_UNIT_END per week per machine, and certainly if you get two in a row. Make sure your machine is up to spec, with reasonable temperatures, reasonable clocking (CPU, FSB, and memory must all be stable), and a good, powerful PSU. EARLY_UNIT_END is most often caused by problems with a user's machine, and an abnormal number of them certainly merits examining your system.

This error may be accompanied by a LINCS WARNING message that gives more specific technical details on exactly what happened.

NOTE: See the description about "-forceasm" (3.x) or "-forceSSE" (4.x) causing SPECIAL_EXIT on certain AMD based systems. If you are running an AMD Athlon XP with the Thoroughbred or Barton cores, you should remove the "-forceasm" or "-forceSSE" switch, most likely fixing your problems.

FILE_IO_ERROR:
An error that occurs when disk operations go bad. This is a fairly general error, having many sub-types. It has plummeted in frequency since the release of Gromacs Core 1.46. Now, this error usually happens when a hardware error occurs: something like "Write 0010, read back 0011". If you experience this error, make sure your hard drives are OK: run ScanDisk, CHKDSK, or fsck, make sure the IDE bus is in spec, make sure you're using good IDE cables, and make sure the drive isn't dying.

FILE_IO_ERROR has also been reported to occur if two Console clients working on the same unit are started. This can occur if you accidentally start one client twice on a dually, instead of two clients once.
Thanks to sortofageek for contributing the part about two clients causing this error.

CLIENT_DIED:
This happens when, simply enough, the client dies. The core is still running, and can't find the client, so it shuts down. This is usually related to overclocking and/or overly aggressive memory timings. Back down on these and this error should vanish.

UNKNOWN ERROR:
A now rare Gromacs error that usually occurs if there's a corrupt WU being processed. It is no longer common and any instances should probably be reported (post a log, etc.). You may also want to check your hardware if you've had past errors.

Client-Core Communications Error:
There are several different kinds of this error.

ERROR 0xX is basically another form of an unknown error. It can be found on Linux if you're having Glibc version problems. See the Linux forum for more info. Overclocking is another possible cause. ERROR 0x1 has occured with both Gromacs and Genome units. Its cause is still unknown. This error has not been replicated by the Pande group. There are known solutions to 0xX if it's caused by overclocking (stop!) or Glibc (see Linux forum). Otherwise, there's no known fix. Post relevant sections of FAHlog.txt (including version and type of client) and which version your OS is and continue folding/genoming.

ERROR 0x1 has been reported to occur if the core is killed while the client is processing, though this is a fairly rare occurrence if you are not using scripts that kill the core.
[15:07:06] CoreStatus = 1 (1) 
[15:07:06] Client-core communications error: ERROR 0x1 
[15:07:06] Deleting current work unit & continuing... 
[15:07:26] Trying to send all finished work units 
[15:07:26] + No unsent completed units remaining. 
[15:07:26] - Preparing to get new work unit...
Thanks to gnewbury for information on this form of ERROR 0x1.

ERROR 0xC0000005 means there was a memory access violation. This is a standard Windows error code for any program trying to access memory it does not control. This can be a rare hardware error and is not cause for concern. Old versions of clients/cores can also cause this problem.

ERROR 0x________, where the blank is an eight-digit hexadecimal code, is usually a general Windows error. Look up the specific Windows error code (if you need help, just post a thread) and you will most likely find the cause.
Thanks to Bruce and Guha for clarifying 0xX errors.

BAD_FRAME_CHECKSUM:
You'll see a block in your log that looks something like this:
[hh:mm:ss] Header on frame 220 differs from expected header 
[hh:mm:ss] Got: A028B-5C-3E84B02E-EA1B7D4: 0220 
[hh:mm:ss] Expected: A028B-5C-3E84B02E-EA1B7D4: 0219
Note that the two lines of Hexadecimal numerals are the same. This strange error only occurs with Tinker units. The only known cause is when two or more clients are started at once and are working in the same directory, but there may be other causes. This error often, bizzarely, occurs on an early frame but is not detected until the unit's end.

BAD_FRAME_CHECKSUM, similar to one type of Gromacs FILE_IO_ERROR, can also mean that a hardware error occurred where there was a slight discrepancy between what was read and what was expected: something like writing 101010 and reading back 110110. Again, this is commonly not detected until the unit finishes.

SPECIAL_EXIT:
This severe error means that something unknown happened inside the Gromacs core. The only known cause is when "-forceasm" (3.x) or "-forceSSE" (4.x) is applied to an AMD system that is not 100% stable with SSE. CPUs that have had problems include the Thoroughbred B, Barton, and Opteron cored processors. In this case it should be dealt with as an EARLY_UNIT_END error (see above). Removing "-forceasm" or "-forceSSE" will almost certainly fix the problem. SSE related errors are now fairly rare, compared to a few months ago.

If you are not forcing use of SSE and this error occurs, a log should be posted as this is a serious problem.

Posting Log Files: A Guide
When you post any log straight onto the board, please edit out any insignificant details. Examples of this would be completion of frames, core download ("1024 bytes downloaded... 512000"), and Getwork errors (leave the first and last ones please). An example:
[hh:mm:ss] Writing local files 
[hh:mm:ss] Completed 0 out of 250000 steps (0) 
<snip> 
[hh:mm:ss] Writing local files 
[hh:mm:ss] Completed 250000 out of 250000 steps (100)
This makes logs far easier to read and problems are easier to spot. If you're unsure if something should be cut, please just leave it in.

Folding@Home Unit Types and Cores
To tell which type of unit you are running, you can simply look at your log file. When the client first starts the core, you'll find one of the following strings somewhere. "Gromacs Core" means it's a Gromacs unit. "Protein Design Core" is a Genome unit. Tinker simply says "Folding@Home Client Core" and then, farther down, "TINKER: Software Tools for Molecular Design".

If you look at the currently running processes under WinNT/2K/XP (in Task Manager) or Linux, you'll find one of these cores running (the most current version number is also listed):
  • FAHCore_65.exe - Tinker - Version 2.50
  • FAHCore_78.exe - Gromacs - Version 1.56
  • FAHCore_ca.exe - Genome - Version 2.06
Special thanks to the Pande group for their excellent support of this project (in no particular order): Vijay, Guha, Youngmin, Vishal, Eric, Chris, Adam, Stefan, and the rest of you who never post. You've got to like people who get up in the morning and go over to kick the servers when they're not working. Thanks to all of you.

This document may be freely linked to or reproduced, so long as a link to this page is provided.

Last updated February 18, 2003. Added to the Client-Core Communications Error section and updated most current core revision numbers.

Comments

  • a2jfreaka2jfreak Houston, TX Member
    edited February 2004
    I experienced the CLIENT_DIED multiple times on the p4 today when switching it from the v3 client to the v4 client. Seems my p4 does not like the -local flag. turning off/on other flags had no affect. -local was the problem. Odd I tell ya. Odd. The Athlon system had no problem with -local. I still have a couple Athlon systems to change to the v4 client.
  • gtghmgtghm New
    edited February 2004
    I would only like to add that not always is an "FILE_IO_ERROR:" your fault.

    I have personally experienced this error since updateing to the latest client from time to time and I KNOW that my rig is sound. When this has happened I have noticed that the units were being DNL'ed from the same server. Also I have had it happen to my client #3 and Client #2 but not always the same time or every time. Most of the time my 4, (dual hyper-threaded xeons) :rolleyes , clients run perfectly but every once in a while I get that error... I have seen other posts discussing this and the general consensis is that it was/is possibly a server issue. I have not had it happen for the last several WU's now...

    "g"
Sign In or Register to comment.