SMP Error

GargGarg Purveyor of Lincoln Nightmares Icrontian
edited August 2007 in Folding@Home
My SMP client has been returning "UNKNOWN ERROR" messages since the last WU finished. I'm leaving for work now, I guess I'll reinstall the client when I get home. Has anyone seen this before or know its cause?
[08:09:41] Completed 500000 out of 500000 steps  (100 percent)
[08:09:42] Writing final coordinates.
[08:09:46] Past main M.D. loop
[08:09:58] Will end MPI now
[08:10:58] 
[08:10:58] Finished Work Unit:
[08:10:59] - Reading up to 3714144 from "work/wudata_07.arc": Read 3714144
[08:10:59] - Reading up to 1768952 from "work/wudata_07.xtc": Read 1768952
[08:10:59] goefile size: 0
[08:10:59] logfile size: 17316
[08:10:59] Leaving Run
[08:10:59] - Writing 5504812 bytes of core data to disk...
[08:11:01]   ... Done.
[08:11:01] - Failed to delete work/wudata_07.sas
[08:11:01] - Failed to delete work/wudata_07.goe
[08:11:01] Warning:  check for stray files
[08:11:01] - Shutting down core
[08:13:01] 
[08:13:01] Folding@home Core Shutdown: FINISHED_UNIT
[08:13:01] 
[08:13:01] Folding@home Core Shutdown: FINISHED_UNIT
[08:13:10] CoreStatus = 64 (100)
[08:13:10] Sending work to server


[08:13:10] + Attempting to send results
[08:13:39] + Results successfully sent
[08:13:39] Thank you for your contribution to Folding@Home.
[08:13:39] + Number of Units Completed: 56

[08:15:43] - Preparing to get new work unit...
[08:15:43] + Attempting to get work packet
[08:15:43] - Connecting to assignment server
[08:15:44] - Successful: assigned to (171.64.65.64).
[08:15:44] + News From Folding@Home: Welcome to Folding@Home
[08:15:44] Loaded queue successfully.
[08:15:50] + Closed connections
[08:15:50] 
[08:15:50] + Processing work unit
[08:15:50] Core required: FahCore_a1.exe
[08:15:50] Core found.
[08:15:50] Working on Unit 08 [August 15 08:15:50]
[08:15:50] + Working ...
[08:15:50] 
[08:15:50] *------------------------------*
[08:15:50] Folding@Home Gromacs SMP Core
[08:15:50] Version 1.74 (March 10, 2007)
[08:15:50] 
[08:15:50] Preparing to commence simulation
[08:15:50] - Ensuring status. Please wait.
[08:16:07] - Assembly optimizations manually forced on.
[08:16:07] - Not checking prior termination.
[08:16:14] - Expanded 929114 -> 11968368 (decompressed 1288.1 percent)
[08:16:15] - Failed to delete work/wudata_08.ar
[08:16:15] Project: 2610 (Run 1, Clone 84, Gen 0)
[08:16:15] 
[08:16:15] ing from initial work packet
[08:16:15] 
[08:16:15] Project: 2610 (Run 1, Clone 84, Gen 0)
[08:16:15] 
[08:16:16] Assembly optimizations on if available.
[08:16:16] Entering M.D.
[08:16:22] Rejecting checkpoint
[08:16:23] Gromacs error.
[08:16:23] 
[08:16:23] Folding@home Core Shutdown: UNKNOWN_ERROR
[08:16:23] 
[08:16:23] Folding@home Core Shutdown: UNKNOWN_ERROR

Comments

  • LeonardoLeonardo Wake up and smell the glaciers Eagle River, Alaska Icrontian
    edited August 2007
    Could be many things. Some of the SMP units are just delicate - simply put. It could also be brief connectivity loss from a wireless adapter or card. Could also be a 'bad' work unit downloaded.

    2610s have been among some of the tougher SM WUs to fold.
  • GargGarg Purveyor of Lincoln Nightmares Icrontian
    edited August 2007
    What's the proper way of getting rid of the work unit and making it download another?
  • LeonardoLeonardo Wake up and smell the glaciers Eagle River, Alaska Icrontian
    edited August 2007
    So it won't progress beyond UNKNOWN_ERROR? Did you try and shut it down with Ctrl+C and restart.

    If you didn't know, after Ctrl+C, check Task Manager. If any Fah_Core1.exe are still running, you must wait. The shutdown process must synchronize all four Fah_Cores or it can destroy the work unit.
  • LeonardoLeonardo Wake up and smell the glaciers Eagle River, Alaska Icrontian
    edited August 2007
    If it still won't run, delete the "Work" and "queue" folders from your Folding@Home installation folder and then restart the client.
  • GargGarg Purveyor of Lincoln Nightmares Icrontian
    edited August 2007
    Leonardo wrote:
    So it won't progress beyond UNKNOWN_ERROR? Did you try and shut it down with Ctrl+C and restart.

    If you didn't know, after Ctrl+C, check Task Manager. If any Fah_Core1.exe are still running, you must wait. The shutdown process must synchronize all four Fah_Cores or it can destroy the work unit.

    After the error that I posted, I Ctrl+C closed it, restarted the computer, and tried running it again. Same UNKNOWN ERROR.
    Leonardo wrote:
    If it still won't run, delete the "Work" and "queue" folders from your Folding@Home installation folder and then restart the client.

    Will do. That's what I did last time I needed to get rid of a WU, but I wasn't sure if that was the safest way. Thanks :thumbsup:
  • GargGarg Purveyor of Lincoln Nightmares Icrontian
    edited August 2007
    Hmm, weird. I deleted the WU and downloaded another, but it still gave me the UNKNOWN ERROR message (but only on one line, this time). After the error, I noticed that all four cores seemed to be processing as normal in Task Manager, though. I left it running all day, but it never did update the console or log with any time steps. Never updated the unitinfo.txt, either. I can't be sure it's not just spinning its wheels.
  • LeonardoLeonardo Wake up and smell the glaciers Eagle River, Alaska Icrontian
    edited August 2007
    It would appear that the cores are indeed "spinning [their] wheels." This could be caused by a wireless adapter or card losing connectivity to the network or Internet. SMP units, some more than others, are very touchy about network connectivity, even if the client is not attempting to send or receive data.

    If the console and/or log is not showing frame progression with time stamps, the work unit is stalled. There's almost no chance it can be recovered, at least in my experience.

    Have you had any events lately that might have corrupted Windows system files and communications apps, such as .net Framework? How long has it been since you've done a Check Disk operation? That's probably not the problem, but it wouldn't hurt to run it. If that doesn't fix it, just wipe out the contents of the Folding client folder and re-install. I had the exact same problem a week ago.

    The first diagnostics I performed were Check Disk, overclock stability testing, temperature monitoring, and hard drive diagnostics - Hitachi's Drive Fitness Test. Everything passed 100%. The system was rock stable with the hard drive passing in flying colors. I thought everything was OK, that stalled WUs were just anomalies. Subsequent downloaded WUs continued to hang and be ruined. After reading the known bugs thread at Folding Community I started suspecting the network connection (home network). One time I was observing the Task Manager and all four FAHCore_A1s disappeared at the same time that a "lost internet connection" window popped up on the desktop. After this incident there was no question. I uninstalled the D-Link wireless USB adapter and installed a Netgear wireless G card (just say no to both!). The Netgear card was even worse than the D-Link adapter! OK, enough is enough. I reconnected the computer via Ethernet cable nearly a week ago and the subject computer has successfully completed every SMP work unit downloaded since then.

    BTW, I've had zero problems like this with the computers that are networked with Linksys PCI cards. Note: none of the Linksys cards have the dubious "speedboost" technology, which has not been getting good reviews. These are the conventional Linksys B-G cards. I returned that sorry Netgear for a refund.

    BTW, I've ordered two of these MSI wireless G/B cards from Newegg. They get excellent user reviews and cost less than half of what Linksys and other competitors cost. Cross fingers - we'll see.

    Netgear WG311 - 'just say no!' I purchased this at CompUSA during a lunch break from work. I should have checked out reviews first.

    Pic of the MSI wireless B/G:
  • LeonardoLeonardo Wake up and smell the glaciers Eagle River, Alaska Icrontian
    edited August 2007
    After the error, I noticed that all four cores seemed to be processing as normal in Task Manager, though.
    Yes, that can occur. The FAHCore_A1's can be seen as active in Task Manager but can be a false positive. Key indicators of stalling are failure to advance to new frames and CPU core temperature dropping to near idle.
  • GargGarg Purveyor of Lincoln Nightmares Icrontian
    edited August 2007
    Leonardo wrote:
    Yes, that can occur. The FAHCore_A1's can be seen as active in Task Manager but can be a false positive. Key indicators of stalling are failure to advance to new frames and CPU core temperature dropping to near idle.

    Ah, I should have checked the core temps. Normally I have SpeedFan running, but I didn't last time. I'll wipe out my install and try again tonight :)

    I'm connected with an ethernet cable, so hopefully there aren't any connection issues (rather not get a new nic or router). Icrontic_11 is on a Gigabyte wireless PCI card, but it's folding regular WUs.
  • QCHQCH Ancient Guru Chicago Area - USA Icrontian
    edited August 2007
    And issues like this keep me from running SMP on any of my systems. I hope they can work these issues out before getting out of Beta.

    Also... they need to work out password issues. I CANNOT enter my domain password into a program to use my credentials. Not allowed at work. The three systems that I have tried all failed to complete.
  • GargGarg Purveyor of Lincoln Nightmares Icrontian
    edited August 2007
    QCH2002 wrote:
    And issues like this keep me from running SMP on any of my systems. I hope they can work these issues out before getting out of Beta.

    Also... they need to work out password issues. I CANNOT enter my domain password into a program to use my credentials. Not allowed at work. The three systems that I have tried all failed to complete.

    No kidding - there's got to be a better way to do it than it needing access to credentials. It's been in beta a long time. I know their resources are limited, but I hope they get everything worked out soon and get it ready for general release.
  • TBonZTBonZ Ottawa, ON Icrontian
    edited August 2007
    Gargoyle, uninstall the client. Re-d'load the exe, install. Be sure to run the install.bat in the folding directory.......enter password X2 to re-initiate the services. Then start the client again.

    That should get you back up..
  • GargGarg Purveyor of Lincoln Nightmares Icrontian
    edited August 2007
    Thanks for the tips, guys! I reinstalled on Saturday, and it turned in a WU this morning :thumbsup:
  • LeonardoLeonardo Wake up and smell the glaciers Eagle River, Alaska Icrontian
    edited August 2007
    Good deal!

    Tips:

    Be very careful about shutdowns. If you don't use Ctrl+C to shut the client down, you risk destroying the work unit. After Ctrl+C to shut down the client, open Task Manager and observe for the four Fah_Core A~s running in the background. These four cores must synchronize before ending. After they have disappeared from TM, it is safe to shut down your computer.

    Networking. Some of the WinSMP units are very, very sensitive to network connections. If you are on a wireless network connection, ensure that all power saving settings for your wireless card/adapter are turned OFF, that the device is at full power all the time. Just a one or two-second network disconnect when on wireless can destroy a work unit. (I don't know why, but it's a fact.)
  • SPIKE09SPIKE09 Scatland
    edited August 2007
    Leonardo wrote:
    Good deal!

    Tips:

    Networking. Some of the WinSMP units are very, very sensitive to network connections. If you are on a wireless network connection, ensure that all power saving settings for your wireless card/adapter are turned OFF, that the device is at full power all the time. Just a one or two-second network disconnect when on wireless can destroy a work unit. (I don't know why, but it's a fact.)
    MPI.exe uses the IP address to run the cross core communication so when you alter any network setting the MPi.exe gets confuzzled.
  • LeonardoLeonardo Wake up and smell the glaciers Eagle River, Alaska Icrontian
    edited August 2007
    Hey, thanks for the explanation. Is that why .net Framework is required for the Windows SMP client? But I still don't quite understand. The cores must communicate within the client to coordinate the simulation calculations for the work unit. What does that have to do with outside network - 'outside' meaning the connection to the LAN and/or Internet?
  • SPIKE09SPIKE09 Scatland
    edited August 2007
    It uses the external ip address as a point of reference, some folk have cured it on a network by having static IP addresses
  • LeonardoLeonardo Wake up and smell the glaciers Eagle River, Alaska Icrontian
    edited August 2007
    (A light bulb blinks on above Leo's head.) Ahh, now I really do understand why the units would fail.
  • ThraxThrax 🐌 Austin, TX Icrontian
    edited August 2007
    As a point of reference to what?
  • SPIKE09SPIKE09 Scatland
    edited August 2007
    Thrax wrote:
    As a point of reference to what?
    The bucket thraxie pooh, I aren't a coder just a folder.:eek:
  • edited August 2007
    I have a folding@home (windows-smp) problem and at the same time for some reason I cannot get to the user community forum (Bruce). maybe I can get some help here. I found this forum by googling "2610 (1,84,0)" !

    It is work unit 2610 (1,84,0), a gromacs error (get_symtab_handle 54650952 not found) ...src\gmxliblsymtab.c, line 108. It is a hard stop. no recovery.

    I downloaded and re-installed folding-smp (after deleting the 'folding' folder) several times, got the same work unit resulting in the same error. I do not know how to delete a work unit. deleting the 'queue file' and the 'work folder' does not help.

    By the way I have a two computers with a Q6600 (4 cores) - stock, not overclocked - each running a single version of Windows SMP. On the same cable connection - one is running fine (both have been running for fine for about 2-3 months)

    Anyway I have been stuck for a couple of hours and I would like to get going again. Any help is greatly appreciated.

    thank you

    Otto1939
  • SPIKE09SPIKE09 Scatland
    edited August 2007
    Hi otto did you perchance clone the install from one to the other as this can cause problems. you have done the correct thing in this instance in deleting the work folder and the queue.dat files. Only time I have ever seen bruce/7im advocating it. next i would rerun the instal.bat and change machine ID this should remove the problem.

    edit and welcome to Icrontic maybe a post of the first 30 lines of the fahlog with the verbosity 9 flag in place would help
  • edited August 2007
    thank you.

    I went thru separate installs. I don't know how to do anything else.

    I also don't know how to run 'verbosity 9'.

    Here is the log, showing the end of the download until the error:

    [02:03:32] + 696320 bytes downloaded
    [02:03:32] + 706560 bytes downloaded
    [02:03:32] + 716800 bytes downloaded
    [02:03:32] + 727040 bytes downloaded
    [02:03:32] + 737280 bytes downloaded
    [02:03:32] + 747520 bytes downloaded
    [02:03:32] + 757760 bytes downloaded
    [02:03:32] + 768000 bytes downloaded
    [02:03:32] + 778240 bytes downloaded
    [02:03:32] + 788480 bytes downloaded
    [02:03:32] + 789667 bytes downloaded
    [02:03:32] Verifying core Core_a1.fah...
    [02:03:32] Signature is VALID
    [02:03:32]
    [02:03:32] Trying to unzip core FahCore_a1.exe
    [02:03:32] Decompressed FahCore_a1.exe (2035712 bytes) successfully
    [02:03:32] + Core successfully engaged
    [02:03:37]
    [02:03:37] + Processing work unit
    [02:03:37] Core required: FahCore_a1.exe
    [02:03:37] Core found.
    [02:03:37] Working on Unit 01 [August 27 02:03:37]
    [02:03:37] + Working ...
    [02:03:37]
    [02:03:37] *
    *
    [02:03:37] [EMAIL="Folding@Home"]Folding@Home[/EMAIL] Gromacs SMP Core
    [02:03:37] Version 1.74 (March 10, 2007)
    [02:03:37]
    [02:03:37] Preparing to commence simulation
    [02:03:37] - Ensuring status. Please wait.
    [02:03:39] - Starting from initial work packet
    [02:03:39]
    [02:03:39] Project: 2610 (Run 1, Clone 84, Gen 0)
    [02:03:39]
    [02:03:39] Assembly optimizations on if available.
    [02:03:39] Entering M.D.
    [02:03:58] tial work pa- Starting from initial work packet
    [02:03:58]
    [02:03:58] Project: 2610 (Run 1, Clone 84, Gen 0)
    [02:03:58]
    [02:03:58] Entering M.D.
    [02:04:04] Rejecting checkpoint
    [02:04:05] Gromacs error.
    [02:04:05]
    [02:04:05] [EMAIL="Folding@home"]Folding@home[/EMAIL] Core Shutdown: UNKNOWN_ERROR
    [02:04:05]
    [02:04:05] [EMAIL="Folding@home"]Folding@home[/EMAIL] Core Shutdown: UNKNOWN_ERROR
  • edited August 2007
    here is the end of the data from the file wudata_010:

    ++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
    H. J. C. Berendsen, D. van der Spoel and R. van Drunen
    GROMACS: A message-passing parallel molecular dynamics implementation
    Comp. Phys. Comm. 91 (1995) pp. 43-56

    --- Thank You ---


    Program Core_A1.exe, VERSION 3.3
    Source code file: ..\..\..\src\gmxlib\symtab.c, line: 108
    Fatal error:
    symtab get_symtab_handle 54650952 not found
    Thanx for Using GROMACS - Have a Nice Day
  • SPIKE09SPIKE09 Scatland
    edited August 2007
    to add -verbosity 9 flag create a shortcut to the SMP client drag it to your desktop right click the shortcut go to properties at the end of the path to the exe add a space and then -verbosty 9.
    wiki entries relating

    http://fahwiki.net/index.php/How_do_I_reconfigure_the_console_client_options%3F

    http://fahwiki.net/index.php/How_do_I_add_flags_using_a_shortcut_to_the_console_client%3F
  • edited August 2007
    Thank you. I really think I don't know enough to get this going. My intent is just to get to the point where it will download a different work unit.

    I tried the verbosity 9 option: and got this... the program is downloaded repeatedly. This shows one iteration: I had to modify the links because I am not allowed to post links here yet.


    [17:55:12] Initial: 316E; + 727040 bytes downloaded
    [17:55:12] Initial: D89D; + 737280 bytes downloaded
    [17:55:12] Initial: E6A3; + 747520 bytes downloaded
    [17:55:12] Initial: B488; + 757760 bytes downloaded
    [17:55:12] Initial: BAFD; + 768000 bytes downloaded
    [17:55:12] Initial: 34A0; + 778240 bytes downloaded
    [17:55:12] Initial: DD6C; + 788480 bytes downloaded
    [17:55:12] Initial: D2E9; + 789667 bytes downloaded
    [17:55:12] Verifying core Core_a1.fah...
    [17:55:12] Signature is VALID
    [17:55:12]
    [17:55:12] Trying to unzip core FahCore_a1.exe
    [17:55:13] Decompressed FahCore_a1.exe (2035712 bytes) successfully
    [17:55:13] + Core successfully engaged
    [17:55:18]
    [17:55:18] + Processing work unit
    [17:55:18] Core required: FahCore_a1.exe
    [17:55:18] Core found.
    [17:55:18] Working on Unit 01 [August 27 17:55:18]
    [17:55:18] + Working ...
    [17:55:18] - Calling 'mpiexec -channel auto -np 4 FahCore_a1.exe -dir work -suffix 01 -checkpoint 15 -verbose -lifeline 2212 -version 591'
    [17:55:26] CoreStatus = 63 (99)
    [17:55:26] + Error starting Folding Home core.
    [17:55:31]
    [17:55:31] + Processing work unit
    [17:55:31] Core required: FahCore_a1.exe
    [17:55:31] Core found.
    [17:55:31] Working on Unit 01 [August 27 17:55:31]
    [17:55:31] + Working ...
    [17:55:31] - Calling 'mpiexec -channel auto -np 4 FahCore_a1.exe -dir work -suffix 01 -checkpoint 15 -verbose -lifeline 2212 -version 591'
    [17:55:39] CoreStatus = 63 (99)
    [17:55:39] + Error starting Folding Home core.
    [17:55:44]
    [17:55:44] + Processing work unit
    [17:55:44] Core required: FahCore_a1.exe
    [17:55:44] Core found.
    [17:55:44] Working on Unit 01 [August 27 17:55:44]
    [17:55:44] + Working ...
    [17:55:44] - Calling 'mpiexec -channel auto -np 4 FahCore_a1.exe -dir work -suffix 01 -checkpoint 15 -verbose -lifeline 2212 -version 591'
    [17:55:52] CoreStatus = 63 (99)
    [17:55:52] + Error starting Folding Home core.
    [17:55:52] - Attempting to download new core...
    [17:55:52] + Downloading new core: FahCore_a1.exe
    [17:55:52] Downloading core (~pande Win32 x86 Core_a1 fah from stanford)
    [17:55:53] Initial: AFDE; + 10240 bytes downloaded
    [17:55:53] Initial: AD21; + 20480 bytes downloaded
    [17:55:53] Initial: CC38; + 30720 bytes downloaded
    [17:55:53] Initial: 8501; + 40960 bytes downloaded
    [17:55:53] Initial: F56A; + 51200 bytes downloaded
    [17:55:53] Initial: ABAE; + 61440 bytes downloaded
    [17:55:53] Initial: B6B0; + 71680 bytes downloaded
    [17:55:53] Initial: 783A; + 81920 bytes downloaded
  • edited August 2007
    Eureka!

    Changing the machine id from 1 to 2 did the trick.

    It downloaded a different 2610 wu and started folding.

    Many Thanks

    Otto
  • SPIKE09SPIKE09 Scatland
    edited August 2007
    Thought that would work glad to help, I think 5 posts is enough to be allowed to post links here.:p
Sign In or Register to comment.