Dual Windows SMP clients no dice
Ultra-Nexus
Buenos Aires, ARG
Hi!
Problem occurs when I try to shutdown the clients.
I am running the affinity optimizer (the russian soft) but every time I stop any of these I get:
[17:27:50] Writing local files
[17:27:50] Completed 275000 out of 500000 steps (55 percent)
[17:42:53] Timered checkpoint triggered.
[17:44:27] Writing local files
[17:44:29] Completed 280000 out of 500000 steps (56 percent)
[17:59:30] Timered checkpoint triggered.
[18:11:03] Killing all core threads
[18:11:03] Killing SMP core threads
[18:11:03] Could not get process id information. Please kill core process manually
Folding@Home Client Shutdown at user request.
[18:11:03] ***** Got a SIGTERM signal (2)
[18:11:03] Killing all core threads
[18:11:03] Killing SMP core threads
[18:11:03] Could not get process id information. Please kill core process manually
Folding@Home Client Shutdown.
But I still see the FahCore_a1.exe processes still taking up CPU resources. Even after waiting half an hour... so I kill them manually as it says in the logs.
Now when I engage them again, this is the result:
--- Opening Log file [February 18 21:03:11]
# SMP Client ##################################################################
###############################################################################
Folding@Home Client Version 5.91beta6
http://folding.stanford.edu
###############################################################################
###############################################################################
Launch directory: U:\SMP1
Executable: U:\SMP1\fah.exe
Arguments: -local -forceasm -verbosity 9
Warning:
By using the -forceasm flag, you are overriding
safeguards in the program. If you did not intend to
do this, please restart the program without -forceasm.
If work units are not completing fully (and particularly
if your machine is overclocked), then please discontinue
use of the flag.
[21:03:11] - Ask before connecting: No
[21:03:11] - User name: _-_ThaNexus_-_ (Team 93)
[21:03:11] - User ID: 598F5D623175336
[21:03:11] - Machine ID: 1
[21:03:11]
[21:03:12] Loaded queue successfully.
[21:03:12]
[21:03:12] - Autosending finished units...
[21:03:12] + Processing work unit
[21:03:12] Trying to send all finished work units
[21:03:12] Core required: FahCore_a1.exe
[21:03:12] + No unsent completed units remaining.
[21:03:12] - Autosend completed
[21:03:12] Core found.
[21:03:12] Working on Unit 05 [February 18 21:03:12]
[21:03:12] + Working ...
[21:03:12] - Calling 'mpiexec -channel auto -np 4 FahCore_a1.exe -dir work/ -suffix 05 -checkpoint 15 -forceasm -verbose -lifeline 3780 -version 591'
[21:03:12]
[21:03:12] *
*
[21:03:12] Folding@Home Gromacs SMP Core
[21:03:12] Version 1.74 (March 10, 2007)
[21:03:12]
[21:03:12] Preparing to commence simulation
[21:03:12] - Assembly optimizations manually forced on.
[21:03:12] - Not checking prior termination.
[21:03:12]
[21:03:12] Folding@home Core Shutdown: MISSING_WORK_FILES
[21:03:12] Finalizing output
So, am I doing something wrong in here? Dang, I already lost 4 SMP units because of this...
Problem occurs when I try to shutdown the clients.
I am running the affinity optimizer (the russian soft) but every time I stop any of these I get:
[17:27:50] Writing local files
[17:27:50] Completed 275000 out of 500000 steps (55 percent)
[17:42:53] Timered checkpoint triggered.
[17:44:27] Writing local files
[17:44:29] Completed 280000 out of 500000 steps (56 percent)
[17:59:30] Timered checkpoint triggered.
[18:11:03] Killing all core threads
[18:11:03] Killing SMP core threads
[18:11:03] Could not get process id information. Please kill core process manually
Folding@Home Client Shutdown at user request.
[18:11:03] ***** Got a SIGTERM signal (2)
[18:11:03] Killing all core threads
[18:11:03] Killing SMP core threads
[18:11:03] Could not get process id information. Please kill core process manually
Folding@Home Client Shutdown.
But I still see the FahCore_a1.exe processes still taking up CPU resources. Even after waiting half an hour... so I kill them manually as it says in the logs.
Now when I engage them again, this is the result:
--- Opening Log file [February 18 21:03:11]
# SMP Client ##################################################################
###############################################################################
Folding@Home Client Version 5.91beta6
http://folding.stanford.edu
###############################################################################
###############################################################################
Launch directory: U:\SMP1
Executable: U:\SMP1\fah.exe
Arguments: -local -forceasm -verbosity 9
Warning:
By using the -forceasm flag, you are overriding
safeguards in the program. If you did not intend to
do this, please restart the program without -forceasm.
If work units are not completing fully (and particularly
if your machine is overclocked), then please discontinue
use of the flag.
[21:03:11] - Ask before connecting: No
[21:03:11] - User name: _-_ThaNexus_-_ (Team 93)
[21:03:11] - User ID: 598F5D623175336
[21:03:11] - Machine ID: 1
[21:03:11]
[21:03:12] Loaded queue successfully.
[21:03:12]
[21:03:12] - Autosending finished units...
[21:03:12] + Processing work unit
[21:03:12] Trying to send all finished work units
[21:03:12] Core required: FahCore_a1.exe
[21:03:12] + No unsent completed units remaining.
[21:03:12] - Autosend completed
[21:03:12] Core found.
[21:03:12] Working on Unit 05 [February 18 21:03:12]
[21:03:12] + Working ...
[21:03:12] - Calling 'mpiexec -channel auto -np 4 FahCore_a1.exe -dir work/ -suffix 05 -checkpoint 15 -forceasm -verbose -lifeline 3780 -version 591'
[21:03:12]
[21:03:12] *
*
[21:03:12] Folding@Home Gromacs SMP Core
[21:03:12] Version 1.74 (March 10, 2007)
[21:03:12]
[21:03:12] Preparing to commence simulation
[21:03:12] - Assembly optimizations manually forced on.
[21:03:12] - Not checking prior termination.
[21:03:12]
[21:03:12] Folding@home Core Shutdown: MISSING_WORK_FILES
[21:03:12] Finalizing output
So, am I doing something wrong in here? Dang, I already lost 4 SMP units because of this...
0
Comments
I just did a new test and reinstalled it again, disabling the affinity changer and I am also getting this "could not get process id information". I believe this has something to do with this problem.
(err- sorry yes you did). Is your client up to date?
Can you run each SMP by it self?
This "could not get process id information" thing is my first. Never saw that before.
If one won't run right it sounds like remove and reinstall time to me.
Yeah -> correct me if I'm wrong UN <- he's using a quad-core with two installations of the SMP client using two cores each. He's using a core affinity assignment tool to do so.
I also agree that he should do a reinstall. I think that each install should have a separate download so that the UIDs are unique.
-- Completely uninstall all F@H clients, folders, everything
-- ensure computer (Windows) is set to login with password (can set to login automatically at boot up)
-- download fresh Microsoft .net Framework and install
-- download fresh Affinity Changer
-- download fresh, latest Win SMP client
-- reinstall both clients
Also, would you please tell us your config file settings. I'm wondering if you've got a bad setting in there.
BTW, in my experience running several computers with Win SMP, I've found the current Win SMP F@H client to be more stable than the previous client. Still though, I always manually backup the entire contents of the client folders before I shut down the clients, always. That has saved countless work units.
I'm getting ~1000ppd MORE on each of my Q6600's by using the affinity changer and running 2 clients. My 2 home boxes are putting out a combined 7500PPD.
I understand that Stanford wants work units back as quickly as possible, but there's room for moderation.
EDIT: here is my config:
[settings]
username=_-_ThaNexus_-_
team=93
asknet=no
bigpackets=yes
machineid=1
local=3
[http]
active=no
host=localhost
port=8080
usereg=no
[clienttype]
type=3
Also, when was the last time you downloaded a new client version? While this doesn't look like a two-month expiration I've had some weird results when it's time to update versions.
lol, sorry, mispelled. Meant "MPICH2 Process Manager, Argonne National Lab" service.
Should I have two?
So, I followed your advice to uninstall everything and install again, reset my user account password (I was using another administrator user before) and it seems to be running along this time
I have downloaded the client yesterday. I´ll try copying to the 2nd folder and see if both work together fine!
Thanks to all!
Do you have your clients installed under a 'user' with administrative rights? If not, it won't work. Also, the clients have to be installed under a user that logs into Windows with a password. I've set my machines automatically login on Windows boot.
Has anyone else here just copied the contents of one client folder to another?
If I remember right, the installations should be done with separate downloads so that each instance of the client has a unique UID for the project to use.
Using the same download shouldn't cause a problem. It's the same binary that's being executed whether you download it once or twice. I'd just install and configure one at a time.
This leaves 1 thread opened for each time I close any of the clients... dang!
EDIT: still no luck. Yes, both processes start fine and all, but when I Control-C one, it not only shuts down only 3 processes (instead of 4) but it also errors out the other running client with a "Client-core communications error: ERROR 0x7b".
Dont know why this is happening... does anyone running 2 Win SMP clients have this same problem or the clients shut down the 4 processes correctly on each client?
Thanks!
1) Before shutting down clients I copy the entire contents of each client folder to a backup folder (one backup folder per each operational folder)
2) Open both operating clients to the desktop
3) In rapid succession shutdown each client via CTR+C. Don't just stop one client - stop both of them. It's none or all, in my experience.
4) When restarting the clients later, sometimes it is necessary to delete the contents of the operational client folders and copy over the 'clean' files from the respective backup folders.
Usually I can restart the clients without copying over backed up files, but I had so many problems before with corrupted units at manual shutdowns I've just made it a habit to always backup the folders' contents.
Yes, sometimes it takes a long, long time for all the Folding processes to stop after a manual client shutdown. It's ridiculous.