Conflict of some sort?

DogSoldierDogSoldier The heart of radical Amish country..
edited October 2003 in Folding@Home
Hi, you seem like nice people...

I might be having a problem. I have 3 machines folding.. This one (Home comp) is fine.. it's the 2 at work that are doing strange things. I've been folding with my office comp for a week, no problems. But yesterday I installed the GUI on the company server and now both machines seem unable to finish a wu. Take a look at http://folding.extremeoverclocking.com/member_overview.php?UserID=60927 and note the "Last 24hr Production" table.
You'll notice some small point gains at 10/27/03 09:00 PM and 10/28/03 12:00 AM, Those are the office machines. Now look at:
http://folding.stanford.edu/cgi-bin/userpage?name=DogSoldier
It lists 2 active processors when it should show 3.
Now, I'm not a network guy, but since they are behind a firewall, could the Folding server be getting confused and sending what it thinks is one machine bad instructions? My first impulse is to disable -forceasm on the server ala mmonnin's post but I'm certain theres something deeper to this.

edit// Oh yeah, running all GUI

Comments

  • LincLinc Owner Detroit Icrontian
    edited October 2003
    So, the only problem is you see the wrong number of processor on Stanford? I wouldn't be too concerned about that. I've heard that their tally is frequently incorrect. I'm pretty certain my number of CPUs is off as well.
  • a2jfreaka2jfreak Houston, TX Member
    edited October 2003
    You can run the two machines that you know work under the username "a2jfreak" to see if the machine you're unsure about turns in any points. Even if the CPU doesn't show up, the points will. :D
  • DogSoldierDogSoldier The heart of radical Amish country..
    edited October 2003
    No General, the cpu not showing up is probably because the new machine hasn't completed a WU, that's one issue but i think it's related to the crashing WUs, which is the primary problem. I believe they are crashing at the same time cause stats list 2 WUs but with minimal points.

    a2jfreak, great idea.. actually, I'll run all machines using a2jfreak!
  • edited October 2003
    DogSoldier, if you installed the client yesterday on the server, it might take a day or 2 for it to show up in the stats. It needs to turn in it's first wu, then get that info to the stats server. If it doesn't show up in a few more days, then we'll see if there is a problem.:)
  • a2jfreaka2jfreak
    crosses his fingers. ;D
    DogSoldier had this to say
    a2jfreak, great idea.. actually, I'll run all ma...
    Houston, TX Member
    edited October 2003
    crosses his fingers. ;D
    DogSoldier had this to say
    a2jfreak, great idea.. actually, I'll run all machines using a2jfreak!
  • csimoncsimon Acadiana Icrontian
    edited October 2003
    well if it's any comfort ...I've been running 26 or better cpus and stanford only shows like 9 most of the time ...sometimes 12 ...and I don't think I'm losing work but I can check logs when I get a chance if you like
  • Straight_ManStraight_Man Geeky, in my own way Naples, FL Icrontian
    edited October 2003
    Basicly, the GUI tends to always use Machine ID one. Problem is, what happens is that the Servers use Machine ID and not pure CPU presence for a bunch of things:

    Machine ID difference in one ID results in a different unique machine hash for each box as well as a UID hash that is keyed to you as F@H UID;

    The WUs are keyed to machine ID hash for tracking purposes, though the UID hash is normally sent to get credit to that UID;

    So, you need one machine only on GUI, ideally (per UID), as the installer always sets Machine ID 1 and then the server setting up the account thinks a machine is being REINSTALLED if a second machine has the same machine ID number and UID (your user ID) and uses the GUI also.

    Note the Machine ID hash might be used to also tune what kinds of WU are being sent, so you might get WUs that the first machine is good at being sent to another box that might not be able to grok or handle those real well and if it is under load (like a server IS, typically) the client and core might get swapped and pended and abend at random as load reaches peaks.

    This results in the following-- seond box with same machine ID would be likely to abend WUs more than first box with ID. Server box, with priority on F@H at low, might be pending things in a way the Client or Core is not set up to handle. The F@H servers per se do not fold themselves, they are dedicated to handling resultrs and giving out new WUs and passing data to the stats servers and the web server gets data from them. Each single box might talk to about 8-9 servers, yes, but boht there are more servers doing just admin work for data and there are work servers to process and collate folding returns AND each box is unique in what WUs it handles.

    So, I would do this-- look in the work folder\directory of the Folding@Home directory on your server and see if multiple work units are present in work. If so, one of three things are so-- either the Server is so busy that it is pending transmit to web of WUs and they are stacked up waiting to be sent, or client and core are abending and then starting with a new WU when they are unpended (told to come back to life), or the wrong WUs for your box are being sent. Note also that with F@H partial credit is possible, if your box interrupts F@H too much the WU will be abandoned by the client and when tendered you might get partial credit for a partially worked WU.

    IF this is so, chjanging the priority on the GUI in the server might be good if your server has excess capacity to let the process be a higher priority process. There IS, in the GUI config, a priority that is optional that is used when other distributed computing processes are interfering with F@H, but this will take away from the server's main work to a degree as what it does it make the F@H process a more important to O\S process and then other things get pended more than they did with F@H at lowest priority(default) if a resource crunch occurs.

    I use one of my boxes here is a Windows Workstation that works on high-floating point work and that box has a GUI is at higher priority simply beacuse it limped at lower priority. The other box does mostly surfing workstaion things and virus protection for email, and it is a Linux box. The Linux box is also a P4 box. IT runs at default priority.

    In this case, not only could the dist center for F@H have confused servers due to machines with same Machine ID that hashs differenttly, but also the second box on GUI might be gettign WUs that take too many resources for the server to cope with given its other work. If you HAVE to run many GUI'd boxes and have problems with one of them, try to set each problem box to its own UID and let it use Machine ID 1 or edit the config files manually to another unused Machine ID for the UID it is hooked to than the other GUI'd boxes use. Machine ID does matter a lot.

    So does randomly accessible comm that does not get pended a lot or preempted by ohter higher priority processes. Also, the non-GUI lets you use a choice of MAchine IDs (1-4) at intsall time and is easier to get a machine custom install out of when you have additional machines. If all boxes HAVE to have GUI, easiest is to hook them to multiple UIDs so they all get completely unique hash sets, and it is easiest to have one download set for each box, done from the box you want it on, as some of the machine ID data in the hash set is precalced based on communications rate and effectiveness and sent with the download and a base key for the downloading machine is calced at just before download time and stuck into the package sent AND that key is supposed to be machine-unique to begin with. When I downloaded from teh Linux a Windows GUI set and stuck into the Windows box, I got some strange things happening, and finally reloaded the GUI from a dialup connect download done by the client target box itself and things smoothed out a lot-- I can SEE the differences in effectiveness in minutes\frame of most of the WUs sent by downloading directly from the box that will be doing the work versus a TRANSFERRED BY FEET (burn to CD, walk over there, install) client download.

    If there is a proxy server inline for both the work boxes, that might also account for the CPU count being off, but this looks like it might be something more than just that.

    John.
  • mmonninmmonnin Centreville, VA
    edited October 2003
    Can you get the log file from those machines? 7 points for 2 WUs is too low.

    And what tyoe of machines are these? Some specs please.
  • CammanCamman NEW! England Icrontian
    edited October 2003
    yeah, post up some log files so we can check them out.

    In other news:
  • DogSoldierDogSoldier The heart of radical Amish country..
    edited October 2003
    My workstation is a P2.4@533 FSB, 1gig PC 2700 RAM running WinXP Pro. The server is a P4 1.8 with 512 RAM running Windows 2000 Server.

    These logs are interesting. If I'm reading them right.. http://www.planetfortress.com/tfa/stuff/Workstation_FAHlog.txt
    My workstation stopped work on the unit at 8:53 pm (Is that EST or CST? And does this mean the machine rebooted or just the GUI?) then resumed the same Unit 8 minutes later. It then finished the unit at 7:00am but this doesn't jibe with my stats:
    http://folding.extremeoverclocking.com/member_overview.php?UserID=60927

    The server meanwhile started work on it's first unit minus a User ID. At 1:39am it stopped and restarted over 40 minutes later on the same unit. It also gained a new User ID.
    http://www.planetfortress.com/tfa/stuff/Server_FAHlog.txt

    I don't know where the wu points came from at 9pm and 12am last night, but the points that the workstation submitted at 7:00 am are nowhere to be found. (Escuse me, I seem to have lost my WUs!?!) And the server, my new addition is still working on it's first unit.

    edit// It was 9pm and 12am, sorry.. The 3am points came from my home machine.
  • TemplarTemplar You first.
    edited October 2003
    General Keebler had this to say
    So, the only problem is you see the wrong number of processor on Stanford? I wouldn't be too concerned about that. I've heard that their tally is frequently incorrect. I'm pretty certain my number of CPUs is off as well.

    Yeah I've had 5 CPUs at one point, and only have 4. It's highly unlikely someone's going to jump on your name with your team number Dog, and fold for your name, so yeah, it's probably wrong :)
  • mmonninmmonnin Centreville, VA
    edited October 2003
    Its not 7 AM your time. Its 7 Stanford or GMT time. The hours come from servers but the minutes are pulled off of your machine. That explains why the times WUs were sent in dont match.

    Either the computer rebooted for some reason or the client was stopped. Either way FAH stopped for a brief period on both machines.

    Other than the clients being stopped it all looks ok to me.
  • edited October 2003
    I have to ask...what are the wu's, are they for 400 or 500?
    Mines been folding 400's the last few times and they take over 24hrs to fold and I have my priority turned up and my load at 100%.
    The first one's i'd folded were really easy and quickly folded and they were 500's but the last 3 have all been real number crunchers in terms of difficulty to do.
  • DogSoldierDogSoldier The heart of radical Amish country..
    edited October 2003
    I'm currently working on a 500 and 2 400s, all Gromacs. The way I understand it, there are 2 types of units. Gromacs and Tinkers. Gromacs are easier to digest. Do you use the SIMD flag? If not, open up your shortcut to FAH and add these lines -advmethods -forceasm Make sure they are after the quotation marks and leave one space. i.e.

    winFAH.exe" -advmethods -forceasm

    More info on these flags here:
    http://forum.folding-community.org/viewtopic.php?t=6057

    edit// Actually, the 2 office machines are running tinkers. A 2500 and a 400... so.. the plot thickens!
  • edited October 2003
    Ok, done. I hope it speeds things up.
    It's taken all day to get 80 out of 400 done.....
  • edited October 2003
    I was just checking your stats DogSoldier and it looks like you've gotten credit for a few WU's turned in today, so you are probably OK.:)
  • DogSoldierDogSoldier The heart of radical Amish country..
    edited October 2003
    mudddoctor, yes but they are all partial WUs from one machine and of no use to F@H. Anyways... I came home looked at the log (Thanks for that tip BTW) and realized the problem lay here and not at work. I OC my 2.4c to 3.06Ghz and this causes too many errors in the F@H GUI. And the thing reboots so fast that when I check on it.. it looks kosher. So I removed the -forceasm flag (I don't want to underclock) and will look at the log tomorrow to see if it's stabilized. I think maybe I'll have to look into some faster RAM, my Crucial (DON'T CALL ME OEM!) 3200 just isn't cutting it.
  • edited October 2003
    I bet that the ram could very well be the problem. You could try relaxing your timings a little and see if that helps too, until you get some faster ram or maybe trying to run at a different ram/fsb divisor to slow the ram speed down a little. I've heard that running the ram slower often doesn't affect things much anyways, especially if you are pushing it so hard you start getting errors.
Sign In or Register to comment.