Recovering Lost F@H Data
comfortable
Sugarland, TX
Hi guys,
Recently, I decided to push my cpu further in terms of o/c. I was finally able to get it running stable @ 220x11.5 :41' load (a 2600-M,) but the journey was a sad one :bawling:
I used F@H as the stability checker- running it 24/7, and I didn't operate the computer as much as I usually did. During one of my fsb-increases, I noticed that the F@H service crashed many times. To further compound the problem, my ISP was going through some sort of failure; I had lost internet connectivity for an extended period of time.
Here are my F@H specs:
FAH502-Console.exe -svcstart -advmethods -forcesse:
Overnight, it seems as though my computer removed the large w.u that it was 67% through (which incidentally provided 320 points,) and encountered the error you'll see attached below. Upon getting a stable o/c and a dependable internet link, my computer was assigned with a different w.u. My f@h folder contains Fahcore's 65, 78, and 82.exe.
Here's my F@H log:
Ultimately, my question would be: How do I prevent this from happening in the future. Is there any way that I can restore jobs that have been neglected by internet/stability issues?
Recently, I decided to push my cpu further in terms of o/c. I was finally able to get it running stable @ 220x11.5 :41' load (a 2600-M,) but the journey was a sad one :bawling:
I used F@H as the stability checker- running it 24/7, and I didn't operate the computer as much as I usually did. During one of my fsb-increases, I noticed that the F@H service crashed many times. To further compound the problem, my ISP was going through some sort of failure; I had lost internet connectivity for an extended period of time.
Here are my F@H specs:
FAH502-Console.exe -svcstart -advmethods -forcesse:
comfortable's client.cfg wrote:[settings]
username=comfortable
team=93
asknet=no
bigpackets=yes
machineid=1
local=6
[http]
active=no
host=localhost
port=8080
usereg=no
[core]
checkpoint=30
cpuusage=100
ignoredeadlines=yes
Overnight, it seems as though my computer removed the large w.u that it was 67% through (which incidentally provided 320 points,) and encountered the error you'll see attached below. Upon getting a stable o/c and a dependable internet link, my computer was assigned with a different w.u. My f@h folder contains Fahcore's 65, 78, and 82.exe.
Here's my F@H log:
Quit 101 - Fatal error:
[8:47:10] Step 658, time 1.316 (ps) LINCS WARNING
[8:47:10] relative constraint deviation after LINCS:
[8:47:10] max 0.000000 (between atoms 1 and 2) rms 1.#QNAN0
[8:47:10]
[8:47:10] Simulation instability has been encountered. The run has entered a
[8:47:10] state from which no further progress can be made.
[8:47:10] If you often see other project units terminating early like this
[8:47:10] too, you may wish to check the stability of your computer (issues
[8:47:10] such as high temperature, overclocking, etc.).
[8:47:10] Going to send back what have done.
[8:47:10] logfile size: 8131
[8:47:10] - Writing 8804 bytes of core data to disk...
[8:47:10] ... Done.
[8:47:10]
[8:47:10] Folding@home Core Shutdown: EARLY_UNIT_END
[8:47:13] CoreStatus = 72 (114)
[8:47:13] Sending work to server
[8:47:13] + Attempting to send results
[8:47:14] + Results successfully sent
[8:47:14] Thank you for your contribution to Folding@Home.
ad infinitum.[10:33:02] + Attempting to get work packet
[10:33:02] - Connecting to assignment server
[10:33:02] + Could not connect to Assignment Server
[10:33:02] + Could not connect to Assignment Server 2
[10:33:02] + Couldn't get work instructions.
[10:33:02] - Error: Attempt #1 to get work failed, and no other work to do.
Waiting before retry.
Ultimately, my question would be: How do I prevent this from happening in the future. Is there any way that I can restore jobs that have been neglected by internet/stability issues?
0
Comments
No, you can't resume work that has been corrupted from instability, because...
When the internet goes down, F@H keeps completed WU data in queue so that when the connection is restored it can send the data back as normal. If the connection stays out, it will continue to try and get a new WU until it finally does get one. It will repeat this error...
If its 41C thats not a bad temp at all. What is your vcore and vdimm? Raising those might add some stability to your computer. You have headroom with the temps.
2.8 - vdimm
My computer is stable now. I've run prime95 and the usual stress tests. I had a whole bunch of problems during o/c testing, but it seems to be doing fine right now.
In restrospect, I should've disabled F@H service during my o/c tests. There were plenty of other tests available, so it was kind of foolish for me to be experimenting with o/c values while folding. I'm burning-in the cpu right now with 24/7 f@h.