Hey all! I was hoping the hardware gurus here might tell me if they have seen any similar problems, or have further ideas for troubleshooting that I haven't yet pursued. I'm more of a software guru, so I'll seriously pursue any suggestions.
I apologize for the length of this in advance, but I usually don't bother helping people that don't try to investigate the problem to the best of their abilities and provide thorough information when asking for help, so why should you? If you actually read through all of this, you're a saint - and I appreciate the help.
Here are the specs for the system encountering the problems:
I built this in March of 2008, and it's been working great for the last 21 months. I'm even typing this from the same system now.
On X-mas eve, I was testing some patches for Wine in order to get Champions Online to work (after I noticed the demo was free to play until you hit level 15). Well... the patches worked, and I played... until I ran into this:
Every so often, when I would be in a battle... the computer would just completely reboot. Now, I've seen bad drivers crash the X server before, but in my mind a complete reboot of Linux is almost always a hardware problem. I've had this system up for months straight, playing all kinds of Steam games and never had it just reboot on me.
Lucky observation #1:
I noticed that now I can consistently reboot the system by will just by firing up CounterStrike: Source, choosing to create a server, selecting 23 computer players at easy level, and hitting the tab key... no need to actually start the game: the system reboots in what should actually be a low-stress menu.
I used to play CS:S all the time, so this is a totally new problem. Also, I have my CS:S install set up to use a whole different installation of Wine - not the one that I was applying hacks to in order to get Champions Online running.
I turned the system off for a few hours. Restarted and found the CounterStrike menu was working fine. I let the system sit there for an hour or so and started flipping the number of computer bots back and forth... it instantly rebooted. :(
Attempt to fix #1:
My first thought is that the system sucked up enough dust and something must be overheating. So, I pulled out the patented can-o-air and blew everything out. I even removed the video card and blew it out thouroughly.
No luck - still reboots.
Attempt to fix #2:
Just to make sure this wasn't some horrible bug in the beta Nvidia 195.22 drivers that I had installed some weeks back, I backleveled to 190.42 (the latest official 'stable' Linux drivers).
I thought perhaps the CPU fan had slowed and was overheating (though it's still actually running at 2200+ RPM, according to the BIOS). So, I used the Linux stress command to spawn four threads and give the CPU a workout... I left the CPU running at 100% for over an hour... no reboot. :\
I fired up 3DMark06, cranked up the settings and let it beat on the video card in and endless loop for an hour.... no reboot.
(in between all of these tests, I would fire up CounterStrike and force a reboot to make sure the hardware-fairies had not magically eradicated the problem).
Lucky Observation #2:
Right after one of these reboots, I started streaming a video from YouTube... halfway through it, the PC rebooted. So it does not seem to be a problem specific to Wine... and is most likely unrelated to the video drivers since I watch YouTube videos all the time with no problem
Since I am always using a bleeding edge desktop, I decided to eliminate other software possiblities: #1) I replaced my Compiz desktop with KWin.... tested again: still rebooted. #2) I booted into Gnome instead of KDE 4.4 SC2... tested again: still rebooted.
By this time, I was beginning to wonder about the stability of the system memory. So, I rebooted into memtest86 and let it go... 3 passes... no errors found.
I even pulled the memory, reversed the slots, ran memtest86 again... no errors found.
Okay... screw this noise. The last time I saw a problem so bad that it could actually crash the Linux kernel, it was a defective video card. So, I yanked the video card and switched it with the EVGA 9800GTX+ from the PC used by my kids.
Fitting the EVGA in was not simple: It's heat-shroud is huge compared to the Asus card - I had to put it in a different slot because the shroud wouldn't clear a hard drive. I even accidentally bent some of the pins on the motherboard fitting it in (luckly I saw and fixed).
SAME EXACT REBOOT ISSUE... so, obviously not a video (hardware) problem.
Time for surgery: I took a BFG 800w PS that I had sitting around (Yes, I'm a hardware packrat), and replaced my current PS with it. The thought was that the power supply might be flaky and failing under load (though I don't know what kind of load the CS:S menu would possibly provide).
I also plugged this power supply into a different outlet, bypassing my UPS in case this was some kind of power draw issue there.
No dice: It still reboots in exactly the same manner.
So here I am:
At this point, I've eliminated most software, the CPU, memory, and the PSU.
I'm thinking the MB chipset is going bad... and only showing the problems after I do things that cause enough I/O for it to really warm up. I even (troubleshooting #8) reset the CMOS config and tested again: same issue.
What's weird is that the system seems so stable the rest of the time. I'm typing this on it right now... I could use it for days... it won't be unstable unless I use one of those two games - and then it may potentially reboot again (outside of Wine) when I start back up.
The only thing I haven't tried is re-installing Kubuntu 9.10 to see if the problem goes away. I suppose this could be some weird problem with the 2.6.32 kernel, but I've done many Google and Launchpad searches and can't find anyone reporting similar issues.
I'll add more info below, if I can think of anything else to try before buying a new MB. I'm halfway tempted to remove my RAID1 mirrored drives, connect them to the kids PC... and boot them to see if it will crash. Hmmmm... if it did, it's definitely a software issue... but if it doesn't I can't say for sure it's not software because of the differing chipsets...
What part of "Ph'nglui mglw'nafh Cthulhu R'lyeh wgah'nagl fhtagn" don't you understand?
You've done most of the heavy lifting. Have you check the motherboard capacitors? I've had 2 or 3 systems die premature deaths due to slightly bulging capacitors and I'm talking barely bulging slightly rounded on top not completely blown or leaking. The symptoms you mention seem to fit the bill.
I may have figured this out today (just toggled the CS:S menu about 50 times with no reboots), and it wasn't what I was originally thinking.
I'll post back with details of how I figured it out, but first I need to do some further testing to be more specific about the cause.
Well... that turned out to be a goose chase: I had installed Kubuntu 9.10 to another hard drive, copied in my hacked version of Wine... and had no problem...
I rebuilt my main install... installed the 195 drivers, updated xorg to 1.7.5, installed the 2.6.32 kernel... testing at each stage - thinking I was on to some bad interaction between them.
I got every last package installed, and there was no problem. Even did the complete upgrade to Lucid... tested... no problem.
Then, I copied in 100+ GB of data from an external drive I had connected via firewire.... and the problem came back. It seems that the system was fine until I did a lot of IO with my external drives...
I turned the system off for 30 minutes, and I couldn't immediately recreate the problem (but probably could have if I had tried four or five more times)
So, I'm back to my original conclusion: The MB's got some kind of issue that's only showing itself after some significant throughput has warmed it up (though the temps still appear acceptable). I'd try the other drive to make sure it didn't get stable again... but just my luck the motor on it died last night.
I'll live with it for now, then hit NewEgg sometime after I've paid off XMas to see if I can find a replacement that can use the same CPU/memory. I think I'll unplug the external drive to see if it doesn't get completely stable again, as a workaround.
I always find myself learning more about computers from people who diagnose their own computer problem.
I thank you for posting your process because I'm sure it will help me in the future when someone has a similar issue.
I can pretty much always figure out software issues in any OS, but faulty hardware is not my forte. I generally have to make myself rule absolutely everything else out before I can believe hardware is failing and only failing under specific scenarios. It helps to write everything down and try to predict what others will ask/suggest. It saves them the time of asking - and forces me to do the tests that I'm normally too lazy to do. :)
Time, and the subsequent reviews for this MB on NewEgg, have not been kind. Many people report problems - some similar to mine, and have had to RMA boards multiple times. Oh well... @ 21 months it looks like I faired better than most of them. 77 people gave it 2 eggs or less.
I expected more from Asus, but I'm still willing to give them another chance - considering how well all of their other products have worked for me.
My northbridge temperatures were never as awful as some people claim in the reviews, though maybe it had something to do with the abbreviated lifespan. My northbridge ran at maybe 60-65 degrees Celsius - but that's in an Antec 900 case with the fans cranked. All other temps from the board are very low.
Anyone have any suggestions for a replacement MB? The requirements are: ATX, socket 775 (I'm using a Q6600 Core 2 Quad), DDR2 1066, and SATA RAID 1 (FakeRAID/dmraid). I was tempted to use this as an excuse to upgrade, but I'm putting that off until after tax time next year.
I was thinking about getting a GIGABYTE GA-EP45-UD3P LGA 775 Intel P45 ATX Intel Motherboard, GIGABYTE GA-EP45-UD3R LGA 775 Intel P45 ATX Intel Motherboard (The EX58-UD3R Gigabyte board in the CyberPower PC I bought for the kids a while back seems pretty decent), or maybe even an ASUS P5E3 PRO LGA 775 Intel X48 ATX Intel Motherboard. I'm far from finished looking, though.
hmmm... Still sounds like it COULD be a software problem. I'm not too familiar with linux tho, so I don't understand your Jargon lol
If it's Hardware... it's most likely your south bridge. Seem that the problem didn't show up til you started using the firewire bus.
Did you have the external firewire drive hooked up during your initial problems? Or any external devices. Did you run the system with the bare essentials? MB, CPU, RAM, Vid, Power, and 1 hard drive. No CD drives, no accessories.
Could be a bad hard drive, if you're running more than one. If linux using some kind of page file system like windows it's possible that the game is using it, on a bad drive. Again, i'm not a linux pro so I am shooting in the dark here lol. Troubleshooting hardware is easier when you understand how it works with the software. :-P
Try reseating the heat sinks on the MB. maybe put some new paste on them. I'd imagine the stuff it came with is pretty much useless now.
Again, try running the system with as few things hooked up as you possibly can.
Core i7 920|EVGA X58|GTX 660 TI & 460se for PHYSX|2x30GB Vertex RAID0|5x1.5TB RAID5
-- Certifications --
CompTIA A+; CompTIA Network+ ; CompTIA Security+; Microsoft Certified Professional(MCP); Microsoft Certified Systems Administrator(MCSA); Microsoft Certified Sysems Engineer(MCSE); Certified Wireless Network Administrator (CWNA); Certified Wireless Security Professional (CWSP); Aruba Certified Mobility Associate (ACMA);
hmmm... Still sounds like it COULD be a software problem
hmmm... Still sounds like it COULD be a software problem
I almost would have bet money that you were incorrect, but I'm big enough to admit when I was wrong: It was not the motherboard.
I know, because I replaced it, booted up (Linux is great in that the OS provides all the drivers, so you can replace any hardware and it will just boot without flinching), and was able to immediately recreate the problem.
Of course... I kind of suspected on some level that the most likely cause was a brand new kernel vs. video driver vs. xOrg problem - otherwise I wouldn't have done all that re-install testing previously.
I'm putting 9.10, nVidia-195, my data, and usual packages on it right now, I'm going to test for several days without updating Xorg or the kernel (or other lucid packages). When I'm absolutely sure that the problem isn't presenting itself, I'll put each one on and test for a week. I have plenty of time to figure this out before 10.04 goes to release in April. I must be a sadist to enjoy this troubleshooting process.
Oh well - I guess sometime soon I can buy a CPU, mem, and a vid card for my old MB and put this extra Antec 900 case and power supply I have to use - my kids need a second system anyway. :)
9.10 and 1.95 have run stable for several days now. I put Xorg 1.7.5 on it this morning, so we'll see how that goes for a bit.
Heh, yea I still thought it could be software because you mentioned a lot of software changes and it seemed like a sudden thing :-)
Glad you got it fixed!
Xorg 1.7.5 seems stable so far. I'm going to give it until this weekend, then I may bite the bullet and put the 2.6.32 kernel on it. I'm still suspect of the KMS changes.
I have scripts that can put everything back from tar files, but I think I'll use the opportunity to also play with Clonezilla and the dmraid recovery process.
Put the 2.6.32 kernel on it... no problems as of yet. :\
Put the KDE4.4SC2 on it... no problems....
Now I'm adding back 127 lucid packages that I think are harmless.... (testing will tell)
After that, I'll test updating HAL and the other things (one at a time) that I think have changed significantly enough to cause this bug.
I hope the new motherboard is working! Let us know if you are having any other issues with it.
As I mentioned up there, I replaced the motherboard and had the same issue. The problem is now gone though, and I'm on the most up-to-date Lucid build.
The problem seems to be caused by a conflict between the proprietary non-open nVidia drivers I was using and Xorg 1.7.5. I can't be entirely sure though, because so many packages have been upgraded in the repos while I was doing the testing. But since I switched to a different, newer, glx package, I haven't been able to recreate the issue. That's what happens when you have to be on the bleeding edge and put a closed-source binary blob in your system for performance reasons. :p
Heh, from the initial descriptions I would have sworn it was your NB chip or part of the FSB, but I am glad you were able to get it to work.
I had originally posted something to that effect (in my previous post), but edited it down to that after reading down further. Granted I am pretty much the opposite of you, I can diagnose hardware problems pretty easily but have trouble with software with driver problems and the like. Plus I am pretty much a novice when it comes to linux issues.
NEWS TIPS |
This site is intended for informational and entertainment purposes only. The contents are the views and opinion of the author and/or hisassociates. All products and trademarks are the property of their respective owners. All content and graphical elements areCopyright © 1999 - 2013 David Altavilla and HotHardware.com, LLC. All rights reserved. Privacy and Terms