All services have been moved off of failing hardware. There should be no further unscheduled interruptions.
There will be some interruptions of services lasting about 20-30 minutes per service between 10pm-6am over the next few days in order to make backups of the newly configured machines so that if there is a failure restoration returns to the current configuration.
I think the motherboard is damaged in one of our hosts that holds the mail spool, mail.eskimo.com, mx2.eskimo.com, debian.eskimo.com, and scientific7.eskimo.com.
In addition to random reboots, the machine is sometimes taking disk errors but the smart status does not show any issue with the drives, no errors recorded, which leaves the controllers which are on board.
So tonight various services mentioned above will be down for a period of time as I move them off of this failing machine so I can take it out of service and replace the motherboard.
When the BIOS lost it’s fan settings resulting in the shutting down of a chassis fan, it got quite warm, but it’s hard to say if the heat damaged it, or existing damage caused the BIOS to lose it’s settings.
At any rate I am moving things off so I can take it out of service for several days to replace the motherboard and then to burn it in properly (extensive testing with mprime, while monitoring temperatures, etc. This is both to make sure it is stable and to find the minimum voltage the CPU will operate stably on. The lower the voltage, the less the heat. and the longer the life.
Fedora.eskimo.com is now upgraded to Fedora 27. Not all the applications previously installed are there because I did a clean install rather than an upgrade as the previous two upgraded had not gone cleanly and left the rpm database less than pristine.
If there are applications which are missing that are of importance to you please e-mail firstname.lastname@example.org or open a ticket and I will prioritize getting those installed for you.
I am taking fedora.eskimo.com down to install fedora 27. I am going to do a clean install on this because the last two upgrades did not go cleanly and the system has been somewhat flakey since.
I am going to be taking the machine down which houses the mail spool and also debian and scientific7 virtual machines tonight to run some diagnostics.
The machine spontaneously booted again today. The thing that makes this so difficult to troubleshoot is that it is not generating errors in between reboots.
I am concerned that the problem with the BIOS the other day either resulted in damage by overheating components, or the motherboard may already be damaged which may have been the root cause of the BIOS losing it’s fan settings.
I am going to run some diagnostic software to try to see if there is some marginal hardware and also going to try to remove a Linux option that overwrites the processors firmware. This module caused problems on another of my machines so it may just be bad software, however, there are two identical machines in terms of hardware and only one is having problems.
I have made substantial reductions in the rates for virtual private servers.
I am with out power and Phones at my home office but if you call and leave a message it will be automatically transcribed and I can pick it up on my tablet. Alternatel, you can create a ticket.
It would be the understatement of the year or at least the week to say that the work did not go as anticipated. The machine I was going to use as a basis for comparison, it also had an old BIOS and given it was the same as the other machine that kept losing it’s fan profile, I decided it was best to upgrade it as well.
When I upgraded the BIOS it returned to the default settings which meant that I had to determine good settings from scratch. While I was doing this another machine crashed.
I don’t really know what caused the crashes but I found a number of problems. For one thing VTd instructions were not enabled but should have been since we have kvm/qemu virtual machines on these boxes. Not having it enabled isn’t a show stopper but it causes Linux to more work to perform I/O in a virtual machine.
Then I found an issue with an item that allows full 64 bit address decoding for I/O devices. For some reason this breaks the built in graphics of the i7–6700k causing instability.
I now have all the machines on current BIOS, ran some torture tests on the CPU and they ran clean. So we’ll see how it goes. I had to slow down the CPU’s slightly on two machines to get them to complete the torture tests without error.
Virtual private servers will go down briefly so I can compare BIOS settings between a machine that is stable and the mail server which presently is not. When I updated the BIOS to resolve a fan issue, it reset all settings to default. I tried to set them back the best I could by memory but suspect didn’t do the best job so will be attempting to correct that tonight.
The virtual private servers should only be down a few minutes, the mail server somewhat longer as I will be running memory diagnostics to eliminate the possibility of a bad DIMM. It may take several hours of intermittent downtime to resolve these issues.
The host machine which houses the mail spool, mail server virtual machine, and debian virtual machine, spontaneously booted today even though temperatures are correct. So I am going to take it down around midnight tonight to run some diagnostics and make some adjustments.