One of my customers has a corrupt historic file alarm that I'll look into what's going on tomorrow, but I thought I'd see if anyone here had any advice on dealing with this. They have a hot standby pair and a permanent standby.
The first question is really what the corruption is:
a. NTFS file system corruption
b. File write corruption
c. Application level corruption
For a. the solution can often be chkdsk /R, but the content of the file may also be corrupt. The standby may have the data as data is streamed to the standby separately to whatever goes to disk so the standby may have all the data, however a resync may have caused problems there. The thing about a chkdsk /R is it needs a reboot so when it comes back up the other server should be the Main and so should recover that way too.
For b., it is possible with anti-virus, backup and similar tools, although very rare. Check the size of the HRD file (not the size on disk) is a multiple of 32 bytes... it might not be
For c, I've seen zeros be written as the 32 bytes a long long time ago. scx_cmd hisdump might show something or you may need to break out a hex editor and check each 32 bytes for anything odd
For b and c then support can also take a look, although probably need DB logs of the record being written too which may be asking a lot depending on how far back the problem happened
@AdamWoodlandyour comment about db logs made me go ahead and give it quick look for when they got the alarm. it was about 12 hours ago so I checked the Primary's DB log. The db log only had the same description as the Alarm in it. The good news is that it is just one file corrupt and the file is on the one running standby.
I've never seen this to be due to an actual NTFS disk failure.
The issues that I've seen cause this have been:
ChkDsk said everything was fine and the file size was the same as on the running Primary, so it might have already resynced it.
I still deleted it though and restarted the Standby so I know it got replaced syncing from the Primary again.
They are running Sophos so that is likely the cause of the problem. First time in like 7 years to have a corrupt historic file alarm though.
Thanks for the tips guys. It was likely Sophos interfering but I did not see any settings that would let me tell Sophos to not scan the history files. I let the server admin know since I cannot playing in his software anyway. I don't think he will change anything unless it happens again since it has not happened in the past 7 or so years.
I feel that anti-virus software is largely just a random number generator to determine which files it's going to mess with.
I consistently have issues with non-Microsoft anti-virus and ClearSCADA/GeoSCADA.
If it's not actual historic file corruption warnings (which are typically not at all corrupt files, just files locked for a longer time than ClearSCADA likes), then it's horrible performance because of scanning of ViewX cache files, or log files..
On different software I've even had anti-virus cause crashes of PLC programming software. That's one of those annoying 'tech support' calls.. where they say "Please sir to be turning off your anti-virus" and me saying "Fine... but I really don't think it's the antivirus, it's clearly a bug in your software"... and of course for the crashes to immediately stop when I disabled the anti-virus.
I still say it was a bug in their software... but also.. bloody anti-virus.
The best way I've found to say with 99% confidence that "it is anti-virus" without running active things things like procmon is look in the config or data folder. If anti-virus is interfering with file writes and file locks then you'll see temp files persist there.
When the database saves it first writes to the temp files (the slow bit of the process), then deletes the old file version and renames the temp file. This causes problems with av when exclusions aren't sent, and with the amount of time this happens each minute usually captures something eventually. Normally you may see a temp file but it goes within a couple of seconds depending on how Windows Explorer is updating the file list.
The other 1% is when the database crashes mid database save, but then you'll see lots of files with the same timestamp rather than random files with random timestamps.
Modern AV solutions like Cylance and Crowdstrike brings a whole new way of working, and the nuances of those on the database are yet to be seen.
First time I've heard of Cylance and Crowdstrike. I'm not sure the IT world needs to have another software set with non-deterministic behaviour 😉 We already have Windows...
I can see it being a good replacement for 'the dog ate my homework'.. 'the anti-virus AI must have gotten confused and thought my thesis was a virus... crazy AI you know...'
In theory it should be better. You can still set exceptions but if an application has written similar data to the same folder 5 billion times it should have hopefully learnt pretty quickly that is 'normal' behaviour and let it continue.
That's the problem I see though... the AI learning part if not controlled.
For example alarms... for 10 years there might be a value of '0'... and then one day the alarm occurs, and it's a '1'.
Is the 'smart' AI going to consider that abnormal and prevent the writing?
It should be behaviour based, so in theory it should be irrelevant what data you write, it is the why rather than the what. Not saying there might not be a problem as it is still relatively new technology but as I understand it that scenario shouldn't one of the problems.
COM in scripting (i.e. FileScriptingHost and WShell.Script) and changes to DLLs used and referenced due to upgrades, those are areas of higher likelihood of something being triggered as that would be a deviation from what could be normal/learnt behaviour.
Cylance for example is usually set to block powershell scripts. So anything like SYSTEM() calls might also be affected.
Discuss challenges in energy and automation with 30,000+ experts and peers.
Find answers in 10,000+ support articles to help solve your product and business challenges.
Find peer based solutions to your questions. Provide answers for fellow community members!