There's a mix of stuff here; DB, OS, storage, client applications. I'll make an attempt at organizing.
OS:
More errors in OS system logs than I can shake a stick at
Interesting, but we can't help without details. Unless you're just venting. Which is fine by me; I do it too.
DB:
2.3 million latch timeouts in a 10 day period
What to do about this depends on
which latch or latches show contention.
Perhaps not ideal. But by itself I don't think this is causing severe problems.
750m buffer cache who's blocks never reach the -spinskip value, at a time that -spinlocktimeout was set to 800,000. Correct me if I'm wrong, but that would indicate that the buffer is flushing faster than blocks can be moved to the MRU with given spinskip. OR the code is so bad that we are just churning our buffer cache.
I think you're confusing parameter names and meanings. We have mentioned -spin (you had 800,000, then adjusted to 100,000). That is the number of times a process will "spin on" (i.e. repeatedly attempt to acquire) a latch before giving up and "napping" for a set amount of time, and then repeating the cycle. It is not a timer. We have also mentioned -lruskips, which you are not currently using. There are no -spinskip or -spinlocktimeout parameters. The -spin parameter is not
directly related to buffer pool LRU chain maintenance.
At times, checkpoints at a rate of 6-8 during a 30 second period.
During these periods, what is the value of "sync time" (promon R&D | 3 | 4)? This is the time in seconds when your forward processing is frozen (no transaction activity). This could be related to users' reports of application freezes.
Occasional buffer flushes in BI during checkpoints
Those would be buffer pool buffers being flushed at checkpoint, because the blocks are "dirty". Ideally you want zero buffers flushed at checkpoint. Increasing BI cluster size may help, provided your disks aren't saturated with I/O load.
All database files on same disk.
Not according to your structure file.
An 8k BI blocksize on a disk that is allocated at 2k.
I don't understand this. Do you mean the OS file system allocation unit is 2 KB instead of the NTFS default of 4 KB?
Having a file system block size (aka page size) that is larger than the DB's blocks is a big performance penalty. But you have the opposite situation. So you do multiple file system block I/Os per database disk I/O. That means a theoretical possibility of "torn pages" (physically completing only part of an atomic logical I/O to disk), but not a severe performance penalty as far as I know. But of course I could be mistaken.
A BI file that is on a Raid5 system drive with very high split I/O, AI on same drive
Not according to your structure file.
More errors in DB logs than I can shake a stick at
Perhaps relevant, but we can't help you without knowing what the errors are. You should know also that some errors in DB logs are actually written by or on behalf of clients, and they actually reflect client-side problems, not DB problems. But you need the error numbers (and in some cases, the error message text) to make that determination.
Client application(s):
More end-user complaints than I have ever seen with any database of any size
Hopefully if the other items in this list (and others in this thread) are dealt with, this one will take care of itself.
Shared locks that hold long after a transaction is complete
That's a function of record and transaction scope in the application code. Without at least one of (a) a knowledgeable developer with tools and source, or (b) a willing and capable vendor, this one isn't going away any time soon.
500 application crash logs in a 15 day period
What to do about this depends on what the errors are.
Storage:
Misaligned disks across the board
Not sure quite what you mean by this.
A pagefile on same disks as database files
Yeah, not ideal. While you're at it, check if you have anti-virus or other anti-malware software scanning the DB directories. If so, add exceptions as needed.
I could go on, but I don't want to bore you. When I say "our database is choking" please don't misconstrued my statement to imply a slight to Progress. This system is exhibiting issues from the client keyboard all the way to the back end.
Not bored. But the devil is in the details. At this rate, despite our best efforts and yours, it will take a while to make meaningful headway.
Time is money, and you'll burn through lots of both trying to figure this out (mostly) on your own. And there is only so much that can be done with remote hands-off help. You may want to consider acquiring the services of a Progress consultant to come on site and help you. It would likely turn around your situation faster and better than you could do on your own, and it would be a fantastic lesson for you to boot.