[rescue] Looking for I/O performance metrics

Patrick Giagnocavo rescue at sunhelp.org
Fri Nov 23 12:45:19 CST 2001


Chris Byrne wrote:
> 
> Patrick,
> 
> Everything you said makes perfect sense, I just wish it were true ;-)
> 
> Here's the situation
> 
> The problem is occurring under light load or more precisely no essentially
> no load.
> It's only occurring on the Sun systems attached to the SAN not the windows
> systems.
> They are using Veritas to manage UFS filesystems on virtual mount points
> created out of a single large LUN being presented to them from the SAN.
> 
> Yes I know that these are screwy
> 
> I've already recommended that they restructure their filesystems and put
> their data on Oracle raw but I need to give them hard numbers on the current
> upgefucked setup.
> 
> The system is basically set up as if it was on a JBOD array and they were
> trying to reduce spindle contention, but with a large scale storage array
> the array itself handles the contention and resource management issues so
> anything you do on the filesystem side will just make things worse.
> 
> Chris Byrne

My first instinct is to run away screaming, but this is not an option
for you :-)

My first response would be to run a background task that created light
to medium load, even if it was just a cron job that ran every minute and
wrote files in /tmp and see if that fixes the problem.  Or just a shell
script that stat's some of the files in each directory or filesystem you
care about.

My guess is that somewhere in the maze of connections something is
"timing out", though that might not be exactly the phrase, or an exact
description of what is happening.  Creating more load will tickle the
connection and keep it alive.

You don't say what card you are using to talk to the SAN; but you might
want to look over any buffer size settings or whatever in the device
driver.  If data gets stuck in a buffer and doesn't get transmitted bad
things will happen; or maybe it is waiting for the buffer to fill up and
due to no load it takes a long time to fill, longer than <some other
device>'s wait timeout.

I know, there are a lot of if's and whatnot, hope this helps.

./patrick



More information about the rescue mailing list