[geeks] A U30 puzzle

Fri Jul 3 12:26:31 CDT 2009

Folks,
As previously discussed, I transferred my NAS to a new dual-Zeon box.  I
just reinstalled the U30 that was previously doing the job with Solaris
10u7 SPARC (200905).  One thing I discovered in the process was that
during the month or so it's been shut down, one of the SCSI controllers
(a SunSwift PCI) had gone south.  So I took that board out and replaced
it with a Symbios SYM22801 dual SCSI card (dual U160, I think), which
fixed that issue and spread the twelve array disks across two SCSI
controllers they now have all to themselves, entirely separate from the
internal disks.  They're configured as raidz1 with a hot spare slice on
the second internal disk.

minbar:root:~:44 # prtdiag
System Configuration:  Sun Microsystems  sun4u Sun Ultra 30 UPA/PCI
(UltraSPARC-II 248MHz)
System clock frequency: 83 MHz
Memory size: 512 Megabytes

========================= CPUs =========================

                    Run   Ecache   CPU    CPU
Brd  CPU   Module   MHz     MB    Impl.   Mask
---  ---  -------  -----  ------  ------  ----
 0     0     0      248     1.0   US-II    1.1

========================= IO Cards =========================

     Bus   Freq
Brd  Type  MHz   Slot        Name                          Model
---  ----  ----  ----------  ----------------------------
--------------------
 0   PCI    33     On-Board  network-SUNW,hme
 0   PCI    33     On-Board  scsi-glm/disk (block)         Symbios,53C875
 0   PCI    33   pcib slot 2  scsi-glm/disk (block)         Symbios,53C875
 0   PCI    33   pcib slot 2  scsi-glm/disk (block)         Symbios,53C875
 0   PCI    66   pcia slot 1  ethernet-pci8086,1001
 0   UPA    83           29  FFB, Double Buffered          SUNW,501-4788
 0   UPA    83           30  AFB, Double Buffered

No failures found in System
===========================

Of course, the machine could stand to have more RAM, but I only have
32MB DIMMs in it and don't have anything bigger.

Time to create a 1GB file on the array is actually slightly faster than
on the single boot disk, which rather argues against ZFS bringing the
machine to its knees:

minbar:root:~:42 # time dd if=/dev/zero bs=1M count=1000
of=/spool/export/bigfile
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 43.014 s, 24.4 MB/s

real    0m43.088s
user    0m0.037s
sys     0m11.405s
minbar:root:~:43 # time dd if=/dev/zero bs=1M count=1000 of=/bigfile
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 45.1339 s, 23.2 MB/s

real    0m45.197s
user    0m0.031s
sys     0m10.449s

Now, here's the puzzling part.  When I was using this machine as my
primary NAS box, running Solaris 9 on it, I could get an honest 98Mb/s
transfer rate off it across the network over its internal hme.  But
after being shut down for a month or so then brought back up as a backup
cache of the data on babylon4's array, it's slower than a sick dog.  One
reason I reinstalled it with solaris 10 was to see whether something had
gotten badly corrupted on the OS somehow.  Even just a copy across the
network from /dev/zero dumped to /dev/null is slow.  Here's 100MB from
babylon4 to /dev/null on minbar:

babylon4:root:~:52 # dd if=/dev/zero bs=1M count=100 | ssh minbar dd
of=/dev/null
100+0 records in
100+0 records out
104857600 bytes (105 MB) copied, 64.7485 s, 1.6 MB/s
204800+0 records in
204800+0 records out

Wondering if the onboard hme had developed a fault similar to the
SunSwift, I popped a spare pgi64 ge interface into minbar's pci64 slot.
This is the same copy, over a direct point-to-point connection from
babylon4's bge1 to minbar's e1000g0:

babylon4:root:~:53 # dd if=/dev/zero bs=1M count=100 | ssh minbar-sync
dd of=/dev/null
100+0 records in
100+0 records out
104857600 bytes (105 MB) copied, 50.4264 s, 2.1 MB/s
204800+0 records in
204800+0 records out

2.1 megabytes/s from /dev/zero to /dev/null over a point-to-point
gigabit connection is absurd.  rsync, zfs send, cp -av via nfs mount,
everything is slow.

Anyone have any thoughts on the matter?  I've looked everywhere I can
think of, and I can't find anything obviously *wrong* (aside from the
failing SCSI controller that I already replaced) ... it just seems to be
running at a fraction of the throughput it ought to be, for no apparent
reason that I can figure out.  vmstat looks reasonable, the machine's
not swapping hard.  iostat -Cx looks sane, prstat shows the CPU sitting
at a couple of percent utilization.

minbar:root:~:47 # vmstat -p
     memory           page          executable      anonymous
filesystem
   swap  free  re  mf  fr  de  sr  epi  epo  epf  api  apo  apf  fpi
fpo  fpf
 2093840 155632 6  20   0   0   2    0    0    0    0    0    0   18
0    0
minbar:root:~:48 # vmstat
 kthr      memory            page            disk          faults      cpu
 r b w   swap  free  re  mf pi po fr de sr f0 rm s0 s3   in   sy   cs us
sy id
 0 0 0 2093848 155600 6  20 18  0  0  0  2 -0 -0  0  3  469 1180  160  4
 6 90

Typical iostat frame during a cp -av from an nfs mount of babylon4's
array to minbar's array:

                  extended device statistics
device      r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b
c0          0.2    0.0   17.0    0.0  0.0  0.0    8.8   0   0
sd0         0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0
sd3         0.2    0.0   17.0    0.0  0.0  0.0    8.8   0   0
sd6         0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0
c1          0.0    6.7    0.0   21.1  0.0  0.1   21.8   0   5
sd23        0.0    1.1    0.0    3.7  0.0  0.0   23.9   0   1
sd24        0.0    1.1    0.0    3.7  0.0  0.0   21.5   0   1
sd25        0.0    1.0    0.0    3.6  0.0  0.0   16.8   0   1
sd26        0.0    1.3    0.0    2.7  0.0  0.0   24.4   0   1
sd27        0.0    1.1    0.0    3.8  0.0  0.0   22.7   0   1
sd28        0.0    1.1    0.0    3.7  0.0  0.0   20.3   0   1
c2          0.0    8.6    0.0   19.4  0.0  0.3   31.1   0   7
sd38        0.0    1.0    0.0    3.7  0.0  0.0   28.2   0   1
sd39        0.0    1.2    0.0    1.9  0.0  0.0   19.3   0   1
sd40        0.0    1.5    0.0    3.6  0.0  0.1   33.4   0   1
sd41        0.0    1.5    0.0    3.6  0.0  0.0   32.4   0   1
sd42        0.0    1.5    0.0    3.6  0.0  0.0   29.6   0   1
sd43        0.0    1.8    0.0    2.9  0.0  0.1   39.0   0   1
fd0         0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0
ramdisk1    0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0
nfs1        0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0
nfs2        6.8    0.0   29.4    0.0  0.0  0.4   63.4   0  43

Nothing here looks like a machine that's struggling to keep its head
above water or sitting hard up against a bottleneck.  I'm baffled.  The
closest I can get to the kind of throughput I *should* be seeing is a
direct cp -av from an NFS mount over the point-to-point gigE connection:

babylon4:root:~:67 # time dd if=/dev/zero bs=1M count=1000
of=/netstore/bigfile
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 16.6926 s, 62.8 MB/s

real    0m16.709s
user    0m0.015s
sys     0m3.395s

minbar:root:~:61 # time dd if=/netstore-sync/bigfile bs=1M
of=/export/bigfile
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 113.532 s, 9.2 MB/s

real    1m53.764s
user    0m0.048s
sys     0m24.881s

This is clearly still nowhere near what it ought to be, though.

At this point, I'm baffled.

-- 
  Phil Stracchino, CDK#2     DoD#299792458     ICBM: 43.5607, -71.355
  alaric at caerllewys.net   alaric at metrocast.net   phil at co.ordinate.org
         Renaissance Man, Unix ronin, Perl hacker, Free Stater
                 It's not the years, it's the mileage.