RSM2000 Trouble Shooting Guide

-------------------------------------------------------------------------------

Scenario One: Installing RSM2000 patches fail to install

Background: System running 2.5.1, generic patch cluster installed.

The patches that failed to install

> 104563-02 Sun RSM Array 2000 1.0: ICON Chip Failure Messages
> 104564-01 Sun RSM Array 2000 1.0: Rdriver deadlock detection code

Both these patches failed with

# ./installpatch ../104563-02

Checking installed packages and patches...
None of the packages included in patch 104563-02
are installed on this system.

Installpatch is terminating.

# ./installpatch ../104564-01

Checking installed packages and patches...
None of the packages included in patch 104564-01
are installed on this system.

Installpatch is terminating

Resolution:

The problem turns out to be the package SUNWosar with
VERSION=1.0,REV=06.02, this version was on the early access CD. Pulled the
FCS version off of cornmeal.eng. The SUNWosar package shows
VERSION=1.0,REV=06.00

Now when you install the RSM2000 patches, they all install successfully!!!!!

-------------------------------------------------------------------------------

Scenario Two: Lose a controller(maybe)

Background: In the console window, you notice a sudden rash of "disk
not responding to selection" message scrolling.

Go to the RM6 window, click on the "status" icon. Do a health check.
It shows that you have a "Data Path Failure" on our RSM2000 unit....

Goto the RM6 window, click on the "recovery" icon. Inside the RM6 window, you
will see a one liner at the bottom of the window stating to start a parity/
repair repair, perform manual recovery.

Click on "start parity/check repair"

You will notice that there is a horizontal bar that will slowly turn blue
as it checks the LUN's on that RAID Module

Note: The two RSM2000 controllers will still appear to be operartional(
Power light on and second green LED in heartbeat mode)

During the "parity/check process" you will see some messages in the console
conerning disk "offline".

Once you start the "parity/check repair" process, you can NOT start anything
else from that window unless you hit the "cancel" button or the "parity/repair
process" has completed.

After about thirty minutes of the "parity/check process" running. I hit the
"cancel" button, as a user, I got very impatient. This "PCR" process is very
SLOW.

A popup window will ask for "ok" to cancel "PCR" process. Click the "ok"
button.

Click on the "Recovery Guru". It will show you the status of the RAID module.
At the bottom, it will tell you to obtain step-by-step instruction to
recover from a "data path failure".

Click on the "fix" button.

Window will popup and give you a summary of what could have caused the
"data path failure"...

Note: Make sure YOU read EVERYTHING in this window before going on!!!!

Click "ok"

Another window pops up, a caution message will be first(READ IT)
then it tells you to perform a series of steps

Step 1: check cables and terminators
Step 2: Click 'ok" after performing Step 1

I had checked the terminators, they appeared to be ok. I replaced both of the
cables, clicked "ok". Another window pops up telling me that I have recovered
from "data path failure".

It then gives you a notice about file system not being accessible or
logical units and that you might have to reboot...

Click "ok"

The Recovery "module Information" window shows that the RAID module has been
fixed.

Click on "Status" from the main RM6 window, click on "health check", now
it shows the RAID module as being in optimal state now.

-------------------------------------------------------------------------------

Scenario Three: Power interruption to RSM2000(one of the sequencers dies)

Background: Alarms on all RSM trays go off, second light on Controller box
goes off(amber)

Click on "status", do a "health check", it shows that there has been a
drive tray failure and module component failure.

Note: RSM2000 controller will soon start blasting messages to the console.

Click on the "show details", tells user to go thru "Recovery Guru".....

Click on "Recovery Guru", gives procedure to follow along with a caution
about do not operate drive trays with a fan module failure for more then
5 minutes.

Resolution:

Problem was due to one of the sequencers being tripped off. Flipped it back
on and the summary information window shows that the RAID module as being
fixed.

Alarms on RSM trays shut off, as well as the amber light on the RSM2000
controller box being shut off...

Goto the "status", do a "health check", shows that RAID module is optimal now.

-------------------------------------------------------------------------------

Scenario Four: lose your host that RSM2000 is attached to....

Background: System crashes....

RSM2000 will keep on trucking, no fault LED's, or anything. If you looked at
the unit, you would think that life is good....

I would strongly suggest that you use something life SYMON to monitor the
host that the RSM2000 is attached too, so it will allow you to keep an eye
on the host that the RSM2000 is attched to.

Bring the system back up, right before the login prompts appears. The array
monitor daemon gets loaded, all the LED's on the drives in the RSM trays
will flicker for a brief moment.
(Of course this assumes that you did not have to reinstall any software)

Fire up RM6

Click "status" icon

click on "health check"

Shows that RAID module is in optimal state!!!!!!

-------------------------------------------------------------------------------

Scenario Five: Lose complete power to the RSM2000 unit

Background: Lose access to the RSM2000

After a few moments, you will see "disk not responding" in console as well
as "offline" messages.....

click on "status" icon (will be slow in responding)

Shows "data path failure" on RAID Module

Click on "show summary"

It tells you to go thru the "Recovery Guru"

click on "Recovery Guru" (slow in responding)

click "fix"

Note: READ EVERYTHING about "data path failure"

click "ok"

Check connections as suggested....

Click "ok"...

window will popup telling you that it is checking data paths(this will
take about five minutes)

Note: You will be still getting warnings about "disk not responding" in
console...

Window then will popup concerning "Data Path Recovery" basically asking if you
want too remove the RAID Module from system, if you click yes, then it will
do so.....If No, it will give you the steps to replace controllers....

click "no"

Window will then popup concerning "Controller Failure Replacement" telling you
to umount all file systems and stop all I/O to the RSM2000.

click "ok"

Another "Controller Failure Replacement" window will popup. READ THIS
CAREFULLY, gives you a caution, then steps to replace the controllers. There
is a notice at the bottom of the window telling you about the new serial
numbers on the new controller boards, and that the RAID Module numbers may
change....

WEll, we opened the cabinet to the RSM2000, and discovered that there is NO
power to the unit...end up replacing the sequencers....

Power up the RSM2000

soon afterwards, "disk okay" messages will appear in console window. Since we
did NOT have to replace the controllers, we clicked "cancel"....

Window pops up telling you that you have exited the Recovery Guruand that
your controllers are still failed....

clikc "ok"

click "status"

click "health check"

summary information shows that life is good on the RSM2000
-------------------------------------------------------------------------------

Scenario Six: Writing data to RSM2000, lose host!!!!

Background: Doing a remote copying(rdist) to the RSM2000 system...

When you are writing data to the RSM2000, the cache light will go on, the
lights on the RSM drives will begin to flicker on and off.....

System crashes to the "ok" prompt, lights on the drives will continue to
be lit for a few more moments. Then those lights will stop flickering as
well as the cache light going off.

The cache light will stay on till it completes all successful writes to the
RSM drives in the RSM2000.

Bring the hosts back up, go to "statsus" do a "health check" and shows
that RAID Module is in optimal mode.

-------------------------------------------------------------------------------

Scenario Seven: Write data to RSM2000 and lose the RSM2000 due to complete
power failure

Background: Again doing a remote copy of data to the RSM2000

Lose power to the RSM2000, the remote copy hangs from that host. You'll soon
seea bunch of "rdaemon" messages blast across the console on the RSM2000
host.

Stop the remote copy.

Got power restored to the RSM2000.

Powered RSM2000 back up. After the drives spun up and cam online. The cache
light lit up and completed all successful writes that were stored in the cache
of the RSM2000 before it lost all power.

Note: you will also see some "disk okay" messages cpme across the console
of the RSM2000 host as it come online.

-------------------------------------------------------------------------------

Scenario Eight: Writing data to RSM2000, lose a drive in that LUN, during the
failover to the Hotspare, you lose a second drive in that same LUN.....

You have completely lost that LUN!!!! in addition you will see LOTS of
message sblow across the console and screen of your RSM2000 host.

You'll see messages like "out of inodes" "sense key errors"...

Replace the two failed drives from that LUN, click "refresh" from the
OpenWindows menu. A few moments after this was done, there were LOTS of
"sense key" error messages that blew across the screen for about 10
seconds and then stopped.

Did "health check", shows life is good.

Cache light is still on at this point.

Went to "configuration" and looked at "module Information" and still shows
that the failed LUN is dead.

Powered cycled RSM2000, came back up with cache light still on, every few
moments there is activity on the LUN that did not fail.

Click "status" and in the console window, got error messages from rm6stat
saying "SysDevOpen Failed (I/O error)"

At this point, the RSM2000 host was halted and rebooted....

Fired up RM6, went into "configuration", still shows that LUN as being dead.
Deleted that failed LUN(drives go to unassigned state), then created that
LUN again with those drives I just deleted. Cache light went OFF at this point.

Note: Before failed LUN was deleted, brought up format and label that was
associated with that LUN was blown away...that would explain why we got those
"SysDevOpen" error messages earlier.....

What does all this mean?????

If you lose more then one drive in a LUN, then that LUN is DEAD....

To bring that LUN back, you need to do the following....

1. Replaced the failed drives
2. Delete failed LUN from the "configuration" window.
3. Readd that LUN again
4. After those drives are formatted again, restore any data

Note: format takes a while, so if you have a large set of drives in a LUN,
you'll be waiting for a while...

After format is complete, configuration window shows that LUN is in an
optimal state now.

-------------------------------------------------------------------------------

Scenario nine: After setting up swap on RSM2000, and rebooting system. During
reboot, you get a message stating that "overlapping of swap is not allowed"...

The problem is that the rdac driver need to loaded before you load swap local
to the system. In the /etc/rcS.d file, you need to edit the
S40standardmounts.sh and comment out the line /sbin/swapadd -l.
Then go into the S46rdac file and
put that line in the VERY END of the file. This will cause the rdac drivers
to get loaded, then swap that is on the RSM2000 will get loaded.

In the /etc/vfstab file, you need to use the /dev/dsk/cXtXdXsX naming
convention. I have filed a Sev 1 bug against using the /dev/rRAID_Module/0s0
in /etc/vfstab. If you use this naming convention, on reboot, you will get
that error message. Even though the system shows that it is swapping on the
RSM2000. BugID 4039017

Symbios is currently working on the problem.

-------------------------------------------------------------------------------

Scenario Ten: I cannot newfs anything >2gb on s2 on a RSM2000....

There have been reports of folks not being able to newfs anything
greater then 2gb on s2 on an RSM2000. When folks have tried this, they
have seen this.....

seek error : <a number>
wtfs invalid argument.

There are a couple of things that could be going on, a possible bug, disk
label that went out to lunch, reconfigured something on the RSM2000.

Bring up format, can you see the drive???
Also, did you just created a new RAID/group/LUN. If you just did that, you
need to do a boot -r, during the reboot, you will see a couple of messages
concerning the "reconfiguration" of the rdnexus.conf file and second .conf
file. These are configuration files for the RSM2000.

After the system comes up, bring up format, you will be able to see the
drive, format the drive with everything on s2. Then do newfs tto slice 2
and then your all set and ready to go.

It is also possible that your label went out to "lunch". Doing a boot -r
has cleaned this up. I have asked the engineer in the field if they seen
the "reconfiguration" messages. He cannot remember, if there were
"reconfiguration" messages, then there were some mods done to the RSM2000
he was not aware of.

Another workaround is moved the entire slice to something other then 2 on
the RSM2000 disk then do your newfs.

I will file a bug so we can get this looked at, just to be on the
safe side.(bugid 4053588)

-------------------------------------------------------------------------------

Scenario 11

User has setup a RAID 1+0 with 6 drives and a second 1+0 with 20 drives and
also has two hot spares....

When format started on the LUNS, it has been running for the last two days
and it appears tht the RSM2000 and or RM6 is hung out to dry....

Resolution:

You can use ps to find the process, which must be hung as it should not
take take that long, then use the kill command to terminate. I've done
this without any problems when I just wanted to stop the config and redo it.

Then check for hardware problems and make sure you have all the current
patches. The current patch list is at:
http://storageweb.eng/techmark_site/arrays/tmkt_rsmpatch

-------------------------------------------------------------------------------

Scenario 12

Similar to Scenario 11, if the RAID Manager comes to an abnormal
termination(ie: crash), the LUN would remain hidden from the
other RAID Manager utilities. If the user should notice a hung or a missing
LUN that she/he knows is configured, the user should remove the lunlocks
file (rm /etc/raid/locks/lunlocks), exit and then re-enter the RAID Manager
application that could not see the LUN.

-------------------------------------------------------------------------------

Scenario 13

Deleting LUNs on a RSM2000 using RM6 and getting the following messages....

svr8# WARNING: /sbus@1f,0/QLGC,isp@1,10000/sd@4,1 (sd130):
offline

WARNING: /sbus@1f,0/QLGC,isp@3,10000/sd@5,1 (sd187):
offline

WARNING: /sbus@1f,0/QLGC,isp@1,10000/sd@4,1 (sd130):
offline

WARNING: /sbus@1f,0/QLGC,isp@3,10000/sd@5,1 (sd187):
offline

WARNING: /sbus@1f,0/QLGC,isp@1,10000/sd@4,1 (sd130):
offline

WARNING: /sbus@1f,0/QLGC,isp@3,10000/sd@5,1 (sd187):
offline

Solution:

These are NORMAL messages...

This is due to the fact sonoma luns look like real disks to the system so
deleting them is like unpluging (offlining) an actual disk.

--------------------------------------------------------------------------------------

Scenario 14

Everytime I do a boot -r, my RSM2000 device paths get renamed, is there a way to
stop this.....

Solution:
This is a known issue with RM6 6.0 and has been fixed with 6.1.

You might have to do a rm -r of /dev/dsk and /rdsk for those sonoma controller numbers.
Then you would need to do a rm -r /dev/osa and a boot -r to get things straighten out.

--------------------------------------------------------------------------------------

Scenario 15

During a failover, I go into the RM6 gui and turn the "reconstruction rate" up to
max, I can see in the graph that it is working.

I come back the next morning and the GUI is showing that it is still running
and the LED's on the disk involved are still flashing.....

What's going on here....

Solution:

The short answer is to quit RM6 after a period of time when you think
the failover has completed.

Then restart RM6, it will then tell you that the resync is done and the LED's
on the disks involved in the failover process will stop flashing.

The longer answer is that a P1 bug(4060003) has been filed against this.

--------------------------------------------------------------------------------------

Scenario 16

Using a fully loaded RSM2000 on a UE6000 with SEVM 2.4 and Oracle, after a system reboot
The filesystem changed from a ownership of oracle to root.

How do you setup ownership to be "oracle" permanently????

Solution:

srdb 11430

Using the following Volume Manager command:

# vxedit set user=oracle group=oracle mode=600 volume_name

----------------------------------------------------------------------------------------
Scenario 17

we used vxdiskadm to get veritas control of a LUN and we see this on
console when Veritas 2.4 initializes the disk(LUN) ..

NOTICE: vxvm:vxio: Disk c3t5d0s2: Unexpected status on close: 19
NOTICE: vxvm:vxio: Disk c3t5d1s2: Unexpected status on close: 19
NOTICE: vxvm:vxio: Disk c3t5d2s2: Unexpected status on close: 19
NOTICE: vxvm:vxio: Disk c3t5d3s2: Unexpected status on close: 19
NOTICE: vxvm:vxio: Disk c3t5d4s2: Unexpected status on close: 19
NOTICE: vxvm:vxio: Disk c3t5d5s2: Unexpected status on close: 19
NOTICE: vxvm:vxio: Disk c5t5d0s2: Unexpected status on close: 19
NOTICE: vxvm:vxio: Disk c5t5d1s2: Unexpected status on close: 19
NOTICE: vxvm:vxio: Disk c5t5d2s2: Unexpected status on close: 19
NOTICE: vxvm:vxio: Disk c5t5d3s2: Unexpected status on close: 19
NOTICE: vxvm:vxio: Disk c5t5d4s2: Unexpected status on close: 19
NOTICE: vxvm:vxio: Disk c5t5d5s2: Unexpected status on close: 19

Solution:

The message your getting is ok to ignore for VM 2.4 with RSM2000
------------------------------------------------------------------------------------------

Scenario 18
I am using format on a 160gb LUN and and format isn't showing the entire LUN?????

Solution:

The short answer is bugid 4036085

This has been fixed in 2.6 as well as 2.5.1 8/97. Or you can use VxFS if your using an
earlier version of Solaris.
----------------------------------------------------------------------------------------

Scenario 19

Node name isses

1)In phase I Sonoma, the node names can become corrupted.
This can cause all sorts of problems, and is esentialy, the root
of most if not all evil, with redards to software related problems
with the product.

Fortunitly there is an easy fix for even the worst manifistations of
this problem: a complete removal of the /dev/osa directory followed
by a re-configuration boot will re-build the RSM2000 node names
correctly. For a given hardware configuration, you will get the same
answer every time.
You should also remove the `rdriver' nodes from /dev/rdsk
and /dev/dsk. The script chk.rmodules.sh can be used to
check the node names and also their binding in VxVM.
It create a file in the current directory called
rdsk.rdriver.names. This is a list of the rdriver names
that need to be removed from /dev/rdsk and /dev/dsk.

examples:
rm -r /dev/osa; touch /reconfigure ; init 6
cat rdsk.rdriver.names
/dev/rdsk/c0t4d1s6
/dev/rdsk/c0t4d4s6
/dev/rdsk/c0t4d5s6
/dev/rdsk/c0t4d6s6
/dev/rdsk/c2t5d0s6
/dev/rdsk/c2t5d2s6
/dev/rdsk/c2t5d3s6

foreach d (`cat rdsk.rdriver.names`)
rm $d
end
foreach d (`cat rdsk.rdriver.names | sed -e 's/rdsk/dsk/g' `)
rm $d
end

----------------------------------------------------------------------------------------

Scenario 20

Hung process:

Sonoma uses a file to serialize access to
shared resourcse. i.e. for locking. If rm6 or
one of the CLI is hung, chances are it is
a left over lock condition. Check for and remove
the file /etc/osa/lunlocks. Be carefull *not*
to delete the symlink /etc/raid/locks. This is a pointer
to a directory that contains more lock files. It is
far lesss common, but it is possible to have a hang
condition because of an entry in this directory.

If a command is hung, use truss(1M) to observe
the command. If it is a loop that involves a reference
to a file in this directory, that file is a likley culpret.

Be sure no ligitimite use of the array is in process
before removing the file.

If a process is hung while accessing node names under /dev/osa/dev/rdsk
then re-build the LUN names as per 1).

examples:
truss -f lad
truss -f -p PID
rm /etc/raid/locks

----------------------------------------------------------------------------------------

Scenario 21

unable to perform operation on LUN:

Both controllers in an RSM2000 have a relationship
to every LUN; each is the primary or secondary path.
The primary controler has ownership and exclusive write
access to the LUN's meta data. If there is a problem performing
an operation, you can access the LUN with the other channel,
and/ or by change the LUNS ownership. This will likley shed more light on
the nature of the problem; generate additionaly diagnostic
messages. Or may resolve it.

example:
lad
c0t4d1s0 1T63350903 LUNS: 1 4 5 6
c2t5d0s0 1T63350944 LUNS: 0 2 3

raidutil -c c0t4d1s0 -D 0
LUNs found on c0t4d1s0.
LUN 1 RAID 5 16000 MB
LUN 4 RAID 5 16000 MB
LUN 5 RAID 5 16000 MB
LUN 6 RAID 5 16000 MB
Deleting LUN 0.
Press Control C to abort.

Deleting LUN Failed: ** Check Condition **
SENSE Data:
7000050000000098 0000000094010000
0000000000000000 0000000000000000
0000000000000000 0000000000000000
0000000000000000 0000000000000000
0000000000000000 0000000000000000
Sense Key: 05
ASC: 94
ASCQ: 01

raidutil program failed.

This terrifying message is saying that the LUN belongs
to the other controller. I intend to file RFEs on the
error message handleing.

-------------------------------------------------------------------------------------

Scenario 22

If you have mulltiple sonoma's, how do you make the connection with what
controllers go with each RAID Modules.

In Phase Two, there is a file called "mnf", which is located in /etc/osa. This is a
ascii file. Inside there it will have the RAID Module names along with
what controllers are with it.

---------------------------------------------------------------------------------------

Scenario 23

VxVM/RSM 2000 LUN errors:

1) The condition:
mopti# vxdisk define c2t4d6s2
vxvm:vxdisk: ERROR: Device c2t4d6s2: define failed:
Disk is not usable

2) The LUN is labeled but VxVM does not like the layout.
Take a vtoc from a good LUN of the same kind:

prtvtoc /dev/rdsk/c13t4d6s2 > good.r1.vtoc

fmthard -s good.r1.vtoc /dev/rdsk/c2t4d6s2
fmthard: Partition 4 specified as 16752640 sectors starting at 4096
does not fit. The full disk contains 16375808 sectors.

(adjust the vtoc to the correct size reported above)

mopti# fmthard -s fix.r1.vtoc /dev/rdsk/c2t4d6s2
fmthard: New volume table of contents now in place.

mopti# vxdisk init c2t4d6s2
mopti# vxdisk list | grep c2t4d6s2
c2t4d6s2 sliced - - online

-----------------------------------------------------------------------------------------------------

Scenario 24

I created a 6 disk RAID 10 LUN that is 9 GB. The disks are
4.2 GB. I removed one of the drives and the
LUN starting reconstructing. I have no hot spares configured.
What is it reconstructing the failed disk to?
Why is it doing this?

Solution:

This is the "awe and mystery" of "hot-relocation". essentially, if
vxrelocd detects errors on a volume/drive it scans your disks to see if
there is sufficient UNallocated space somewhere and relocates it
automagically.

This has been met with mixed reactions so some people disable vxrelocd and
turn on vxsparecheck (the old hot-spare paradigm).

------------------------------------------------------------------------------------------------------

Scenario 25

Has anyone run the new RM6.1 with Solaris 2.5.1, NOT 2.5.1 4/97 or 8/97.
I installed all the patches in the matrix and upon booting I got this
error:

Spill 5 normal

and it came back to the boot prom.

Any way around this?????

Solution:

What I did to recover the system is to boot from cdrom,
mount the root file system as /mnt and then use pkgrm with
the "-R /mnt" to tell it to the root file system is under
/mnt instead of /. I remove all the 5 "osa" packages and
the system was back to normal.

-----------------------------------------------------------------------------------------------------

Scenario 26

Existing StorEDGE A3000 us running normal, user then edits the rmparams file to increase
the number of LUNs to something greater then 8. Everything appears to be normal until the
user soons discovers that RM6 cannot see any controllers or drives anymore. However, he could
see his LUNs.

Background:

As of this date, we still do not have > 8 LUN support, so when he created a > 8 LUN config, this
information was written out to what a lot of people call the "private region", but the proper
name is DACstore. The procedure below describes how to recover from this scenario.

WARNING: These steps will completely destroy any previously existing
configuration. These steps should only be done as a last resort and
if the reset configuration option via the GUI & CLI do not work!!!

1. To ensure that the problem wasn't caused by a problem in the
rmparams file, reset the rmparams file to it's original default
state (this file should have been saved shortly after the initial
install of RM6 but is available from the CDROM if it wasn't).

2. Remove all power from host and RSM2K.

3. Disconnect all disks from the RSM2K save one. The goal here is
to force RM6 to rebuild a default configuration by denying it
access to the previous (and corrupted) one. This is done by
removing the 3 disks that contain the existing configuration
information. Since there is no certainty as to where these disks
will be located, I advocate removing them all except the 1 disk
needed for the RAID controllers to use to build a new configuration.

4. Disconnect the battery backup (this will ensure that NVRAM is
drained on the RAID controllers) for about 20-30 seconds.

5. Power on the RSM2K with the single disk.

6. Power on and boot the host.

7. After the host is up, start RM6 and verify that the configuration
is accessible and reset. If you still have problems, try another
disk in step #3.

8. Re-install all remaining disks into the RSM2K. You should observe
the number of disks increase via the RM6 GUI Configuration application.

9. To ensure that everything works, use the reset configuration option
from the GUI or CLI.
-----------------------------------------------------------------------------------------------------

Scenario 27

IHAC with a newly installed RSM/2000 that started the getting these errors
after 30 hours after the installation. Healthck reports no errors.
His LUNS are concatenated. His configuration consists of 8 LUNS. 7 LUNS
have 4 disks and the 1 LUN has 6 disks.

Feb 19 06:39:03 ssht45 raid: Parity event Host=ssht45 Ctrl=1T71525352
Dev=c4t5d2s0
Feb 19 06:39:03 ssht45 raid: Start Blk=032986E1 End Blk=032987E9 #
Blks=000000
0A LUN=02
Feb 19 07:58:46 ssht45 raid: Parity event Host=ssht45 Ctrl=1T73942619
Dev=c5t4d3s0
Feb 19 07:58:46 ssht45 raid: Start Blk=021BAF61 End Blk=021BAFE9 #
Blks=000000
0A LUN=03
Feb 19 09:21:50 ssht45 raid: Parity event Host=ssht45 Ctrl=1T71525352
Dev=c4t5d4s0

Solution:

[1] replace whichever controller is yellow-lighted, in our case it was
the top one.

[2] replace the cache and processor memory, DO NOT transfer the memory
from the controller you are replacing! It is impossible to tell
which memory SIMM is bad and it could be either. There are NO
DIAGNOSTICS for this problem.

[3] Make certain that the controller firmware is at least 2.4.4.1

[4] upgrade to RM6.1 on the server.

-------------------------------------------------------------------------------------------------------------

Scenario 28

My customer has a SC2000 with 3 RSM2000.
One is in rootdg.

As he discovered this is unsupported we try to put this RSM2000 out of rootdg.

The RSM2000 is configured in 7 RAID 5 LUNs .

Is there a way to transfert these disks from rootdg to a new created group
WITHOUT data backup and restore procedure ?

Solution:

srdb 12177
------------------------------------------------------------------------------------------------------------

Scenario 29

I am getting an error on Module 10 LUN 1, how can I
tell what disk is involved....

/var/adm/messages:
The errored I/O is being routed to the Resolution daemon
The Array Resolution Daemon is retrying an I/O on Module 10,
LUN 1 at sector 129

So what cXtXdXsX is module 10, LUN 1

Look in the following file first:

/kernel/drv/rdriver.conf:
name="rdriver" module=10 lun=1 target=5
parent="/pseudo/rdnexus@5" dev_a=0x801040 dev_b=0x801628;

(/devices/pseudo/rdnexus@5/rdriver@5,1:c,raw)

find this device under rdsk:

ls -l /dev/rdsk/* | grep 'rdnexus@5/rdriver@5,1:c,r'

it was /dev/rdsk/c5t5d1s2 that is associated with Module 10 LUN 1.
--------------------------------------------------------------------------------------------------------