Saturday, January 30, 2010

TAF vs FAN vs FCF vs ONS


Introduction to Workload Management



What the differences and relationship among TAF/FAN/FCF/ONS?
1 Definition
1) TAF
a feature of Oracle Net Services for OCI8 clients. TAF is transparent application failover which will move a session to a backup connection if the session fails. With Oracle 10g Release 2, you can define the TAF policy on the service using dbms_service package. It will only work with OCI clients. It will only move the session and if the parameter is set, it will failover the select statement. For insert, update or delete transactions, the application must be TAF aware and roll back the transaction. YES, you should enable FCF on your OCI client when you use TAF, it will make the failover faster.

Note: TAF will not work with JDBC thin.

2) FAN
FAN is a feature of Oracle RAC which stands for Fast Application Notification. This allows the database to notify the client of any change (Node up/down, instance up/down, database up/down). For integrated clients, inflight transactions are interrupted and an error message is returned. Inactive connections are terminated. 
FCF is the client feature for Oracle Clients that have integrated with FAN to provide fast failover for connections. Oracle JDBC Implicit Connection Cache, Oracle Data Provider for .NET (ODP.NET) and Oracle Call Interface are all integrated clients which provide the Fast Connection Failover feature.
3) FCF
FCF is a feature of Oracle clients that are integrated to receive FAN events and abort inflight transactions, clean up connections when a down event is received as well as create new connections when a up event is received. Tomcat or JBOSS can take advantage of FCF if the Oracle connection pool is used underneath. This can be either UCP (Universal Connection Pool for JAVA) or ICC (JDBC Implicit Connection Cache). UCP is recommended as ICC will be deprecated in a future release. 
4) ONS


ONS is part of the clusterware and is used to propagate messages both between nodes and to application-tiers

ONS is the foundation for FAN upon which is built FCF.

RAC uses FAN to publish configuration changes and LBA events. Applications can react as those published events in two way :
- by using ONS api (you need to program it)
- by using FCF (automatic by using JDBC implicit connection cache on the application server)

you can also respond to FAN event by using server-side callout but this on the server side (as their name suggests it
)


Rodrigo Mufalani
"ONS send/receive messages about failures automatically. It is a daemon process that runs on each node notifying status from components of database, nodeapps.
If listener process fails on node1 his failure is notified by EVMD, then local ONS communicates the failure to remote ONS in remote nodes, then local ONS on these nodes notifying all aplications about failure that occurred on node1."


2 Relationship
ONS --> FAN --> FCF
ONS -> send/receive messages on local and remote nodes.
FAN -> uses ONS to notify other processes about changes in configuration of service level
FCF -> uses FAN information working with conection pools JAVA and others.

http://forums.oracle.com/forums/thread.jspa?messageID=3566976

3 To use TAF/FAN/FCF/ONS, do you need to configure/install in server or client side?

Does ONS automatically send messages ? 
or is there any settings to be done ?
Does ONS only broadcast msgs ?

ONS is part of the clusterware and is used to propagate messages both between nodes and to application-tiers

ONS is the foundation for FAN upon which is built FCF.

RAC uses FAN to publish configuration changes and LBA events. Applications can react as those published events in two way :
- by using ONS api (you need to program it)
- by using FCF (automatic by using JDBC implicit connection cache on the application server)

you can also respond to FAN event by using server-side callout but this on the server side (as their name suggests it)


Rodrigo Mufalani
"ONS send/receive messages about failures automatically. It is a daemon process that runs on each node notifying status from components of database, nodeapps.
If listener process fails on node1 his failure is notified by EVMD, then local ONS communicates the failure to remote ONS in remote nodes, then local ONS on these nodes notifying all aplications about failure that occurred on node1."


5 Are TAF and FAN mutually exclusive? or if TAF and FCF are mutually exclusive?
No. You can use both TAF and FAN at the same time, or both TAF and FCF, it depends on what you want to achieve with it. 

6 TAF Basic Configuration with FAN: Example
Oracle Database 10g Release 2 supports server-side TAF with FAN. 
To use server-side TAF:
1) create 
and start your service using SRVCTL
$ srvctl add service -d RACDB -s AP -r I1,I2
$ srvctl start service -d RACDB -s AP
2) configure TAF in the RDBMS by using the 
DBMS_SERVICE package.
execute dbms_service.modify_service ( ,-
service_name => 'AP' ,-
aq_ha_notifications => true ,-
failover_method => dbms_service.failover_method_basic ,-
failover_type => dbms_service.failover_type_session ,-
failover_retries => 180, failover_delay => 5 ,-
clb_goal => dbms_service.clb_goal_long);
3) When done, make sure that you define a TNS entry 
for it in your tnsnames.ora file. 
AP =
(DESCRIPTION =(FAILOVER=ON)(LOAD_BALANCE=ON)
(ADDRESS=(PROTOCOL=TCP)(HOST=N1VIP)(PORT=1521))
(ADDRESS=(PROTOCOL=TCP)(HOST=N2VIP)(PORT=1521))
(CONNECT_DATA = (SERVICE_NAME = AP)))
Note that this TNS name does not need to specify TAF parameters 
as with the previous slide.

7 TAF Basic Configuration without FAN: Example
1) 
Before using TAF, it is recommended that you create and start a service that is used during 
connections. 
By doing so, you benefit from the integration of TAF and services. When you want to 
use BASIC TAF with a service, you should have the -P BASIC option when creating the service.
After the service is created, you simply start it on your database.
$ srvctl add service -d RACDB -s AP -r I1,I2 
 -P BASIC
$ srvctl start service -d RACDB -s AP
2) 
Then, your application needs to connect to the service by using a connection descriptor similar to the 
one shown in the slide. The FAILOVER_MODE parameter must be included in the CONNECT_DATA 
section of your connection descriptor.
AP =
(DESCRIPTION =(FAILOVER=ON)(LOAD_BALANCE=ON)
(ADDRESS=(PROTOCOL=TCP)(HOST=N1VIP)(PORT=1521))
(ADDRESS=(PROTOCOL=TCP)(HOST=N2VIP)(PORT=1521))
(CONNECT_DATA =
(SERVICE_NAME = AP)
(FAILOVER_MODE =
(TYPE=SESSION)
(METHOD=BASIC)
(RETRIES=180)
(DELAY=5))))

Note: If using TAF, do not set the GLOBAL_DBNAME parameter in your listener.ora file.

8 Metalink notes
  • --Understanding Transparent Application Failover (TAF) and Fast Connection Failover (FCF) [ID 334471.1]
  • --How To Verify And Test Fast Connection Failover (FCF) Setup From a JDBC Thin Client Against a 10.2.x RAC Cluster [ID 433827.1]

  • --Fast Connection Failover (FCF) Test Client Using 11g JDBC Driver and 11g RAC Cluster [ID 566573.1]
  • --
    Questions about how ONS and FCF work with JDBC [ID 752595.1]
  • --
    How To Implement (Fast Connection Failover) FCF Using JDBC driver ? [ID 414199.1]


  • --How to Implement Load Balancing With RAC Configured System Using JDBC [ID 247135.1]

Friday, January 29, 2010

ORA-16146: standby destination control file enqueue unavailable

In one of my standby db, I found this error in the alert log file: 
ORA-16146: standby destination control file enqueue unavailable.


Cause: high workload in the primary db: many update & insert & delete activities > many commits > many archived log files > slow I/O in standby db cannot work fast enough to process those archived log files.


ORA-16146 occurs in primary when the control file enqueue is hold by some process for long time and while archiving to the standby, control file enqueue is not available.

There is no error related to control file in the standby this means, the control file enqueue is not getting blocked but it is getting locked for long time.This can happen due to slow I/O to the control file.

Solution: move the control files on the standby database to the faster disk and check whether issue reproduces.

 The standby controlfile is updated by several sources namely the RFS  processes and the local process which performs recovery or the MRP process  which is responsible for sustained recovery at the standby. The standby's  controlfile is updated with the latest logfile when its shipped from the source to the destination. The standby controlfile is updated with the logfile information for the sustained recovery to proceed. Also the SRL's need to be updated when the online logs are copied. The error 16146 is returned when the controlfile enqueue in the standby cannot be obtained. There were bugs before reported where primary hung because of the standby CF enqueue problems. Actually this error can indicate latent controlfile
 activities on the remote side and hence the stack might not be necessary.
 .
 - All processes waiting on CF and causing 16146
 - This does not affect the physical standby
 - This error on itself might indicate OS specific problems
 - The holder here is waiting for "logfile sequential read"




Read more »

Tuesday, January 26, 2010

PGA Monitoring Scripts using v$sql_workarea and v$sql_workarea_active

1. Finding top ten work areas requiring the most cache memory:

select * 
from
 (select workarea_address, operation_type, policy, estimated_optimal_size
    from v$sql_workarea
  order by estimated_optimal_size DESC)
where ROWNUM <=10;

2. Finding the percentage of work areas using maximum memory:

select operation_type, total_executions * 100  / optimal_executions "%cache"
From v$sql_workarea
Where policy='AUTO'
And optimal_executions > 0
Order By operation_type;

3. Finding the top ten biggest work areas currently allocated in the system:

select c.sql_text, w.operation_type, top_ten.wasize
From (Select *
      From (Select workarea_address, actual_mem_used wasize
            from v$sql_workarea_active
            Order by actual_mem_used)
      Where ROWNUM <=10) top_ten,
      v$sql_workarea w,
      v$sql c
Where    w.workarea_address=top_ten.workarea_address
        And c.address=w.address
        And c.child_number = w.child_number
        And c.hash_value=w.hash_value;          

4. Finding the percentage of memory that is over and under allocated:

select  total_used,
        under*100/(total_used+1) percent_under_use,
        over*100/(total_used+1)   percent_over_used
From
        ( Select
                sum(case when expected_size > actual_mem_used 
                                       then actual_mem_used else 0 end) under,
                sum(case when expected_size<> actual_mem_used 
                                       then actual_mem_used else 0 end) over,
                sum(actual_mem_used) total_used
        From v$sql_workarea_active
        Where policy='AUTO') usage; 

Labels: , , ,

Query returns a percentage of work areas used with optimal memory size.

select
trunc (
       (sum(case when name like 'workarea executions - optimal' 
                                       then value else 0 end) *100) /
         (
                  sum(case when name like 'workarea executions - optimal'   
                                             then value else 0 end)         +
                  sum(case when name like 'workarea executions - one pass'  
                                             then value else 0 end)         +
                  sum(case when name like 'workarea executions - multipass' 
                                             then value else 0 end)
         )
        ) optimal_percent
from v$sysstat
where name like 'workarea executions - %'
/

Note
  • If this feature is not enabled the previous query will fail with ORA-01476: divisor is equal to zero
  • 'workarea memory allocated' - the total amount of PGA memory dedicated to work areas allocated in Kb.
  • 'workarea executions - optimal' - the cumulative count of work areas which had an optimal size. For example optimal size is defined if the sort does not need to spill to the disk.
  • 'workarea executions - onepass' - the cumulative count of work areas using the one pass size. One pass is generally used for big work areas where spilling to disk cannot be avoided.
  • 'workarea executions - multipass' - the cumulative count of work areas running in more than one pass. This should be avoided and is the symptom of poorly tuned system.
    
A workarea execution is either optimal, onepass or multipass.

Labels: , ,

Query to see total PGA memory used by the instance

select sn.name, sum(s.value)
from v$sesstat s, v$statname sn
where s.statistic# = sn.statistic#
and sn.name like '%pga%'
group by sn.name

Labels: , ,

Friday, January 22, 2010

Finding statement/s which use lots of shared pool memory

SELECT substr(sql_text,1,40) "Stmt", 
count(*),
sum(sharable_mem) "Mem",
sum(users_opening)   "Open",
sum(executions)      "Exec"
FROM v$sql
GROUP BY substr(sql_text,1,40)
HAVING sum(sharable_mem) > ;


MEMSIZE is about 10% of the shared pool size in bytes

Labels: ,

Checking for high version counts

"Versions" of a statement occur where the SQL is character for character identical but the underlying objects or binds or etc.. are different.
For example:
User scott has a dept table:
SQL>select * from dept
User Tom has his own dept table:
SQL>select * from dept

When Scott and Tom executes the above same SQL in terms of characters, because the base object "dept" is different, Oracle will treat the two statements as two different versions of SQL.

SELECT address, 
hash_value,
version_count,
users_opening,
users_executing,
substr(sql_text,1,40) "SQL"
FROM v$sqlarea
WHERE version_count > 10;

Labels: ,

Finding the Library Cache hit ratio


    SELECT SUM(PINS) "EXECUTIONS",SUM(RELOADS) "CACHE MISSES WHILE EXECUTING"  
    FROM V$LIBRARYCACHE;
    If the ratio of misses to executions is more than 1%, then try to reduce the library cache misses.

Labels: ,

SQL Invalidation

Use the following query to identify cursors that are invalidated:
*****************************
set lines 300
col sql format a50
select substr(sql_text, 1, 40) "SQL", sql_id, invalidations 
from v$sqlarea 
where invalidations > &value 
order by invalidations;

Note: use the "&value" to dynamically enter a value to limit the output

Or use the following SQL to observer invalidations:

select sql_text,version_count,loads,invalidations,parse_calls
from v$sqlarea 
where sql_text like '%xxxxx%';

*****************************
What will cause cursor invalidation?
Activities like TRUNCATE, ANALYZE or DBMS_STATS.GATHER_XXX on tables or indexes, grants changes on unerlying objects.  The associated cursors will stay in the SQL area but when it will be reference next time, it should be reloaded and reparsed fully, so it will have negative impact on library cache.
****************************
Related metalink notes

  1. [ID 123214.1]Truncate - Causes Invalidations in the LIBRARY CACHE 
  2. WAIT SCENARIOS REGARDING LIBRARY CACHE PIN AND LIBRARY CACHE LOAD LOCK [ID 115656.1]

Labels: ,

Query shows SQL in the SGA where there are a large number of similar statements

The following query shows SQL in the SGA where there are a large number of similar statements:
SELECT substr(sql_text,1,40) "SQL",count(*),sum(executions) "TotalExecs"
FROM v$sqlarea
WHERE executions < 5
GROUP BY substr(sql_text,1,40)
HAVING count(*) > 30
ORDER BY 2;

Note: If there is latch contention for the library cache latches the above statement may cause yet further contention problems.

Labels: ,

v$sql vs. v$sqlarea, v$sql_shared_cursor and Cursor Sharing

V$SQL vs. V$SQLAREA
Each SQL statement has 1 parent cursor, 1 or more child cursors.

V$SQLAREA: parent cursor, it contains one row for each parent cursor.

V$SQL: each parent cursor can have one or more child cursors.

-- Compared to parent cursor, child cursors have heap 6 (execution plan).

-- Each parent requires at lest 1 child cursor.

-- Multiple parent cursors are created because of differences in sql statement text.
The following statements all require separate parent cursors:
SELECT COUNT(*) FROM tab1;
2 select count(*) from TAB1;
3 SELECT /*comment*/ COUNT(*) FROM tab1;
4 SELECT COUNT(*) FROM tab1;


Cursor Sharing: EXACT, FORCE and SIMILAR
-- If the dynamic initialization parameter cursor_sharing is set to "EXACT", this cursor sharing feature is disabled.  If it's set to "FORCE" or "SIMILAR", the feature is enabled.


-- This parameter can be set at session or system level.


-- You can explicitly disable cursor sharing at the SQL statement level by specifying the hint "cursor_sharing_exact".


--Use "SELECT address, child_address, sql_text FROM v$sql WHERE sql_text LIKE '.....%' to check if two sql statements are the same: use same parent and child cursor. 
--The following two similar sql statements will have different results when running the above query to check parent and child cursors, when the cursor sharing paramenter is set to different values: EXACT and FORCE.
select col1 from tab1 where col1=0;
select col1 from tab1 where col1=1;


cursor_sharing=exact: 1 parent cursor with 2 child cursors


cursor_sharing=force:  1 parent cursor with 1 child cursor.  That means Oracle treats the above two sql statements as the same statement. 


cursor_sharing=similar:
The histogram factor on the column statistics will have different impacts on the parent and child cursors. 
Without histogram, Oracle will treat the following 4 sql statements as one statement: 1 parent cursor vs. 1 child cursor:
select col1 from tab1 where col1=0;
select col1 from tab1 where col1=1;
select col1 from tab1 where col1=2;

select col1 from tab1 where col1=3;
****************************************************************
cursor_sharing=force
Problems: a single child cursor is used for all SQL statements sharing the same text after the replacement of the literals.
"Consequently, the literals are peeked only during the generation of the execution plan related to the first submitted SQL statement. Naturally, this could lead to suboptimal execution plans because literals used in subsequent SQL statements will lead to different execution plans" - Christian Antognini's book (Troubleshooting Oracle Performance).
****************************************************************

========================================================================
According to Christian Antognini's book (Troubleshooting Oracle Performance):
"if an application uses literals and cursor sharing is set to similar, the behavior depends on the existence of relevant histograms. If they do exists, similar behaves like exact.  If they don't exist, similar behaves like force.  This means that if you are facing parsing problems, more often than not, it is pointless to use similar."
=================================================================================


With histogram, the above 4 sql statements will have 2 parent cursors and 3 child cursors.  


V$SQL_SHARED_CURSOR
--describes each loaed child cursor
--contains set of boolean values describing why cursors could not be shared.
--boolean values for 1st child of each parent will always be false.


SQL> select address, child_address, sql_text from v$sql where sql_text like '.....%';
SQL> select child_number, child_address, optimizer_mode_mismatch from v$sql_shared_cursor where address='xxxxxx';





Labels: , , ,

Friday, January 15, 2010

My First Blog Post

I decide to create my own Oracle related blog, so that I can share some of Oracle tips with friends.


1 To identify the total amount and type of parsing going on in the system execute the following SQL statement:
SQL> select name, value from v$sysstat where name like 'parse count%';


2 To associate SQL statements with their bind variable names and values:
SQL> select s.address, s.sql_text, m.bind_name, d.value
 from v$sql s, v$sql_bind_metadata m, v$sql_bind_data d, v$sql_cursor c, x$kglcursor k
 where s.address=c.parent_handle 
         and s.address=k.kglhdpar
         and m.address=k.kglhdadr 
         and c.curno=d.cursor_num;

3 Diagnostic LIBRARY CACHE hit ratio:
 SQL> select namespace, gets, gethitratio GET_HR, pins, pinhitratio PIN_HR, reloads, invalidations INV from v$librarycache;

4 Monitoring SGA usage:
select * from v$sgastat;

Use V$SQLAREA to identify SQL with high parse calls and high resource usage (DISK_READS, BUFFER_GETS). These SQL statements and the application should be
tuned and optimized to minimize this activity. 
 
5 Locate similar SQL not using bind variables
select substr(sql_text,1,40) SQL, count(*) from v$sqlarea group by substr(sql_text,1,40) having count(*) > 5;

6 Locate cursors with different object definitions
select substr(sql_text,1,50), version_count from v$sqlarea where version_count > 5;

7 Estimate memory usage
select sum(sharable_mem) from v$sqlarea;

8 Locate anonymous blocks in order to convert them to package, so that to share SQL.
Avoid sizing the shared pool too large if SQL is not shared; iteratively arrive at the optimal value. Find anonymous blocks and convert to packages if possible.
SQL> select sql_text from v$sqlarea where command_type = 47;
Note: 47 is the command type for an anonymous PL/SQL block.

9 To identify the statements that are receiving a lot of parse calls execute the following:
SQL> select sql_text, parse_calls, executions from v$sqlarea where parse_calls > 100 and executions < 2*parse_calls;

10 Latch usage info:  "Normally excessive library cache contention will show up in waits for latch free event in V$SYSTEM_EVENT and V$SESSION_WAIT. Then the next step would be to determine which of the latches are contributing excessively to the waits for the latch free event. If it is the library cache or the shared pool latches then probably the library cache and shared pool tuning is required. V$SESSION_WAIT will display high number of waits for event latch free. V$LATCH will show that the contention occurs on the library cache latch."

select latch#, substr(name,1,25) Latch, sleeps from v$latch where sleeps != 0 order by sleeps desc;

11 Real time indication of library cache latch contention
select a.sid, a.username, a.program, a.status, b.name, c.wait_time, c.state
from v$session a, v$latch b, v$session_wait c
where a.sid = c.sid and c.p2 = b.latch# and c.event = 'latch free' and b.name like 'library%';

12 Determine the distribution of the load among the library cache child latches.
select gets, misses, sleeps, spin_gets, sleep1, sleep2, sleep3
from v$latch_children
where latch# = 99
order by sleeps desc;


13 Find sessions with the Highest CPU Consumption
set lines 300
col module format a50
SELECT s.sid, s.serial#, p.spid as "OS PID",s.username, s.module, st.value/100 as "CPU sec"
FROM v$sesstat st, v$statname sn, v$session s, v$process p
WHERE sn.name = 'CPU used by this session' -- CPU
AND st.statistic# = sn.statistic#
AND st.sid = s.sid
AND s.paddr = p.addr
AND s.last_call_et < 1800 -- active within last 1/2 hour
AND s.logon_time > (SYSDATE - 240/1440) -- sessions logged on within 4 hours
ORDER BY st.value;

14 Find Sessions with Highest Waits of a Certain Type
-- sessions with the highest time for a certain wait
SELECT s.sid, s.serial#, p.spid as "OS PID", s.username, s.module, se.time_waited
FROM v$session_event se, v$session s, v$process p
WHERE se.event = '&event_name'
AND s.last_call_et < 1800 -- active within last 1/2 hour
AND s.logon_time > (SYSDATE - 240/1440) -- sessions logged on within 4 hours
AND se.sid = s.sid
AND s.paddr = p.addr
ORDER BY se.time_waited;

15 Find Sessions with Waits
set lines 300
col module format a50
col event format a40
col username format a15
SELECT s.sid, s.serial#, p.spid as "OS PID", s.username, se.event, s.module, se.time_waited
FROM v$session_event se, v$session s, v$process p
where s.last_call_et < 1800 -- active within last 1/2 hour
AND s.logon_time > (SYSDATE - 240/1440) -- sessions logged on within 4 hours
AND se.sid = s.sid
AND s.paddr = p.addr
ORDER BY se.time_waited desc

16 10g or higher: Find Sessions with the Highest DB Time
-- sessions with highest DB Time usage
SELECT s.sid, s.serial#, p.spid as "OS PID", s.username, s.module, st.value/100 as "DB Time
(sec)", stcpu.value/100 as "CPU Time (sec)", round(stcpu.value / st.value * 100,2) as "% CPU"
FROM v$sesstat st, v$statname sn, v$session s, v$sesstat stcpu, v$statname sncpu, v$process p
WHERE sn.name = 'DB time' -- CPU
AND st.statistic# = sn.statistic#
AND st.sid = s.sid
AND sncpu.name = 'CPU used by this session' -- CPU
AND stcpu.statistic# = sncpu.statistic#
AND stcpu.sid = st.sid
AND s.paddr = p.addr
AND s.last_call_et < 1800 -- active within last 1/2 hour
AND s.logon_time > (SYSDATE - 240/1440) -- sessions logged on within 4 hours
AND st.value > 0;

17 Find blocking and blocked sessions and corresponding SQL statement
--Find blocked sessions
select sid,serial#, SQL_ID,status,ACTION, BLOCKING_SESSION,BLOCKING_SESSION_STATUS, EVENT from v$session where blocking_session is not null;
--Find blocked sessions' sql_id and sql_text
select sql_id,sql_text from v$sql where sql_id in (select sql_id from v$session where blocking_session is not null);

18 Look for SQL stmts with more than one parent cursor
-- (more than 1024 child cursors).
-- We hope that it returns no rows.

select sql_id, count(*) n
from v$sql
where child_number = 0
group by sql_id
having count(*) > 1 ; 

-- If any rows are returned from the previous query,
-- then please run the following
-- statement to see why the cursors are not sharing.

select *
from v$sql_shared_cursor
where sql_id in
(
select sql_id from v$sql where child_number = 0
group by sql_id having count(*) > 1
) ;

==========================================
Note: Most of the above scripts are found from Oracle guide, metalink, and some performance tuning books.