ORACLE——运安宕机问题分析-2020-11-27-redolog【暂未解决】、ASH_FLUSH

2020年11月27日 662点热度 0人点赞 0条评论

QUESTION
CAUSE
SOLUTION

【2020-11-30】新增告警

QUESTION

1、Active Session History (ASH) performed an emergency flush. This may mean that ASH is undersized. If emergency flushes are a recurring issue, you may consider increasing ASH size by setting the value of _ASH_SIZE to a sufficiently large value. Currently, ASH size is 134217728 bytes. Both ASH size and the total number of emergency flushes since instance startup can be monitored by running the following query:

select total_size,awr_flush_emergency_count from v$ash_info;

CAUSE

Typically some activity on system causes more active sessions, therefore filling the ASH buffers faster than usual causing this message to be displayed. It is not a problem per se, just indicates the buffers might need to be increased to support peak activity on the database.

SOLUTION

The current ASH size is displayed in the message in the alert log, or can be found using the following SQL statement.

select total_size from v$ash_info;

Then increase the value for _ash_size by some value, like 50% more than what is currently allocated. For example if total_size = 16MB, then an increase of 50% more would be (16MB + (16MB * 50%)) = 24MB.

sqlplus / as sysdba
alter system set "_ash_size"=25165824;

You can verify the change using the following select:

select total_size from v$ash_info;

解决：

大小：_ash_size=134217728 +（134217728 *50%）=201326592

alter system set "_ash_size"=201326592;

【2020-11-25】

事件现象：

运安自更换服务器后，2020-11-03日起切换，车间mes程序经常出现服务断开情况，跟踪日志发现，数据库出现莫名宕机。并报：

DBW6 (ospid: 11184): terminating the instance due to error 472

日志分析

此情况出现后，Oracle会直接杀掉进程，并自动关闭数据库，导致前端应用无法连接。

通过查看资料，针对472错误，可能是由于内存过高、第三方软件、bug导致的

排查流程：

内存过高问题，通过安装zabbix监控发现，每次发生宕机时间点，内存是正常的；且内存过高后，linux系统中才会调用，通常会触发 Linux 内核里的 Out of Memory (OOM) killer，OOM killer 会杀掉某个进程以腾出内存留给系统用。但当前
Bug问题，经查询，此bug已经在10g版本中修正了，当前11g版本中不会再出现；
第三方软件问题，数据库服务器安装了卡巴斯基，但经查看卡巴日志及系统日志，并没有杀掉数据库进程的操作。关闭卡巴斯基后，仍然出现此问题；
ADG问题，3日进行了ADG切换，停掉ADG，问题仍然出现；
RMAN问题，停止RMAN备份，问题仍然出现；

最终排查过程：

通过详细筛查alert.log日志，发现每次在进行redo切换时，会发现有一个提示，但没有报错：

Thread 1 cannot allocate new log, sequence 5814

Checkpoint not complete

可能的原因有以下：

1>日志文件过小，切换过于频繁；

2>日志组太少，不能满足正常事务量的需要；

3>日志文件所在磁盘I/O存在瓶颈，导致写出缓慢，阻塞数据库正常运行；

4>由于数据文件磁盘I/O瓶颈，DBWR写出过于缓慢；

5>由于事务量巨大，DBWR负荷过高，不堪重负。

查看当前redolog,一共3组，每组大小为50m，每5分钟切换一次，根据oracle建议15-20分钟切换一次正常，于是将50M大小改为500M，后面切换日志时，不再出现Checkpoint not complete，问题解决。

分析历史记录：

第一次出现checkpoint not complete是在2020年09月16日02:25:02【更换印前服务器】

第一次宕机前的checkpoint问题

第一次宕机，报警472错误

第二次出现:但当前未出现checkpoint问题

更换新服务器日期为：2020-11-03

第一次出现checkpoint

第一次宕机：

后面便比较频繁，10.10，10.13，10.14，10.15，10.18，10.19，10.22，10.23，10.24，10.25，10.26，10.27，每天一次，时间不固定，大部分在凌晨2点、6点和晚上10点。

结论：最终分析为redolog日志文件默认太小，车间操作频繁，导致数据库，LGWR进程无法及时将脏数据落入硬盘，也就是redolog中，3组文件，2组处于active过程，1组处于current过程，而此过程checkpoint未开始执行，使得2组active的数据不能彻底落盘，导致checkpoint问题。

【2020-11-27】

本作品采用知识共享署名 4.0 国际许可协议进行许可

ORACLE——运安宕机问题分析-2020-11-27-redolog【暂未解决】、ASH_FLUSH

QUESTION

CAUSE

SOLUTION

文章评论