by shigemk2

当面は技術的なことしか書かない

How to Collect Information for Troubleshooting Enterprise Servers #linucon

preparation for kernel space problem

kernel panic

kdump may fail due to hardware problem

Bios/firmware on the latest hardware nmay not be sufficiently tested

system freeze

/proc/sys/kernel/syssrq to 1 using /etc/sysctl.confab unexpected system freeze may happen during the boot process

quiet and rhgb command line parameters

OOM killer

a built-in feature which tries to survive out of memory condition by killing some processes.

Linux kernel is not carefully designed enought to completely avoid OOM killer deadlock

do not rely on the OOM killer too much

Linuxキーワード - OOM Killer:ITpro

unexpected system reboot?

one of most annoying troubles in Linux because it is difficulut ot understand the reason of rebooting.

/varlog/messages

serial console

a relatively reliable way to capture kernel messages

/sys/module/printk/parameters/time (RHEL 6) /etc/rc/local (RHEL 5)

some hardware support redirection of serial console

eth0

the network interface is not avaliable during boot up and initaliza of the kernel kdump

カーネルダンプツールkdumpの使い方とそれの解析方法 - ymkoの日記

a utility for saving kernel messages is available

service fail over

The fail over will happen without any prior warning if the timeout of watchdog is shorter than timeout of kernel warning mechanisms.

/proc/sys/kernel/hung_task_tmimeout_secs

the cause of time out can be within hardware drivers or hardware itself.

Programs like shell scripts are vulnerable to this kind of disturbance

Heartbeat/watchdog software need painstaking error checking and retyr mechanisms

What tools can we use for recording unexpected events?

  • System call auiting
  • SystemTap

SystemTap とは何ですか? どのように使用しますか? - Red Hat Customer Portal

Red Hat + CentOS FAQ

[http://fedoraproject.org/wiki/Red_Hat_Enterprise_Linux/ja#Red_Hat_Enterprise_Linux.E3.81.A8_Fedora.E3.81.AE.E9.81.95.E3.81.84.E3.81.AF.E4.BD.95.E3.81.A7.E3.81.97.E3.82.87.E3.81.86.E3.81.8B.EF.BC.9F:title]

watch out for integer overflow probelmes

  • This kind of problem one day suddenly happends

User Datagram Protocol - Wikipedia

Is systemtap good at everything?

SystemTap can be used for not only measuring performance of functionality but also tracing functionality

unfortunately systemtap is not a tool designed for monitoring throughout years

Preparations for use space problems

when a system trouble happend. you need to retrieve and examine log files as soon as possible

TOMOYO Linux

  • A tool for tracking/restricting various operations from boot.
  • Mainlned version is available since Linux 2.6.30 kernel
  • Name based access tracking. (/sbin/rsyslogd accessing resources)

the listener process of ssh daemon /usr/sbin/sshd is accessing

what is nice with TOMOYO Linux as a tracking tool?

\アッカリ~ン/

Single

AKARI is an unexpected usage of LSM interface but useful technique for implementing variougs "single function LSM" module

Linux Security Modules - Wikipedia

ローダブル・カーネル・モジュール - Wikipedia

CaitSith

A new type of rule based in-kernel acess auditing and restricting tool.

TOMOYO AKARI

Conclusion

  • Troubleshooting is something that compares that state and current state.
  • It is important that you the normal state is before an encoutner troubles
  • There are parameters and tools which help you understand what the normal state of your system is and what is happening to your system.