preparation for kernel space problem
kernel panic
kdump may fail due to hardware problem
Bios/firmware on the latest hardware nmay not be sufficiently tested
system freeze
/proc/sys/kernel/syssrq to 1 using /etc/sysctl.confab unexpected system freeze may happen during the boot process
quiet and rhgb command line parameters
OOM killer
a built-in feature which tries to survive out of memory condition by killing some processes.
Linux kernel is not carefully designed enought to completely avoid OOM killer deadlock
do not rely on the OOM killer too much
unexpected system reboot?
one of most annoying troubles in Linux because it is difficulut ot understand the reason of rebooting.
/varlog/messages
serial console
a relatively reliable way to capture kernel messages
/sys/module/printk/parameters/time (RHEL 6) /etc/rc/local (RHEL 5)
some hardware support redirection of serial console
eth0
the network interface is not avaliable during boot up and initaliza of the kernel kdump
カーネルダンプツールkdumpの使い方とそれの解析方法 - ymkoの日記
a utility for saving kernel messages is available
service fail over
The fail over will happen without any prior warning if the timeout of watchdog is shorter than timeout of kernel warning mechanisms.
/proc/sys/kernel/hung_task_tmimeout_secs
the cause of time out can be within hardware drivers or hardware itself.
Programs like shell scripts are vulnerable to this kind of disturbance
Heartbeat/watchdog software need painstaking error checking and retyr mechanisms
What tools can we use for recording unexpected events?
- System call auiting
- SystemTap
SystemTap とは何ですか? どのように使用しますか? - Red Hat Customer Portal
[http://fedoraproject.org/wiki/Red_Hat_Enterprise_Linux/ja#Red_Hat_Enterprise_Linux.E3.81.A8_Fedora.E3.81.AE.E9.81.95.E3.81.84.E3.81.AF.E4.BD.95.E3.81.A7.E3.81.97.E3.82.87.E3.81.86.E3.81.8B.EF.BC.9F:title]
watch out for integer overflow probelmes
- This kind of problem one day suddenly happends
User Datagram Protocol - Wikipedia
Is systemtap good at everything?
SystemTap can be used for not only measuring performance of functionality but also tracing functionality
unfortunately systemtap is not a tool designed for monitoring throughout years
Preparations for use space problems
when a system trouble happend. you need to retrieve and examine log files as soon as possible
TOMOYO Linux
- A tool for tracking/restricting various operations from boot.
- Mainlned version is available since Linux 2.6.30 kernel
- Name based access tracking. (/sbin/rsyslogd accessing resources)
the listener process of ssh daemon /usr/sbin/sshd is accessing
what is nice with TOMOYO Linux as a tracking tool?
\アッカリ~ン/
Single
AKARI is an unexpected usage of LSM interface but useful technique for implementing variougs "single function LSM" module
Linux Security Modules - Wikipedia
CaitSith
A new type of rule based in-kernel acess auditing and restricting tool.
TOMOYO AKARI
Conclusion
- Troubleshooting is something that compares that state and current state.
- It is important that you the normal state is before an encoutner troubles
- There are parameters and tools which help you understand what the normal state of your system is and what is happening to your system.