VSnake notes: С чего начать разборки

2013-08-01

С чего начать разборки

Допустим, вам выдали юниксовую систему и попросили разобраться, почему тормозит/глючит/падает. С чего начать?

Полезная шпаргалка для сисадмина:

A few “must have”:

What exactly are the symptoms of the issue? Unresponsiveness? Errors?

When did the problem start being noticed?

Is it reproducible?

Any pattern (e.g. happens every hour)?

What were the latest changes on the platform (code, servers, stack)?

Does it affect a specific user segment (logged in, logged out, geographically located…)?

Is there any documentation for the architecture (physical and logical)?

Is there a monitoring platform? Munin, Zabbix, Nagios, New Relic… Anything will do.

Any (centralized) logs?. Loggly, Airbrake, Graylog…

…

Who’s there?

$ w

$ last

…

What was previously done?

$ history

…

What is running?

$ pstree -a

$ ps aux

…

Listening services

$ netstat -ntlp

$ netstat -nulp

$ netstat -nxlp

…

CPU and RAM

$ free -m

$ uptime

$ top

$ htop

…

Hardware

$ lspci

$ dmidecode

$ ethtool

…

IO Performances

$ iostat -kx 2

$ vmstat 2 10

$ mpstat 2 10

$ dstat --top-io —top-bio

…

Mount points and filesystems

$ mount

$ cat /etc/fstab

$ vgs

$ pvs

$ lvs

$ df -h

$ lsof +D / /* beware not to kill your box */

…

Kernel, interrupts and network usage

$ sysctl -a | grep ...

$ cat /proc/interrupts

$ cat /proc/net/ip_conntrack /* may take some time on busy servers */

$ netstat

$ ss -s

…

System logs and kernel messages

$ dmesg

$ less /var/log/messages

$ less /var/log/secure

$ less /var/log/auth

…

Cronjobs

$ ls /etc/cron* + cat

$ for user in $(cat /etc/passwd | cut -f1 -d:); do crontab -l -u $user; done

…

Application logs

There is a lot to analyze here, but it’s unlikely you’ll have time to be exhaustive at first. Focus on the obvious ones, for example in the case of a LAMP stack:

Apache & Nginx; chase down access and error logs, look for 5xx errors, look for possible limit_zone errors.

MySQL; look for errors in the mysql.log, trace of corrupted tables, innodb repair process in progress. Looks for slow logs and define if there is disk/index/query issues.

PHP-FPM; if you have php-slow logs on, dig in and try to find errors (php, mysql, memcache, …). If not, set it on.

Varnish; in varnishlog and varnishstat, check your hit/miss ratio. Are you missing some rules in your config that let end-users hit your backend instead?

HA-Proxy; what is your backend status? Are your health-checks successful? Do you hit your max queue size on the frontend or your backends?

Conclusion

After these first 5 minutes (give or take 10 minutes) you should have a better understanding of:

What is running.

Whether the issue seems to be related to IO/hardware/networking or configuration (bad code, kernel tuning, …).

Whether there’s a pattern you recognize: for example a bad use of the DB indexes, or too many apache workers.

http://devo.ps/blog/2013/03/06/troubleshooting-5minutes-on-a-yet-unknown-box.html

http://www.lognormal.com/blog/2012/09/27/linux-tcpip-tuning/

Автор говорит о том, что все эти предварительные исследования занимают 5-15 минут. Может быть, если заниматься этим не реже раза в неделю и не впадать в прокрастинацию.

original post http://vasnake.blogspot.com/2013/08/blog-post.html

1 комментарий:

Unknownсреда, 19 марта 2014 г. в 15:06:00 GMT+4
мой "горячий" наборчик:
pstree -a
ps aux
top
netstat -tulnpv
iptables -L -vn
ОтветитьУдалить
Ответы

Добавить комментарий

Tools

VSnake notes

2013-08-01

С чего начать разборки

1 комментарий:

Архив блога

Ярлыки

Обо мне

Links