Допустим, вам
выдали юниксовую систему и попросили
разобраться, почему тормозит/глючит/падает.
С чего начать?
Полезная
шпаргалка для сисадмина:
A few “must have”:
What exactly are
the symptoms of the issue? Unresponsiveness? Errors?
When did the
problem start being noticed?
Is it
reproducible?
Any pattern
(e.g. happens every hour)?
What were the
latest changes on the platform (code, servers, stack)?
Does it affect a
specific user segment (logged in, logged out, geographically
located…)?
Is there any
documentation for the architecture (physical and logical)?
Is there a
monitoring platform? Munin, Zabbix, Nagios, New Relic… Anything
will do.
Any
(centralized) logs?. Loggly, Airbrake, Graylog…
…
Who’s there?
$ w
$ last
…
What was previously
done?
$ history
…
What is running?
$ pstree -a
$ ps aux
…
Listening services
$ netstat -ntlp
$ netstat -nulp
$ netstat -nxlp
…
CPU and RAM
$ free -m
$ uptime
$ top
$ htop
…
Hardware
$ lspci
$ dmidecode
$ ethtool
…
IO Performances
$ iostat -kx 2
$ vmstat 2 10
$ mpstat 2 10
$ dstat --top-io
—top-bio
…
Mount points and
filesystems
$ mount
$ cat /etc/fstab
$ vgs
$ pvs
$ lvs
$ df -h
$ lsof +D / /*
beware not to kill your box */
…
Kernel, interrupts
and network usage
$ sysctl -a | grep
...
$ cat
/proc/interrupts
$ cat
/proc/net/ip_conntrack /* may take some time on busy servers */
$ netstat
$ ss -s
…
System logs and
kernel messages
$ dmesg
$ less
/var/log/messages
$ less
/var/log/secure
$ less /var/log/auth
…
Cronjobs
$ ls /etc/cron* +
cat
$ for user in $(cat
/etc/passwd | cut -f1 -d:); do crontab -l -u $user; done
…
Application logs
There is a lot to
analyze here, but it’s unlikely you’ll have time to be exhaustive
at first. Focus on the obvious ones, for example in the case of a
LAMP stack:
Apache &
Nginx; chase down access and error logs, look for 5xx errors, look
for possible limit_zone errors.
MySQL; look for
errors in the mysql.log, trace of corrupted tables, innodb repair
process in progress. Looks for slow logs and define if there is
disk/index/query issues.
PHP-FPM; if you
have php-slow logs on, dig in and try to find errors (php, mysql,
memcache, …). If not, set it on.
Varnish; in
varnishlog and varnishstat, check your hit/miss ratio. Are you
missing some rules in your config that let end-users hit your backend
instead?
HA-Proxy; what
is your backend status? Are your health-checks successful? Do you hit
your max queue size on the frontend or your backends?
Conclusion
After these first 5
minutes (give or take 10 minutes) you should have a better
understanding of:
What is running.
Whether the
issue seems to be related to IO/hardware/networking or configuration
(bad code, kernel tuning, …).
Whether there’s
a pattern you recognize: for example a bad use of the DB indexes, or
too many apache workers.
Автор говорит
о том, что все эти предварительные
исследования занимают 5-15 минут. Может
быть, если заниматься этим не реже раза
в неделю и не впадать в прокрастинацию.
original post http://vasnake.blogspot.com/2013/08/blog-post.html
original post http://vasnake.blogspot.com/2013/08/blog-post.html
мой "горячий" наборчик:
ОтветитьУдалитьpstree -a
ps aux
top
netstat -tulnpv
iptables -L -vn