hello云胜

技术与生活

0%

firewall-cmd

在 centos7 里有几种防火墙共存,firewalld、iptables,默认是使用 firewalld 来管理 netfilter 子系统,不过底层调用的命令仍然是 iptables 等。

firewalld 自身并不具备防火墙的功能,而是和 iptables 一样需要通过内核的 netfilter 来实现,也就是说 firewalld 和 iptables 一样,他们的作用都是用于维护规则,而真正使用规则干活的是内核的netfilter,只不过 firewalld 和 iptables 的结构以及使用方法不一样罢了。firewalld是iptables的一个封装,可以让你更容易地管理iptables规则。它并不是iptables的替代品,虽然iptables命令仍可用于firewalld,但建议firewalld时仅使用firewalld命令。

firewalld 和 iptables service 之间一个极大的不同在于:

使用 iptables service 每一个单独更改都意味着清除所有旧有的规则和从/etc/sysconfig/iptables 里读取所有新的规则,换句话说,每次修改都要求防火墙完全重启。这就会影响到现有的连接。

然而使用 firewalld 却不会再创建任何新的规则,仅仅运行规则中的不同之处,因此firewalld 可以在运行时间内改变设置而不丢失现行连接。

这种不同的原因在于iptables service 在 /etc/sysconfig/iptables 这一个文件中储存,而 firewalld 将配置储存在 /usr/lib/firewalld/(系统配置文件,预定义配置文件)和 /etc/firewalld/(用户配置文件) 中的多个XML文件里。

firewalld 跟 iptables 的区别:
(1)firewalld可以动态修改单条规则,动态管理规则集,允许更新规则而不破坏现有会话和连接,而iptables,在修改了规则后必须得全部刷新才可以生效。

(2)firewalld使用区域和服务而不是链式规则。

(3)进站:firewalld默认是拒绝的(ping和ssh不拒绝),需要设置以后才能放行。而iptables默认是允许的,需要拒绝的才去限制。

​ 出站:firewalld、iptables 不限制。

(5)firewalld 默认每个服务是拒绝,每个服务都需要去设置才能放行;iptables 默认是每个服务是允许,需要拒绝的才去限制。

firewalld的基本概念

在CentOS7之后,当你使用firewalld时,有两个基本概念

  • 服务(service)
  • 区域(zone)

什么叫service?

简单说,就是应用协议
例如,上网经常用到tcp协议的80端口和443端口,还有域名解析要用到udp协议的35端口,访问共享文件夹需要用到udp端口的137和138端口, ssh登录的22端口

这些常用的应用,firewalld都已经内置了。因此在防火墙的配置和管理会变得简单以及人性化。

什么是zone?

而理解区域就更简单了,就是对各种内置服务预分组的集合。

在目录/usr/lib/firewalld/zones下预定义了所有的zone

image-20230815153659848

当然,也可以通过firewall命令进行查看

1
2
3
[root@procketmq1 ~]# firewall-cmd --get-zones
block dmz drop external home internal public trusted work

各个zone的含义如下。按信任等级排序

  • trusted–接受所有网络连接。我不建议将该区域用于连接到WAN的专用服务器或VM。

  • home –适用于您信任其他计算机的局域网内的家用计算机,例如笔记本电脑和台式机。仅允许选择的TCP / IP端口。

  • internal–当您主要信任LAN上的其他服务器或计算机时,用于内部网络。

  • work–在信任同事和其他服务器的工作场所中使用。以上都支持接受ssh

  • public–默认区域。不接受ssh。需要单独配置开通指定机器的ssh访问。

  • external-对于路由器连接类型很有用。您还需要LAN和WAN接口,以使伪装(NAT)正常工作。

  • dmz –仅接受SSH服务连接

  • block–拒绝所有传入的网络连接。仅从系统内部启动的网络连接是可能的。

  • drop –丢弃所有传入网络连接,没有任何回复。仅允许传出网络连接。

简单来说,区域就是firewalld预先准备了几套防火墙策略集合(策略模板),

用户可以根据生产场景的不同而选择合适的策略集合,从而实现防火墙策略之间的快速切换。

查看默认zone

1
2
[root@procketmq1 ~]# firewall-cmd --get-default-zone
public

默认使用的zone是public

查看当前使用的zone

1
2
3
[root@procketmq1 ~]# firewall-cmd --get-active-zones
public
interfaces: eth0

当前活跃的zone是public。并且分配的接口是eth0

确定使用的区域

默认区是public,但是不是说此时所有的连接就是使用public区域的配置

对于一个接收到的请求具体使用哪个 zone,firewalld 是通过下面三种方式来判断的:

source:来源地址。

Interface:接收请求的网卡。

firewalld:配置的默认区域(zone)。

这三个方式的优先级按顺序依次降低,也就是说如果按照 source 可以找到就不会再按 interface 去找,如果前两个都找不到才会使用第三个默认区域。

常用命令清单

image-20230816093718528

补充说明

  1. 使用firewall-cmd配置的命令是立即生效的,但是不是永久的。重启会失效。如果想一直有效,需要使用永久(Permanent)模式,即在firewall-cmd的命令后加–permanent参数
  2. 但是永久模式的配置不是立即生效的,在系统重启后才会生效。要想立即生效,手动执行 firewall-cmd –reload 重载命令。
  3. remove 掉 ssh 服务或者 ssh 端口,当前远程登陆会话不会断开,退出后就无法远程连接了。
  4. 添加常用端口,默认会添加到默认的区域(如果没有修改默认区域,默认区域为 public)。

查看当前防火墙启用的规则

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
[root@procketmq1 ~]# firewall-cmd --list-all
public (active)
target: default
icmp-block-inversion: no
interfaces: eth0
sources:
services: dhcpv6-client
ports: 3181/tcp 4118/tcp 19100/tcp
protocols:
masquerade: no
forward-ports:
source-ports:
icmp-blocks:
rich rules:
rule family="ipv4" source address="x.x.x.x/32" port port="22" protocol="tcp" accept

因为默认的就是public区域

1
firewall-cmd --list-all --zone=public 

所以 ,这俩个命令是一样的效果。

对结果的一些解释

  • public(active)说明目前该zone是active状态。你–zone=internal就会发现internal后面就没有active

  • target: default

    target就是对数据包的处理动作。有以下几种选项

    • default:不做任何事情
    • ACCEPT:除了被明确写好的规则,会接受所有流入的数据包
    • REJECT 除了被明确写好允许的规则,会拒绝所有流入的数据包, 会给发起连接的机器回复被拒绝的消息。
    • DROP:除了被明确写好允许的规则,会拒绝所有流入的数据包, 不会给发起连接的机器回复任何消息。

​ 要修改target,命令如下

1
firewall-cmd --zone=public --set-target=DROP 
  • icmp-block-inversion: no

ICMP协议类型黑白名单开关,默认是no。即黑名单模式:默认放行,允许处理所有的icmp type。只有添加到黑名单的icmp type被禁止响应。

  • interfaces: eth0

    关联的网卡

  • sources

    来源,可以是IP地址,也可以是mac地址

  • services

    允许的服务。public域默认不允许ssh

  • ports:

    允许的目标端口,即本地开放的端口,这里添加的是公开端口,所有的IP地址都能访问。

  • protocols:

    允许通过的协议

  • masquerade: no

    是否允许伪装(yes/no),可改写来源IP地址及mac地址

  • forward-ports:

    允许转发的端口

  • source-ports:

    允许的来源端口

  • icmp-blocks:

    可添加ICMP类型,当icmp-block-inversion为no时(黑名单),这些ICMP类型被拒绝;当icmp-block-inversion为yes时(白名单), 这些ICMP类型被允许。

  • rich rules:
    富规则,即更细致、更详细的防火墙规则策略,它的优先级在所有的防火墙策略中也是最高的。

查询、列出、添加、删除开放的端口:

应该是最常用的功能了

1
2
3
4
5
6
7
firewall-cmd --zone=public --query-port=8080/tcp  //返回该端口是否开放

firewall-cmd --zone=public --list-ports

firewall-cmd --permanent --zone=public --add-port=8080/tcp

firewall-cmd --permanent --zone=public --remove-port=8080/tcp

端口既可以是一个独立端口数字,又或者端口范围,例如,5060-5070。

协议可以指定为 tcp 或 udp。

rich rules

rich rule里可以写的内容:

  1. rule :规则。

  2. family:’ipv4’:指定ipv4的地址。

  3. source address=’10.0.10.1’:要拒绝或接受的IP,可以是IP或者是IP段。

  4. service name=’ssh’:指定的是ssh服务。

  5. port 指定端口

  6. protocol:指定协议

  7. target操作:drop:此条规则的执行方法是丢弃。reject:此条规则的执行方法是拒绝。accept:此条规则的执行方法是接受。

举例:

允许某个ip进行ssh登录

1
firewall-cmd --permanent --zone=public --add-rich-rule="rule family='ipv4' source address='x.x.x.x' service name='ssh' accept"

这是通过实行service的方式或者指定port 22端口

1
firewall-cmd --permanent --zone=public --add-rich-rule="rule family='ipv4' source address='x.x.x.x' port protocol='tcp' port='22' accept"

最后别忘了 firewall-cmd –reload

CTOP的安装和使用

CTOP是什么

ctop是一个命令行工具,作用是帮助我们查看和监控容器状态。ctop的意思也就是container的top命令。

我们经常使用top命令来查看linux服务器的状况,在使用docker容器时,我们也想看下容器使用的cpu、内存等的状态,一方面我们可以依靠 Portainer 和 Rancher等图形化界面,但是如果像顺手看一下,而不想浪费时间去找页面。这时候ctop就排上用场了。

ctop可以显示很多容器指标,比如 CPU 利用率、内存利用率、磁盘 I/O 读写、进程 ID(PID)和网络发送(TX - 从此服务器发送)以及接受(RX - 此服务器接受)等。

ctop工具安装

ctop是一个开源工具。源码在github开源https://github.com/bcicen/ctop

安装步骤README里写的很清楚。

我这里使用的是linux centos服务器,所以只写下linux写的使用。

1
2
sudo wget https://github.com/bcicen/ctop/releases/download/v0.7.5/ctop-0.7.5-linux-amd64 -O /usr/local/bin/ctop
sudo chmod +x /usr/local/bin/ctop

这两句命令就行了。

1,是下载工具

2,给ctop赋予执行权限

ctop的使用

和使用top命令一样简单,直接执行ctop。就会显示当前各个容器的实时情况。

img

快捷键

掌握一下快捷键,你会发现太舒服了。

Key Action 使用频率
打开操作菜单。 常用
a 显示所有容器,包括stop状态的。 一般
f 开启过滤功能。 常用
H 是否显示header。 不用
h 帮助对话框 一般
s 排序。可以按照cpu,内存等不同维度 一般
r 反序。 一般
o 查看选择的容器的详细状态信息 常用
l 查看选中的容器的日志 常用
e 进入容器的shell状态 常用
c 配置显示的列 不用
S 保存设置 不用
q 退出ctop 常用

img

对于我来说,最常用的就是l和e。

以前查看容器里的日志,要先docker ps 复制一下容器id。

然后 docker logs命令,现在只要按下l即可

以前进入容器,要敲下docker exec -it 容器id bash或sh,现在只要按下e

方便多了。

按o显示的容器内部详情

img

特殊字符

image-20230202092817986

通配符

image-20230202092853669

linux系统load高问题排查解决思路

对于系统load高问题,分两种情况进行讨论

CPU高、Load高
CPU低、Load高

CPU高、Load高的情况

  • 使用vmstat 查看系统纬度的 CPU 负载;
  • 使用 top 查看进程纬度的 CPU 负载;

1.1、使用 vmstat 查看系统纬度的 CPU 负载

可以通过 vmstat 从系统维度查看 CPU 资源的使用情况

格式:vmstat -n 1

-n 1 表示结果一秒刷新一次

1
2
3
4
5
6
7
8
[root@mongopaas2 ~]# vmstat -n 1
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
3 0 6341500 23114444 0 34589480 0 0 412 47 0 0 14 2 84 0 0
1 0 6341500 23110900 0 34590268 0 0 0 0 43874 49129 10 3 88 0 0
3 0 6341500 23111476 0 34591180 0 0 0 0 42170 46862 10 3 88 0 0
7 0 6341500 23111256 0 34592052 0 0 0 0 44789 47291 9 3 88 0 0

返回结果中的主要数据列说明:

  • r: 表示系统中 CPU 等待处理的线程。由于 CPU 每次只能处理一个线程,所以,该数值越大,通常表示系统运行越慢。

  • b: 表示阻塞的进程,这个不多说,进程阻塞,大家懂的。

  • us: 用户CPU时间,我曾经在一个做加密解密很频繁的服务器上,可以看到us接近100,r运行队列达到80(机器在做压力测试,性能表现不佳)。

  • sy: 系统CPU时间,如果太高,表示系统调用时间长,例如是IO操作频繁。

  • wa:IO 等待消耗的 CPU 时间百分比。该值较高时,说明 IO 等待比较严重,这可能磁盘大量作随机访问造成的,也可能是磁盘性能出现了瓶颈。

  • id:处于空闲状态的 CPU 时间百分比。如果该值持续为 0,同时 sy 是 us 的两倍,则通常说明系统则面临着 CPU 资源的短缺。

    常见问题及解决方法:

  • 如果r经常大于4,且id经常少于40,表示cpu的负荷很重。

  • 如果bi,bo长期不等于0,表示内存不足。

  • 如果disk经常不等于0,且在b中的队列大于3,表示io性能不好。

1.2、使用 top 查看进程纬度的 CPU 负载

可以通过 top 从进程纬度来查看其 CPU、内存等资源的使用情况。

image-20230821111909960

默认界面上第三行会显示当前 CPU 资源的总体使用情况,下方会显示各个进程的资源占用情况。

小技巧:可以直接在界面输入大写字母 P,来使监控结果按 CPU 使用率倒序排列,进而定位系统中占用 CPU 较高的进程。

可以看到现在cpu最高的进程是195317。

然后就可以继续对进程进行精准分析了。

对于java程序来说,通常的操作为:

  1. 通过top -Hp PID查找占用CPU最高的线程TID;
  2. 使用jstack打印线程堆栈信息
  3. 通过printf %x tid打印出最消耗CPU线程的十六进制;
  4. 在堆栈信息中查看该线程的堆栈信息;

CPU低、Load高情况分析

问题描述
Linux 系统没有业务程序运行,通过 top 观察,类似如下图所示,CPU 很空闲,但是 load average 却非常高:

image-20230821110632089

处理办法

  • load average 是对 CPU 负载的评估,其值越高,说明其任务队列越长,处于等待执行的任务越多。

  • 出现此种情况时,可能是由于僵死进程导致的。可以通过指令ps -axjf查看是否存在 D 状态进程。

  • D 状态是指不可中断的睡眠状态。该状态的进程无法被 kill,也无法自行退出。只能通过恢复其依赖的资源或者重启系统来解决。

    等待I/O的进程通过处于uninterruptible sleep或D状态;通过给出这些信息我们就可以简单的查找出处在wait状态的进程

1
ps -eo state,pid,cmd | grep "^D"

总结

1. CPU高、Load高
  1. 通过top命令查找占用CPU最高的进程PID;
  2. 通过top -Hp PID查找占用CPU最高的线程TID;
  3. 对于java程序,使用jstack打印线程堆栈信息(可联系业务进行排查定位);
  4. 通过printf %x tid打印出最消耗CPU线程的十六进制;
  5. 在堆栈信息中查看该线程的堆栈信息;
2. CPU低、Load高
  1. 通过top命令查看CPU等待IO时间,即%wa
  2. ps -axjf 查看是否有僵尸进程
  3. 通过iostat -d -x -m 1 10查看磁盘IO情况;(安装命令 yum install -y sysstat)
  4. 通过sar -n DEV 1 10查看网络IO情况;
  5. 通过如下命令查找占用IO的程序;
1
ps -e -L h o state,cmd  | awk '{if($1=="R"||$1=="D"){print $0}}' | sort | uniq -c | sort -k 1nr

Understanding Memory

理解页帧page frames 和 页 pages

Linux下的内存是以页(page)的形式组织的。默认情况下一页是4k大小。一个页上的连续的地址会映射到物理上连续的ram内存。

但是,连续的页在物理内存上的分布并不是连续的,是随机的。

所以,linux内核对物理内存的控制是以页为单位的,而不是一个个小的内存地址。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
Understanding page frames and pages
Memory in Linux is organized in the form of pages (typically 4 KB in size). Contiguous linear addresses within a page are mapped onto contiguous physical addresses on the RAM chip. However congtiguous pages can be present anywhere on the physical RAM.
Access rights and physical address mapping in the kernel is done at a page level rather than for every linear address.
A page refers both to the set of linear addresses that it contains as well as to the data contained in this group of addresses.

The paging unit thinks of all physical RAM as partitioned into fixed-length page frames. Each page frame contains a page. A page frame is a constituent of main memory, and hence it is a storage area. It is important to distinguish a page from a page frame;
the former is just a block of data, which may be stored in any page frame or on disk. The paging unit translates linear addresses into physical ones. One key task in the unit is to check the requested access type against the access rights of the linear address. If the memory access is not valid, it generates a Page Fault exception (see Chapter 4 and Chapter 8). The data structures that map linear to physical addresses are called page tables ; they are stored in main memory and must be properly initialized by the kernel before enabling the paging unit.

Pages can optionally be 4 MB in size. However this is not advised except for applications where the expected data unit is large.

The kernel considers the following page frames as reserved:

Those falling in the unavailable physical address ranges
Those containing the kernel's code and initialized data structures
A page contained in a reserved page frame can never be dynamically assigned or swapped to disk. As a general rule, the Linux kernel is installed in RAM starting from the physical address 0x00100000 i.e., from the second megabyte. The total number of page frames required depends on how the kernel is configured. A typical configuration yields a kernel that can be loaded in less than 3 MB of RAM

The remaining portion of the RAM barring the reserved page frames is called dynamic memory. It is a valuable resource, needed not only by the processes but also by the kernel itself. In fact, the performance of the entire system depends on how efficiently dynamic memory is managed. Therefore, all current multitasking operating systems try to optimize the use of dynamic memory, assigning it only when it is needed and freeing it as soon as possible.

The kernel must keep track of the current status of each page frame. For instance, it must be able to distinguish the page frames that are used to contain pages that belong to processes from those that contain kernel code or kernel data structures. Similarly, it must be able to determine whether a page frame in dynamic memory is free. A page frame in dynamic memory is free if it does not contain any useful data. It is not free when the page frame contains data of a User Mode process, data of a software cache, dynamically allocated kernel data structures, buffered data of a device driver, code of a kernel module, and so on

Allocating memory to processes
A kernel function gets dynamic memory in a fairly straightforward manner since the kernel trusts itself. All kernel functions are assumed to be error-free, so the kernel does not need to insert any protection against programming errors.

When allocating memory to User Mode processes, the situation is entirely different:

Process requests for dynamic memory are considered non-urgent. When a process's executable file is loaded, for instance, it is unlikely that the process will address all the pages of code in the near future. Similarly, when a process invokes malloc( ) to get additional dynamic memory, it doesn't mean the process will soon access all the additional memory obtained. Thus, as a general rule, the kernel tries to defer allocating dynamic memory to User Mode processes.
Because user programs cannot be trusted, the kernel must be prepared to catch all addressing errors caused by processes in User Mode.
When a User Mode process asks for dynamic memory, it doesn't get additional page frames; instead, it gets the right to use a new range of linear addresses, which become part of its address space. This interval is called a "memory region". A memory region consists of a range of linear addresses representing one or more page frames. Each memory region therefore consists of a set of pages that have consecutive page numbers.

Following are some typical situations in which a process gets new memory regions:

A new process is created
A running process decides to load an entirely different program (using exec()). In this case, the process ID remains unchanged, but the memory regions used before loading the program are released and a new set of memory regions is assigned to the process
A running process may perform a "memory mapping" on a file
A process may keep adding data on its User Mode stack until all addresses in the memory region that map the stack have been used. In this case, the kernel may decide to expand the size of that memory region
A process may create an IPC-shared memory region to share data with other cooperating processes. In this case, the kernel assigns a new memory region to the process to implement this construct
A process may expand its dynamic area (the heap) through a function such as malloc( ). As a result, the kernel may decide to expand the size of the memory region assigned to the heap
Demand paging
The term demand paging denotes a dynamic memory allocation technique that consists of deferring page frame allocation until the last possible moment until the process attempts to address a page that is not present in RAM, thus causing a Page Fault exception

Fig 9.4

The motivation behind demand paging is that processes do not address all the addresses included in their address space right from the start; in fact, some of these addresses may never be used by the process. Moreover, the program locality principle ensures that, at each stage of program execution, only a small subset of the process pages are really referenced, and therefore the page frames containing the temporarily useless pages can be used by other processes. Demand paging is thus preferable to global allocation (assigning all page frames to the process right from the start and leaving them in memory until program termination), because it increases the average number of free page frames in the system and therefore allows better use of the available free memory. From another viewpoint, it allows the system as a whole to get better throughput with the same amount of RAM.

The price to pay for all these good things is system overhead: each Page Fault exception induced by demand paging must be handled by the kernel, thus wasting CPU cycles. Fortunately, the locality principle ensures that once a process starts working with a group of pages, it sticks with them without addressing other pages for quite a while. Thus page Fault exceptions may be considered rare events.

An addressed page may not be present in main memory either because the page was never accessed by the process, or because the corresponding page frame has been reclaimed by the kernel.

Overcommitting memory
Linux allws overcommitting memory to processes. As we have seen that even though a process may malloc() 1 GB, Linux does not issue it 1 GB immediately, but rather only issues memory when the process actually needs it. Additionally Linux can overcommit the memory allocation. So if 5 processes each ask for 1 GB but the total amount of RAM and swap add up to only 4 GB, Linux may still allocate the 5 GB without any error. The overcommit settings depend on overcommit_memory and overcommit_ratio settings of the vm. Refer to http://www.mjmwired.net/kernel/Documentation/sysctl/vm.txt for further details on these parameters. In most cases overcommitting will not have any negative impact on the system unless you know your processes will utilize all of the memory that they are granted, and no addtl memory will be left over. On the other hand overcommitting does not have any advantage in server environments where capacity planning and calculations should be accurately performed.

atop shows the overcommit limit and the current committed memory, but this can be a bit misleading. I explain this calculation below

atop output:
MEM | tot 11.7G | free 75.5M | cache 3.9G | dirty 66.7M | buff 42.1M | slab 198.8M |
SWP | tot 2.0G | free 2.0G | vmcom 9.2G | vmlim 7.8G |

meminfo output
[user@server ~]$ cat /proc/meminfo
MemTotal: 12305340 kB
MemFree: 73672 kB
Buffers: 43120 kB
Cached: 4074220 kB
SwapTotal: 2048276 kB
SwapFree: 2047668 kB
Dirty: 62236 kB
Slab: 203948 kB
CommitLimit: 8200944 kB
Committed_AS: 9630052 kB
[user@server ~]$

Note: slight differences in the above two are from the fact that meminfo was run a few seconds after atop
From the above we can conclude the following -
Total memory: 11.7 GB
Memory used for the disk cache: 3.9 GB
Memory used for buffers and slab: ~240 MB
Memory free: ~75 MB
Memory actually used by processes: 11.7 - (3.9 + 0.24 + 0.075) => 7.485
Note: This can also be roughly estimated from the RSS of all processes. However the Resident size of each process will also contain shared memory making this difficult to estimate
Committed_AS field tells us the amount of memory we have committed to these processes => 9630052 KB => ~9.2 GB. Therefore these processes could theoretically ask for upto 9.2 GB
The documentation of the CommitLimit field tells us - Based on the overcommit ratio ('vm.overcommit_ratio'), this is the total amount of memory currently available to be allocated on the system. This limit is only adhered to if strict overcommit accounting is enabled (mode 2 in 'vm.overcommit_memory').
On our system (and on most default systems) overcommit_memory is set to "1", which means When this flag is 1, the kernel pretends there is always enough memory until it actually runs out.
So as we can see this overcommit_limit figure is irrelevant. The only thing relevant here is that incase the processes on the system do need 9.2 GB instead of their current 7.48 then that space will be most likely reduced from the disk cache (currently at 3.9 GB).
Page faults and swapping
Page faults and swapping are two independent processes. Page faults take place when a process requests for an address that has been allocated to it but has not yet been assigned to it. Upon receving this request the Kernel confirms that the address being requested belongs to the process and if so then allocates a new page frame from memory and assigns it to the process.

Swapping occurs in one of two scenarios -

When the kernel needs to allocate a page of memory to a process and finds that there is no memory available. In this case the kernel must swap out the least used pages of an existing process into the swap space (on disk) and allocate those page frames to the requesting process
There is a kernel parameter that determines swappiness of the kernel. The value is between 1 to 100 and is set to around 60 by default. A value of 100 means that the kernel will be considerably agressive when it comes to preferring allocatoin of memory to disk cache over processes. A value of 60 can result in occasional swapping out of process owned pages onto disk to make room for additional pages for the disk cache
In general page faults are rare since they only occur when a process needs to access latent memory space. Infact on a long running server where there are no new processes being forked, page faults should almost never occur

Swapping is bad for performance and should also never occur in a well planned deployment. Swapping will almost always signify that your server does not have adequate memory to run all its processes. Infact during constant swapping all your memory is used up by existing processes. There is no memory available for the disk cache either. Constant swapping can bring a server to a standstill. It is important to note that lack of memory for the page cache will never cause swapping. It is only when there is no memory available for your processes that swapping occurs.

a better description of swapping
when the kernel has no free space it needs to free up memory
it has the following options
drop a disk buffer cache page that is not dirty
flush dirty pages and drop them
move a page used by a process to disk
it uses the following rough algorithm to figure this out
is there inactive memory that can be reclaimed by dropping a page?
if not then it can do one of the below -
write a dirty page to disk and reclaim it
reclaim an active non dirty disk buffer cache page
swap-out a user mode process page to disk
depending on the value of swappiness it will prefer swap out over reclaiming disk buffer or vice versa
whenever that user mode process needs that page the same will be swapped in
in an idea world there should be no swap out and definitely not any swap-ins since that signifies that the system is low on memory
seeing swap ins may also signify that the swappiness value is inccorectly set based on the type of workload. for instance in appservers where the only disk activity maybe logging or some such ancillary activity one may want to set swappiness to a lower value before assuming that one has run out of memory incase of swwing swpins and outs
VmSize, Resident size and Actual size of a process
The resident size of a process (as shown in top or ps) represents the amount of non-swapped memory the kernel has already allocated to the process. This number is inaccurate when totalled (especially in a multi=process app like postgres or apache) since it includes shared memory. This also does not include the swapped out portion of the process. VmSize is the total memory of a program including its resident size, swap size, code, data, shared libraries etc. The SWAP column in top is calculated using VsSize - RSS which I believe is an incorrect calculation. Lets take an example and uinderstand these numbers better -

[user@server ~]$ cat /proc/9894/status
Name: java
State: S (sleeping)
VmPeak: 4109896 kB
VmSize: 4099492 kB
VmLck: 0 kB
VmHWM: 2855336 kB
VmRSS: 2848964 kB
VmData: 4000304 kB
VmStk: 84 kB
VmExe: 36 kB
VmLib: 65392 kB
VmPTE: 5940 kB

VmPeak: Peak virtual memory size.
VmSize: Virtual memory size.
VmLck: Locked memory size (see mlock(3)).
VmHWM: Peak resident set size ("high water mark").
VmRSS: Resident set size.
VmData, VmStk, VmExe: Size of data, stack, and text segments.
VmLib: Shared library code size.
VmPTE: Page table entries size (since Linux 2.6.10).
We can conclude from the above -

Total program size is 4099492 KB => 3.9 GB. I actually dont know what this number signifies. I do know it accounts for the Resident size of the program plus swap plus other files. However at the time the above snapshot was taken there was zero swap utilization
Current actual physical mem usage by the program = 2848964 => 2.71 GB
Max actual physical mem usage by the program in its history since startup => 2855336 => 2.72 GB
There is another aspect to remember here. Even though the resident size of the above program is 2.71 GB that too does not mean that the program is actually using 2.71 GB at this time. For instance in the above java program, java requests the kernel to consistently provide it additional memory whenever it needs addtl memory upto the limit specified for the java process. This memory is resident memory (unless a portion is swapped out). However after running an intensive process when java clears a large set of objects through a gc, this memory is not given back to the OS. The actual memory used by java at a point in time maybe significantly lesser than RSS. This can be measured independently provided the process allows you to do so.

Note that the VmHWM parameter is interesting inasmuch as it signifies the amount of physical memory required for the process at peak times.

Types of page faults
Minor page fault: If the page is loaded in memory at the time the fault is generated, but is not marked in the memory management unit as being loaded in memory, then it is called a minor or soft page fault. The page fault handler in the operating system merely needs to make the entry for that page in the memory management unit point to the page in memory and indicate that the page is loaded in memory; it does not need to read the page into memory. This could happen if the memory is shared by different programs and the page is already brought into memory for other programs.

Major page fault: If the page is not loaded in memory at the time the fault is generated, then it is called a major or hard page fault. The page fault handler in the operating system needs to find a free page in memory, or choose a page in memory to be used for this page's data, write out the data in that page if it hasn't already been written out since it was last modified, mark that page as not being loaded into memory, read the data for that page into the page, and then make the entry for that page in the memory management unit point to the page in memory and indicate that the page is loaded in memory. Major faults are more expensive than minor page faults and may add disk latency to the interrupted program's execution. This is the mechanism used by an operating system to increase the amount of program memory available on demand. The operating system delays loading parts of the program from disk until the program attempts to use it and the page fault is generated.

Invalid page fault: If a page fault occurs for a reference to an address that's not part of the virtual address space, so that there can't be a page in memory corresponding to it, then it is called an invalid page fault. The page fault handler in the operating system then needs to terminate the code that made the reference, or deliver an indication to that code that the reference was invalid.

Understanding the Linux page cache
(More details available in the disk IO section)

The page cache is the main disk cache used by the Linux kernel. In most cases, the kernel refers to the page cache when reading from or writing to disk. New pages are added to the page cache to satisfy User Mode processes's read requests. If the page is not already in the cache, a new entry is added to the cache and filled with the data read from the disk. If there is enough free memory, the page is kept in the cache for an indefinite period of time and can then be reused by other processes without accessing the disk.

Similarly, before writing a page of data to a block device, the kernel verifies whether the corresponding page is already included in the cache; if not, a new entry is added to the cache and filled with the data to be written on disk. The I/O data transfer does not start immediately: the disk update is delayed for a few seconds (unless an explicit fsync() is called), thus giving a chance to the processes to further modify the data to be written (in other words, the kernel implements deferred write operations).

Typically the kernel will use as much of the dynamic memory available to it for the page cache, only reclaiming page frames from the page cache peridically or as and when needed by a process or by newer pages that need to be written into the page cache. When the system load is low, the RAM is filled mostly by the disk caches and the few running processes can benefit from the information stored in them. However, when the system load increases, the RAM is filled mostly by pages of the processes and the caches are shrunken to make room for additional processes. Page reclaiming by default uses an LRU algorithm.

Read http://www.redhat.com/magazine/001nov04/features/vm/ for details on the lifecycle of a memory page

Understanding the PFRA
The objective of the page frame reclaiming algorithm (PFRA ) is to pick up page frames and make them free. The PFRA is invoked under different conditions and handles page frames in different ways based on their content.

The PFRA is invoked on one of the following -

Low on memory reclaiming - The kernel detects a "low on memory" condition
Periodic reclaiming - A kernel thread is activated periodically to perform memory reclaiming, if necessary
The types of pages are as follows -

Unreclaimable
Free pages (included in buddy system lists)
Reserved pages (with PG_reserved flag set)
Pages dynamically allocated by the kernel
Pages in the Kernel Mode stacks of the processes
Temporarily locked pages (with PG_locked flag set)
Memory locked pages (in memory regions with VM_LOCKED flag set)
Swappable
Anonymous pages in User Mode address spaces
Mapped pages of tmpfs filesystem (e.g., pages of IPC shared memory)
Syncable
Mapped pages in User Mode address spaces
Pages included in the page cache and containing data of disk files
Block device buffer pages
Pages of some disk caches (e.g., the inode cache )
Discardable
Unused pages included in memory caches (e.g., slab allocator caches)
Unused pages of the dentry cache
In the above table, a page is said to be mapped if it maps a portion of a file. For instance, all pages in the User Mode address spaces belonging to file memory mappings are mapped, as well as any other page included in the page cache. In almost all cases, mapped pages are syncable: in order to reclaim the page frame, the kernel must check whether the page is dirty and, if necessary, write the page contents in the corresponding disk file.

Conversely, a page is said to be anonymous if it belongs to an anonymous memory region of a process (for instance, all pages in the User Mode heap or stack of a process are anonymous). In order to reclaim the page frame, the kernel must save the page contents in a dedicated disk partition or disk file called "swap area" therefore, all anonymous pages are swappable

When the PFRA must reclaim a page frame belonging to the User Mode address space of a process, it must take into consideration whether the page frame is shared or non-shared . A shared page frame belongs to multiple User Mode address spaces, while a non-shared page frame belongs to just one. Notice that a non-shared page frame might belong to several lightweight processes referring to the same memory descriptor. Shared page frames are typically created when a process spawns a child or when two or more processes access the same file by means of a shared memory mapping

PFRA algorithm considerations:

Free the "harmless" pages first: Pages included in disk and memory caches not referenced by any process should be reclaimed before pages belonging to the User Mode address spaces of the processes; in the former case, in fact, the page frame reclaiming can be done without modifying any Page Table entry. As we will see in the section "The Least Recently Used (LRU) Lists" later in this chapter, this rule is somewhat mitigated by introducing a "swap tendency factor."
Make all pages of a User Mode process reclaimable: With the exception of locked pages, the PFRA must be able to steal any page of a User Mode process, including the anonymous pages. In this way, processes that have been sleeping for a long period of time will progressively lose all their page frames.
Reclaim a shared page frame by unmapping at once all page table entries that reference it: When the PFRA wants to free a page frame shared by several processes, it clears all page table entries that refer to the shared page frame, and then reclaims the page frame.
Reclaim "unused" pages only: The PFRA uses a simplified Least Recently Used (LRU) replacement algorithm to classify pages as active and inactive. If a page has not been accessed for a long time, the probability that it will be accessed in the near future is low and it can be considered "inactive;" on the other hand, if a page has been accessed recently, the probability that it will continue to be accessed is high and it must be considered as "active." The main idea behind the LRU algorithm is to associate a counter storing the age of the page with each page in RAM that is, the interval of time elapsed since the last access to the page. This counter allows the PFRA to reclaim the oldest page of any process. Some computer platforms provide sophisticated support for LRU algorithms; unfortunately, 80 x 86 processors do not offer such a hardware feature, thus the Linux kernel cannot rely on a page counter that keeps track of the age of every page. To cope with this restriction, Linux takes advantage of the Accessed bit included in each Page Table entry, which is automatically set by the hardware when the page is accessed; moreover, the age of a page is represented by the position of the page descriptor in one of two different lists
Active vs inactive memory
The PFRA classifies memory into active and inactive. /proc/meminfo provides the current active and inactive memory. Here is an eg -

[root@server]# cat /proc/meminfo
MemTotal: 132093140 kB
MemFree: 591272 kB
Buffers: 239488 kB
Cached: 125650056 kB
SwapCached: 0 kB
Active: 25157088 kB
Inactive: 103410468 kB
HighTotal: 0 kB
HighFree: 0 kB
<snip>

This shows that active memory is 25 GB while inactive is 103 GB. Starting from Linux Kernel 2.6.xx onwards these functions are handled by pdflush and kswapd and the Page Frame Reclaiming Algorithm.

Linux maintains two lists in the page cache - the Active List and the Inactive List. The Page Frame Reclaiming Algorithm gathers pages that were recently accessed in the active list so that it will not scan them when looking for a page frame to reclaim. Conversely, the PFRA gathers the pages that have not been accessed for a long time in the inactive list. Of course, pages should move from the inactive list to the active list and back, according to whether they are being accessed.

Clearly, two page states ("active" and "inactive") are not sufficient to describe all possible access patterns. For instance, suppose a logger process writes some data in a page once every hour. Although the page is "inactive" for most of the time, the access makes it "active," thus denying the reclaiming of the corresponding page frame, even if it is not going to be accessed for an entire hour. Of course, there is no general solution to this problem, because the PFRA has no way to predict the behavior of User Mode processes; however, it seems reasonable that pages should not change their status on every single access.

The PG_referenced flag in the page descriptor is used to double the number of accesses required to move a page from the inactive list to the active list; it is also used to double the number of "missing accesses" required to move a page from the active list to the inactive list (see below). For instance, suppose that a page in the inactive list has the PG_referenced flag set to 0. The first page access sets the value of the flag to 1, but the page remains in the inactive list. The second page access finds the flag set and causes the page to be moved in the active list. If, however, the second access does not occur within a given time interval after the first one, the page frame reclaiming algorithm may reset the PG_referenced flag.

The active and inactive memory can be used to infer a bunch of stuff as follows -

Active (file) can be used to determine what portion of the disk cache is actively in use
Inactive memory is the best candidate for reclaiming memory and so low inactive memory would mean that you are low on memory and the kernwl may have to swap out process pages, or swap out the cache to disk or in the worst case if it runs out of swap space then begin killing processes
Rough PFRA algo
Memory space is divided into memory used by processes, disk cache, free memory and memory used by kernel
Periodically pages from the memory are marked as active or inactive based on whether they have been accessed recently
Periodically or if memory is low then pages are reclaimed from the inactive list first and then the active list as follows -
The page to be reclaimed must be swappable, syncable or discardable
If the page is dirty it is written out to disk and reclaimed
If the page belongs to a user mode process it is written out to swap space
Pages are reclaimed using the active/inactive list in an LRU manner as described above
Depending on the "swappiness" variable, pages of a user mode process maybe preferred over disk cache pages when reclaiming memory
If there are very few discardable and syncable pages and the swap space is full then the system runs out of memory and invokes the OOM killer
OOM
Despite the PFRA effort to keep a reserve of free page frames, it is possible for the pressure on the virtual memory subsystem to become so high that all available memory becomes exhausted. This situation could quickly induce a freeze of every activity in the system: the kernel keeps trying to free memory in order to satisfy some urgent request, but it does not succeed because the swap areas are full and all disk caches have already been shrunken. As a consequence, no process can proceed with its execution, thus no process will eventually free up the page frames that it owns.

To cope with this dramatic situation, the PFRA makes use of a so-called out of memory (OOM) killer, which selects a process in the system and abruptly kills it to free its page frames. The OOM killer is like a surgeon that amputates the limb of a man to save his life: losing a limb is not a nice thing, but sometimes there is nothing better to do.

The out_of_memory() when the free memory is very low and the PFRA has not succeeded in reclaiming any page frames. The function selects a victim among the existing processes, then invokes oom_kill_process() to perform the sacrifice.

Of course the process is not picked at random. The selected process should satisfy several requisites:

The victim should own a large number of page frames, so that the amount of memory that can be freed is significant. (As a countermeasure against the "fork-bomb" processes, the function considers the amount of memory eaten by all children owned by the parent, too)
Killing the victim should lose a small amount of workit is not a good idea to kill a batch process that has been working for hours or days.
The victim should be a low static priority processthe users tend to assign lower priorities to less important processes.
The victim should not be a process with root privileges they usually perform important tasks.
The victim should not directly access hardware devices (such as the X Window server), because the hardware could be left in an unpredictable state.
The victim cannot be swapper (process 0), init (process 1), or any other kernel thread.
The function scans every process in the system, uses an empirical formula to compute from the above rules a value that denotes how good selecting that process is, and returns the process descriptor address of the "best" candidate for eviction. Then the out_of_memory( ) function invokes oom_kill_process( ) to send a deadly signal - usually SIGKILL; either to a child of that process or, if this is not possible, to the process itself. The oom_kill_process( ) function also kills all clones (referring here to LWPs) that share the same memory descriptor with the selected victim
One indicator of running into OOM is to look at the combination of free memory, Inactive memory and free swap in /proc/meminfo as explained below -

[user@server ~]$ cat /proc/meminfo
MemTotal: 12305340 kB
MemFree: 79968 kB
Buffers: 165376 kB
Cached: 3500048 kB
SwapCached: 0 kB
Active: 9819744 kB
Inactive: 1787500 kB
SwapTotal: 2048276 kB
SwapFree: 2047668 kB
Dirty: 80108 kB

In the above example -

Free memory is 79 MB
This means whenever the kernel requires additional memory it must reclaim memory by swapping out process pages to swap or writing file pages to disk. The primary candidate for reclaiming memory is the Inactive memory which in the above case is a healthy 1.7 GB. If there is no inactive memory to reclaim then the kernel would look at active memory. Lastly if no active file pages are available to write to disk, and all active process pages have been swapped out OR the swap space is full then the OOM killer would be activated.
If your server ever has an issue where the OOM killer was activated you have seriously neglected your memory monitoring. This condition must NEVER take place on any server.

Using drop_cache
Check http://linux-mm.org/Drop_Caches to learn how to drop the page cache in Linux. You can experiment with this command in combination with the output of meminfo (Cached, Active memory, Inactive memory) and fincore to determine how much of your data store is typically loaded into cache within how much time and what portion of it is extremely active.

Measuring memory utilization
atop
MEM | tot 126.0G | free 6.4G | cache 113.2G | dirty 924.9M | buff 394.7M | slab 1.8G
SWP | tot 2.0G | free 2.0G | vmcom 10.1G | vmlim 65.0G |

atop shows the system memory as a whole broken up as -

MEM
tot: total physical memory
free: free physical memory
cache: amount of memory used for the page cache
dirty: amount of page cache that is currently dirty
buff: the amount of memory used for filesystem meta data
slab: amount of memory being used for kernel mallocs
SWP
tot: total amount of swap space on disk
free: amount of swap space that is free
PAG (appears only if there is data to show in the interval)
scan: number of scanned pages due to the fact that free memory drops below a particular threshold
stall: number of times that the kernel tries to reclaim pages due to an urgent need
swin/swout: Also the number of memory pages the system read from swap space ('swin') and the number of memory pages the system wrote to swap space ('swout') are shown
/proc/meminfo
> cat /proc/meminfo
[bhavin.t@mongo-history-1 ~]$ cat /proc/meminfo
MemTotal: 62168992 kB
MemFree: 287900 kB
Buffers: 12264 kB
Cached: 59953784 kB
SwapCached: 0 kB
Active: 29934172 kB
Inactive: 30168836 kB
Active(anon): 137004 kB
Inactive(anon): 24 kB
Active(file): 29797168 kB
Inactive(file): 30168812 kB
Unevictable: 0 kB
Mlocked: 0 kB
SwapTotal: 0 kB
SwapFree: 0 kB
Dirty: 10832 kB
Writeback: 0 kB
AnonPages: 136704 kB
Mapped: 863444 kB
Shmem: 68 kB
Slab: 1526616 kB
SReclaimable: 1498556 kB
SUnreclaim: 28060 kB
KernelStack: 1520 kB
PageTables: 110824 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 31084496 kB
Committed_AS: 393640 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 116104 kB
VmallocChunk: 34359620200 kB
DirectMap4k: 63496192 kB
DirectMap2M: 0 kB

MemTotal: Total usable ram (i.e. physical ram minus a few reserved bits and the kernel binary code)
MemFree: The sum of LowFree+HighFree (essentially total free memory)
Buffers: Relatively temporary storage for raw disk blocks shouldn't get tremendously large (20MB or so)
Cached: Page cache. Doesn't include SwapCached
SwapCached: Memory that once was swapped out, is swapped back in but still also is in the swapfile (if memory is needed it doesn't need to be swapped out AGAIN because it is already in the swapfile. This saves I/O)
Active: Memory that has been used more recently and usually not reclaimed unless absolutely necessary.
anon: active memory that is not file backed (check http://www.linuxjournal.com/article/10678 for a desc on anonymous pages). this will typically be the higher chunk of active memory on a app server machine which does not have a db
file: active memory that is file backed. this will typically be the higher chunk of active memory on a data store machine that reads / writes from disk
Inactive: Memory which has been less recently used. It is more eligible to be reclaimed for other purposes
HighTotal/HighFree: Highmem is all memory above ~860MB of physical memory. Highmem areas are for use by userspace programs, or for the pagecache. The kernel must use tricks to access this memory, making it slower to access than lowmem.
LowTotal/LowFree: Lowmem is memory which can be used for everything that highmem can be used for, but it is also available for the kernel's use for its own data structures. Among many other things, it is where everything from the Slab is allocated. Bad things happen when you're out of lowmem.
SwapTotal: total amount of swap space available
SwapFree: Memory which has been evicted from RAM, and is temporarily on the disk
Dirty: Memory which is waiting to get written back to the disk
Writeback: Memory which is actively being written back to the disk
Mapped: files which have been mmaped, such as libraries
Slab: in-kernel data structures cache
Committed_AS — The total amount of memory, in kilobytes, estimated to complete the workload. This value represents the worst case scenario value, and also includes swap memory. PageTables — The total amount of memory, in kilobytes, dedicated to the lowest page table level.
VMallocTotal — The total amount of memory, in kilobytes, of total allocated virtual address space.
VMallocUsed — The total amount of memory, in kilobytes, of used virtual address space.
VMallocChunk — The largest contiguous block of memory, in kilobytes, of available virtual address space.
/proc/vmstat
This file shows detailed virtual memory statistics from the kernel. Most of the counters explained below are available only if you have kernel compiled with VM_EVENT_COUNTERS config option turned on. That's so because most of the parameters below have no function for the kernel itself, but are useful for debugging and statistics purposes

[user@server proc]$ cat /proc/vmstat
nr_anon_pages 2014051
nr_mapped 11691
nr_file_pages 890051
nr_slab_reclaimable 128956
nr_slab_unreclaimable 9670
nr_page_table_pages 5628
nr_dirty 15158
nr_writeback 0
nr_unstable 0
nr_bounce 0
nr_vmscan_write 4737
pgpgin 2280999
pgpgout 76513335
pswpin 0
pswpout 152
pgalloc_dma 1
pgalloc_dma32 27997500
pgalloc_normal 108826482
pgfree 136842914
pgactivate 24663564
pgdeactivate 8083378
pgfault 266178186
pgmajfault 2228
pgrefill_dma 0
pgrefill_dma32 6154199
pgrefill_normal 19920764
pgsteal_dma 0
pgsteal_dma32 0
pgsteal_normal 0
pgscan_kswapd_dma 0
pgscan_kswapd_dma32 3203616
pgscan_kswapd_normal 4431168
pgscan_direct_dma 0
pgscan_direct_dma32 1056
pgscan_direct_normal 2368
pginodesteal 0
slabs_scanned 391808
kswapd_steal 7598807
kswapd_inodesteal 0
pageoutrun 49495
allocstall 37
pgrotated 154

nr_anon_pages
nr_mapped - pages mapped by files
nr_file_pages -
nr_slab_reclaimable - pages from the kernel slab memory usage that can be reclaimed
nr_slab_unreclaimable 9670 - pages from the kernel slab memory usage that cannot be reclaimed
nr_page_table_pages 5628 - pages allocated to page tables
nr_dirty 15158 - dirty pages waiting to be written to disk
nr_writeback 0 - dirty pages currently being written to disk
nr_unstable 0
nr_bounce 0
nr_vmscan_write 4737
pgpgin 2280999 - page ins since last boot
pgpgout 76513335 - page outs since last boot
pswpin 0 - swap ins since last boot
pswpout 152 - swap outs since last boot
pgalloc_dma 1
pgalloc_dma32 27997500
pgalloc_normal 108826482
pgfree 136842914 - page frees since last boot
pgactivate 24663564 - page activations since last boot
pgdeactivate 8083378 - page deactivations since last boot
pgfault 266178186 - minor faults since last boot
pgmajfault 2228 - major faults since last boot
pgrefill_dma 0
pgrefill_dma32 6154199
pgrefill_normal 19920764 - page refills since last boot
pgsteal_dma 0
pgsteal_dma32 0
pgsteal_normal 0
pgscan_kswapd_dma 0
pgscan_kswapd_dma32 3203616
pgscan_kswapd_normal 4431168 - pages scanned by kswapd since boot
pgscan_direct_dma 0
pgscan_direct_dma32 1056
pgscan_direct_normal 2368 - pages reclaimed since boot
pginodesteal 0
slabs_scanned 391808
kswapd_steal 7598807
kswapd_inodesteal 0
pageoutrun 49495 - number of times kswapd called page reclaim
allocstall 37 - number of times page reclaim was called directly (low memory)
pgrotated 154
Of the above the following are important -

nr_dirty - signifies amount of memory waiting to be written to disk. If you have a power loss you can expect to lose this much data, unless your application has some form of journaling (eg Transaction logs)
pswpin & pswpout - should never be positive. This means the kernel is having to write memory pages to disk to free up memory for some other process or disk cache. One may see occasional swapping on the machine due to the kernel swapping out a process page in favor of a disk cache page due to the swappiness factor set
pgfree 136842914 - page frees since last boot
pgactivate 24663564 - page activations since last boot
pgdeactivate 8083378 - page deactivations since last boot
pgmajfault 2228 - shouldnt be too many. page faults are normal. but major page faults are generally rare. major page faults may involve disk activity and hence should ideally not occur frequently.
allocstall 37 - should not occur often. This signifies that the periodic running of kswapd could not free up adequate pages and for these many number of times the kernel had to trigger page reclaims manually
vmstat
[user@server ~]$ vmstat -a -S M 5
procs ----------memory--------- --swap- ----io--- -system- ----cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
3 0 2 6593 394 115893 0 0 690 767 1 2 32 12 53 4 0
3 0 2 6585 394 115901 0 0 204 6310 6005 23103 29 15 53 2 0
2 1 2 6549 394 115912 0 0 182 4707 5102 20867 38 13 48 2 0

[user@server ~]$ vmstat -S M 5
procs ----------memory--------- --swap- ----io--- -system- ----cpu-----
r b swpd free inact active si so bi bo in cs us sy id wa st
4 0 2 6390 48082 71527 0 0 690 767 1 2 32 12 53 4 0
2 0 2 6383 48082 71534 0 0 87 4614 5859 21944 34 13 51 1 0
3 0 2 6376 48082 71543 0 0 137 5164 4925 19994 23 12 64 1 0

vmstat shows the following memory related fields -

swpd: the amount of virtual memory used
free: the amount of idle memory
buff: the amount of memory used as buffers
cache: the amount of memory used as cache
inact: the amount of inactive memory (-a option)
active: the amount of active memory (-a option)
/proc - per process memory stats
[user@server ~]$ cat /proc/7278/status
<snip>
FDSize: 1024
Groups: 26
VmPeak: 3675100 kB
VmSize: 3675096 kB
VmLck: 0 kB
VmHWM: 81160 kB
VmRSS: 81156 kB
VmData: 944 kB
VmStk: 84 kB
VmExe: 3072 kB
VmLib: 2044 kB
VmPTE: 244 kB
StaBrk: 0ac3c000 kB
Brk: 0ac82000 kB
StaStk: 7fff35863220 kB
Threads: 1

FDSize: Number of file descriptor slots currently allocated.
Groups: Supplementary group list.
VmPeak: Peak virtual memory size.
VmSize: Virtual memory size.
VmLck: Locked memory size (see mlock(3)).
VmHWM: Peak resident set size ("high water mark").
VmRSS: Resident set size.
VmData, VmStk, VmExe: Size of data, stack, and text segments.
VmLib: Shared library code size.
VmPTE: Page table entries size (since Linux 2.6.10).
Threads: Number of threads in process containing this thread.
[user@server ~]$ cat /proc/7278/statm
918774 20289 20186 768 0 257 0

Table 1-2: Contents of the statm files (as of 2.6.8-rc3)
..............................................................................
Field Content
size total program size (pages) (same as VmSize in status)
resident size of memory portions (pages) (same as VmRSS in status)
shared number of pages that are shared (i.e. backed by a file)
trs number of pages that are 'code' (not including libs; broken,
includes data segment)
lrs number of pages of library (always 0 on 2.6)
drs number of pages of data/stack (including libs; broken,
includes library text)
dt number of dirty pages (always 0 on 2.6)
..............................................................................

[user@server ~]$ cat /proc/7278/stat
7278 (postgres) S 1 7257 7257 0 -1 4202496 36060376 10845160168 0 749 20435 137212 158536835 39143290 15 0 1 0 50528579 3763298304 20289 18446744073709551615 4194304 7336916 140734091375136 18446744073709551615 225773929891 0 0 19935232 84487 0 0 0 17 2 0 0 12

Table 1-3: Contents of the stat files (as of 2.6.22-rc3)
..............................................................................
Field Content
pid process id
tcomm filename of the executable
state state (R is running, S is sleeping, D is sleeping in an
uninterruptible wait, Z is zombie, T is traced or stopped)
ppid process id of the parent process
pgrp pgrp of the process
sid session id
tty_nr tty the process uses
tty_pgrp pgrp of the tty
flags task flags
min_flt number of minor faults
cmin_flt number of minor faults with child's
*maj_flt number of major faults
cmaj_flt number of major faults with child's
utime user mode jiffies
stime kernel mode jiffies
cutime user mode jiffies with child's waited for
cstime kernel mode jiffies with child's waited for
priority priority level
nice nice level
num_threads number of threads
it_real_value (obsolete, always 0)
start_time time the process started after system boot
vsize virtual memory size
rss resident set memory size
rsslim current limit in bytes on the rss
start_code address above which program text can run
end_code address below which program text can run
start_stack address of the start of the stack
esp current value of ESP
eip current value of EIP
pending bitmap of pending signals (obsolete)
blocked bitmap of blocked signals (obsolete)
sigign bitmap of ignored signals (obsolete)
sigcatch bitmap of catched signals (obsolete)
wchan address where process went to sleep
0 (place holder)
0 (place holder)
exit_signal signal to send to parent thread on exit
task_cpu which CPU the task is scheduled on
rt_priority realtime priority
policy scheduling policy (man sched_setscheduler)
blkio_ticks time spent waiting for block IO
..............................................................................

[user@server ~]$ cat /proc/7278/smaps
00400000-00700000 r-xp 00000000 08:03 6424710 /usr/local/postgres/pgsql8.2.3/bin/postgres
Size: 3072 kB
Rss: 2108 kB
Shared_Clean: 2108 kB
Shared_Dirty: 0 kB
Private_Clean: 0 kB
Private_Dirty: 0 kB
Swap: 0 kB
2b3a78a33000-2b3b5493f000 rw-s 00000000 00:09 1114115 /SYSV0052e2c1 (deleted)
Size: 3603504 kB
Rss: 2129800 kB
Shared_Clean: 54300 kB
Shared_Dirty: 2075500 kB
Private_Clean: 0 kB
Private_Dirty: 0 kB
Swap: 0 kB

smaps shows for each process the memory distribution for various libraries, data and programs and what portion of it is shared. for instance above I have snipped out two entries from postgres showing that the postgres executable is taking 2 MB of shared memory and the postgres internal cache is taking 2 GB of shared memory.

[root@server]# pmap -x 30850 | less
Address Kbytes RSS Dirty Mode Mapping
0000000040000000 36 0 0 r-x-- java
0000000040108000 8 8 8 rwx-- java
0000000041373000 1469492 1469352 1469352 rwx-- [ anon ]
000000071ae00000 45120 44740 44740 rwx-- [ anon ]
000000071da10000 38848 0 0 ----- [ anon ]
0000000720000000 3670016 3670016 3670016 rwx-- [ anon ]
00007ff67286f000 12 0 0 ----- [ anon ]
00007ff672872000 1016 24 24 rwx-- [ anon ]
00007ff672970000 12 0 0 ----- [ anon ]
00007ff672973000 1016 24 24 rwx-- [ anon ]
...

top
Mem: 132093140k total, 128645860k used, 3447280k free, 413200k buffers
Swap: 2096472k total, 2596k used, 2093876k free, 122750144k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ SWAP CODE DATA nFLT nDRT COMMAND
21827 postgres 15 0 3626m 2.1g 2.0g S 15.5 1.6 10:20.94 1.5g 3072 32m 0 0 postgres
19638 postgres 15 0 3626m 2.1g 2.0g S 14.5 1.6 14:03.23 1.5g 3072 32m 0 0 postgres
27306 postgres 15 0 3618m 2.1g 2.0g R 11.6 1.6 9:34.90 1.5g 3072 24m 0 0 postgres
19673 postgres 15 0 3626m 2.1g 2.0g S 10.9 1.6 8:40.20 1.5g 3072 32m 0 0 postgres
22068 postgres 15 0 3626m 2.1g 2.0g S 10.2 1.6 15:20.89 1.5g 3072 32m 0 0 postgres
4339 postgres 15 0 3618m 2.1g 2.0g S 8.6 1.6 8:04.42 1.5g 3072 24m 0 0 postgres

top shows the following global memory related fields -

Mem: physical memory (total, used, free, used for buffers)
Swap: swap space (total, used, free, amount of memory used for disk cache?? - this last value is uncertain)
top shows the following memory related fields per process -
%MEM – Memory usage (RES) - A task's currently used share of available physical memory
VIRT – Virtual Image (kb) - The total amount of virtual memory used by the task. It includes all code, data and shared libraries plus pages that have been swapped out. (Note: you can define the STATSIZE=1 environment variable and the VIRT will be calculated from the /proc/#/state VmSize field.)
SWAP – Swapped size (kb) - The swapped out portion of a task's total virtual memory image. SWAP is calculated as VIRT-RES. This field shows incorrect data in my opinion
RES – Resident size (kb) - The non-swapped physical memory a task has used. RES = CODE + DATA. RES includes SHR
CODE – Code size (kb) - The amount of physical memory devoted to executable code, also known as the 'text resident set' size or TRS.
DATA – Data+Stack size (kb) - The amount of physical memory devoted to other than executable code, also known as the 'data resident set' size or DRS.
SHR – Shared Mem size (kb) - The amount of shared memory used by a task. It simply reflects memory that could be potentially shared with other processes.
nFLT – Page Fault count - The number of major page faults that have occurred for a task. A page fault occurs when a process attempts to read from or write to a virtual page that is not currently present in its address space. A major page fault is when disk access is involved in making that page available.
nDRT – Dirty Pages count - The number of pages that have been modified since they were last written to disk. Dirty pages must be written to disk before the corresponding physical memory location can be used for some other virtual page.
vmtouch
vmtouch is a great tool for learning about and controlling the file system cache of unix and unix-like systems. You can use it to learn about how much of a file is in memory, what files should be evicted from memory etc

Example 1
How much of the /bin/ directory is currently in cache?

$ vmtouch /bin/
Files: 92
Directories: 1
Resident Pages: 348/1307 1M/5M 26.6%
Elapsed: 0.003426 seconds

Example 2
We have 3 big datasets, a.txt, b.txt, and c.txt but only 2 of them will fit in memory at once. If we have a.txt and b.txt in memory but would now like to work with b.txt and c.txt, we could just start loading up c.txt but then our system would evict pages from both a.txt (which we want) and b.txt (which we don't want).

So let's give the system a hint and evict a.txt from memory, making room for c.txt:

$ vmtouch -ve a.txt
Evicting a.txt

Files: 1
Directories: 0
Evicted Pages: 42116 (164M)
Elapsed: 0.076824 seconds

fincore
fincore is a great tool that can be used to measure how much of a file is currently in the disk cache. This can be used to determine rough cache usage for an application.

root@xxxxxx:/var/lib/mysql/blogindex# fincore --pages=false --summarize --only-cached *
stats for CLUSTER_LOG_2010_05_21.MYI: file size=93840384 , total pages=22910 , cached pages=1 , cached size=4096, cached perc=0.004365
stats for CLUSTER_LOG_2010_05_22.MYI: file size=417792 , total pages=102 , cached pages=1 , cached size=4096, cached perc=0.980392
stats for CLUSTER_LOG_2010_05_23.MYI: file size=826368 , total pages=201 , cached pages=1 , cached size=4096, cached perc=0.497512
stats for CLUSTER_LOG_2010_05_24.MYI: file size=192512 , total pages=47 , cached pages=1 , cached size=4096, cached perc=2.127660
stats for CLUSTER_LOG_2010_06_03.MYI: file size=345088 , total pages=84 , cached pages=43 , cached size=176128, cached perc=51.190476
stats for CLUSTER_LOG_2010_06_04.MYD: file size=1478552 , total pages=360 , cached pages=97 , cached size=397312, cached perc=26.944444
stats for CLUSTER_LOG_2010_06_04.MYI: file size=205824 , total pages=50 , cached pages=29 , cached size=118784, cached perc=58.000000
Optimizing memory usage
Optimizing memory usage consists of the following principles -

Ensure memory never runs out
This can be achieved as follows -

Reduce your applications memory footprint. Try to use memory efficiently within your application. Use memory efficient data structures
Perform proper capacity planning to determine memory usage during peak loads. Account for concurrently running processes and disk cache requirements of all running applications and potential impact of backup scripts or scripts that read/write large quantities of data from disk, which typically wipe out your disk cache if they are not configured to use O_DIRECT along with free memory requirements of the OS and other applications
Use LWPs (threads) instead of processes for concurrency. Even when using processes try to use shared memory for inter-process communication and common data
Monitor your memory utilization and determine if any process is hogging up too much memory
No swapping
Your server should NEVER swap. Note: Some swapping may occur if the kernel is configured to prefer utilization of memory towards disk caches as opposed to process space. However this too should be minimal. Swapping is bad and should never never occur
You may want to tune /proc/sys/vm/swappiness. Details at http://www.westnet.com/~gsmith/content/linux-pdflush.htm and http://people.redhat.com/nhorman/papers/rhel4_vm.pdf
Infact if you have done appropriate capacity planning you can actually configure your machine without any swap space altogether (refer http://david415.wordpress.com/2009/11/21/running-linux-with-no-swap/). Offcourse you have to be dead certain about your capacity planning
Rare page faulting
Page faults will occur when a new process is forked or created or when an existing process requests for additional memory allocation. However these situations in a constantly running server should not be too high, and therefore you should see very rare page faulting on the server, especially major faults (minor faults are fine - they require no disk access. major faults may require some disk access)

Maximize disk cache hits for reads
Determine your disk cache needs appropriately. For eg if your database is 100 GB and about 10% of it is accessed about 95% of the time then you need about 10 GB to be available in the disk cache. You can use fincore to determine what portion of your data files are in the page cache at any given time. This helps you determine which files are being cached and what % of them remain in cache in a warm cache scenario. You can also use drop_cache and fincore dumps in combination to determine the rate at which your data files are being loaded into cache which also gives you some idea on what portion of your data is most frequently accessed. Lastly if you have implemented flashcache in LRU mode you can judge from the cache hits and misses in flashcache as to roughly how much data is frequently used and what amount of RAM you may wish to dedicate to the disk cache.
Disk cache replacement algorithms are LRU. This works for normal access scenarios. However if you have a backup script or some script that is either reading or writing a large amount of data onetime, it can wipe out your disk cache. It is important to therefore optimize sequential backup or simliar scripts to use O_DIRECT mode of data transfer which bypasses the disk cache and prevents wiping it out. It is also important to run these types of processes during times where your system has minimal IO load
Avoid double buffering - it only ends up wasting space. For eg if your application itself caches data from the disk in its own heap then you are wasting twice the memory for most frequently used data. It may make sense for you to manage your own app cache since you can save significant cpu cycles by caching data in the exact format it is required in, as opposed to the page cache that linux maintains. However in that case you may want to optimize memory usage by using O_DIRECT mode for accessing that data when it is not available in your app cache. Alternatively you may leave caching entirely to the Operating system and not maintain any page cache in your own application
Backup processes or processes that linearly read / write to a large chunk of the disk should not wipe out the page cache.
The disk cache replacement algorithm is provided by the operating system as an LRU algorithm. Now currently it is not practical to change this, however based on your application a different algorithm may be more optimal.
Check the disk IO operations. If they are predominantly write operations with minimal reads then your disk cache is likely serving most of the reads. If however the disk IO operations are predominantly reads you could improve performance by optimizing your disk cache
Each application has varying data access behavior. Combining multiple applications on a single machine such that they have to share their dynamic memory results in sub-optimal memory utilization. For eg consider a web hosting server consisting of site data and database. Typically website data comprises of static files, html files, code files, images, videos, media etc. The total amount of website content is much more than database content (on typical servers we have noticed site data can run in terrabytes while database sizes are in gigabytes). However databases generate more IOPs than site content. This means if you deploy them on the same server and they have to share their RAM, a greater portion of the RAM will be dedicated towards site data than database data, even though the latter is more frequently accessed and should get a larger portion of the disk cache. Just by separating these two applications you can optimize the data that gets stored in the disk cache
Here is a small probabilistic model that shows how segregating the disk caches for different applications can help optimize memory usage -
Say we have 6 blocks of data - A, B, C, D, E, F
A gets accessed 10 times every minute
B, C, D, E, F get accessed 2 times every minute
Lets say your cache can only store one of these blocks
The probability of finding A in the cache is 50% while the probability of finding one of B, C, D, E, F is also 50%
however if the cache has any of B, C, D, E or F, then the cache will only be useful 2 times every minute. Therefore half the times the cache is being used sub-optimally
Tuning disk cache
Refer to http://www.westnet.com/~gsmith/content/linux-pdflush.htm and http://www.cyberciti.biz/faq/linux-kernel-tuning-virtual-memory-subsystem/ for tips on tuning various parameters of pdflush and kswapd which control the page reclaim logic of your disk cache

Maximize disk cache merges for writes
If your application does not require to fsync() data immediately, you can gain a considerable performance boost out of the write-back nature of the disk cache. Most databases, mail servers etc fsync() on each write since they cannot afford to lose data. However, if you have created a custom data store, you may have a model wherein you write the same data onto multiple nodes synchronously. In this case all of the nodes do not need to fsync() the data, since a replica is available incase of a total node failure. In this situation the total number of writes will reduce since many updates will cancel previous writes and multiple writes can be merged together resulting in lesser IOPs which helps both incase of flash drives and SATA drives.

Tuning kernel vm parameters
Refer to http://www.mjmwired.net/kernel/Documentation/sysctl/vm.txt

resources

Linux零拷贝

很多人都听过零拷贝zero-copy的大名,知道在很多牛掰的中间件(比如rocketmq)中为了提高效率,都是用了这项技术。

但是,zero-copy到底是怎么回事?

今天我想把他彻底搞明白。

什么是zero-copy

为了搞明白一个技术方案,首先我们应该搞明白这个技术是为了解决什么问题。先理解问题,才能理解答案。

我们先看一个简单的场景,服务器需要将磁盘文件中的数据通过网络发送给客户端。在这个场景下都发生了什么?

你可能觉得这只是两个系统间的调用,数据从一个服务器copy到另一台服务器。但事实上,在操作系统层面发生的事情要复杂得多。

在传统模式下,至少会copy四次数据,下面会细讲。所以,为了减少copy次数,提高效率,诞生了zero-copy方案。

zero-copy并不是真的一次都不copy,而是站在内核空间的角度讲,没有进行内核空间和用户空间之间的copy。

看不懂这几个名词没关系,下面会细讲。

前置知识

Linux系统的“用户空间”和“内核空间”

首先要知道,Linux操作系统分为【用户态user context】和【内核态 kernel context】,

从Linux系统上看,除了引导系统的BIN区,整个内存空间主要被分成两个部分:内核空间(Kernel space)、用户空间(User space)。“用户空间”和“内核空间”的空间、操作权限以及作用都是不一样的。文件操作、网络操作需要涉及这两种上下文的切换,免不了进行数据复制。

内核空间是Linux自身使用的内存空间,主要提供给程序调度、内存分配、连接硬件资源等程序逻辑使用;

用户空间则是提供给各个进程的主要空间。用户空间不具有访问内核空间资源的权限,因此如果应用程序需要使用到内核空间的资源,则需要通过系统调用来完成:从用户空间切换到内核空间,然后在完成相关操作后再从内核空间切换回用户空间

DMA

DMA(Direct Memory Access) ———— 直接内存访问 :DMA是允许外设组件将I/O数据直接传送到主存储器中并且传输不需要CPU的参与,以此将CPU解放出来去完成其他的事情。
而用户空间与内核空间之间的数据传输并没有类似DMA这种可以不需要CPU参与的传输工具,因此用户空间与内核空间之间的数据传输是需要CPU全程参与的。所有也就有了通过零拷贝技术来减少和避免不必要的CPU数据拷贝过程。

传统做法

在普通的做法中,这个简单的数据复制过程,至少经过了4次的数据复制。

下面这张图,上半部分是上下文的切换,下半部分是数据复制操作

img

已java为例讲解

第一步:首先jvm向操作系统发出read()系统调用,此时会要将上下文从用户态切换到内核态。将数据从磁盘copy到内核地址空间buffer中,这一步copy是由DMA(直接内存访问)引擎执行的,这是第一次copy

第二步:linux的read操作结束退出,此时会将上下文切换回用户态,同时将数据从内核缓冲区拷贝到用户空间缓冲区。这里用的是CPU。这是第二次copy

之后需要将数据写给网卡进行发送

第三步:jvm向OS发起write()系统调用。系统上下文由用户态切换为内核态。系统的write操作将数据从用户空间buffer复制到内核空间的socket buffer。现在数据又回到了内核buffer,当然这和第一步的内核buffer不是一个空间。这是第三次copy。用的是cpu。

第四步:write操作返回,系统切换回用户态。同时,系统通过DMA将socket buffer中的数据copy到底层硬件(网卡)缓冲区。(这是一个独立且异步的操作,和write操作无关)

下面这个图更清晰些

img

可以看到,数据在内核空间和用户空间之间,被来来回回的复制。同时伴随着系统上下文的切换。我们知道,上下文切换是CPU密集型的工作,数据拷贝是I/O密集型的工作。如果一次简单的传输就要像上面这样复杂的话,效率是相当低下的。零拷贝机制的终极目标,就是消除冗余的上下文切换和数据拷贝,提高效率。

sendfile

为了做到这点,linux内核从2.1版本开始,提供了名为sendfile()的系统调用。不仅能减少数据copy,而且能减少系统上下文切换。

img

看图,从图的上半部分看出,sendfile方式,只有2次上下文的切换。

在数据copy方面。

第一步,从硬盘copy进内核空间,DMA。

第二步,内核buffer到socket buffer的copy。用的cpu。注意这里的copy只在内核空间,不涉及用户空间。

第三步,从socket buffer 到网卡,DMA

ok,目前我们省了1次上下文切换,省了内核空间和用户空间的cpu copy

但是,我们看到在内核空间中,还有一次cpu copy。直觉告诉我们,这次copy还是没有什么用。

为什么会有这一步呢?

这是因为在一般的Block DMA方式中,源物理地址和目标物理地址都得是连续的,所以一次只能传输物理上连续的一块数据,每传输一个块发起一次中断,直到传输完成,所以必须要在两个缓冲区之间拷贝数据。

那么能否省去这一步呢?

对Scatter/Gather的支持的IO

Scatter/Gather技术的意思是分散然后收集。

从kernel 2.4版本开始,socket buffer的描述符做了优化。描述符中包含有数据的起始地址和长度。传输时只需要遍历链表,按序传输数据,全部完成后发起一次中断即可,效率比Block DMA要高。也就是说,硬件可以通过Scatter/Gather DMA直接从内核缓冲区中取得全部数据,不需要再从内核缓冲区向Socket缓冲区拷贝数据。

img

这样,又省区了cpu拷贝。

到现在为止,从用户空间角度看,是0次copy。从CPU的角度看,也是0次copy。

所以zero-copy完成。

那么,现在是不是完美了呢?其实是有一个问题的。

现在数据一直在内核空间,用户空间没有数据,那么我们的程序就不能对数据进行任何修改操作。如果我们有修改的需求,sendfile实现是不行的。

进而,引出了下面这个。

mmap

mmap是内存映射。他要比sendfile效率差,但是比传统方式又高效的一种方案。

img

mmap的特点是,不会将数据从内核空间copy到用户空间。而是在用户空间共享内核空间的数据。从而可以对数据进行修改。

从效率上来说,mmap也是有4次上下文切换,和三次的数据copy。相比传统方式是省了一次cpu copy。

java的NIO

java的NIO中,常用的FileChannel。有个map方法,FileChannel的map方法会返回一个MappedByteBuffer。MappedByteBuffer是一个直接字节缓冲器,该缓冲器的内存是一个文件的内存映射区域。map方法底层是通过mmap实现的,因此将文件内存从磁盘读取到内核缓冲区后,用户空间和内核空间共享该缓冲区。

权限控制设计

流程图

一站式权限流程

说明

  1. 对iam的token的解析问题。

    此处有2种方案

    1.本地解析jwt,目前看iam的token是jwt

    2.远程调用iam接口解析

框架

JWT

spring-security

数据库表设计

  • ums_admin:后台用户表。基于IDM工号,不需要了。
  • ums_role:后台用户角色表
  • ums_admin_role_relation:工号和角色对应关系
  • ums_menu:菜单表。控制前台侧边来菜单显示
  • ums_role_memu_relation。角色和菜单对应关系

页面&api接口

  1. 根据工号获取菜单列表
  2. 创建角色
  3. 配置工号角色关系
  4. 配置菜单
  5. 配置角色菜单对应关系

拦截器

通过对请求进行拦截,解析jwt,获取权限信息,组装security-context

1
2
3
4
5
6
7
8
9
public class AuthenticationInterceptor implements HandlerInterceptor {
@Override
public boolean preHandle(HttpServletRequest httpServletRequest, HttpServletResponse httpServletResponse, Object object) throws Exception {
// 获取paas-token
// 解析token,得到工号
// 查询权限信息
// 组装security-context
}
}

后台接口

后台接口开启基于角色的权限控制

  • 实现UserDetailService,获取用户角色和权限列表

  • 启用Spring Security开启注解@EnableWebSecurity

  • 开启注解 @EnableGlobalMethodSecurity(prePostEnabled = true, securedEnabled = true)

    @PreAuthorize, @PostAuthorize, @Secured这三个注解可以使用了

  • @Secured:专门判断用户是否具有角色,可以写在方法或类上,参数以 ROLE_ 开头

    1
    2
    3
    4
    5
    @GetMapping("/helloUser")
    @Secured({"ROLE_normal","ROLE_admin"}) //拥有normal或者admin角色的用户都可以方法helloUser()方法
    public String helloUser() {
    return "hello,user";
    }

​ 如果我们要求,只有同时拥有admin & noremal的用户才能方法helloUser()方法,这时候@Secured就无能为力了

  • @PreAuthorize更适合方法级的安全,也支持Spring 表达式语言,提供了基于表达式的访问控制

    PreAuthorize 访问的类或方法执行前判断权限。PostAuthorize 在执行之后,基本不用。

1
2
3
4
5
6
@GetMapping("/helloUser")
@PreAuthorize("hasAnyRole('normal','admin')")
// 拥有normal或者admin角色的用户都可以方法helloUser()方法。
public String helloUser() {
return "hello,user";
}
1
2
3
4
5
6
@GetMapping("/helloUser")
@PreAuthorize("hasRole('normal') AND hasRole('admin')")
// 用户必须同时拥有normal和admin的角色
public String helloUser() {
return "hello,user";
}

除了hasRole判断角色,还可用hasAuthority类别判断权限信息。

  • @RolesAllowed。还有一个@RolesAllowed注解。用法和@Secured很像。区别是@Secured是spring-security的注解,而@RolesAllowed是JSR的标准注解,不止在spring中可用。

由于Java里closure的自由变量是只读的,你没法对lambda外面的变量进行赋值

应用不响应排查

同事给了jstack信息和dump文件,

jstack信息使用在线分析https://heaphero.io

image-20220106143845851

image-20220106143855209

image-20220106143905149

说线程数太多。刚开始还真以为是这个原因。700个dubbo handler线程全部处于waiting状态。

然后仔细看了下这个日志waiting(parking)

image-20220106144140519

dubbo业务处理使用线程池。通过这个SynchronousQueue队列,基本可以确定是CachedThreadPool

image-20220106144243702

因为其他几个线程池模型用的是LinkedBlockingQueue。

SynchronousQueue特点是,他的容量是0,进来一个任务,必须有一个处理线程取走,否则阻塞。处理线程要取任务,必须等待任务放进来,否则阻塞。所以这里是dubbo线程池没活干。闲着呢。没问题。

用mat看dump文件,也没什么问题。就是字符串有点多。

然后,没办法了。去看gc情况。容器环境还没jdk。复制了一个放进来。

image-20220106143754185

一次Full-GC,330秒,简直要疯了。

更好的情况是,添加JVM参数重启,打印gcLog,

1
2
3
4
5
6
-XX:+PrintGCDateStamps:打印 GC 发生的时间戳。
-XX:+PrintTenuringDistribution:打印 GC 发生时的代龄信息。
-XX:+PrintGCApplicationStoppedTime:打印 GC 停顿时长
-XX:+PrintGCApplicationConcurrentTime:打印 GC 间隔的服务运行时长
-XX:+PrintGCDetails:打印 GC 详情,包括 GC 前/内存等。
-Xloggc:/temp/gc.log.date:指定 GC log 的路径

但是这个是业务的应用。不让动。

Universal JVM GC analyzer - Java Garbage collection log analysis made easy (gceasy.io)

image-20220106152415295

堆大小4G,新生代只给了256M。默认是新生代:老年代= 1 : 2。现在配的这个比例可能有问题。

通过前面的截图,YGC的次数过多,也说明了新生代太小。

让业务开发把-Xmn=256m去掉,使用默认的

用了CMS垃圾收集器,回顾一下CMS垃圾回收的6个阶段。

CMS垃圾回收的6个重要阶段

  • initial-mark 初始标记(CMS的第一个STW阶段),标记GC Root直接引用的对象,GC Root直接引用的对象不多,所以很快。

    (GC Root包括:1,所有Java线程当前活跃的栈帧里指向的堆里的对象,2一些静态结构指向的堆里的对象。)

  • concurrent-mark 并发标记阶段,可达性标记。由第一阶段标记过的对象出发,标记所有可达对象。

  • concurrent-preclean 并发预清理阶段,也是一个并发执行的阶段。在本阶段,会查找前一阶段执行过程中,从新生代晋升或新分配或被更新的对象。通过并发地重新扫描这些对象,预清理阶段可以减少下一个stop-the-world 重新标记阶段的工作量。

  • concurrent-abortable-preclean 并发可中止的预清理阶段。这个阶段其实跟上一个阶段做的东西一样,也是为了减少下一个STW重新标记阶段的工作量。增加这一阶段是为了让我们可以控制这个阶段的结束时机,比如扫描多长时间(默认5秒)或者Eden区使用占比达到期望比例(默认50%)就结束本阶段。

  • remark 重标记阶段(CMS的第二个STW阶段),暂停所有用户线程,从GC Root开始重新扫描整堆,标记存活的对象。需要注意的是,虽然CMS只回收老年代的垃圾对象,但是这个阶段依然需要扫描新生代,因为很多GC Root都在新生代,而这些GC Root指向的对象又在老年代,这称为“跨代引用”。

  • concurrent-sweep ,并发清理。