内存剖析的语言机制

楔子

这是一个由四部分组成的系列文章的第三篇，该系列文章将提供对Go中指针，栈，堆，逃逸分析和值/指针语法背后设计和机制的理解。这篇文章主要关注堆栈和指针。

四部分系列文章索引：

观看视频来看这个代码的生动的示例:

GopherCon Singapore (2017) - Escape Analysis

介绍

在前面的文章中，我使用在协程栈上共享值的示例来教学了基本的逃逸分析。我没有显示给您的是其他可能导致值逃逸的情况。为了帮助你解决此问题，我将调试一个以令人惊讶的方式分配的程序。

程序

我想学习更多关于 os 包的信息，所以我给自己一个快速的项目。给定一个字节流，写一个函数来找到字符串 elvis 并使用大写字段版本的字符串 Elvis 替换掉它。我们讨论 King ，所以他人名字应该总是大写。

这儿链接到这个问题：

https://play.golang.org/p/n_SzF4Cer4

这儿链接到压测：

https://play.golang.org/p/TnXrxJVfLV

代码列出两个不同的函数来解决这个问题。由于使用 ioc包，因此本文将重点放在 algOne 函数上。使用 algTwo 函数可以自己尝试使用内存和cpu配置文件。

这儿是我们将要使用的输入数据，以及 algOne 预期产生的数据

Listing 1

1
2
3
4
5
6
7


Input:
abcelvisaElvisabcelviseelvisaelvisaabeeeelvise l v i saa bb e l v i saa elvi
selvielviselvielvielviselvi1elvielviselvis

Output:
abcElvisaElvisabcElviseElvisaElvisaabeeeElvise l v i saa bb e l v i saa elvi
selviElviselvielviElviselvi1elviElvisElvis

这儿是 algOne 函数的完整列表

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48


 80 func algOne(data []byte, find []byte, repl []byte, output *bytes.Buffer) {
 81
 82     // Use a bytes Buffer to provide a stream to process.
 83     input := bytes.NewBuffer(data)
 84
 85     // The number of bytes we are looking for.
 86     size := len(find)
 87
 88     // Declare the buffers we need to process the stream.
 89     buf := make([]byte, size)
 90     end := size - 1
 91
 92     // Read in an initial number of bytes we need to get started.
 93     if n, err := io.ReadFull(input, buf[:end]); err != nil {
 94         output.Write(buf[:n])
 95         return
 96     }
 97
 98     for {
 99
100         // Read in one byte from the input stream.
101         if _, err := io.ReadFull(input, buf[end:]); err != nil {
102
103             // Flush the reset of the bytes we have.
104             output.Write(buf[:end])
105             return
106         }
107
108         // If we have a match, replace the bytes.
109         if bytes.Compare(buf, find) == 0 {
110             output.Write(repl)
111
112             // Read a new initial number of bytes.
113             if n, err := io.ReadFull(input, buf[:end]); err != nil {
114                 output.Write(buf[:n])
115                 return
116             }
117
118             continue
119         }
120
121         // Write the front byte since it has been compared.
122         output.WriteByte(buf[0])
123
124         // Slice that front byte out.
125         copy(buf, buf[1:])
126     }
127 }

我想要知道的是这个函数的性能如何以及它对堆施加了什么样的压力。为了了解这些，我们需要运行基准测试。

基准测试

这儿是我写的基准函数，它调用 algOne 函数来执行数据流处理。

Listing 3

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


15 func BenchmarkAlgorithmOne(b *testing.B) {
16     var output bytes.Buffer
17     in := assembleInputStream()
18     find := []byte("elvis")
19     repl := []byte("Elvis")
20
21     b.ResetTimer()
22
23     for i := 0; i < b.N; i++ {
24         output.Reset()
25         algOne(in, find, repl, &output)
26     }
27 }

有了这个基准函数，我们可以通过 go test 使用 -bench、-benchtime 和 -benchmem 开关来运行它。

Listing 4

1
2


$ go test -run none -bench AlgorithmOne -benchtime 3s -benchmem
BenchmarkAlgorithmOne-8    	2000000 	     2522 ns/op       117 B/op  	      2 allocs/op

在运行基准后，我们可以看到，algOne 函数分配正在分配两上值，每个操作总计117个字节。很好，但是我们需要知道函数中的哪些代码行导致了这些分配。要了解这一点，我们需要为此基准生成概要分析数据。

Profiling

为了生成 profile 数据，我们需要再次运行基准，便是这个时候需要使用 -memprofile 开头获得内存profile

Listing 5

1
2


$ go test -run none -bench AlgorithmOne -benchtime 3s -benchmem -memprofile mem.out
BenchmarkAlgorithmOne-8    	2000000 	     2570 ns/op       117 B/op  	      2 allocs/op

当基准完成后，test工具生成两个新的文件。

Listing 6

1
2
3
4
5
6
7


~/code/go/src/.../memcpu
$ ls -l
total 9248
-rw-r--r--  1 bill  staff      209 May 22 18:11 mem.out       (NEW)
-rwxr-xr-x  1 bill  staff  2847600 May 22 18:10 memcpu.test   (NEW)
-rw-r--r--  1 bill  staff     4761 May 22 18:01 stream.go
-rw-r--r--  1 bill  staff      880 May 22 14:49 stream_test.go

源代码位于名为 memcpu 的文件夹中，函数 algOne 位于 stream.go 中。基准函数位于 stream_test.go。两个生成的新的文件为 mem.out 和 memcpu.test。其中 mem.out 文件包含 profile 数据，而 memcpu.test 文件，（以文件夹命名）包含测试二进制文件，我们在查看配置文件数据时需要访问符号。

使用该 profile 文件和测试二进制，我们可以运行 pprof 工具来研究 profile 数据。

Listing 7

1
2
3


$ go tool pprof -alloc_space memcpu.test mem.out
Entering interactive mode (type "help" for commands)
(pprof) _

当剖析内存并查找 ”low hanging fruit", 你可以使用 -alloc_space 选项替换默认的 -inuse_space 选项。这将显示您进行配置文件时，无论分配是否仍在内存中，每项分配的发生位置。

从 pprof 的提示符中，我们可以使用 list 命令检查 algOne 函数。该命令将正则表达式作为参数来查找您要查看的函数。

Listing 8

(pprof) list algOne
Total: 335.03MB
ROUTINE ======================== .../memcpu.algOne in code/go/src/.../memcpu/stream.go
 335.03MB   335.03MB (flat, cum)   100% of Total
        .          .     78:
        .          .     79:// algOne is one way to solve the problem.
        .          .     80:func algOne(data []byte, find []byte, repl []byte, output *bytes.Buffer) {
        .          .     81:
        .          .     82: // Use a bytes Buffer to provide a stream to process.
 318.53MB   318.53MB     83: input := bytes.NewBuffer(data)
        .          .     84:
        .          .     85: // The number of bytes we are looking for.
        .          .     86: size := len(find)
        .          .     87:
        .          .     88: // Declare the buffers we need to process the stream.
  16.50MB    16.50MB     89: buf := make([]byte, size)
        .          .     90: end := size - 1
        .          .     91:
        .          .     92: // Read in an initial number of bytes we need to get started.
        .          .     93: if n, err := io.ReadFull(input, buf[:end]); err != nil || n < end {
        .          .     94:       output.Write(buf[:n])
(pprof) _

基于此配置文件，我们现在知道 input，并且 buf 切片的支持数组正在分配给堆。由于 input 是一个指针变量，配置得文件真正说的是指针 input 指向的 bytes.Buffer 值正在分配。所以让我们首先集中到 input 配置并理解为什么它被分配。

我们可以假设它正在分配，因为函数对 bytes.NewBuffer 的调用是共享 bytes.Buffer 的值，它创建了调用栈。然而，在 flat 列（在 pprof 的第一列输出的值）存在的值告诉我值正在创建，因为 algOne 函数以一种导致逃逸的方式共享它。

我知道 flag 列表示函数分配，因为看看 list 命令显示的基准函数正在调用 algOne。

Listing 9

(pprof) list Benchmark
Total: 335.03MB
ROUTINE ======================== .../memcpu.BenchmarkAlgorithmOne in code/go/src/.../memcpu/stream_test.go
        0   335.03MB (flat, cum)   100% of Total
        .          .     18: find := []byte("elvis")
        .          .     19: repl := []byte("Elvis")
        .          .     20:
        .          .     21: b.ResetTimer()
        .          .     22:
        .   335.03MB     23: for i := 0; i < b.N; i++ {
        .          .     24:       output.Reset()
        .          .     25:       algOne(in, find, repl, &output)
        .          .     26: }
        .          .     27:}
        .          .     28:
(pprof) _

由于在 cum 列中只有一个值（第二列），这告诉我 Benchamrk 函数没有直接分配任何内容。所有分配都发生在该循环内进行的函数调用中。您可以看到这两个 list1 调用匹配的所有分配号。

我们仍然不知道为什么bytes.Buffe 值在分配。这就是 go build 的 -gcflags "-m -m" 开关派上用场的地方。profiler只能告诉您转义了哪些值，但是build命令可以告诉您原因。

编译器报告

让我们问一下编译器在与代码的转义分析有关时做出了哪些决定。

Listing 10

1

$ go build -gcflags "-m -m"

这条命令产生了很多的输出。我们仅需要搜索任何拥有 stream.go:83 的输出，因为 stream.go 是包含这个代码的文件名，83包含在 bytes.buffer 值的构造。在搜索后，我们会发现有6行。

Listing 11

1
2
3
4
5
6
7


./stream.go:83: inlining call to bytes.NewBuffer func([]byte) *bytes.Buffer { return &bytes.Buffer literal }

./stream.go:83: &bytes.Buffer literal escapes to heap
./stream.go:83:   from ~r0 (assign-pair) at ./stream.go:83
./stream.go:83:   from input (assigned) at ./stream.go:83
./stream.go:83:   from input (interface-converted) at ./stream.go:93
./stream.go:83:   from input (passed to call[argument escapes]) at ./stream.go:93

我们找到的第一行 stream.go:83 是有意思的。

Listing 12

1

./stream.go:83: inlining call to bytes.NewBuffer func([]byte) *bytes.Buffer { return &bytes.Buffer literal }

它确认了 bytes.Buffers 值并未逃逸，因为它被传递到调用栈。这是因为 bytes.NewBuffer 永远没有调用。函数中的代码被内联。

所以这是我写的一片代码：

Listing 13

1

83     input := bytes.NewBuffer(data)

由于编译器选择内联 bytes.NewBuffer 函数调用，我写的代码被转换为：

Listing 14

1

input := &bytes.Buffer{buf: data}

这意味着 algOne 函数直接构造 bytes.Buffer 的值。所以现在问题是，是什么导致了值从 algOne 栈帧逃逸了？答案在报告中的另外5行被找到。

Listing 15

1
2
3
4
5


./stream.go:83: &bytes.Buffer literal escapes to heap
./stream.go:83:   from ~r0 (assign-pair) at ./stream.go:83
./stream.go:83:   from input (assigned) at ./stream.go:83
./stream.go:83:   from input (interface-converted) at ./stream.go:93
./stream.go:83:   from input (passed to call[argument escapes]) at ./stream.go:93

这些行告诉我们，是代码的 93 行导致了逃逸。 input 变量被分配给接口值。

Interfaces

我根本没有记得在代码中分配给一个接口值。然而，如果你看 93行，发行了一什么都清晰了。

Listing 16

1
2
3
4


 93     if n, err := io.ReadFull(input, buf[:end]); err != nil {
 94         output.Write(buf[:n])
 95         return
 96     }

调用 io.ReadFull 导致了接口指定。如果你看 io.ReadFull 函数的定义，你可以看到它是如何通过接口类型接收 input 值。

Listing 17

1
2
3
4
5
6
7


type Reader interface {
      Read(p []byte) (n int, err error)
}

func ReadFull(r Reader, buf []byte) (n int, err error) {
      return ReadAtLeast(r, buf, len(buf))
}

看起来将 bytes.Buffer 地址向下传递到调用栈并存储到 Reader 接口值里面会导致逃逸。现在我们知道使用接口有开销：分配和间接。因此，如果不清楚接口如何使代码变得更好，你可能不想使用它。这是我遵循的一些准则，以验证代码中接口的使用。

何时使用接口：

API的用户需要提供实现细节
API需要在内部维护多个实现
已经确定了API的部分可以更改，并且需要解耦

不要使用接口的情形：

为了使用接口而使用接口
概括算法
当用户可以声明他们自己的接口

现在我们可以问自己，这个算法真的需要 io.ReadFull 函数吗？答案是不，因为 bytes.Buffer 类型有方法集我们可以使用。针对函数拥有的值使用方法可能会阻止分配。

我们来修改下代码，移除 io 包，对变量 input 直接使用 Read 方法。

该代码的修改移除了导入 io 包的需要，来保持所有行号是一样的，我对 io 包导入使用了空白标识符。这将使导入保留在列表中

Listing 18

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54


 12 import (
 13     "bytes"
 14     "fmt"
 15     _ "io"
 16 )

 80 func algOne(data []byte, find []byte, repl []byte, output *bytes.Buffer) {
 81
 82     // Use a bytes Buffer to provide a stream to process.
 83     input := bytes.NewBuffer(data)
 84
 85     // The number of bytes we are looking for.
 86     size := len(find)
 87
 88     // Declare the buffers we need to process the stream.
 89     buf := make([]byte, size)
 90     end := size - 1
 91
 92     // Read in an initial number of bytes we need to get started.
 93     if n, err := input.Read(buf[:end]); err != nil || n < end {
 94         output.Write(buf[:n])
 95         return
 96     }
 97
 98     for {
 99
100         // Read in one byte from the input stream.
101         if _, err := input.Read(buf[end:]); err != nil {
102
103             // Flush the reset of the bytes we have.
104             output.Write(buf[:end])
105             return
106         }
107
108         // If we have a match, replace the bytes.
109         if bytes.Compare(buf, find) == 0 {
110             output.Write(repl)
111
112             // Read a new initial number of bytes.
113             if n, err := input.Read(buf[:end]); err != nil || n < end {
114                 output.Write(buf[:n])
115                 return
116             }
117
118             continue
119         }
120
121         // Write the front byte since it has been compared.
122         output.WriteByte(buf[0])
123
124         // Slice that front byte out.
125         copy(buf, buf[1:])
126     }
127 }

当我们对这个代码修改运行基准，我们可以看到对 bytes.Buffer 值的分配已经消失。

Listing 19

1
2


$ go test -run none -bench AlgorithmOne -benchtime 3s -benchmem -memprofile mem.out
BenchmarkAlgorithmOne-8    	2000000 	     1814 ns/op         5 B/op  	      1 allocs/op

我们还可以看到性能提升了大约 ~29%。代码运行从 2570 ns/op 到 1814 ns/op。解决了这个问题之后，我们现在就可以集中注意到为 buf 切片分配的数组支持了。如果对刚生成的新的 profile 数据再次使用 profiler 工具，我们应该可以辨别是什么导致了剩余的分配。

Listing 20

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17


$ go tool pprof -alloc_space memcpu.test mem.out
Entering interactive mode (type "help" for commands)
(pprof) list algOne
Total: 7.50MB
ROUTINE ======================== .../memcpu.BenchmarkAlgorithmOne in code/go/src/.../memcpu/stream_test.go
     11MB       11MB (flat, cum)   100% of Total
        .          .     84:
        .          .     85: // The number of bytes we are looking for.
        .          .     86: size := len(find)
        .          .     87:
        .          .     88: // Declare the buffers we need to process the stream.
     11MB       11MB     89: buf := make([]byte, size)
        .          .     90: end := size - 1
        .          .     91:
        .          .     92: // Read in an initial number of bytes we need to get started.
        .          .     93: if n, err := input.Read(buf[:end]); err != nil || n < end {
        .          .     94:       output.Write(buf[:n])

剩余的唯一的分配在第89行，它是切片的支持数组。

栈帧(Stack Frames)

我们想要知道是什么导致了 buf 的支持数组的分配？让我们使用 -gcflags "-m -m" 选项再次运行 go build 同时搜索 stream:89。

Listing 21

1
2
3


$ go build -gcflags "-m -m"
./stream.go:89: make([]byte, size) escapes to heap
./stream.go:89:   from make([]byte, size) (too large for stack) at ./stream.go:89

报告显示支持的数组 “too large for stack"，这个消息是非常有误导性的。并不是支持数组太大，而是编译器在编译期并不知道支持数组的大小。

仅当编译器在编译期知道值的大小时，值才可能被放到栈上。这是因为对于每个函数，每个栈帧的大小，是在编译期被计算的。如果编译器不知道会下的大小，它被放到堆上。

为了展示这个，我们暂时将切片的大小硬编码为 5 ，然后再次运行基准。

Listing 22

1

 89     buf := make([]byte, 5)

此时，当运行基准，分配消失了。

Listing 24

1
2
3


$ go build -gcflags "-m -m"
./stream.go:83: algOne &bytes.Buffer literal does not escape
./stream.go:89: algOne make([]byte, 5) does not escape

很明显，我们不能用硬编码切片的大小，因此我们需要为这个算法使用1个分配。

分配和性能

比较我们在每个重构过程中获得的不同性能增益。

Listing 25

1
2
3
4
5
6
7
8


Before any optimization
BenchmarkAlgorithmOne-8    	2000000 	     2570 ns/op       117 B/op  	      2 allocs/op

Removing the bytes.Buffer allocation
BenchmarkAlgorithmOne-8    	2000000 	     1814 ns/op         5 B/op  	      1 allocs/op

Removing the backing array allocation
BenchmarkAlgorithmOne-8    	3000000 	     1720 ns/op         0 B/op  	      0 allocs/op

通过移除 bytes.Buffer 的分配，我们将性能提升了 ~29%，当所有的分配都被移除，性能提升了 ~33%。分配是应用程序性能可能受到影响的地方。

结论

Go具有一些出色的工具，可让您了解编译器与转义分析相关的决策。根据这些信息，您可以将代码重构为同情，从而将不需要保留在堆中的值保留在栈中。您不会编写零分配软件，但希望在可能的情况下最小化分配。

话虽这么说，编写代码千万不要以性能为首，因为您不想猜测性能。编写优先考虑正确性的代码。这意味着首先要关注完整性，可读性和简单性。有了可用的程序后，请确定该程序是否足够快。如果速度不够快，请使用该语言提供的工具来查找和解决您的性能问题。

Table of Contents

内存剖析的语言机制

楔子

介绍

程序

基准测试

Profiling

编译器报告

Interfaces

栈帧(Stack Frames)

分配和性能

结论

See Also

Latest articles

Categories

Tags

Links

Meta