contrib/libs/jemalloc/TUNING.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129

This document summarizes the common approaches for performance fine tuning with 
jemalloc (as of 5.1.0).  The default configuration of jemalloc tends to work 
reasonably well in practice, and most applications should not have to tune any 
options. However, in order to cover a wide range of applications and avoid 
pathological cases, the default setting is sometimes kept conservative and 
suboptimal, even for many common workloads.  When jemalloc is properly tuned for 
a specific application / workload, it is common to improve system level metrics 
by a few percent, or make favorable trade-offs. 
 
 
## Notable runtime options for performance tuning 
 
Runtime options can be set via 
[malloc_conf](http://jemalloc.net/jemalloc.3.html#tuning). 
 
* [background_thread](http://jemalloc.net/jemalloc.3.html#background_thread) 
 
    Enabling jemalloc background threads generally improves the tail latency for 
    application threads, since unused memory purging is shifted to the dedicated 
    background threads.  In addition, unintended purging delay caused by 
    application inactivity is avoided with background threads. 
 
    Suggested: `background_thread:true` when jemalloc managed threads can be 
    allowed. 
 
* [metadata_thp](http://jemalloc.net/jemalloc.3.html#opt.metadata_thp) 
 
    Allowing jemalloc to utilize transparent huge pages for its internal 
    metadata usually reduces TLB misses significantly, especially for programs 
    with large memory footprint and frequent allocation / deallocation 
    activities.  Metadata memory usage may increase due to the use of huge 
    pages. 
 
    Suggested for allocation intensive programs: `metadata_thp:auto` or 
    `metadata_thp:always`, which is expected to improve CPU utilization at a 
    small memory cost. 
 
* [dirty_decay_ms](http://jemalloc.net/jemalloc.3.html#opt.dirty_decay_ms) and 
  [muzzy_decay_ms](http://jemalloc.net/jemalloc.3.html#opt.muzzy_decay_ms) 
 
    Decay time determines how fast jemalloc returns unused pages back to the 
    operating system, and therefore provides a fairly straightforward trade-off 
    between CPU and memory usage.  Shorter decay time purges unused pages faster 
    to reduces memory usage (usually at the cost of more CPU cycles spent on 
    purging), and vice versa. 
 
    Suggested: tune the values based on the desired trade-offs. 
 
* [narenas](http://jemalloc.net/jemalloc.3.html#opt.narenas) 
 
    By default jemalloc uses multiple arenas to reduce internal lock contention. 
    However high arena count may also increase overall memory fragmentation, 
    since arenas manage memory independently.  When high degree of parallelism 
    is not expected at the allocator level, lower number of arenas often 
    improves memory usage. 
 
    Suggested: if low parallelism is expected, try lower arena count while 
    monitoring CPU and memory usage. 
 
* [percpu_arena](http://jemalloc.net/jemalloc.3.html#opt.percpu_arena) 
 
    Enable dynamic thread to arena association based on running CPU.  This has 
    the potential to improve locality, e.g. when thread to CPU affinity is 
    present. 
     
    Suggested: try `percpu_arena:percpu` or `percpu_arena:phycpu` if 
    thread migration between processors is expected to be infrequent. 
 
Examples: 
 
* High resource consumption application, prioritizing CPU utilization: 
 
    `background_thread:true,metadata_thp:auto` combined with relaxed decay time 
    (increased `dirty_decay_ms` and / or `muzzy_decay_ms`, 
    e.g. `dirty_decay_ms:30000,muzzy_decay_ms:30000`). 
 
* High resource consumption application, prioritizing memory usage: 
 
    `background_thread:true` combined with shorter decay time (decreased 
    `dirty_decay_ms` and / or `muzzy_decay_ms`, 
    e.g. `dirty_decay_ms:5000,muzzy_decay_ms:5000`), and lower arena count 
    (e.g. number of CPUs). 
 
* Low resource consumption application: 
 
    `narenas:1,lg_tcache_max:13` combined with shorter decay time (decreased 
    `dirty_decay_ms` and / or `muzzy_decay_ms`,e.g. 
    `dirty_decay_ms:1000,muzzy_decay_ms:0`). 
 
* Extremely conservative -- minimize memory usage at all costs, only suitable when 
allocation activity is very rare: 
 
    `narenas:1,tcache:false,dirty_decay_ms:0,muzzy_decay_ms:0` 
 
Note that it is recommended to combine the options with `abort_conf:true` which 
aborts immediately on illegal options. 
 
## Beyond runtime options 
 
In addition to the runtime options, there are a number of programmatic ways to 
improve application performance with jemalloc. 
 
* [Explicit arenas](http://jemalloc.net/jemalloc.3.html#arenas.create) 
 
    Manually created arenas can help performance in various ways, e.g. by 
    managing locality and contention for specific usages.  For example, 
    applications can explicitly allocate frequently accessed objects from a 
    dedicated arena with 
    [mallocx()](http://jemalloc.net/jemalloc.3.html#MALLOCX_ARENA) to improve 
    locality.  In addition, explicit arenas often benefit from individually 
    tuned options, e.g. relaxed [decay 
    time](http://jemalloc.net/jemalloc.3.html#arena.i.dirty_decay_ms) if 
    frequent reuse is expected. 
 
* [Extent hooks](http://jemalloc.net/jemalloc.3.html#arena.i.extent_hooks) 
 
    Extent hooks allow customization for managing underlying memory.  One use 
    case for performance purpose is to utilize huge pages -- for example, 
    [HHVM](https://github.com/facebook/hhvm/blob/master/hphp/util/alloc.cpp) 
    uses explicit arenas with customized extent hooks to manage 1GB huge pages 
    for frequently accessed data, which reduces TLB misses significantly. 
 
* [Explicit thread-to-arena 
  binding](http://jemalloc.net/jemalloc.3.html#thread.arena) 
 
    It is common for some threads in an application to have different memory 
    access / allocation patterns.  Threads with heavy workloads often benefit 
    from explicit binding, e.g. binding very active threads to dedicated arenas 
    may reduce contention at the allocator level.