aboutsummaryrefslogtreecommitdiffstats
path: root/contrib/restricted/aws/aws-c-compression/README.md
blob: 2edaedb6cc7e645dad61da143f16f765b4b869bd (plain) (blame)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
## AWS C Compression

This is a cross-platform C99 implementation of compression algorithms such as
gzip, and huffman encoding/decoding. Currently only huffman is implemented.

## License

This library is licensed under the Apache 2.0 License.

## Usage

### Building

Note that aws-c-compression has a dependency on aws-c-common:

```
git clone git@github.com:awslabs/aws-c-common.git
cmake -DCMAKE_PREFIX_PATH=<install-path> -DCMAKE_INSTALL_PREFIX=<install-path> -S aws-c-common -B aws-c-common/build
cmake --build aws-c-common/build --target install

git clone git@github.com:awslabs/aws-c-compression.git
cmake -DCMAKE_PREFIX_PATH=<install-path> -DCMAKE_INSTALL_PREFIX=<install-path> -S aws-c-compression -B aws-c-compression/build
cmake --build aws-c-compression/build --target install
```

### Huffman

The Huffman implemention in this library is designed around the concept of a
generic "symbol coder" object, which defines how each symbol (value between 0
and 255) is encoded and decoded. This object looks like this:
```c
typedef struct aws_huffman_code (*aws_huffman_symbol_encoder)(uint8_t symbol, void *userdata);
typedef uint8_t (*aws_huffman_symbol_decoder)(uint32_t bits, uint8_t *symbol, void *userdata);

struct aws_huffman_symbol_coder {
    aws_huffman_symbol_encoder encode;
    aws_huffman_symbol_decoder decode;
    void *userdata;
};
```
These callbacks may be implemented manually, or you may use the included
Huffman coder generator to generate one from a table definition file. The
generator expects to be called with the following arguments:
```shell
$ aws-c-compression-huffman-generator path/to/table.def path/to/generated.c coder_name
```

The table definition file should be in the following format:
```c
/*           sym               bits   code len */
HUFFMAN_CODE(  0,      "1100101110", 0x32e, 10)
HUFFMAN_CODE(  1,      "1100101111", 0x32f, 10)
/* ... */
```
The HUFFMAN_CODE macro expects 4 arguments:
* sym: the symbol value [0-255]
* bits: the bits representing the symbol in string form
* code: the bits representing the symbol in numeric form
* len: the number of bits used to represent the symbol

> #### Note
> This file may also be `#include`d in the following way to generate a static
> list of codes:
> ```c
> /* Provides the HUFFMAN_CODE macro */
> #include <aws/testing/compression/huffman.h>
>
> static struct huffman_test_code_point code_points[] = {
> #include "test_huffman_static_table.def"
> };
> ```

This will emit a c file which exports a function with the following signiture:
```c
struct aws_huffman_symbol_coder *{coder_name}_get_coder();
```
Note that this function does not allocate, but maintains a static instance of
the coder.


An example implementation of this file is provided in
`tests/test_huffman_static_table.def`.


To use the coder, forward declare that function, and pass the result as the
second argument to `aws_huffman_encoder_init` and `aws_huffman_decoder_init`.
```c
struct aws_huffman_encoder encoder;
aws_huffman_encoder_init(&encoder, {coder_name}_get_coder());

struct aws_huffman_decoder decoder;
aws_huffman_decoder_init(&decoder, {coder_name}_get_coder())
```

#### Encoding
```c
/**
 * Encode a symbol buffer into the output buffer.
 *
 * \param[in]       encoder         The encoder object to use
 * \param[in]       to_encode       The symbol buffer to encode
 * \param[in,out]   length          In: The length of to_decode. Out: The number of bytes read from to_encode
 * \param[in]       output          The buffer to write encoded bytes to
 * \param[in,out]   output_size     In: The size of output. Out: The number of bytes written to output
 *
 * \return AWS_OP_SUCCESS if encoding is successful, AWS_OP_ERR the code for the error that occured
 */
int aws_huffman_encode(struct aws_huffman_encoder *encoder, const char *to_encode, size_t *length, uint8_t *output, size_t *output_size);
```
The encoder is built to support partial encoding. This means that if there
isn't enough space in `output`, the encoder will encode as much as possible,
update `length` to indicate how much was consumed, `output_size` won't change,
and `AWS_ERROR_SHORT_BUFFER` will be raised. `aws_huffman_encode` may then be
called again like the following pseudo-code:
```c
void encode_and_send(const char *to_encode, size_t size) {
    while (size > 0) {
        uint8_t output[some_chunk_size];
        size_t output_size = sizeof(output);
        size_t bytes_read = size;
        aws_huffman_encode(encoder, to_encode, &bytes_read, output, &output_size);
        /* AWS_ERROR_SHORT_BUFFER was raised... */
        send_output_to_someone_else(output, output_size);

        to_encode += bytes_read;
        size -= bytes_read;
    }
    /* Be sure to reset the encoder after use */
    aws_huffman_encoder_reset(encoder);
}
```

`aws_huffman_encoder` also has a `uint8_t` field called `eos_padding` that
defines how any unwritten bits in the last byte of output are filled. The most
significant bits will used. For example, if the last byte contains only 3 bits
and `eos_padding` is `0b01010101`, `01010` will be appended to the byte.

#### Decoding
```c
/**
 * Decodes a byte buffer into the provided symbol array.
 *
 * \param[in]       decoder         The decoder object to use
 * \param[in]       to_decode       The encoded byte buffer to read from
 * \param[in,out]   length          In: The length of to_decode. Out: The number of bytes read from to_decode
 * \param[in]       output          The buffer to write decoded symbols to
 * \param[in,out]   output_size     In: The size of output. Out: The number of bytes written to output
 *
 * \return AWS_OP_SUCCESS if encoding is successful, AWS_OP_ERR the code for the error that occured
 */
int aws_huffman_decode(struct aws_huffman_decoder *decoder, const uint8_t *to_decode, size_t *length, char *output, size_t *output_size);
```
The decoder is built to support partial encoding. This means that if there
isn't enough space in `output`, the decoder will decode as much as possible,
update `length` to indicate how much was consumed, `output_size` won't change,
and `AWS_ERROR_SHORT_BUFFER` will be raised. `aws_huffman_decode` may then be
called again like the following pseudo-code:
```c
void decode_and_send(const char *to_decode, size_t size) {
    while (size > 0) {
        uint8_t output[some_chunk_size];
        size_t output_size = sizeof(output);
        size_t bytes_read = size;
        aws_huffman_decode(decoder, to_decode, &bytes_read, output, &output_size);
        /* AWS_ERROR_SHORT_BUFFER was raised... */
        send_output_to_someone_else(output, output_size);

        to_decode += bytes_read;
        size -= bytes_read;
    }
    /* Be sure to reset the decoder after use */
    aws_huffman_decoder_reset(decoder);
}
```

Upon completion of a decode, the most significant bits of
`decoder->working_bits` will contain the final bits of `to_decode` that could
not match a symbol. This is useful for verifying the padding bits of a stream.
For example, to validate that a stream ends in all 1's (like HPACK requires),
you could do the following:
```c
AWS_ASSERT(decoder->working_bits == UINT64_MAX << (64 - decoder->num_bits));
```