aboutsummaryrefslogblamecommitdiffstats
path: root/contrib/libs/dtl/README.md
blob: 4a8ee208d6b27ec78c826c5050116adf365a64d2 (plain) (tree)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685











































































































































































































































































































































































































































































































































































































































































































                                                                                                                                                                                              
# dtl

`dtl` is the diff template library written in C++. The name of template is derived C++'s Template.

# Table of contents

 * [Features](#features)
 * [Getting started](#getting-started)
     * [Compare two strings](#compare-two-strings)
     * [Compare two data has arbitrary type](#compare-two-data-has-arbitrary-type)
     * [Merge three sequences](#merge-three-sequences)
     * [Patch function](#patch-function)
     * [Difference as Unified Format](#difference-as-unified-format)
     * [Compare large sequences](#compare-large-sequences)
     * [Unserious difference](#unserious-difference)
     * [Calculate only Edit Distance](#calculate-only-edit-distance)
 * [Algorithm](#algorithm)
     * [Computational complexity](#computational-complexity)
     * [Comparison when difference between two sequences is very large](#comparison-when-difference-between-two-sequences-is-very-large)
     * [Implementations with various programming languages](#implementations-with-various-programming-languages)
 * [Examples](#examples)
     * [strdiff](#strdiff)
     * [intdiff](#intdiff)
     * [unidiff](#unidiff)
     * [unistrdiff](#unistrdiff)
     * [strdiff3](#strdiff3)
     * [intdiff3](#intdiff3)
     * [patch](#patch)
     * [fpatch](#fpatch)
 * [Running tests](#running-tests)
     * [Building test programs](#building-test-programs)
     * [Running test programs](#running-test-programs)
 * [Old commit histories](#old-commit-histories)
 * [License](#license)

# Features

`dtl` provides the functions for comparing two sequences have arbitrary type. But sequences must support random access\_iterator.

# Getting started

To start using this library, all you need to do is include `dtl.hpp`.

```c++
#include "dtl/dtl.hpp"
```

## Compare two strings

First of all, calculate the difference between two strings.

```c++
typedef char elem;
typedef std::string sequence;
sequence A("abc");
sequence B("abd");
dtl::Diff< elem, sequence > d(A, B);
d.compose();
```

When the above code is run, `dtl` calculates the difference between A and B as Edit Distance and LCS and SES.

The meaning of these three terms is below.

| Edit Distance | Edit Distance is numerical value for declaring a difference between two sequences. |
|:--------------|:-----------------------------------------------------------------------------------|
| LCS           | LCS stands for Longest Common Subsequence.                                         |
| SES           | SES stands for Shortest Edit Script. I mean SES is the shortest course of action for tranlating one sequence into another sequence.|

If one sequence is "abc" and another sequence is "abd", Edit Distance and LCS and SES is below.

| Edit Distance | 2               |
|:--------------|:----------------|
| LCS           | ab              |
| SES           | C a C b D c A d |

 * 「C」:Common
 * 「D」:Delete
 * 「A」:ADD

If you want to know in more detail, please see [examples/strdiff.cpp](https://github.com/cubicdaiya/dtl/blob/master/examples/strdiff.cpp).

This calculates Edit Distance and LCS and SES of two strings received as command line arguments and prints each.

When one string is "abc" and another string "abd", the output of `strdiff` is below.

```bash
$ ./strdiff abc abd
editDistance:2
LCS:ab
SES
 a
 b
-c
+d
$
```

## Compare two data has arbitrary type

`dtl` can compare data has aribtrary type because of the C++'s template.

But the compared data type must support the random access\_iterator.

In the previous example, the string data compared,

`dtl` can also compare two int vectors like the example below.

```c++
int a[10] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10};
int b[10] = {3, 5, 1, 4, 5, 1, 7, 9, 6, 10};
std::vector<int> A(&a[0], &a[10]);
std::vector<int> B(&b[0], &b[10]);
dtl::Diff< int > d(A, B);
d.compose();
```

If you want to know in more detail, please see [examples/intdiff.cpp](https://github.com/cubicdaiya/dtl/blob/master/examples/intdiff.cpp).

## Merge three sequences

`dtl` has the diff3 function.

This function is that `dtl` merges three sequences.

Additionally `dtl` detects the confliction.

```c++
typedef char elem;
typedef std::string sequence;
sequence A("qqqabc");
sequence B("abc");
sequence C("abcdef");
dtl::Diff3<elem, sequence> diff3(A, B, C);
diff3.compose();
if (!diff3.merge()) {
  std::cerr << "conflict." << std::endl;
  return -1;
}
std::cout << "result:" << diff3.getMergedSequence() << std::endl;
```

When the above code is run, the output is below.

```console
result:qqqabcdef
```

If you want to know in more detail, please see [examples/strdiff3.cpp](https://github.com/cubicdaiya/dtl/blob/master/examples/strdiff3.cpp).

## Patch function

`dtl` can also translates one sequence to another sequence with SES.

```c++
typedef char elem;
typedef std::string sequence;
sequence A("abc");
sequence B("abd");
dtl::Diff<elem, sequence> d(A, B);
d.compose();
string s1(A);
string s2 = d.patch(s1);
```

When the above code is run, s2 becomes "abd".
The SES of A("abc") and B("abd") is below.

```console
Common a
Common b
Delete c
Add    d
```

The patch function translates a sequence as argument with SES.
For this example, "abc" is translated to "abd" with above SES.

Please see dtl's header files about the data structure of SES.

## Difference as Unified Format

`dtl` can also treat difference as Unified Format. See the example below.

```c++
typedef char elem;
typedef std::string sequence;
sequence A("acbdeaqqqqqqqcbed");
sequence B("acebdabbqqqqqqqabed");
dtl::Diff<elem, sequence > d(A, B);
d.compose();             // construct an edit distance and LCS and SES
d.composeUnifiedHunks(); // construct a difference as Unified Format with SES.
d.printUnifiedFormat();  // print a difference as Unified Format.
```

The difference as Unified Format of "acbdeaqqqqqqqcbed" and "acebdabbqqqqqqqabed" is below.

```diff
@@ -1,9 +1,11 @@
 a
 c
+e
 b
 d
-e
 a
+b
+b
 q
 q
 q
@@ -11,7 +13,7 @@
 q
 q
 q
-c
+a
 b
 e
 d
```

The data structure Unified Format is below.

```c++
/**
 * Structure of Unified Format Hunk
 */
template <typename sesElem>
struct uniHunk {
  int a, b, c, d;                   // @@ -a,b +c,d @@
  std::vector<sesElem> common[2];   // anteroposterior commons on changes
  std::vector<sesElem> change;      // changes
  int inc_dec_count;                // count of increace and decrease
};
```

The actual blocks of Unified Format is this structure's vector.

If you want to know in more detail, please see [examples/unistrdiff.cpp](https://github.com/cubicdaiya/dtl/blob/master/examples/unistrdiff.cpp)
and [examples/unidiff.cpp](https://github.com/cubicdaiya/dtl/blob/master/examples/unidiff.cpp) and dtl's header files.

In addtion, `dtl` has the function translates one sequence to another sequence with Unified Format.

```c++
typedef char elem;
typedef std::string sequence;
sequence A("abc");
sequence B("abd");
dtl::Diff<elem, sequence> d(A, B);
d.compose();
d.composeUnifiedHunks()
string s1(A);
string s2 = d.uniPatch(s1);
```

When the above code is run, s2 becomes "abd".
The uniPatch function translates a sequence as argument with Unified Format blocks.

For this example, "abc" is translated to "abd" with the Unified Format block below.

```diff
@@ -1,3 +1,3 @@
 a
 b
-c
+d
```

## Compare large sequences

When compare two large sequences, `dtl` can optimizes the calculation of difference with the onHuge function.

This function is available when the compared data type is std::vector.

When you use this function, you may call this function before calling compose function.

```c++
typedef char elem;
typedef  std::vector<elem> sequence;
sequence A;
sequence B;
/* ・・・ */
dtl::Diff< elem, sequence > d(A, B);
d.onHuge();
d.compose();
```

## Unserious difference

The calculation of difference is very heavy.
`dtl` uses An O(NP) Sequence Comparison Algorithm.

Though this Algorithm is sufficiently fast,
when difference between two sequences is very large,

the calculation of LCS and SES needs massive amounts of memory.

`dtl` avoids above-described problem by dividing each sequence into plural subsequences
and joining the difference of each subsequence finally.

As this way repeats allocating massive amounts of memory,
`dtl` provides other way. It is the way of calculating unserious difference.

For example, The normal SES of "abc" and "abd" is below.

```console
Common a
Common b
Delete c
Add    d
```

The unserious SES of "abc" and "abd" is below.

```console
Delete a
Delete b
Delete c
Add    a
Add    b
Add    d
```

Of course, when "abc" and "abd" are compared with `dtl`, above difference is not derived.

`dtl` calculates the unserious difference when `dtl` judges the calculation of LCS and SES
needs massive amounts of memory and unserious difference function is ON.

`dtl` joins the calculated difference before `dtl` judges it and unserious difference finally.

As a result, all difference is not unserious difference when unserious difference function is ON.

When you use this function, you may call this function before calling compose function.

```c++
typedef char elem;
typedef std::string sequence;
sequence A("abc");
sequence B("abd");
dtl::Diff< elem, sequence > d(A, B);
d.onUnserious();
d.compose();
```

## Calculate only Edit Distance

As using onOnlyEditDistance, `dtl` calculates the only edit distance.

If you need only edit distance, you may use this function,
because the calculation of edit distance is lighter than the calculation of LCS and SES.

When you use this function, you may call this function before calling compose function.

```c++
typedef char elem;
typedef std::string sequence;
sequence A("abc");
sequence B("abd");
dtl::Diff< elem, sequence > d(A, B);
d.onOnlyEditDistance();
d.compose();
```

# Algorithm

The algorithm `dtl` uses is based on "An O(NP) Sequence Comparison Algorithm" by described by Sun Wu, Udi Manber and Gene Myers.

An O(NP) Sequence Comparison Algorithm(following, Wu's O(NP) Algorithm) is the efficient algorithm for comparing two sequences.

## Computational complexity

The computational complexity of Wu's O(NP) Algorithm is averagely O(N+PD), in the worst case, is O(NP).

## Comparison when difference between two sequences is very large

Calculating LCS and SES efficiently at any time is a little difficult.

Because that the calculation of LCS and SES needs massive amounts of memory when a difference between two sequences is very large.

The program uses that algorithm don't consider that will burst in the worst case.

`dtl` avoids above-described problem by dividing each sequence into plural subsequences and joining the difference of each subsequence finally. (This feature is supported after version 0.04)

## Implementations with various programming languages

There are the Wu's O(NP) Algorithm implementations with various programming languages below.

https://github.com/cubicdaiya/onp

# Examples

There are examples in [dtl/examples](https://github.com/cubicdaiya/dtl/tree/master/examples).
`dtl` uses [SCons](http://scons.org/) for building examples and tests. If you build and run examples and tests, install SCons.

## strdiff

`strdiff` calculates a difference between two string sequences, but multi byte is not supported.

```bash
$ cd dtl/examples
$ scons strdiff
$ ./strdiff acbdeacbed acebdabbabed
editDistance:6
LCS:acbdabed
SES
  a
  c
+ e
  b
  d
- e
  a
- c
  b
+ b
+ a
+ b
  e
  d
$
```

## intdiff

`intdiff` calculates a diffrence between two int arrays sequences.

```bash
$ cd dtl/examples
$ scons intdiff
$ ./intdiff # There are data in intdiff.cpp
1 2 3 4 5 6 7 8 9 10 
3 5 1 4 5 1 7 9 6 10 
editDistance:8
LCS: 3 4 5 7 9 10 
SES
- 1
- 2
  3
+ 5
+ 1
  4
  5
- 6
+ 1
  7
- 8
  9
+ 6
  10
$
```

## unidiff

`unidiff` calculates a diffrence between two text file sequences,
and output the difference between files with unified format.

```bash
$ cd dtl/examples
$ scons unidiff
$ cat a.txt
a
e
c
z
z
d
e
f
a
b
c
d
e
f
g
h
i
$ cat b.txt
a
d
e
c
f
e
a
b
c
d
e
f
g
h
i
$ ./unidiff a.txt b.txt
--- a.txt       2008-08-26 07:03:28 +0900
+++ b.txt       2008-08-26 03:02:42 +0900
@@ -1,11 +1,9 @@
 a
-e
-c
-z
-z
 d
 e
+c
 f
+e
 a
 b
 c
$
```

## unistrdiff

`unistrdiff` calculates a diffrence between two string sequences.
and output the difference between strings with unified format.

```bash
$ cd dtl/examples
$ scons unistrdiff
$ ./unistrdiff acbdeacbed acebdabbabed
editDistance:6
LCS:acbdabed
@@ -1,10 +1,12 @@
 a
 c
+e
 b
 d
-e
 a
-c
 b
+b
+a
+b
 e
 d
$
```

## strdiff3

`strdiff3` merges three string sequence and output the merged sequence.
When the confliction has occured, output the string "conflict.".

```bash
$ cd dtl/examples
$ scons strdiff3
$ ./strdiff3 qabc abc abcdef
result:qabcdef
$
```

There is a output below when conflict occured.

```bash
$ ./strdiff3 adc abc aec
conflict.
$
```

## intdiff3

`intdiff3` merges three integer sequence(vector) and output the merged sequence.

```bash
$ cd dtl/examples
$ scons intdiff3
$ ./intdiff3
a:1 2 3 4 5 6 7 3 9 10
b:1 2 3 4 5 6 7 8 9 10
c:1 2 3 9 5 6 7 8 9 10
s:1 2 3 9 5 6 7 3 9 10
intdiff3 OK
$
```

## patch

`patch` is the test program. Supposing that there are two strings is called by A and B,
`patch` translates A to B with Shortest Edit Script or unified format difference.

```bash
$ cd dtl/examples
$ scons patch
$ ./patch abc abd
before:abc
after :abd
patch successed
before:abc
after :abd
unipatch successed
$
```

## fpatch

`fpatch` is the test program. Supposing that there are two files is called by A and B,
`fpatch` translates A to B with Shortest Edit Script or unified format difference.

```bash
$ cd dtl/examples
$ scons fpatch
$ cat a.txt
a
e
c
z
z
d
e
f
a
b
c
d
e
f
g
h
i
$ cat b.txt
$ cat b.txt
a
d
e
c
f
e
a
b
c
d
e
f
g
h
i
$ ./fpatch a.txt b.txt
fpatch successed
unipatch successed
$
```

# Running tests

`dtl` uses [googletest](https://github.com/google/googletest) and [SCons](http://www.scons.org/) with testing dtl-self.

# Building test programs

If you build test programs for `dtl`, run `scons` in test direcotry.

```bash
$ scons
```

# Running test programs

If you run all tests for `dtl`, run 'scons check' in test direcotry. (it is necessary that gtest is compiled)

```bash
$ scons check
```

If you run sectional tests, you may exeucte `dtl_test` directly after you run `scons`.
Following command is the example for testing only Strdifftest.

```bash
$ ./dtl_test --gtest_filter='Strdifftest.*'
```

`--gtest-filters` is the function of googletest. googletest has many useful functions for testing software flexibly.
If you want to know other functions of googletest, run `./dtl_test --help`.

# Old commit histories

Please see [cubicdaiya/dtl-legacy](https://github.com/cubicdaiya/dtl-legacy).

# License

Please read the file [COPYING](https://github.com/cubicdaiya/dtl/blob/master/COPYING).