diff options
| author | pefavel <[email protected]> | 2026-03-16 17:44:57 +0300 |
|---|---|---|
| committer | pefavel <[email protected]> | 2026-03-17 11:40:58 +0300 |
| commit | 6eecc739c342dbfca9be6328231233dd8e77d9f4 (patch) | |
| tree | 491834a1c01185c100a79d420a7492c7e53ba32a /contrib/tools/python/src/Objects | |
| parent | 58b88dfd7db837890ffc2edbe80e5235298cec10 (diff) | |
revert piglet config change
commit_hash:d068d68a89226c414a3d5a1f8ad102579bdd233b
Diffstat (limited to 'contrib/tools/python/src/Objects')
| -rw-r--r-- | contrib/tools/python/src/Objects/dictnotes.txt | 270 | ||||
| -rw-r--r-- | contrib/tools/python/src/Objects/listsort.txt | 755 | ||||
| -rw-r--r-- | contrib/tools/python/src/Objects/lnotab_notes.txt | 129 | ||||
| -rw-r--r-- | contrib/tools/python/src/Objects/stringlib/README.txt | 40 |
4 files changed, 0 insertions, 1194 deletions
diff --git a/contrib/tools/python/src/Objects/dictnotes.txt b/contrib/tools/python/src/Objects/dictnotes.txt deleted file mode 100644 index d3aa77401cf..00000000000 --- a/contrib/tools/python/src/Objects/dictnotes.txt +++ /dev/null @@ -1,270 +0,0 @@ -NOTES ON OPTIMIZING DICTIONARIES -================================ - - -Principal Use Cases for Dictionaries ------------------------------------- - -Passing keyword arguments - Typically, one read and one write for 1 to 3 elements. - Occurs frequently in normal python code. - -Class method lookup - Dictionaries vary in size with 8 to 16 elements being common. - Usually written once with many lookups. - When base classes are used, there are many failed lookups - followed by a lookup in a base class. - -Instance attribute lookup and Global variables - Dictionaries vary in size. 4 to 10 elements are common. - Both reads and writes are common. - -Builtins - Frequent reads. Almost never written. - Size 126 interned strings (as of Py2.3b1). - A few keys are accessed much more frequently than others. - -Uniquification - Dictionaries of any size. Bulk of work is in creation. - Repeated writes to a smaller set of keys. - Single read of each key. - Some use cases have two consecutive accesses to the same key. - - * Removing duplicates from a sequence. - dict.fromkeys(seqn).keys() - - * Counting elements in a sequence. - for e in seqn: - d[e] = d.get(e,0) + 1 - - * Accumulating references in a dictionary of lists: - - for pagenumber, page in enumerate(pages): - for word in page: - d.setdefault(word, []).append(pagenumber) - - Note, the second example is a use case characterized by a get and set - to the same key. There are similar use cases with a __contains__ - followed by a get, set, or del to the same key. Part of the - justification for d.setdefault is combining the two lookups into one. - -Membership Testing - Dictionaries of any size. Created once and then rarely changes. - Single write to each key. - Many calls to __contains__() or has_key(). - Similar access patterns occur with replacement dictionaries - such as with the % formatting operator. - -Dynamic Mappings - Characterized by deletions interspersed with adds and replacements. - Performance benefits greatly from the re-use of dummy entries. - - -Data Layout (assuming a 32-bit box with 64 bytes per cache line) ----------------------------------------------------------------- - -Smalldicts (8 entries) are attached to the dictobject structure -and the whole group nearly fills two consecutive cache lines. - -Larger dicts use the first half of the dictobject structure (one cache -line) and a separate, continuous block of entries (at 12 bytes each -for a total of 5.333 entries per cache line). - - -Tunable Dictionary Parameters ------------------------------ - -* PyDict_MINSIZE. Currently set to 8. - Must be a power of two. New dicts have to zero-out every cell. - Each additional 8 consumes 1.5 cache lines. Increasing improves - the sparseness of small dictionaries but costs time to read in - the additional cache lines if they are not already in cache. - That case is common when keyword arguments are passed. - -* Maximum dictionary load in PyDict_SetItem. Currently set to 2/3. - Increasing this ratio makes dictionaries more dense resulting - in more collisions. Decreasing it improves sparseness at the - expense of spreading entries over more cache lines and at the - cost of total memory consumed. - - The load test occurs in highly time sensitive code. Efforts - to make the test more complex (for example, varying the load - for different sizes) have degraded performance. - -* Growth rate upon hitting maximum load. Currently set to *2. - Raising this to *4 results in half the number of resizes, - less effort to resize, better sparseness for some (but not - all dict sizes), and potentially doubles memory consumption - depending on the size of the dictionary. Setting to *4 - eliminates every other resize step. - -* Maximum sparseness (minimum dictionary load). What percentage - of entries can be unused before the dictionary shrinks to - free up memory and speed up iteration? (The current CPython - code does not represent this parameter directly.) - -* Shrinkage rate upon exceeding maximum sparseness. The current - CPython code never even checks sparseness when deleting a - key. When a new key is added, it resizes based on the number - of active keys, so that the addition may trigger shrinkage - rather than growth. - -Tune-ups should be measured across a broad range of applications and -use cases. A change to any parameter will help in some situations and -hurt in others. The key is to find settings that help the most common -cases and do the least damage to the less common cases. Results will -vary dramatically depending on the exact number of keys, whether the -keys are all strings, whether reads or writes dominate, the exact -hash values of the keys (some sets of values have fewer collisions than -others). Any one test or benchmark is likely to prove misleading. - -While making a dictionary more sparse reduces collisions, it impairs -iteration and key listing. Those methods loop over every potential -entry. Doubling the size of dictionary results in twice as many -non-overlapping memory accesses for keys(), items(), values(), -__iter__(), iterkeys(), iteritems(), itervalues(), and update(). -Also, every dictionary iterates at least twice, once for the memset() -when it is created and once by dealloc(). - -Dictionary operations involving only a single key can be O(1) unless -resizing is possible. By checking for a resize only when the -dictionary can grow (and may *require* resizing), other operations -remain O(1), and the odds of resize thrashing or memory fragmentation -are reduced. In particular, an algorithm that empties a dictionary -by repeatedly invoking .pop will see no resizing, which might -not be necessary at all because the dictionary is eventually -discarded entirely. - - -Results of Cache Locality Experiments -------------------------------------- - -When an entry is retrieved from memory, 4.333 adjacent entries are also -retrieved into a cache line. Since accessing items in cache is *much* -cheaper than a cache miss, an enticing idea is to probe the adjacent -entries as a first step in collision resolution. Unfortunately, the -introduction of any regularity into collision searches results in more -collisions than the current random chaining approach. - -Exploiting cache locality at the expense of additional collisions fails -to payoff when the entries are already loaded in cache (the expense -is paid with no compensating benefit). This occurs in small dictionaries -where the whole dictionary fits into a pair of cache lines. It also -occurs frequently in large dictionaries which have a common access pattern -where some keys are accessed much more frequently than others. The -more popular entries *and* their collision chains tend to remain in cache. - -To exploit cache locality, change the collision resolution section -in lookdict() and lookdict_string(). Set i^=1 at the top of the -loop and move the i = (i << 2) + i + perturb + 1 to an unrolled -version of the loop. - -This optimization strategy can be leveraged in several ways: - -* If the dictionary is kept sparse (through the tunable parameters), -then the occurrence of additional collisions is lessened. - -* If lookdict() and lookdict_string() are specialized for small dicts -and for largedicts, then the versions for large_dicts can be given -an alternate search strategy without increasing collisions in small dicts -which already have the maximum benefit of cache locality. - -* If the use case for a dictionary is known to have a random key -access pattern (as opposed to a more common pattern with a Zipf's law -distribution), then there will be more benefit for large dictionaries -because any given key is no more likely than another to already be -in cache. - -* In use cases with paired accesses to the same key, the second access -is always in cache and gets no benefit from efforts to further improve -cache locality. - -Optimizing the Search of Small Dictionaries -------------------------------------------- - -If lookdict() and lookdict_string() are specialized for smaller dictionaries, -then a custom search approach can be implemented that exploits the small -search space and cache locality. - -* The simplest example is a linear search of contiguous entries. This is - simple to implement, guaranteed to terminate rapidly, never searches - the same entry twice, and precludes the need to check for dummy entries. - -* A more advanced example is a self-organizing search so that the most - frequently accessed entries get probed first. The organization - adapts if the access pattern changes over time. Treaps are ideally - suited for self-organization with the most common entries at the - top of the heap and a rapid binary search pattern. Most probes and - results are all located at the top of the tree allowing them all to - be located in one or two cache lines. - -* Also, small dictionaries may be made more dense, perhaps filling all - eight cells to take the maximum advantage of two cache lines. - - -Strategy Pattern ----------------- - -Consider allowing the user to set the tunable parameters or to select a -particular search method. Since some dictionary use cases have known -sizes and access patterns, the user may be able to provide useful hints. - -1) For example, if membership testing or lookups dominate runtime and memory - is not at a premium, the user may benefit from setting the maximum load - ratio at 5% or 10% instead of the usual 66.7%. This will sharply - curtail the number of collisions but will increase iteration time. - The builtin namespace is a prime example of a dictionary that can - benefit from being highly sparse. - -2) Dictionary creation time can be shortened in cases where the ultimate - size of the dictionary is known in advance. The dictionary can be - pre-sized so that no resize operations are required during creation. - Not only does this save resizes, but the key insertion will go - more quickly because the first half of the keys will be inserted into - a more sparse environment than before. The preconditions for this - strategy arise whenever a dictionary is created from a key or item - sequence and the number of *unique* keys is known. - -3) If the key space is large and the access pattern is known to be random, - then search strategies exploiting cache locality can be fruitful. - The preconditions for this strategy arise in simulations and - numerical analysis. - -4) If the keys are fixed and the access pattern strongly favors some of - the keys, then the entries can be stored contiguously and accessed - with a linear search or treap. This exploits knowledge of the data, - cache locality, and a simplified search routine. It also eliminates - the need to test for dummy entries on each probe. The preconditions - for this strategy arise in symbol tables and in the builtin dictionary. - - -Readonly Dictionaries ---------------------- -Some dictionary use cases pass through a build stage and then move to a -more heavily exercised lookup stage with no further changes to the -dictionary. - -An idea that emerged on python-dev is to be able to convert a dictionary -to a read-only state. This can help prevent programming errors and also -provide knowledge that can be exploited for lookup optimization. - -The dictionary can be immediately rebuilt (eliminating dummy entries), -resized (to an appropriate level of sparseness), and the keys can be -jostled (to minimize collisions). The lookdict() routine can then -eliminate the test for dummy entries (saving about 1/4 of the time -spent in the collision resolution loop). - -An additional possibility is to insert links into the empty spaces -so that dictionary iteration can proceed in len(d) steps instead of -(mp->mask + 1) steps. Alternatively, a separate tuple of keys can be -kept just for iteration. - - -Caching Lookups ---------------- -The idea is to exploit key access patterns by anticipating future lookups -based on previous lookups. - -The simplest incarnation is to save the most recently accessed entry. -This gives optimal performance for use cases where every get is followed -by a set or del to the same key. diff --git a/contrib/tools/python/src/Objects/listsort.txt b/contrib/tools/python/src/Objects/listsort.txt deleted file mode 100644 index c6fcb34a0bf..00000000000 --- a/contrib/tools/python/src/Objects/listsort.txt +++ /dev/null @@ -1,755 +0,0 @@ -Intro ------ -This describes an adaptive, stable, natural mergesort, modestly called -timsort (hey, I earned it <wink>). It has supernatural performance on many -kinds of partially ordered arrays (less than lg(N!) comparisons needed, and -as few as N-1), yet as fast as Python's previous highly tuned samplesort -hybrid on random arrays. - -In a nutshell, the main routine marches over the array once, left to right, -alternately identifying the next run, then merging it into the previous -runs "intelligently". Everything else is complication for speed, and some -hard-won measure of memory efficiency. - - -Comparison with Python's Samplesort Hybrid ------------------------------------------- -+ timsort can require a temp array containing as many as N//2 pointers, - which means as many as 2*N extra bytes on 32-bit boxes. It can be - expected to require a temp array this large when sorting random data; on - data with significant structure, it may get away without using any extra - heap memory. This appears to be the strongest argument against it, but - compared to the size of an object, 2 temp bytes worst-case (also expected- - case for random data) doesn't scare me much. - - It turns out that Perl is moving to a stable mergesort, and the code for - that appears always to require a temp array with room for at least N - pointers. (Note that I wouldn't want to do that even if space weren't an - issue; I believe its efforts at memory frugality also save timsort - significant pointer-copying costs, and allow it to have a smaller working - set.) - -+ Across about four hours of generating random arrays, and sorting them - under both methods, samplesort required about 1.5% more comparisons - (the program is at the end of this file). - -+ In real life, this may be faster or slower on random arrays than - samplesort was, depending on platform quirks. Since it does fewer - comparisons on average, it can be expected to do better the more - expensive a comparison function is. OTOH, it does more data movement - (pointer copying) than samplesort, and that may negate its small - comparison advantage (depending on platform quirks) unless comparison - is very expensive. - -+ On arrays with many kinds of pre-existing order, this blows samplesort out - of the water. It's significantly faster than samplesort even on some - cases samplesort was special-casing the snot out of. I believe that lists - very often do have exploitable partial order in real life, and this is the - strongest argument in favor of timsort (indeed, samplesort's special cases - for extreme partial order are appreciated by real users, and timsort goes - much deeper than those, in particular naturally covering every case where - someone has suggested "and it would be cool if list.sort() had a special - case for this too ... and for that ..."). - -+ Here are exact comparison counts across all the tests in sortperf.py, - when run with arguments "15 20 1". - - Column Key: - *sort: random data - \sort: descending data - /sort: ascending data - 3sort: ascending, then 3 random exchanges - +sort: ascending, then 10 random at the end - %sort: ascending, then randomly replace 1% of elements w/ random values - ~sort: many duplicates - =sort: all equal - !sort: worst case scenario - - First the trivial cases, trivial for samplesort because it special-cased - them, and trivial for timsort because it naturally works on runs. Within - an "n" block, the first line gives the # of compares done by samplesort, - the second line by timsort, and the third line is the percentage by - which the samplesort count exceeds the timsort count: - - n \sort /sort =sort -------- ------ ------ ------ - 32768 32768 32767 32767 samplesort - 32767 32767 32767 timsort - 0.00% 0.00% 0.00% (samplesort - timsort) / timsort - - 65536 65536 65535 65535 - 65535 65535 65535 - 0.00% 0.00% 0.00% - - 131072 131072 131071 131071 - 131071 131071 131071 - 0.00% 0.00% 0.00% - - 262144 262144 262143 262143 - 262143 262143 262143 - 0.00% 0.00% 0.00% - - 524288 524288 524287 524287 - 524287 524287 524287 - 0.00% 0.00% 0.00% - -1048576 1048576 1048575 1048575 - 1048575 1048575 1048575 - 0.00% 0.00% 0.00% - - The algorithms are effectively identical in these cases, except that - timsort does one less compare in \sort. - - Now for the more interesting cases. Where lg(x) is the logarithm of x to - the base 2 (e.g., lg(8)=3), lg(n!) is the information-theoretic limit for - the best any comparison-based sorting algorithm can do on average (across - all permutations). When a method gets significantly below that, it's - either astronomically lucky, or is finding exploitable structure in the - data. - - - n lg(n!) *sort 3sort +sort %sort ~sort !sort -------- ------- ------ ------- ------- ------ ------- -------- - 32768 444255 453096 453614 32908 452871 130491 469141 old - 448885 33016 33007 50426 182083 65534 new - 0.94% 1273.92% -0.30% 798.09% -28.33% 615.87% %ch from new - - 65536 954037 972699 981940 65686 973104 260029 1004607 - 962991 65821 65808 101667 364341 131070 - 1.01% 1391.83% -0.19% 857.15% -28.63% 666.47% - - 131072 2039137 2101881 2091491 131232 2092894 554790 2161379 - 2057533 131410 131361 206193 728871 262142 - 2.16% 1491.58% -0.10% 915.02% -23.88% 724.51% - - 262144 4340409 4464460 4403233 262314 4445884 1107842 4584560 - 4377402 262437 262459 416347 1457945 524286 - 1.99% 1577.82% -0.06% 967.83% -24.01% 774.44% - - 524288 9205096 9453356 9408463 524468 9441930 2218577 9692015 - 9278734 524580 524633 837947 2916107 1048574 - 1.88% 1693.52% -0.03% 1026.79% -23.92% 824.30% - -1048576 19458756 19950272 19838588 1048766 19912134 4430649 20434212 - 19606028 1048958 1048941 1694896 5832445 2097150 - 1.76% 1791.27% -0.02% 1074.83% -24.03% 874.38% - - Discussion of cases: - - *sort: There's no structure in random data to exploit, so the theoretical - limit is lg(n!). Both methods get close to that, and timsort is hugging - it (indeed, in a *marginal* sense, it's a spectacular improvement -- - there's only about 1% left before hitting the wall, and timsort knows - darned well it's doing compares that won't pay on random data -- but so - does the samplesort hybrid). For contrast, Hoare's original random-pivot - quicksort does about 39% more compares than the limit, and the median-of-3 - variant about 19% more. - - 3sort, %sort, and !sort: No contest; there's structure in this data, but - not of the specific kinds samplesort special-cases. Note that structure - in !sort wasn't put there on purpose -- it was crafted as a worst case for - a previous quicksort implementation. That timsort nails it came as a - surprise to me (although it's obvious in retrospect). - - +sort: samplesort special-cases this data, and does a few less compares - than timsort. However, timsort runs this case significantly faster on all - boxes we have timings for, because timsort is in the business of merging - runs efficiently, while samplesort does much more data movement in this - (for it) special case. - - ~sort: samplesort's special cases for large masses of equal elements are - extremely effective on ~sort's specific data pattern, and timsort just - isn't going to get close to that, despite that it's clearly getting a - great deal of benefit out of the duplicates (the # of compares is much less - than lg(n!)). ~sort has a perfectly uniform distribution of just 4 - distinct values, and as the distribution gets more skewed, samplesort's - equal-element gimmicks become less effective, while timsort's adaptive - strategies find more to exploit; in a database supplied by Kevin Altis, a - sort on its highly skewed "on which stock exchange does this company's - stock trade?" field ran over twice as fast under timsort. - - However, despite that timsort does many more comparisons on ~sort, and - that on several platforms ~sort runs highly significantly slower under - timsort, on other platforms ~sort runs highly significantly faster under - timsort. No other kind of data has shown this wild x-platform behavior, - and we don't have an explanation for it. The only thing I can think of - that could transform what "should be" highly significant slowdowns into - highly significant speedups on some boxes are catastrophic cache effects - in samplesort. - - But timsort "should be" slower than samplesort on ~sort, so it's hard - to count that it isn't on some boxes as a strike against it <wink>. - -+ Here's the highwater mark for the number of heap-based temp slots (4 - bytes each on this box) needed by each test, again with arguments - "15 20 1": - - 2**i *sort \sort /sort 3sort +sort %sort ~sort =sort !sort - 32768 16384 0 0 6256 0 10821 12288 0 16383 - 65536 32766 0 0 21652 0 31276 24576 0 32767 - 131072 65534 0 0 17258 0 58112 49152 0 65535 - 262144 131072 0 0 35660 0 123561 98304 0 131071 - 524288 262142 0 0 31302 0 212057 196608 0 262143 -1048576 524286 0 0 312438 0 484942 393216 0 524287 - - Discussion: The tests that end up doing (close to) perfectly balanced - merges (*sort, !sort) need all N//2 temp slots (or almost all). ~sort - also ends up doing balanced merges, but systematically benefits a lot from - the preliminary pre-merge searches described under "Merge Memory" later. - %sort approaches having a balanced merge at the end because the random - selection of elements to replace is expected to produce an out-of-order - element near the midpoint. \sort, /sort, =sort are the trivial one-run - cases, needing no merging at all. +sort ends up having one very long run - and one very short, and so gets all the temp space it needs from the small - temparray member of the MergeState struct (note that the same would be - true if the new random elements were prefixed to the sorted list instead, - but not if they appeared "in the middle"). 3sort approaches N//3 temp - slots twice, but the run lengths that remain after 3 random exchanges - clearly has very high variance. - - -A detailed description of timsort follows. - -Runs ----- -count_run() returns the # of elements in the next run. A run is either -"ascending", which means non-decreasing: - - a0 <= a1 <= a2 <= ... - -or "descending", which means strictly decreasing: - - a0 > a1 > a2 > ... - -Note that a run is always at least 2 long, unless we start at the array's -last element. - -The definition of descending is strict, because the main routine reverses -a descending run in-place, transforming a descending run into an ascending -run. Reversal is done via the obvious fast "swap elements starting at each -end, and converge at the middle" method, and that can violate stability if -the slice contains any equal elements. Using a strict definition of -descending ensures that a descending run contains distinct elements. - -If an array is random, it's very unlikely we'll see long runs. If a natural -run contains less than minrun elements (see next section), the main loop -artificially boosts it to minrun elements, via a stable binary insertion sort -applied to the right number of array elements following the short natural -run. In a random array, *all* runs are likely to be minrun long as a -result. This has two primary good effects: - -1. Random data strongly tends then toward perfectly balanced (both runs have - the same length) merges, which is the most efficient way to proceed when - data is random. - -2. Because runs are never very short, the rest of the code doesn't make - heroic efforts to shave a few cycles off per-merge overheads. For - example, reasonable use of function calls is made, rather than trying to - inline everything. Since there are no more than N/minrun runs to begin - with, a few "extra" function calls per merge is barely measurable. - - -Computing minrun ----------------- -If N < 64, minrun is N. IOW, binary insertion sort is used for the whole -array then; it's hard to beat that given the overheads of trying something -fancier (see note BINSORT). - -When N is a power of 2, testing on random data showed that minrun values of -16, 32, 64 and 128 worked about equally well. At 256 the data-movement cost -in binary insertion sort clearly hurt, and at 8 the increase in the number -of function calls clearly hurt. Picking *some* power of 2 is important -here, so that the merges end up perfectly balanced (see next section). We -pick 32 as a good value in the sweet range; picking a value at the low end -allows the adaptive gimmicks more opportunity to exploit shorter natural -runs. - -Because sortperf.py only tries powers of 2, it took a long time to notice -that 32 isn't a good choice for the general case! Consider N=2112: - ->>> divmod(2112, 32) -(66, 0) ->>> - -If the data is randomly ordered, we're very likely to end up with 66 runs -each of length 32. The first 64 of these trigger a sequence of perfectly -balanced merges (see next section), leaving runs of lengths 2048 and 64 to -merge at the end. The adaptive gimmicks can do that with fewer than 2048+64 -compares, but it's still more compares than necessary, and-- mergesort's -bugaboo relative to samplesort --a lot more data movement (O(N) copies just -to get 64 elements into place). - -If we take minrun=33 in this case, then we're very likely to end up with 64 -runs each of length 33, and then all merges are perfectly balanced. Better! - -What we want to avoid is picking minrun such that in - - q, r = divmod(N, minrun) - -q is a power of 2 and r>0 (then the last merge only gets r elements into -place, and r < minrun is small compared to N), or q a little larger than a -power of 2 regardless of r (then we've got a case similar to "2112", again -leaving too little work for the last merge to do). - -Instead we pick a minrun in range(32, 65) such that N/minrun is exactly a -power of 2, or if that isn't possible, is close to, but strictly less than, -a power of 2. This is easier to do than it may sound: take the first 6 -bits of N, and add 1 if any of the remaining bits are set. In fact, that -rule covers every case in this section, including small N and exact powers -of 2; merge_compute_minrun() is a deceptively simple function. - - -The Merge Pattern ------------------ -In order to exploit regularities in the data, we're merging on natural -run lengths, and they can become wildly unbalanced. That's a Good Thing -for this sort! It means we have to find a way to manage an assortment of -potentially very different run lengths, though. - -Stability constrains permissible merging patterns. For example, if we have -3 consecutive runs of lengths - - A:10000 B:20000 C:10000 - -we dare not merge A with C first, because if A, B and C happen to contain -a common element, it would get out of order wrt its occurrence(s) in B. The -merging must be done as (A+B)+C or A+(B+C) instead. - -So merging is always done on two consecutive runs at a time, and in-place, -although this may require some temp memory (more on that later). - -When a run is identified, its base address and length are pushed on a stack -in the MergeState struct. merge_collapse() is then called to see whether it -should merge it with preceding run(s). We would like to delay merging as -long as possible in order to exploit patterns that may come up later, but we -like even more to do merging as soon as possible to exploit that the run just -found is still high in the memory hierarchy. We also can't delay merging -"too long" because it consumes memory to remember the runs that are still -unmerged, and the stack has a fixed size. - -What turned out to be a good compromise maintains two invariants on the -stack entries, where A, B and C are the lengths of the three righmost not-yet -merged slices: - -1. A > B+C -2. B > C - -Note that, by induction, #2 implies the lengths of pending runs form a -decreasing sequence. #1 implies that, reading the lengths right to left, -the pending-run lengths grow at least as fast as the Fibonacci numbers. -Therefore the stack can never grow larger than about log_base_phi(N) entries, -where phi = (1+sqrt(5))/2 ~= 1.618. Thus a small # of stack slots suffice -for very large arrays. - -If A <= B+C, the smaller of A and C is merged with B (ties favor C, for the -freshness-in-cache reason), and the new run replaces the A,B or B,C entries; -e.g., if the last 3 entries are - - A:30 B:20 C:10 - -then B is merged with C, leaving - - A:30 BC:30 - -on the stack. Or if they were - - A:500 B:400: C:1000 - -then A is merged with B, leaving - - AB:900 C:1000 - -on the stack. - -In both examples, the stack configuration after the merge still violates -invariant #2, and merge_collapse() goes on to continue merging runs until -both invariants are satisfied. As an extreme case, suppose we didn't do the -minrun gimmick, and natural runs were of lengths 128, 64, 32, 16, 8, 4, 2, -and 2. Nothing would get merged until the final 2 was seen, and that would -trigger 7 perfectly balanced merges. - -The thrust of these rules when they trigger merging is to balance the run -lengths as closely as possible, while keeping a low bound on the number of -runs we have to remember. This is maximally effective for random data, -where all runs are likely to be of (artificially forced) length minrun, and -then we get a sequence of perfectly balanced merges (with, perhaps, some -oddballs at the end). - -OTOH, one reason this sort is so good for partly ordered data has to do -with wildly unbalanced run lengths. - - -Merge Memory ------------- -Merging adjacent runs of lengths A and B in-place, and in linear time, is -difficult. Theoretical constructions are known that can do it, but they're -too difficult and slow for practical use. But if we have temp memory equal -to min(A, B), it's easy. - -If A is smaller (function merge_lo), copy A to a temp array, leave B alone, -and then we can do the obvious merge algorithm left to right, from the temp -area and B, starting the stores into where A used to live. There's always a -free area in the original area comprising a number of elements equal to the -number not yet merged from the temp array (trivially true at the start; -proceed by induction). The only tricky bit is that if a comparison raises an -exception, we have to remember to copy the remaining elements back in from -the temp area, lest the array end up with duplicate entries from B. But -that's exactly the same thing we need to do if we reach the end of B first, -so the exit code is pleasantly common to both the normal and error cases. - -If B is smaller (function merge_hi, which is merge_lo's "mirror image"), -much the same, except that we need to merge right to left, copying B into a -temp array and starting the stores at the right end of where B used to live. - -A refinement: When we're about to merge adjacent runs A and B, we first do -a form of binary search (more on that later) to see where B[0] should end up -in A. Elements in A preceding that point are already in their final -positions, effectively shrinking the size of A. Likewise we also search to -see where A[-1] should end up in B, and elements of B after that point can -also be ignored. This cuts the amount of temp memory needed by the same -amount. - -These preliminary searches may not pay off, and can be expected *not* to -repay their cost if the data is random. But they can win huge in all of -time, copying, and memory savings when they do pay, so this is one of the -"per-merge overheads" mentioned above that we're happy to endure because -there is at most one very short run. It's generally true in this algorithm -that we're willing to gamble a little to win a lot, even though the net -expectation is negative for random data. - - -Merge Algorithms ----------------- -merge_lo() and merge_hi() are where the bulk of the time is spent. merge_lo -deals with runs where A <= B, and merge_hi where A > B. They don't know -whether the data is clustered or uniform, but a lovely thing about merging -is that many kinds of clustering "reveal themselves" by how many times in a -row the winning merge element comes from the same run. We'll only discuss -merge_lo here; merge_hi is exactly analogous. - -Merging begins in the usual, obvious way, comparing the first element of A -to the first of B, and moving B[0] to the merge area if it's less than A[0], -else moving A[0] to the merge area. Call that the "one pair at a time" -mode. The only twist here is keeping track of how many times in a row "the -winner" comes from the same run. - -If that count reaches MIN_GALLOP, we switch to "galloping mode". Here -we *search* B for where A[0] belongs, and move over all the B's before -that point in one chunk to the merge area, then move A[0] to the merge -area. Then we search A for where B[0] belongs, and similarly move a -slice of A in one chunk. Then back to searching B for where A[0] belongs, -etc. We stay in galloping mode until both searches find slices to copy -less than MIN_GALLOP elements long, at which point we go back to one-pair- -at-a-time mode. - -A refinement: The MergeState struct contains the value of min_gallop that -controls when we enter galloping mode, initialized to MIN_GALLOP. -merge_lo() and merge_hi() adjust this higher when galloping isn't paying -off, and lower when it is. - - -Galloping ---------- -Still without loss of generality, assume A is the shorter run. In galloping -mode, we first look for A[0] in B. We do this via "galloping", comparing -A[0] in turn to B[0], B[1], B[3], B[7], ..., B[2**j - 1], ..., until finding -the k such that B[2**(k-1) - 1] < A[0] <= B[2**k - 1]. This takes at most -roughly lg(B) comparisons, and, unlike a straight binary search, favors -finding the right spot early in B (more on that later). - -After finding such a k, the region of uncertainty is reduced to 2**(k-1) - 1 -consecutive elements, and a straight binary search requires exactly k-1 -additional comparisons to nail it (see note REGION OF UNCERTAINTY). Then we -copy all the B's up to that point in one chunk, and then copy A[0]. Note -that no matter where A[0] belongs in B, the combination of galloping + binary -search finds it in no more than about 2*lg(B) comparisons. - -If we did a straight binary search, we could find it in no more than -ceiling(lg(B+1)) comparisons -- but straight binary search takes that many -comparisons no matter where A[0] belongs. Straight binary search thus loses -to galloping unless the run is quite long, and we simply can't guess -whether it is in advance. - -If data is random and runs have the same length, A[0] belongs at B[0] half -the time, at B[1] a quarter of the time, and so on: a consecutive winning -sub-run in B of length k occurs with probability 1/2**(k+1). So long -winning sub-runs are extremely unlikely in random data, and guessing that a -winning sub-run is going to be long is a dangerous game. - -OTOH, if data is lopsided or lumpy or contains many duplicates, long -stretches of winning sub-runs are very likely, and cutting the number of -comparisons needed to find one from O(B) to O(log B) is a huge win. - -Galloping compromises by getting out fast if there isn't a long winning -sub-run, yet finding such very efficiently when they exist. - -I first learned about the galloping strategy in a related context; see: - - "Adaptive Set Intersections, Unions, and Differences" (2000) - Erik D. Demaine, Alejandro L�pez-Ortiz, J. Ian Munro - -and its followup(s). An earlier paper called the same strategy -"exponential search": - - "Optimistic Sorting and Information Theoretic Complexity" - Peter McIlroy - SODA (Fourth Annual ACM-SIAM Symposium on Discrete Algorithms), pp - 467-474, Austin, Texas, 25-27 January 1993. - -and it probably dates back to an earlier paper by Bentley and Yao. The -McIlroy paper in particular has good analysis of a mergesort that's -probably strongly related to this one in its galloping strategy. - - -Galloping with a Broken Leg ---------------------------- -So why don't we always gallop? Because it can lose, on two counts: - -1. While we're willing to endure small per-merge overheads, per-comparison - overheads are a different story. Calling Yet Another Function per - comparison is expensive, and gallop_left() and gallop_right() are - too long-winded for sane inlining. - -2. Galloping can-- alas --require more comparisons than linear one-at-time - search, depending on the data. - -#2 requires details. If A[0] belongs before B[0], galloping requires 1 -compare to determine that, same as linear search, except it costs more -to call the gallop function. If A[0] belongs right before B[1], galloping -requires 2 compares, again same as linear search. On the third compare, -galloping checks A[0] against B[3], and if it's <=, requires one more -compare to determine whether A[0] belongs at B[2] or B[3]. That's a total -of 4 compares, but if A[0] does belong at B[2], linear search would have -discovered that in only 3 compares, and that's a huge loss! Really. It's -an increase of 33% in the number of compares needed, and comparisons are -expensive in Python. - -index in B where # compares linear # gallop # binary gallop -A[0] belongs search needs compares compares total ----------------- ----------------- -------- -------- ------ - 0 1 1 0 1 - - 1 2 2 0 2 - - 2 3 3 1 4 - 3 4 3 1 4 - - 4 5 4 2 6 - 5 6 4 2 6 - 6 7 4 2 6 - 7 8 4 2 6 - - 8 9 5 3 8 - 9 10 5 3 8 - 10 11 5 3 8 - 11 12 5 3 8 - ... - -In general, if A[0] belongs at B[i], linear search requires i+1 comparisons -to determine that, and galloping a total of 2*floor(lg(i))+2 comparisons. -The advantage of galloping is unbounded as i grows, but it doesn't win at -all until i=6. Before then, it loses twice (at i=2 and i=4), and ties -at the other values. At and after i=6, galloping always wins. - -We can't guess in advance when it's going to win, though, so we do one pair -at a time until the evidence seems strong that galloping may pay. MIN_GALLOP -is 7, and that's pretty strong evidence. However, if the data is random, it -simply will trigger galloping mode purely by luck every now and again, and -it's quite likely to hit one of the losing cases next. On the other hand, -in cases like ~sort, galloping always pays, and MIN_GALLOP is larger than it -"should be" then. So the MergeState struct keeps a min_gallop variable -that merge_lo and merge_hi adjust: the longer we stay in galloping mode, -the smaller min_gallop gets, making it easier to transition back to -galloping mode (if we ever leave it in the current merge, and at the -start of the next merge). But whenever the gallop loop doesn't pay, -min_gallop is increased by one, making it harder to transition back -to galloping mode (and again both within a merge and across merges). For -random data, this all but eliminates the gallop penalty: min_gallop grows -large enough that we almost never get into galloping mode. And for cases -like ~sort, min_gallop can fall to as low as 1. This seems to work well, -but in all it's a minor improvement over using a fixed MIN_GALLOP value. - - -Galloping Complication ----------------------- -The description above was for merge_lo. merge_hi has to merge "from the -other end", and really needs to gallop starting at the last element in a run -instead of the first. Galloping from the first still works, but does more -comparisons than it should (this is significant -- I timed it both ways). For -this reason, the gallop_left() and gallop_right() (see note LEFT OR RIGHT) -functions have a "hint" argument, which is the index at which galloping -should begin. So galloping can actually start at any index, and proceed at -offsets of 1, 3, 7, 15, ... or -1, -3, -7, -15, ... from the starting index. - -In the code as I type it's always called with either 0 or n-1 (where n is -the # of elements in a run). It's tempting to try to do something fancier, -melding galloping with some form of interpolation search; for example, if -we're merging a run of length 1 with a run of length 10000, index 5000 is -probably a better guess at the final result than either 0 or 9999. But -it's unclear how to generalize that intuition usefully, and merging of -wildly unbalanced runs already enjoys excellent performance. - -~sort is a good example of when balanced runs could benefit from a better -hint value: to the extent possible, this would like to use a starting -offset equal to the previous value of acount/bcount. Doing so saves about -10% of the compares in ~sort. However, doing so is also a mixed bag, -hurting other cases. - - -Comparing Average # of Compares on Random Arrays ------------------------------------------------- -[NOTE: This was done when the new algorithm used about 0.1% more compares - on random data than does its current incarnation.] - -Here list.sort() is samplesort, and list.msort() this sort: - -""" -import random -from time import clock as now - -def fill(n): - from random import random - return [random() for i in xrange(n)] - -def mycmp(x, y): - global ncmp - ncmp += 1 - return cmp(x, y) - -def timeit(values, method): - global ncmp - X = values[:] - bound = getattr(X, method) - ncmp = 0 - t1 = now() - bound(mycmp) - t2 = now() - return t2-t1, ncmp - -format = "%5s %9.2f %11d" -f2 = "%5s %9.2f %11.2f" - -def drive(): - count = sst = sscmp = mst = mscmp = nelts = 0 - while True: - n = random.randrange(100000) - nelts += n - x = fill(n) - - t, c = timeit(x, 'sort') - sst += t - sscmp += c - - t, c = timeit(x, 'msort') - mst += t - mscmp += c - - count += 1 - if count % 10: - continue - - print "count", count, "nelts", nelts - print format % ("sort", sst, sscmp) - print format % ("msort", mst, mscmp) - print f2 % ("", (sst-mst)*1e2/mst, (sscmp-mscmp)*1e2/mscmp) - -drive() -""" - -I ran this on Windows and kept using the computer lightly while it was -running. time.clock() is wall-clock time on Windows, with better than -microsecond resolution. samplesort started with a 1.52% #-of-comparisons -disadvantage, fell quickly to 1.48%, and then fluctuated within that small -range. Here's the last chunk of output before I killed the job: - -count 2630 nelts 130906543 - sort 6110.80 1937887573 -msort 6002.78 1909389381 - 1.80 1.49 - -We've done nearly 2 billion comparisons apiece at Python speed there, and -that's enough <wink>. - -For random arrays of size 2 (yes, there are only 2 interesting ones), -samplesort has a 50%(!) comparison disadvantage. This is a consequence of -samplesort special-casing at most one ascending run at the start, then -falling back to the general case if it doesn't find an ascending run -immediately. The consequence is that it ends up using two compares to sort -[2, 1]. Gratifyingly, timsort doesn't do any special-casing, so had to be -taught how to deal with mixtures of ascending and descending runs -efficiently in all cases. - - -NOTES ------ - -BINSORT -A "binary insertion sort" is just like a textbook insertion sort, but instead -of locating the correct position of the next item via linear (one at a time) -search, an equivalent to Python's bisect.bisect_right is used to find the -correct position in logarithmic time. Most texts don't mention this -variation, and those that do usually say it's not worth the bother: insertion -sort remains quadratic (expected and worst cases) either way. Speeding the -search doesn't reduce the quadratic data movement costs. - -But in CPython's case, comparisons are extraordinarily expensive compared to -moving data, and the details matter. Moving objects is just copying -pointers. Comparisons can be arbitrarily expensive (can invoke arbitrary -user-supplied Python code), but even in simple cases (like 3 < 4) _all_ -decisions are made at runtime: what's the type of the left comparand? the -type of the right? do they need to be coerced to a common type? where's the -code to compare these types? And so on. Even the simplest Python comparison -triggers a large pile of C-level pointer dereferences, conditionals, and -function calls. - -So cutting the number of compares is almost always measurably helpful in -CPython, and the savings swamp the quadratic-time data movement costs for -reasonable minrun values. - - -LEFT OR RIGHT -gallop_left() and gallop_right() are akin to the Python bisect module's -bisect_left() and bisect_right(): they're the same unless the slice they're -searching contains a (at least one) value equal to the value being searched -for. In that case, gallop_left() returns the position immediately before the -leftmost equal value, and gallop_right() the position immediately after the -rightmost equal value. The distinction is needed to preserve stability. In -general, when merging adjacent runs A and B, gallop_left is used to search -thru B for where an element from A belongs, and gallop_right to search thru A -for where an element from B belongs. - - -REGION OF UNCERTAINTY -Two kinds of confusion seem to be common about the claim that after finding -a k such that - - B[2**(k-1) - 1] < A[0] <= B[2**k - 1] - -then a binary search requires exactly k-1 tries to find A[0]'s proper -location. For concreteness, say k=3, so B[3] < A[0] <= B[7]. - -The first confusion takes the form "OK, then the region of uncertainty is at -indices 3, 4, 5, 6 and 7: that's 5 elements, not the claimed 2**(k-1) - 1 = -3"; or the region is viewed as a Python slice and the objection is "but that's -the slice B[3:7], so has 7-3 = 4 elements". Resolution: we've already -compared A[0] against B[3] and against B[7], so A[0]'s correct location is -already known wrt _both_ endpoints. What remains is to find A[0]'s correct -location wrt B[4], B[5] and B[6], which spans 3 elements. Or in general, the -slice (leaving off both endpoints) (2**(k-1)-1)+1 through (2**k-1)-1 -inclusive = 2**(k-1) through (2**k-1)-1 inclusive, which has - (2**k-1)-1 - 2**(k-1) + 1 = - 2**k-1 - 2**(k-1) = - 2*2**k-1 - 2**(k-1) = - (2-1)*2**(k-1) - 1 = - 2**(k-1) - 1 -elements. - -The second confusion: "k-1 = 2 binary searches can find the correct location -among 2**(k-1) = 4 elements, but you're only applying it to 3 elements: we -could make this more efficient by arranging for the region of uncertainty to -span 2**(k-1) elements." Resolution: that confuses "elements" with -"locations". In a slice with N elements, there are N+1 _locations_. In the -example, with the region of uncertainty B[4], B[5], B[6], there are 4 -locations: before B[4], between B[4] and B[5], between B[5] and B[6], and -after B[6]. In general, across 2**(k-1)-1 elements, there are 2**(k-1) -locations. That's why k-1 binary searches are necessary and sufficient. diff --git a/contrib/tools/python/src/Objects/lnotab_notes.txt b/contrib/tools/python/src/Objects/lnotab_notes.txt deleted file mode 100644 index 515375772e6..00000000000 --- a/contrib/tools/python/src/Objects/lnotab_notes.txt +++ /dev/null @@ -1,129 +0,0 @@ -All about co_lnotab, the line number table. - -Code objects store a field named co_lnotab. This is an array of unsigned bytes -disguised as a Python string. It is used to map bytecode offsets to source code -line #s for tracebacks and to identify line number boundaries for line tracing. - -The array is conceptually a compressed list of - (bytecode offset increment, line number increment) -pairs. The details are important and delicate, best illustrated by example: - - byte code offset source code line number - 0 1 - 6 2 - 50 7 - 350 207 - 361 208 - -Instead of storing these numbers literally, we compress the list by storing only -the difference from one row to the next. Conceptually, the stored list might -look like: - - 0, 1, 6, 1, 44, 5, 300, 200, 11, 1 - -The above doesn't really work, but it's a start. An unsigned byte (byte code -offset) can't hold negative values, or values larger than 255, a signed byte -(line number) can't hold values larger than 127 or less than -128, and the -above example contains two such values. So we make two tweaks: - - (a) there's a deep assumption that byte code offsets increase monotonically, - and - (b) if byte code offset jumps by more than 255 from one row to the next, or if - source code line number jumps by more than 127 or less than -128 from one row - to the next, more than one pair is written to the table. In case #b, - there's no way to know from looking at the table later how many were written. - That's the delicate part. A user of co_lnotab desiring to find the source - line number corresponding to a bytecode address A should do something like - this: - - lineno = addr = 0 - for addr_incr, line_incr in co_lnotab: - addr += addr_incr - if addr > A: - return lineno - if line_incr >= 0x80: - line_incr -= 0x100 - lineno += line_incr - -(In C, this is implemented by PyCode_Addr2Line().) In order for this to work, -when the addr field increments by more than 255, the line # increment in each -pair generated must be 0 until the remaining addr increment is < 256. So, in -the example above, assemble_lnotab in compile.c should not (as was actually done -until 2.2) expand 300, 200 to - 255, 255, 45, 45, -but to - 255, 0, 45, 128, 0, 72. - -The above is sufficient to reconstruct line numbers for tracebacks, but not for -line tracing. Tracing is handled by PyCode_CheckLineNumber() in codeobject.c -and maybe_call_line_trace() in ceval.c. - -*** Tracing *** - -To a first approximation, we want to call the tracing function when the line -number of the current instruction changes. Re-computing the current line for -every instruction is a little slow, though, so each time we compute the line -number we save the bytecode indices where it's valid: - - *instr_lb <= frame->f_lasti < *instr_ub - -is true so long as execution does not change lines. That is, *instr_lb holds -the first bytecode index of the current line, and *instr_ub holds the first -bytecode index of the next line. As long as the above expression is true, -maybe_call_line_trace() does not need to call PyCode_CheckLineNumber(). Note -that the same line may appear multiple times in the lnotab, either because the -bytecode jumped more than 255 indices between line number changes or because -the compiler inserted the same line twice. Even in that case, *instr_ub holds -the first index of the next line. - -However, we don't *always* want to call the line trace function when the above -test fails. - -Consider this code: - -1: def f(a): -2: while a: -3: print 1, -4: break -5: else: -6: print 2, - -which compiles to this: - - 2 0 SETUP_LOOP 19 (to 22) - >> 3 LOAD_FAST 0 (a) - 6 POP_JUMP_IF_FALSE 17 - - 3 9 LOAD_CONST 1 (1) - 12 PRINT_ITEM - - 4 13 BREAK_LOOP - 14 JUMP_ABSOLUTE 3 - >> 17 POP_BLOCK - - 6 18 LOAD_CONST 2 (2) - 21 PRINT_ITEM - >> 22 LOAD_CONST 0 (None) - 25 RETURN_VALUE - -If 'a' is false, execution will jump to the POP_BLOCK instruction at offset 17 -and the co_lnotab will claim that execution has moved to line 4, which is wrong. -In this case, we could instead associate the POP_BLOCK with line 5, but that -would break jumps around loops without else clauses. - -We fix this by only calling the line trace function for a forward jump if the -co_lnotab indicates we have jumped to the *start* of a line, i.e. if the current -instruction offset matches the offset given for the start of a line by the -co_lnotab. For backward jumps, however, we always call the line trace function, -which lets a debugger stop on every evaluation of a loop guard (which usually -won't be the first opcode in a line). - -Why do we set f_lineno when tracing, and only just before calling the trace -function? Well, consider the code above when 'a' is true. If stepping through -this with 'n' in pdb, you would stop at line 1 with a "call" type event, then -line events on lines 2, 3, and 4, then a "return" type event -- but because the -code for the return actually falls in the range of the "line 6" opcodes, you -would be shown line 6 during this event. This is a change from the behaviour in -2.2 and before, and I've found it confusing in practice. By setting and using -f_lineno when tracing, one can report a line number different from that -suggested by f_lasti on this one occasion where it's desirable. diff --git a/contrib/tools/python/src/Objects/stringlib/README.txt b/contrib/tools/python/src/Objects/stringlib/README.txt deleted file mode 100644 index ab506d60f94..00000000000 --- a/contrib/tools/python/src/Objects/stringlib/README.txt +++ /dev/null @@ -1,40 +0,0 @@ -bits shared by the stringobject and unicodeobject implementations (and -possibly other modules, in a not too distant future). - -the stuff in here is included into relevant places; see the individual -source files for details. - --------------------------------------------------------------------- -the following defines used by the different modules: - -STRINGLIB_CHAR - - the type used to hold a character (char or Py_UNICODE) - -STRINGLIB_EMPTY - - a PyObject representing the empty string, only to be used if - STRINGLIB_MUTABLE is 0 - -Py_ssize_t STRINGLIB_LEN(PyObject*) - - returns the length of the given string object (which must be of the - right type) - -PyObject* STRINGLIB_NEW(STRINGLIB_CHAR*, Py_ssize_t) - - creates a new string object - -STRINGLIB_CHAR* STRINGLIB_STR(PyObject*) - - returns the pointer to the character data for the given string - object (which must be of the right type) - -int STRINGLIB_CHECK_EXACT(PyObject *) - - returns true if the object is an instance of our type, not a subclass - -STRINGLIB_MUTABLE - - must be 0 or 1 to tell the cpp macros in stringlib code if the object - being operated on is mutable or not |
