diff options
author | apollo1321 <apollo1321@yandex-team.com> | 2025-05-01 13:35:41 +0300 |
---|---|---|
committer | apollo1321 <apollo1321@yandex-team.com> | 2025-05-01 13:51:43 +0300 |
commit | e1b2809b60d8b79857b8515832a51056101516e2 (patch) | |
tree | fc29e43b2c99a914d90745e576f2f1d39f7b6040 /contrib/python/matplotlib/py2/src | |
parent | 4463ac0859eddd33f1129c64ec620719ef364cca (diff) | |
download | ydb-e1b2809b60d8b79857b8515832a51056101516e2.tar.gz |
YT-10317: Simplify unordered chunk pool slicing algorithm
This PR simplifies the calculation of `data_weight_per_job` within the `TUnorderedChunkPool`.
**Current Workflow:**
1\. **TJobSizeConstraints:**
\- Users define constraints in the job specification, such as `data_weight_per_job`, `job_count`, etc.
\- These user constraints are transformed into `job_count`.
\- `data_weight_per_job` is then calculated based on this `job_count`.
2\. **TUnorderedChunkPool:**
\- Within this pool, `data_weight_per_job` is again transformed into `job_count`.
\- The ideal `data_weight_per_job` for slicing is calculated as `remaining_data_weight / remaining_job_count`.
**Proposed Changes:**
This PR simplifies the algorithm by directly using the `data_weight_per_job` from `TJobSizeConstraints` in the `TUnorderedChunkPool`. Previously, the approach could lead to an increase or a decrease in `data_weight_per_job` during the slicing process. For instance, with an initial `data_weight_per_job` of `400`, the previous algorithm might split inputs into jobs with data weights of `[433, 433, 394, 394, 394]`. In contrast, the updated algorithm consistently maintains job sizes, resulting in a distribution of `[433, 433, 433, 433, 316]`.
**Additional Notes:**
\- The current algorithm has special handling for the AutoMerge task, using `data_weight_per_job` directly from `TJobSizeConstraints`.
\- Although the current algorithm might provide speed improvements in certain specific scenarios, it is not a consistently reliable solution overall. To more effectively reduce tail latency in operations, it is preferable to use a job splitting mechanism.
\- The simplified logic facilitates the future introduction of slicing mechanisms based on compressed data size, which the old approach would complicate.
commit_hash:2d450fb007e35c6a59dc136f504e2e77f46db625
Diffstat (limited to 'contrib/python/matplotlib/py2/src')
0 files changed, 0 insertions, 0 deletions