summaryrefslogtreecommitdiffstats
path: root/library/cpp/yt/rseq
Commit message (Collapse)AuthorAgeFilesLines
* YT-28458: Make per-CPU rseq fast path dlopen-safebabenko7 days4-8/+118
| | | | | | | | | | | | | | | | | | | | | | | | | Hardens `library/cpp/yt/rseq` for the case where it is linked into a dlopen'd, position-independent module (e.g. a YQL UDF `.so`). Extracted from the profiling work that enables the rseq fast path by default. **TLS model.** The weak `__rseq_abi` gets `global-dynamic` linkage under `__PIC__/__PIE__` (`initial-exec` otherwise), mirroring `contrib/libs/tcmalloc`. `initial-exec` needs a slot in the static TLS block reserved at startup, which the loader cannot grant a module dlopen'd later — the module would fail to load with "cannot allocate memory in static TLS block". This only changes the cold `&__rseq_abi` accesses; the hot path still reads `*(thread_pointer + CpuIdFieldOffset)`. **Runtime safety probe `IsPerCpuFastPathSafe()`.** The cached thread-pointer offset is valid only when `__rseq_abi` sits at a fixed offset from the thread pointer — a glibc-owned area or the static TLS block (incl. tcmalloc), the common case. When our `__rseq_abi` instead lands in a dlopen'd module's *dynamically allocated* TLS, the offset is valid only on the thread that computed it; on other threads the hot path's first store (`area->rseq_cs`) would corrupt unrelated memory. The probe spawns one thread and checks — by pointer comparison, never dereferencing the suspect offset — that the offset names that thread's rseq area; if not, callers use the atomic fallback. Decided once and cached (one thread spawn at first use).= commit_hash:633f58f500d9d097800da81f526c56283445ffc7
* Add lock-free per-CPU primitives to library/cpp/yt/rseqbabenko10 days7-3/+781
| | | | | | | | | | | | | | | | | Introduce AddPerCpu and StorePerCpu over an rseq-sharded per-CPU array. On the x86-64 Linux fast path the update is committed by a hand-rolled rseq critical section (non-atomic, migration-safe): addq for the 8-byte accumulate, movq / movdqu for the 8- or 16-byte store. The kernel restarts the sequence on preemption or migration, and only one thread runs on a CPU at a time, so no atomic or lock is needed. Off the fast path (other arches, no kernel rseq) the operation falls back to an atomic on the slot indexed by sched_getcpu(). A naturally-aligned 8-byte store is single-copy atomic on x86-64, so it is never observed torn; the 16-byte store may be, which is acceptable for a last-writer-wins gauge. commit_hash:6250f6e9e35cf3895ebafe0b534ec12cca50b03b
* Make library/cpp/yt/rseq a Linux-only dependency of library/cpp/yt/systembabenko2026-06-151-0/+4
| | | | | Make library/cpp/yt/rseq a Linux-only dependency of library/cpp/yt/system commit_hash:7d6f5e738658447529440425b55b2891f6664d81
* Fix rseq fast path on glibc < 2.35: read the shared __rseq_abi areababenko2026-06-141-28/+43
| | | | | | | | | | | | | | | | | | | | | | | The own-area approach did not deliver the fast path on glibc 2.31 (YT's current runtime). There tcmalloc registers the conventional `__rseq_abi` area for every thread; our attempt to register a separate area was rejected by the kernel with EINVAL (a thread may have only one rseq area), so `cpu_id` stayed -1 and every `GetCurrentCpuId()` fell back to `sched_getcpu()` (~17-20 ns, slower than the rdtscp it replaced). Read the shared `__rseq_abi` symbol instead -- the area tcmalloc, librseq and pre-2.35 glibc all register. Our definition is weak, so it coalesces with theirs when present (the common case -- tcmalloc owns it) and stands alone otherwise (e.g. musl), with us registering it. We register with the conventional signature `0x53053053` and size 32, so re-registering an already-registered area returns EBUSY (treated as success) rather than EINVAL -- coexisting cleanly with tcmalloc. glibc >= 2.35 still takes the `__rseq_offset` path unchanged. Measured on sas2-2769 (glibc 2.31 + tcmalloc): `GetCurrentCpuId()` 20.0 ns -> 0.60 ns, verified via strace that our registration now returns EBUSY against tcmalloc's `__rseq_abi` (was EINVAL against a separate area). commit_hash:509809deeb5f7c671817fcd9ebcc8499eabf096e
* Add library/cpp/yt/rseq: NYT::GetCurrentCpuId() via Linux rseqbabenko2026-06-144-0/+197
Self-contained current-CPU-id reader backed by Linux **rseq** (restartable sequences), with **no third-party dependency** (no librseq): * The rseq ABI is hand-defined; the calling thread is registered lazily via the rseq syscall. * Fast path is a single inlined, **branch-free** thread-local read. The offset always points at a readable `cpu_id` -- the glibc-owned area when glibc registers rseq (>= 2.35, via the weak `__rseq_offset`/`__rseq_size`), otherwise our own area -- so an unregistered thread reads `-1` and routes to the slow path. * Falls back to `sched_getcpu()` (Linux) or `0` (darwin/windows). Works on glibc **and musl** alike (librseq does not build on musl). Fiber-TLS contract: the inlined read must be reached only via a non-inlinable, fiber-switch-free frame (a virtual call or `YT_PREVENT_TLS_CACHING`). #### Benchmark -- cost of one cpu-id read | source | time / call | |---|---| | `GetCurrentCpuId()` (rseq) | **0.34 ns** | | `sched_getcpu()` (vDSO) | 3.5 ns | | `rdtscp` (what `TTscp::Get()` does) | 23 ns | This is an alternative to the librseq-based review/13886037 -- same speed, but no contrib dependency and it also covers musl. The unit test pins to each allowed CPU and asserts the reported id matches. 🤖 Generated with [Claude Code](https://claude.com/claude-code) commit_hash:09d282c2f48755836b1cd68cedbffc3c6a662eed