Skip to content

gh-146073: Add fitness/exit quality mechanism for JIT trace frontend#147966

Draft
cocolato wants to merge 9 commits intopython:mainfrom
cocolato:jit-tracer-fitness
Draft

gh-146073: Add fitness/exit quality mechanism for JIT trace frontend#147966
cocolato wants to merge 9 commits intopython:mainfrom
cocolato:jit-tracer-fitness

Conversation

@cocolato
Copy link
Copy Markdown
Member

@cocolato cocolato commented Apr 1, 2026

Background

See: #146073

Introduced a preliminary fitness/exit quality mechanism for JIT trace frontend has, enabling the tracer to:

  • Stop proactively: Stop at the appropriate time without waiting for the buffer to fill up or hitting a dead end
  • Stop at good locations: Prioritize stopping at the entry points of existing executors (ENTER_EXECUTOR) to avoid stopping on instructions that can be optimized
  • Control frame depth: Apply a penalty that increases with depth for function call inlining

Copy link
Copy Markdown
Member

@Fidget-Spinner Fidget-Spinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work. Thanks for doing this!

@Fidget-Spinner
Copy link
Copy Markdown
Member

I also forgot to mention: but we should adjust for the following:

  1. Penalize frame depth underflow more so than normal frame push/pop.
  2. Stop tracing when we hit MAX_ABSTRACT_FRAME_DEPTH, as we can't optimize it anyways.

Copy link
Copy Markdown
Member

@markshannon markshannon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, and thanks for doing this.

Overall, the approach looks good.

I originally had in a mind that the fitness would reduce geometrically, not arithmetically, but your arithmetic approach looks easier to reason about and at least as good in terms of trace quality.

We need to reduce the number of configurable parameters to just one or two.
Then we can make sure that fitness and penalties are set such that we are guaranteed not to overflow the trace or optimizer buffers.

I'm not sure that rewinding is worth it. As long as "good" exits have a much higher score than "bad" exits, then we should (almost) always end up at a good exit.

/* Default fitness configuration values for trace quality control.
* These can be overridden via PYTHON_JIT_FITNESS_* environment variables. */
#define FITNESS_PER_INSTRUCTION 2
#define FITNESS_INITIAL (UOP_MAX_TRACE_LENGTH * FITNESS_PER_INSTRUCTION)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd stick with 2000. We really don't need traces of over 1000 uops; it is far more important to find a good stopping point than to have very long traces

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@markshannon 1000 uops is only roughly 300 uops after optimization which corresponds to roughly only 60 bytecode instructions. That's too low. A reminder that the trace recording spews out a lot of uops that is only optimized away by the optimizer.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cocolato I suspect richards perf regression might be due to too low trace length, I experimented with richards before and it's quite sensitive to how well the optimizer performs

Copy link
Copy Markdown
Member Author

@cocolato cocolato Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and I tried reducing the branch/frame penalty, which also improve richards' performance.

trace->end -= needs_guard_ip;

int space_needed = expansion->nuops + needs_guard_ip + 2 + (!OPCODE_HAS_NO_SAVE_IP(opcode));
if (uop_buffer_remaining_space(trace) < space_needed) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
assert (uop_buffer_remaining_space(trace) > space_needed);

If we choose the fitness and exit values correctly, we can't run out of space.

@bedevere-app
Copy link
Copy Markdown

bedevere-app bot commented Apr 1, 2026

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

@cocolato cocolato marked this pull request as draft April 2, 2026 13:36
@cocolato
Copy link
Copy Markdown
Member Author

cocolato commented Apr 2, 2026

I’ll try to optimize the relevant parameters further, as performance in benchmarks like Richards has declined, and I’ll add some debug logs to make it easier to track fitness-related metrics later on.

int off_trace = 16 - on_trace_count;
/* Quadratic scaling: off_trace^2 ranges from 0 (fully biased our way)
* to 256 (fully biased against us, e.g. 15/16 left but traced right). */
return FITNESS_BRANCH_BASE + off_trace * off_trace;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return FITNESS_BRANCH_BASE + off_trace * off_trace;
return FITNESS_BRANCH_BASE * off_trace;

The fitness should (approximately) reflect the chance of staying on trace, so linear might be better.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I use off_trace * 2 and the richards bench result looks better now, from 1.2x slower to 1.13x slower:

+----------------------+----------+-----------------------+
| Benchmark            | baseline | fitness               |
+======================+==========+=======================+
| raytrace             | 250 ms   | 212 ms: 1.18x faster  |
+----------------------+----------+-----------------------+
| go                   | 83.3 ms  | 78.1 ms: 1.07x faster |
+----------------------+----------+-----------------------+
| pickle_pure_python   | 279 us   | 264 us: 1.06x faster  |
+----------------------+----------+-----------------------+
| regex_compile        | 102 ms   | 99.0 ms: 1.03x faster |
+----------------------+----------+-----------------------+
| xml_etree_generate   | 75.3 ms  | 74.3 ms: 1.01x faster |
+----------------------+----------+-----------------------+
| deltablue            | 2.21 ms  | 2.19 ms: 1.01x faster |
+----------------------+----------+-----------------------+
| xml_etree_iterparse  | 70.2 ms  | 69.5 ms: 1.01x faster |
+----------------------+----------+-----------------------+
| unpickle_pure_python | 169 us   | 167 us: 1.01x faster  |
+----------------------+----------+-----------------------+
| json_dumps           | 7.60 ms  | 7.66 ms: 1.01x slower |
+----------------------+----------+-----------------------+
| json_loads           | 19.2 us  | 19.4 us: 1.01x slower |
+----------------------+----------+-----------------------+
| telco                | 5.84 ms  | 5.92 ms: 1.01x slower |
+----------------------+----------+-----------------------+
| spectral_norm        | 56.5 ms  | 57.4 ms: 1.02x slower |
+----------------------+----------+-----------------------+
| pyflate              | 263 ms   | 279 ms: 1.06x slower  |
+----------------------+----------+-----------------------+
| richards             | 16.2 ms  | 18.3 ms: 1.13x slower |
+----------------------+----------+-----------------------+
| Geometric mean       | (ref)    | 1.01x faster          |
+----------------------+----------+-----------------------+

@markshannon
Copy link
Copy Markdown
Member

It might be worth thinking about a few scenarios to help choose parameters.
For example:

  • A loop should only be unrolled once, but we do want to unroll short loops.
  • There shouldn't be at most ~4 unbiased branches in a trace

Comment on lines +633 to +637
static inline int32_t
compute_frame_penalty(const _PyOptimizationConfig *cfg)
{
return (int32_t)cfg->fitness_initial / 10 + 1;
}
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The newest result with lower frame penalty:

+----------------------+----------+-----------------------+
| Benchmark            | baseline | fitness               |
+======================+==========+=======================+
| raytrace             | 262 ms   | 212 ms: 1.24x faster  |
+----------------------+----------+-----------------------+
| pickle_pure_python   | 277 us   | 254 us: 1.09x faster  |
+----------------------+----------+-----------------------+
| go                   | 83.7 ms  | 78.8 ms: 1.06x faster |
+----------------------+----------+-----------------------+
| xml_etree_iterparse  | 74.1 ms  | 69.9 ms: 1.06x faster |
+----------------------+----------+-----------------------+
| xml_etree_process    | 50.4 ms  | 48.3 ms: 1.04x faster |
+----------------------+----------+-----------------------+
| xml_etree_generate   | 77.3 ms  | 74.1 ms: 1.04x faster |
+----------------------+----------+-----------------------+
| xml_etree_parse      | 122 ms   | 119 ms: 1.03x faster  |
+----------------------+----------+-----------------------+
| regex_compile        | 103 ms   | 99.7 ms: 1.03x faster |
+----------------------+----------+-----------------------+
| deltablue            | 2.19 ms  | 2.14 ms: 1.03x faster |
+----------------------+----------+-----------------------+
| unpickle_pure_python | 171 us   | 167 us: 1.02x faster  |
+----------------------+----------+-----------------------+
| regex_effbot         | 2.16 ms  | 2.12 ms: 1.02x faster |
+----------------------+----------+-----------------------+
| fannkuch             | 253 ms   | 249 ms: 1.02x faster  |
+----------------------+----------+-----------------------+
| json_loads           | 19.3 us  | 19.1 us: 1.01x faster |
+----------------------+----------+-----------------------+
| json_dumps           | 7.60 ms  | 7.66 ms: 1.01x slower |
+----------------------+----------+-----------------------+
| pidigits             | 136 ms   | 138 ms: 1.01x slower  |
+----------------------+----------+-----------------------+
| pyflate              | 267 ms   | 273 ms: 1.02x slower  |
+----------------------+----------+-----------------------+
| float                | 45.2 ms  | 46.2 ms: 1.02x slower |
+----------------------+----------+-----------------------+
| richards             | 16.5 ms  | 17.3 ms: 1.05x slower |
+----------------------+----------+-----------------------+
| Geometric mean       | (ref)    | 1.03x faster          |
+----------------------+----------+-----------------------+

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants