-
Notifications
You must be signed in to change notification settings - Fork 17
[RFC] Ongoing comparative performance analysis. #312
Description
There have been some simple comparisons of selected head commits month-to-month (such as the one linked in the HarfRust README), but these have a tendency to become stale because they are updated manually, and they involve small and potentially-unrepresentative tasks. Furthermore, the performance of HarfBuzz is also a moving target, which multiplies the staleness of these comparisons as they are performed right now.
I am looking to build an external, continuously-operating performance analysis harness for HarfRust and HarfBuzz, which samples a more representative corpus (more fonts, more texts, more parameter sets). The approach is to set up a machine (with scheduler isolation, fixed clock rates, quiet IRQs, etc.) to select samples to improve the time resolution and precision of the analysis of both HarfRust and HarfBuzz, to give better information on the development of their performance over time and relative to eachother.
This could also be adapted to preferentially sample PR base and HEAD commits, but for a variety of reasons I think it would probably not be desirable to block CI on this form of analysis (since it is tricky establishing clear performance metrics that are comparable between revisions in a single run); I'd be interested to hear any ideas on this.
The reason that I am looking at a controlled machine (rather than a CI task) for this purpose, is that it is very difficult to get comparable performance measurements with a single build/link of a given codebase, or a single run, and CI runners are on shared resources with cache contention, FU contention, little control over interrupts etc.
Benchmarking tools tend to overstate the meaningfulness of their comparative numbers (e.g. Criterion displaying double digit percent progression/regression between two builds of functionally identical code, with ‘p < 0.05’ next to them).
A continuous sampling approach gives an opportunity to use randomization of task order/grouping, linking order, and other parameters, which should converge to something more representative.
I am writing here to track this process, to gather feedback on the approach/technique, and to request help in developing a representative performance corpus. If you have any techniques for stabilizing system state across runs, any thoughts or resources on corpus selection, or any general feedback on this effort I would love to hear it.