Skip to content

gh-144569: Avoid creating temporary objects in BINARY_SLICE for list, tuple, and unicode#144590

Open
cocolato wants to merge 15 commits intopython:mainfrom
cocolato:optimize/BINARY_SLICE
Open

gh-144569: Avoid creating temporary objects in BINARY_SLICE for list, tuple, and unicode#144590
cocolato wants to merge 15 commits intopython:mainfrom
cocolato:optimize/BINARY_SLICE

Conversation

@cocolato
Copy link
Contributor

@cocolato cocolato commented Feb 8, 2026

Optimize BINARY_SLICE for list, tuple, and unicode object with int/None indices.

This is the first step for BINARY_SLICE optimize. We will implement more jit optimizations after this PR.

@cocolato

This comment was marked as outdated.

Copy link
Member

@Fidget-Spinner Fidget-Spinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty close, thanks!

@Fidget-Spinner
Copy link
Member

You forgot to add the new recorder to BINARY_SLICE macro op

Copy link
Member

@Fidget-Spinner Fidget-Spinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm just one comment and add news please

@Fidget-Spinner
Copy link
Member

Actually I've changed my mind. Can you please open a PR to convert BINARY_SLICE to the POP_TOP form please? Then after we merge that PR we can update this one

Copy link
Member

@Fidget-Spinner Fidget-Spinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's leave the DECREF_INPUTS alone for now and try to remove it in a future PR.

The code has repeated instances that should be factored out to a micro-op

@Fidget-Spinner
Copy link
Member

This is pretty close, I'm going to push some changes in, then wait a day to see if Mark has any objections tomorrow.

@cocolato
Copy link
Contributor Author

cocolato commented Feb 8, 2026

Thanks for your guidance!

1. We cannot exit after unpacking indices, as the stack contains tagged ints now which may lead to a crash. We must insert a guard for the type before unpacking indices.
2. It is possible for an indice to not fit in a tagged int, in which case we must deopt.
3. Recorded values do not always mean that that's the actual type we will see at runtime. So we must guard on that as well.
@Fidget-Spinner
Copy link
Member

@cocolato please check out the latest commit and the extended commit message for the bugs fixed. Thanks!

@Fidget-Spinner
Copy link
Member

add_int.py

def testfunc(n):
    data = [1, 2, 3, 4, 5]
    a, b = 1, 3
    for _ in range(n):
        x = data[a:b]
    return x

testfunc(50000000)

Results using hyperfine:

Summary
  PYTHON_JIT=1 ./python ../add_int.py ran
    1.12 ± 0.02 times faster than PYTHON_JIT=0 ./python ../add_int.py

So we made it >10% faster. Nice!

@cocolato
Copy link
Contributor Author

cocolato commented Feb 9, 2026

@cocolato please check out the latest commit and the extended commit message for the bugs fixed. Thanks!

LGTM! Thanks again for the fix, I learned a lot from it.

Results using hyperfine:

This is my first time learning about this tool, and it looks great. In the past, when I needed to run benchmarks locally, timeit was somewhat inaccurate, and pyperformance was a bit too heavy.

@cocolato

This comment was marked as resolved.

Copy link
Member

@markshannon markshannon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this.
This looks sound, but there are some inefficiencies.

Beware the size of uops. They will be converted to machine code, so keeping code size down is important for performance.

@bedevere-app
Copy link

bedevere-app bot commented Feb 9, 2026

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

@cocolato cocolato changed the title gh-144569: Scalar replacement of BINARY_SLICE for list, tuple, and unicode gh-144569: Avoid creating temporary objects in BINARY_SLICE for list, tuple, and unicode Feb 10, 2026
@cocolato
Copy link
Contributor Author

Thank you for your review and corrections! I have made the requested changes; please review again.

@bedevere-app
Copy link

bedevere-app bot commented Feb 10, 2026

Thanks for making the requested changes!

@markshannon: please review the changes made to this pull request.

@bedevere-app bedevere-app bot requested a review from markshannon February 10, 2026 16:04
@markshannon
Copy link
Member

markshannon commented Feb 18, 2026

Sorry for the delay in getting back to this.

I don't think we want _UNPACK_INDICES. it seems to make the code both slower and more complex.

I think we should spilt this into two PRs: the first with just the interpreter changes (bytecodes.c), leaving the JIT optimizations for another PR.

To make it worth splitting the operation into uops we need to be able to optimize the uops effectively. That would mean handling the index adjustments in the optimizer as much as possible and leaving only minimal checking to the runtime.
To do that we will have to add a few new uops for loading tagged ints and bounds checking.

For example, take x[:-1], where x is known to be a list. We want it to be converted to something like:

// `x` is already on the stack and known to be a list
_LOAD_TAGGED_INT 0 // No bounds checking needed for 0.
_LOAD_TAGGED_INT -1
_LIST_LEN 3 // Operand is the stack depth of the list
_INDEX_ADJUST  // Convert the -1 to an in-bounds value.
_LIST_SLICE // No need to check indices here.

Which looks bulky, but each of the uops is only a few machine instructions (_LIST_SLICE will be a function call).

(In the above _LIST_LEN pushes the length of the list as a tagged int, and _INDEX_ADJUST replaces index len with adjusted_index following the slice index semantics.

So, let's keep it simple for the first PR, and just add special cases for lists, tuples and strings in BINARY_SLICE.

else {
res_o = PyObject_GetItem(PyStackRef_AsPyObjectBorrow(container), slice);
Py_DECREF(slice);
PyObject *slice = PySlice_New(start_o, stop_o, NULL);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To reduce redundant code, we've modified this to call PySlice_New to create the slice. This incurs a slight performance overhead; if needed, I can revert to the original implementation.

@cocolato
Copy link
Contributor Author

@markshannon Thanks for review! I have reimplemented the fast path for BINARY_SLICE in bytecodes.c. After this PR is merged, I will attempt other JIT optimizations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Comments