Skip to content

Comments

[18.0] Add 'Allow Commit' option on job functions#899

Open
guewen wants to merge 7 commits intoOCA:18.0from
guewen:18.0-add-optional-new-cr
Open

[18.0] Add 'Allow Commit' option on job functions#899
guewen wants to merge 7 commits intoOCA:18.0from
guewen:18.0-add-optional-new-cr

Conversation

@guewen
Copy link
Member

@guewen guewen commented Feb 19, 2026

It is forbidden to commit inside a job, because it releases the job lock and can cause it to start again, while still being run, by the dead jobs requeuer. For some use cases, it may actually be legitimate, or at least be needed in the short term before actual updates in the code.

A new option on the job function, false by default, allow to run the job in a new transaction, at the cost of an additional connection + transaction overhead.

Related to #889

@OCA-git-bot
Copy link
Contributor

Hi @sbidoul,
some modules you are maintaining are being modified, check this out!

@guewen guewen force-pushed the 18.0-add-optional-new-cr branch 2 times, most recently from 75725a3 to 8f24e19 Compare February 19, 2026 15:21
"enable, func_name, kwargs.\n"
"See the module description for details.",
)
run_in_new_cursor = fields.Boolean(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to name this option with something that conveys "allow commit in job". Not sure.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll do that, just allow_commit should be enough?

related_action_func_name=None,
related_action_kwargs={},
job_function_id=None,
run_in_new_cursor=False,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the interest of magically un-breaking other modules that depend on queue_job, perhaps run_in_new_cursor could default to True for 18.0 and then back to False for 19.0?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or get the default value from a system parameter (False by default) ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

System parameter that is False by default but with a migration that sets it to True automatically on existing databases?

We could modify _prevent_commit to update the job function option when the error happens, that would give the best of both worlds, but might be a bit too much; also I fear it could create other problems with several jobs attempting to update the same job function.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

System parameter that is False by default but with a migration that sets it to True automatically on existing databases?

Why not. I'd be fine with breaking by default with a message explaining all the solutions and tradeoff, in order to raise awareness, but if people prefer to not break by default, good for me too.

We could modify _prevent_commit to update the job function option when the error happens, that would give the best of both worlds, but might be a bit too much; also I fear it could create other problems with several jobs attempting to update the same job function.

Sounds overkill to me too.

Copy link
Member Author

@guewen guewen Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have mixed feelings about the 2 options. My preference would go about breaking by default with an explanation, because otherwise nobody will care and databases will have the option enabled on all jobs forever. On the other side, I'm sure that in many situations nobody monitor the jobs and the issue will be discovered after some time, maybe after real issues due to non-running jobs (well there is a point about accountability regarding monitoring errors and jobs here).

I updated the PR yesterday evening with a system parameter and migration to allow commits by default, but I tend to favor breaking by default after a night sleep, so I'm tempted to remove it. We could add a link in the error message to a page such as this draft: https://github.com/OCA/queue/wiki/%5BDRAFT%5D-Upgrade-warning:-commits-inside-jobs

What do you think @amh-mw @sbidoul ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My preference would go about breaking by default with an explanation, because otherwise nobody will care and databases will have the option enabled on all jobs forever. On the other side, I'm sure that in many situations nobody monitor the jobs and the issue will be discovered after some time, maybe after real issues due to non-running jobs (well there is a point about accountability regarding monitoring errors and jobs here).

Yeah, doubling the number of database cursors used for all jobs is not the best. 😓 I also lean towards breaking by default and removing the config parameter. Make sure to semver major!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated with a link to the explanation

@guewen guewen force-pushed the 18.0-add-optional-new-cr branch 2 times, most recently from 018331b to 7bc3c77 Compare February 19, 2026 15:58
@guewen guewen changed the title [18.0] Add 'Run in new cursor' option on job functions [18.0] Add 'Allow Commit' option on job functions Feb 19, 2026
@guewen guewen force-pushed the 18.0-add-optional-new-cr branch from 7bc3c77 to 2dec167 Compare February 19, 2026 16:34
It is forbidden to commit inside a job, because it releases the job lock
and can cause it to start again, while still being run, by the dead jobs
requeuer. For some use cases, it may actually be legitimate, or at least
be needed in the short term before actual updates in the code.

A new option on the job function, false by default, allow to run the job
in a new transaction, at the cost of an additional connection +
transaction overhead.

Related to OCA#889
False on new databases, True on existing databases.
Should always be False by default on future versions.
@guewen guewen force-pushed the 18.0-add-optional-new-cr branch from 2dec167 to b4f3bec Compare February 19, 2026 16:36
@guewen guewen requested review from amh-mw and sbidoul February 23, 2026 13:54
As the controller changes env on Job instances.
Copy link
Member

@amh-mw amh-mw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During functional testing, encountered error:

web-1    | 2026-02-24 13:23:47,685 80 ERROR odoo odoo.http: Exception during request handling. 
web-1    | Traceback (most recent call last):
web-1    |   File "/usr/lib/python3/dist-packages/odoo/sql_db.py", line 364, in execute
web-1    |     res = self._obj.execute(query, params)
web-1    |           ^^^^^^^^^
web-1    |   File "/usr/lib/python3/dist-packages/odoo/sql_db.py", line 510, in __getattr__
web-1    |     raise psycopg2.InterfaceError("Cursor already closed")
web-1    | psycopg2.InterfaceError: Cursor already closed
web-1    | 
web-1    | During handling of the above exception, another exception occurred:
web-1    | 
web-1    | Traceback (most recent call last):
web-1    |   File "/usr/lib/python3/dist-packages/odoo/http.py", line 2578, in __call__
web-1    |     response = request._serve_db()
web-1    |                ^^^^^^^^^^^^^^^^^^^
web-1    |   File "/usr/lib/python3/dist-packages/odoo/http.py", line 2105, in _serve_db
web-1    |     return self._transactioning(
web-1    |            ^^^^^^^^^^^^^^^^^^^^^
web-1    |   File "/usr/lib/python3/dist-packages/odoo/http.py", line 2168, in _transactioning
web-1    |     return service_model.retrying(func, env=self.env)
web-1    |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
web-1    |   File "/usr/lib/python3/dist-packages/odoo/service/model.py", line 157, in retrying
web-1    |     result = func()
web-1    |              ^^^^^^
web-1    |   File "/usr/lib/python3/dist-packages/odoo/http.py", line 2135, in _serve_ir_http
web-1    |     response = self.dispatcher.dispatch(rule.endpoint, args)
web-1    |                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
web-1    |   File "/usr/lib/python3/dist-packages/odoo/http.py", line 2296, in dispatch
web-1    |     return self.request.registry['ir.http']._dispatch(endpoint)
web-1    |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
web-1    |   File "/usr/lib/python3/dist-packages/odoo/addons/base/models/ir_http.py", line 333, in _dispatch
web-1    |     result = endpoint(**request.params)
web-1    |              ^^^^^^^^^^^^^^^^^^^^^^^^^^
web-1    |   File "/usr/lib/python3/dist-packages/odoo/http.py", line 756, in route_wrapper
web-1    |     result = endpoint(self, *args, **params_ok)
web-1    |              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
web-1    |   File "/mnt/extra-addons/oca/queue/queue_job/controllers/main.py", line 221, in runjob
web-1    |     self._runjob(env, job)
web-1    |   File "/mnt/extra-addons/oca/queue/queue_job/controllers/main.py", line 190, in _runjob
web-1    |     cls._enqueue_dependent_jobs(env, job)
web-1    |   File "/mnt/extra-addons/oca/queue/queue_job/controllers/main.py", line 118, in _enqueue_dependent_jobs
web-1    |     with job.env.cr.savepoint():
web-1    |          ^^^^^^^^^^^^^^^^^^^^^^
web-1    |   File "/usr/lib/python3/dist-packages/odoo/sql_db.py", line 189, in savepoint
web-1    |     return _FlushingSavepoint(self)
web-1    |            ^^^^^^^^^^^^^^^^^^^^^^^^
web-1    |   File "/usr/lib/python3/dist-packages/odoo/sql_db.py", line 126, in __init__
web-1    |     super().__init__(cr)
web-1    |   File "/usr/lib/python3/dist-packages/odoo/sql_db.py", line 101, in __init__
web-1    |     cr.execute('SAVEPOINT "%s"' % self.name)
web-1    |   File "/usr/lib/python3/dist-packages/odoo/sql_db.py", line 367, in execute
web-1    |     _logger.error("bad query: %s\nERROR: %s", self._obj.query or query, e)
web-1    |                                               ^^^^^^^^^
web-1    |   File "/usr/lib/python3/dist-packages/odoo/sql_db.py", line 510, in __getattr__
web-1    |     raise psycopg2.InterfaceError("Cursor already closed")
web-1    | psycopg2.InterfaceError: Cursor already closed

I have not managed to chase down the origin of this error yet, but I will quickly note that my development environment...

  • makes heavy use of chain and group to order jobs
  • contains #767, #768 and some as-yet unpublished changes to queue_job_batch
  • is running 8 queue job workers out of total 12 workers
  • made it about 7,500 jobs in before I encountered this error
  • contains jobs that use savepoint to encapsulate partial work and commit the cursor after success

@guewen guewen force-pushed the 18.0-add-optional-new-cr branch from 4647d3e to 33ca9e7 Compare February 24, 2026 15:10
@guewen
Copy link
Member Author

guewen commented Feb 24, 2026

@amh-mw

During functional testing, encountered error:

Thanks for the details, I think the problem is the same as this one: #883 (comment)

Looks good. While you are at it, could you add a return at line 174 of the controller (#883 (comment))?

We try to enqueue the dependent jobs using a closed cursor, but we actually should not try to enqueue them because the job failed. I added the return in the except RetryableJobError as err: branch.

I pushed a new commit with the DRY simplifications you propose, and took benefit of the new temporary env method of the job to use it in the controller:

-            with Registry(job.env.cr.dbname).cursor() as new_cr:
-                job.env = job.env(cr=new_cr)
+            with job.in_temporary_env():

So I made the env setter private in Job, as we no longer change it from the outside.

I created many job graphs / single jobs with the last changes with variations of (added a new failure_retry_seconds to have the job raise a retryable error)

http://localhost:8069/queue_job/create_test_job?commit_within_job=True&failure_rate=0.1&size=400&failure_retry_seconds=1
http://localhost:8069/queue_job/create_test_job?commit_within_job=True&failure_rate=0.2&size=1
http://localhost:8069/queue_job/create_test_job?commit_within_job=True&failure_rate=1&size=1&failure_retry_seconds=1

No issue on 10k jobs, but of course it does not reflect the reality of production.
I tried to find a way to write tests for the controller but could find a convincing way :/

Copy link
Member

@amh-mw amh-mw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During functional testing, started encountering this error after about 70,000 jobs:

web-1    | 2026-02-24 15:56:06,362 88 ERROR ? odoo.addons.queue_job.jobrunner.runner: exception in GET http://localhost:8069/queue_job/runjob?db=odoo&job_uuid=cd2eb546-40e4-4f8b-acce-04d5aa974157 
web-1    | Traceback (most recent call last):
web-1    |   File "/mnt/extra-addons/oca/queue/queue_job/jobrunner/runner.py", line 106, in urlopen
web-1    |     response.raise_for_status()
web-1    |   File "/usr/lib/python3/dist-packages/requests/models.py", line 1021, in raise_for_status
web-1    |     raise HTTPError(http_error_msg, response=self)
web-1    | requests.exceptions.HTTPError: 500 Server Error: INTERNAL SERVER ERROR for url: http://localhost:8069/queue_job/runjob?db=odoo&job_uuid=cd2eb546-40e4-4f8b-acce-04d5aa974157

All instances of this error have the same process id, which I have confirmed is the WorkerJobRunner. This coincides with a number of jobs raising FailedJobError (internally handled FileNotFoundError for absent production data). Things seem to be continuing as expected (currently at job 200,000), so maybe this is just some additional verbosity?

Edit: I'm going to increase my odoo.addons.queue_job.jobrunner.runner log verbosity and see if I can get a better picture.

@amh-mw
Copy link
Member

amh-mw commented Feb 24, 2026

Correlation between raised FailedJobError and HTTPError 500 on my development system:

web-1    | 2026-02-24 17:04:34,487 67 DEBUG odoo odoo.addons.queue_job.controllers.main: <Job 6068b4b1-6255-41be-963a-d750f0b66391, priority:801> started 
web-1    | 2026-02-24 17:04:34,497 67 ERROR odoo odoo.addons.queue_job.controllers.main: Traceback (most recent call last):
...
web-1    |     raise FailedJobError(e)
web-1    | odoo.addons.queue_job.exception.FailedJobError: handled FileNotFoundError
web-1    | 2026-02-24 17:04:34,535 88 DEBUG ? odoo.addons.queue_job.jobrunner.channels: job 6068b4b1-6255-41be-963a-d750f0b66391 marked failed in channel root(C:8,Q:60064,R:7,F:1628) 
web-1    | 2026-02-24 17:04:34,537 88 ERROR ? odoo.addons.queue_job.jobrunner.runner: exception in GET http://localhost:8069/queue_job/runjob?db=odoo&job_uuid=6068b4b1-6255-41be-963a-d750f0b66391 
web-1    | Traceback (most recent call last):
web-1    |   File "/mnt/extra-addons/oca/queue/queue_job/jobrunner/runner.py", line 106, in urlopen
web-1    |     response.raise_for_status()
web-1    |   File "/usr/lib/python3/dist-packages/requests/models.py", line 1021, in raise_for_status
web-1    |     raise HTTPError(http_error_msg, response=self)
web-1    | requests.exceptions.HTTPError: 500 Server Error: INTERNAL SERVER ERROR for url: http://localhost:8069/queue_job/runjob?db=odoo&job_uuid=6068b4b1-6255-41be-963a-d750f0b66391

Contrast to HTTPError 400 on staging system running queue_job 18.0.1.7.0

2026-02-24 02:10:06,499 63 ERROR ? odoo.addons.queue_job.jobrunner.runner: exception in GET http://localhost:8069/queue_job/runjob?db=odoo&job_uuid=a8068655-eb98-4bc8-895d-449462d86ccf
--
Traceback (most recent call last):
File "/usr/local/src/metricwise/oca/queue/queue_job/jobrunner/runner.py", line 253, in urlopen
response.raise_for_status()
File "/usr/lib/python3/dist-packages/requests/models.py", line 1021, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 400 Client Error: BAD REQUEST for url: http://localhost:8069/queue_job/runjob?db=odoo&job_uuid=a8068655-eb98-4bc8-895d-449462d86ccf

@amh-mw
Copy link
Member

amh-mw commented Feb 24, 2026

I don't have an easy way to git bisect the origin of the HTTPError 400 to 500 change, but if anyone else can confirm that it wasn't introduced by this pull request, I don't have any other objections to approval.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants