[18.0] Add 'Allow Commit' option on job functions#899
[18.0] Add 'Allow Commit' option on job functions#899
Conversation
|
Hi @sbidoul, |
75725a3 to
8f24e19
Compare
| "enable, func_name, kwargs.\n" | ||
| "See the module description for details.", | ||
| ) | ||
| run_in_new_cursor = fields.Boolean( |
There was a problem hiding this comment.
Would it make sense to name this option with something that conveys "allow commit in job". Not sure.
There was a problem hiding this comment.
I'll do that, just allow_commit should be enough?
| related_action_func_name=None, | ||
| related_action_kwargs={}, | ||
| job_function_id=None, | ||
| run_in_new_cursor=False, |
There was a problem hiding this comment.
In the interest of magically un-breaking other modules that depend on queue_job, perhaps run_in_new_cursor could default to True for 18.0 and then back to False for 19.0?
There was a problem hiding this comment.
Or get the default value from a system parameter (False by default) ?
There was a problem hiding this comment.
System parameter that is False by default but with a migration that sets it to True automatically on existing databases?
We could modify _prevent_commit to update the job function option when the error happens, that would give the best of both worlds, but might be a bit too much; also I fear it could create other problems with several jobs attempting to update the same job function.
There was a problem hiding this comment.
System parameter that is False by default but with a migration that sets it to True automatically on existing databases?
Why not. I'd be fine with breaking by default with a message explaining all the solutions and tradeoff, in order to raise awareness, but if people prefer to not break by default, good for me too.
We could modify
_prevent_committo update the job function option when the error happens, that would give the best of both worlds, but might be a bit too much; also I fear it could create other problems with several jobs attempting to update the same job function.
Sounds overkill to me too.
There was a problem hiding this comment.
I have mixed feelings about the 2 options. My preference would go about breaking by default with an explanation, because otherwise nobody will care and databases will have the option enabled on all jobs forever. On the other side, I'm sure that in many situations nobody monitor the jobs and the issue will be discovered after some time, maybe after real issues due to non-running jobs (well there is a point about accountability regarding monitoring errors and jobs here).
I updated the PR yesterday evening with a system parameter and migration to allow commits by default, but I tend to favor breaking by default after a night sleep, so I'm tempted to remove it. We could add a link in the error message to a page such as this draft: https://github.com/OCA/queue/wiki/%5BDRAFT%5D-Upgrade-warning:-commits-inside-jobs
There was a problem hiding this comment.
My preference would go about breaking by default with an explanation, because otherwise nobody will care and databases will have the option enabled on all jobs forever. On the other side, I'm sure that in many situations nobody monitor the jobs and the issue will be discovered after some time, maybe after real issues due to non-running jobs (well there is a point about accountability regarding monitoring errors and jobs here).
Yeah, doubling the number of database cursors used for all jobs is not the best. 😓 I also lean towards breaking by default and removing the config parameter. Make sure to semver major!
There was a problem hiding this comment.
I updated with a link to the explanation
018331b to
7bc3c77
Compare
7bc3c77 to
2dec167
Compare
It is forbidden to commit inside a job, because it releases the job lock and can cause it to start again, while still being run, by the dead jobs requeuer. For some use cases, it may actually be legitimate, or at least be needed in the short term before actual updates in the code. A new option on the job function, false by default, allow to run the job in a new transaction, at the cost of an additional connection + transaction overhead. Related to OCA#889
False on new databases, True on existing databases. Should always be False by default on future versions.
2dec167 to
b4f3bec
Compare
As the controller changes env on Job instances.
amh-mw
left a comment
There was a problem hiding this comment.
During functional testing, encountered error:
web-1 | 2026-02-24 13:23:47,685 80 ERROR odoo odoo.http: Exception during request handling.
web-1 | Traceback (most recent call last):
web-1 | File "/usr/lib/python3/dist-packages/odoo/sql_db.py", line 364, in execute
web-1 | res = self._obj.execute(query, params)
web-1 | ^^^^^^^^^
web-1 | File "/usr/lib/python3/dist-packages/odoo/sql_db.py", line 510, in __getattr__
web-1 | raise psycopg2.InterfaceError("Cursor already closed")
web-1 | psycopg2.InterfaceError: Cursor already closed
web-1 |
web-1 | During handling of the above exception, another exception occurred:
web-1 |
web-1 | Traceback (most recent call last):
web-1 | File "/usr/lib/python3/dist-packages/odoo/http.py", line 2578, in __call__
web-1 | response = request._serve_db()
web-1 | ^^^^^^^^^^^^^^^^^^^
web-1 | File "/usr/lib/python3/dist-packages/odoo/http.py", line 2105, in _serve_db
web-1 | return self._transactioning(
web-1 | ^^^^^^^^^^^^^^^^^^^^^
web-1 | File "/usr/lib/python3/dist-packages/odoo/http.py", line 2168, in _transactioning
web-1 | return service_model.retrying(func, env=self.env)
web-1 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
web-1 | File "/usr/lib/python3/dist-packages/odoo/service/model.py", line 157, in retrying
web-1 | result = func()
web-1 | ^^^^^^
web-1 | File "/usr/lib/python3/dist-packages/odoo/http.py", line 2135, in _serve_ir_http
web-1 | response = self.dispatcher.dispatch(rule.endpoint, args)
web-1 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
web-1 | File "/usr/lib/python3/dist-packages/odoo/http.py", line 2296, in dispatch
web-1 | return self.request.registry['ir.http']._dispatch(endpoint)
web-1 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
web-1 | File "/usr/lib/python3/dist-packages/odoo/addons/base/models/ir_http.py", line 333, in _dispatch
web-1 | result = endpoint(**request.params)
web-1 | ^^^^^^^^^^^^^^^^^^^^^^^^^^
web-1 | File "/usr/lib/python3/dist-packages/odoo/http.py", line 756, in route_wrapper
web-1 | result = endpoint(self, *args, **params_ok)
web-1 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
web-1 | File "/mnt/extra-addons/oca/queue/queue_job/controllers/main.py", line 221, in runjob
web-1 | self._runjob(env, job)
web-1 | File "/mnt/extra-addons/oca/queue/queue_job/controllers/main.py", line 190, in _runjob
web-1 | cls._enqueue_dependent_jobs(env, job)
web-1 | File "/mnt/extra-addons/oca/queue/queue_job/controllers/main.py", line 118, in _enqueue_dependent_jobs
web-1 | with job.env.cr.savepoint():
web-1 | ^^^^^^^^^^^^^^^^^^^^^^
web-1 | File "/usr/lib/python3/dist-packages/odoo/sql_db.py", line 189, in savepoint
web-1 | return _FlushingSavepoint(self)
web-1 | ^^^^^^^^^^^^^^^^^^^^^^^^
web-1 | File "/usr/lib/python3/dist-packages/odoo/sql_db.py", line 126, in __init__
web-1 | super().__init__(cr)
web-1 | File "/usr/lib/python3/dist-packages/odoo/sql_db.py", line 101, in __init__
web-1 | cr.execute('SAVEPOINT "%s"' % self.name)
web-1 | File "/usr/lib/python3/dist-packages/odoo/sql_db.py", line 367, in execute
web-1 | _logger.error("bad query: %s\nERROR: %s", self._obj.query or query, e)
web-1 | ^^^^^^^^^
web-1 | File "/usr/lib/python3/dist-packages/odoo/sql_db.py", line 510, in __getattr__
web-1 | raise psycopg2.InterfaceError("Cursor already closed")
web-1 | psycopg2.InterfaceError: Cursor already closed
I have not managed to chase down the origin of this error yet, but I will quickly note that my development environment...
- makes heavy use of
chainandgroupto order jobs - contains #767, #768 and some as-yet unpublished changes to
queue_job_batch - is running 8 queue job workers out of total 12 workers
- made it about 7,500 jobs in before I encountered this error
- contains jobs that use savepoint to encapsulate partial work and commit the cursor after success
4647d3e to
33ca9e7
Compare
Thanks for the details, I think the problem is the same as this one: #883 (comment)
We try to enqueue the dependent jobs using a closed cursor, but we actually should not try to enqueue them because the job failed. I added the return in the I pushed a new commit with the DRY simplifications you propose, and took benefit of the new temporary env method of the job to use it in the controller: - with Registry(job.env.cr.dbname).cursor() as new_cr:
- job.env = job.env(cr=new_cr)
+ with job.in_temporary_env():So I made the I created many job graphs / single jobs with the last changes with variations of (added a new failure_retry_seconds to have the job raise a retryable error) http://localhost:8069/queue_job/create_test_job?commit_within_job=True&failure_rate=0.1&size=400&failure_retry_seconds=1 No issue on 10k jobs, but of course it does not reflect the reality of production. |
There was a problem hiding this comment.
During functional testing, started encountering this error after about 70,000 jobs:
web-1 | 2026-02-24 15:56:06,362 88 ERROR ? odoo.addons.queue_job.jobrunner.runner: exception in GET http://localhost:8069/queue_job/runjob?db=odoo&job_uuid=cd2eb546-40e4-4f8b-acce-04d5aa974157
web-1 | Traceback (most recent call last):
web-1 | File "/mnt/extra-addons/oca/queue/queue_job/jobrunner/runner.py", line 106, in urlopen
web-1 | response.raise_for_status()
web-1 | File "/usr/lib/python3/dist-packages/requests/models.py", line 1021, in raise_for_status
web-1 | raise HTTPError(http_error_msg, response=self)
web-1 | requests.exceptions.HTTPError: 500 Server Error: INTERNAL SERVER ERROR for url: http://localhost:8069/queue_job/runjob?db=odoo&job_uuid=cd2eb546-40e4-4f8b-acce-04d5aa974157
All instances of this error have the same process id, which I have confirmed is the WorkerJobRunner. This coincides with a number of jobs raising FailedJobError (internally handled FileNotFoundError for absent production data). Things seem to be continuing as expected (currently at job 200,000), so maybe this is just some additional verbosity?
Edit: I'm going to increase my odoo.addons.queue_job.jobrunner.runner log verbosity and see if I can get a better picture.
|
Correlation between raised FailedJobError and HTTPError 500 on my development system: Contrast to HTTPError 400 on staging system running |
|
I don't have an easy way to |
It is forbidden to commit inside a job, because it releases the job lock and can cause it to start again, while still being run, by the dead jobs requeuer. For some use cases, it may actually be legitimate, or at least be needed in the short term before actual updates in the code.
A new option on the job function, false by default, allow to run the job in a new transaction, at the cost of an additional connection + transaction overhead.
Related to #889