Skip to content

fix(engine) prevent MVCC race in blockRunWithWaitpoint pending check#3075

Open
ericallam wants to merge 2 commits intomainfrom
fix/waitpoint-mvcc-race
Open

fix(engine) prevent MVCC race in blockRunWithWaitpoint pending check#3075
ericallam wants to merge 2 commits intomainfrom
fix/waitpoint-mvcc-race

Conversation

@ericallam
Copy link
Member

Split the CTE in blockRunWithWaitpoint so the pending waitpoint check
is a separate SQL statement. In READ COMMITTED isolation, each statement
gets its own snapshot, so a separate SELECT sees the latest committed
state from concurrent completeWaitpoint calls.

Previously, the CTE did INSERT + pending check in one statement (one
snapshot). If completeWaitpoint committed between the CTE start and
the SELECT, the SELECT would still see PENDING due to the stale
snapshot. Neither side would enqueue continueRunIfUnblocked, leaving
the run stuck forever.

@changeset-bot
Copy link

changeset-bot bot commented Feb 17, 2026

⚠️ No Changeset found

Latest commit: 39fa6d2

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 17, 2026

Walkthrough

The change restructures waitpoint blocking into two statements: (1) a CTE inserts blocking TaskRunWaitpoint rows and corresponding historical _WaitpointRunConnections rows; (2) a separate pendingCheck SELECT counts pending waitpoints using a fresh MVCC snapshot. The code no longer uses the insert’s returned pending_count; instead isRunBlocked is derived from the independent pending-check result. Added comments explain READ COMMITTED MVCC snapshot behavior and why the separate pending-check is required. Control flow and scheduling now depend on the pending-check outcome.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Description check ⚠️ Warning The description clearly explains the technical problem and solution but is missing several required template sections: checklist completion, testing steps, changelog, and screenshots. Complete the PR description template by adding the checklist, testing steps, and changelog sections to meet repository requirements.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main change: fixing an MVCC race condition in the pending check logic for blockRunWithWaitpoint, which aligns with the core issue addressed in the PR.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix/waitpoint-mvcc-race

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
internal-packages/run-engine/src/engine/systems/waitpointSystem.ts (1)

430-442: Excellent fix for the MVCC snapshot race — minor type nit on BigInt.

The separation into a distinct SQL statement is the correct approach under READ COMMITTED isolation, and the explanatory comment is thorough. One small type issue:

Prisma's $queryRaw returns PostgreSQL bigint (from COUNT(*)) as the JavaScript primitive bigint, not the wrapper object BigInt. This won't cause a runtime error since Number() handles both, but the type annotation is technically inaccurate.

Nit: use lowercase bigint for the primitive type
-      const pendingCheck = await prisma.$queryRaw<{ pending_count: BigInt }[]>`
+      const pendingCheck = await prisma.$queryRaw<{ pending_count: bigint }[]>`
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal-packages/run-engine/src/engine/systems/waitpointSystem.ts` around
lines 430 - 442, The type annotation on the result of prisma.$queryRaw is using
the wrapper BigInt instead of the JS primitive bigint; update the generic on
prisma.$queryRaw from { pending_count: BigInt }[] to { pending_count: bigint }[]
so the type accurately reflects PostgreSQL COUNT(*) returning a primitive
bigint, leaving the surrounding logic (pendingCheck,
Number(pendingCheck.at(0)?.pending_count ?? 0), Prisma.join($waitpoints))
unchanged.
📜 Review details

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7af789b and c603987.

📒 Files selected for processing (1)
  • internal-packages/run-engine/src/engine/systems/waitpointSystem.ts
🧰 Additional context used
📓 Path-based instructions (4)
**/*.{ts,tsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.{ts,tsx}: Use types over interfaces for TypeScript
Avoid using enums; prefer string unions or const objects instead

**/*.{ts,tsx}: Always import tasks from @trigger.dev/sdk, never use @trigger.dev/sdk/v3 or deprecated client.defineJob pattern
Every Trigger.dev task must be exported and have a unique id property with no timeouts in the run function

Files:

  • internal-packages/run-engine/src/engine/systems/waitpointSystem.ts
**/*.{ts,tsx,js,jsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Use function declarations instead of default exports

Import from @trigger.dev/core using subpaths only, never import from root

Files:

  • internal-packages/run-engine/src/engine/systems/waitpointSystem.ts
**/*.ts

📄 CodeRabbit inference engine (.cursor/rules/otel-metrics.mdc)

**/*.ts: When creating or editing OTEL metrics (counters, histograms, gauges), ensure metric attributes have low cardinality by using only enums, booleans, bounded error codes, or bounded shard IDs
Do not use high-cardinality attributes in OTEL metrics such as UUIDs/IDs (envId, userId, runId, projectId, organizationId), unbounded integers (itemCount, batchSize, retryCount), timestamps (createdAt, startTime), or free-form strings (errorMessage, taskName, queueName)
When exporting OTEL metrics via OTLP to Prometheus, be aware that the exporter automatically adds unit suffixes to metric names (e.g., 'my_duration_ms' becomes 'my_duration_ms_milliseconds', 'my_counter' becomes 'my_counter_total'). Account for these transformations when writing Grafana dashboards or Prometheus queries

Files:

  • internal-packages/run-engine/src/engine/systems/waitpointSystem.ts
**/*.{js,ts,jsx,tsx,json,md,yaml,yml}

📄 CodeRabbit inference engine (AGENTS.md)

Format code using Prettier before committing

Files:

  • internal-packages/run-engine/src/engine/systems/waitpointSystem.ts
🧠 Learnings (1)
📚 Learning: 2025-07-12T18:06:04.133Z
Learnt from: matt-aitken
Repo: triggerdotdev/trigger.dev PR: 2264
File: apps/webapp/app/services/runsRepository.server.ts:172-174
Timestamp: 2025-07-12T18:06:04.133Z
Learning: In apps/webapp/app/services/runsRepository.server.ts, the in-memory status filtering after fetching runs from Prisma is intentionally used as a workaround for ClickHouse data delays. This approach is acceptable because the result set is limited to a maximum of 100 runs due to pagination, making the performance impact negligible.

Applied to files:

  • internal-packages/run-engine/src/engine/systems/waitpointSystem.ts
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (27)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (2, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (7, 8)
  • GitHub Check: typecheck / typecheck
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (8, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (4, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (5, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (1, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (3, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (6, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (8, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (3, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (5, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (7, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (6, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (1, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (2, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (4, 8)
  • GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - npm)
  • GitHub Check: units / packages / 🧪 Unit Tests: Packages (1, 1)
  • GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - pnpm)
  • GitHub Check: sdk-compat / Node.js 20.20 (ubuntu-latest)
  • GitHub Check: sdk-compat / Bun Runtime
  • GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - npm)
  • GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - pnpm)
  • GitHub Check: sdk-compat / Node.js 22.12 (ubuntu-latest)
  • GitHub Check: sdk-compat / Deno Runtime
  • GitHub Check: sdk-compat / Cloudflare Workers
🔇 Additional comments (2)
internal-packages/run-engine/src/engine/systems/waitpointSystem.ts (2)

400-428: Clean separation of the INSERT CTE from the pending check.

The CTE now correctly focuses solely on inserting the blocking associations, with its result intentionally discarded since the pending count is determined by the separate query below. The inline comments clearly explain why the SELECT COUNT(*) is needed (to force PostgreSQL to execute the inserted CTE branch). Looks good.


493-504: Debounce-based safety net is sound.

Even if both blockRunWithWaitpoint (seeing 0 pending) and a concurrent completeWaitpoint both enqueue continueRunIfUnblocked for the same run, the shared job ID continueRunIfUnblocked:${runId} ensures the work is debounced. The two-statement approach combined with this debounce makes the race harmless from both directions.

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@internal-packages/run-engine/src/engine/systems/waitpointSystem.ts`:
- Around line 430-442: The type annotation on the result of prisma.$queryRaw is
using the wrapper BigInt instead of the JS primitive bigint; update the generic
on prisma.$queryRaw from { pending_count: BigInt }[] to { pending_count: bigint
}[] so the type accurately reflects PostgreSQL COUNT(*) returning a primitive
bigint, leaving the surrounding logic (pendingCheck,
Number(pendingCheck.at(0)?.pending_count ?? 0), Prisma.join($waitpoints))
unchanged.

Copy link

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 3 additional findings.

Open in Devin Review

@ericallam ericallam force-pushed the fix/waitpoint-mvcc-race branch from c603987 to 64ddf12 Compare February 17, 2026 12:22
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
internal-packages/run-engine/src/engine/systems/waitpointSystem.ts (1)

435-435: Nit: use lowercase bigint for the TypeScript primitive type.

BigInt (uppercase) refers to the BigInt constructor/interface, not the primitive type. Prisma's $queryRaw returns the bigint primitive for PostgreSQL COUNT(*) results.

Proposed fix
-      const pendingCheck = await prisma.$queryRaw<{ pending_count: BigInt }[]>`
+      const pendingCheck = await prisma.$queryRaw<{ pending_count: bigint }[]>`
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal-packages/run-engine/src/engine/systems/waitpointSystem.ts` at line
435, The type for the raw query result uses the wrong BigInt identifier; update
the type annotation on the prisma.$queryRaw call (the pendingCheck declaration)
to use the TypeScript primitive `bigint` instead of the `BigInt`
constructor/interface so the inferred type matches PostgreSQL COUNT(*) results
(e.g., change `{ pending_count: BigInt }[]` to `{ pending_count: bigint }[]` for
pendingCheck).
📜 Review details

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c603987 and 64ddf12.

📒 Files selected for processing (1)
  • internal-packages/run-engine/src/engine/systems/waitpointSystem.ts
🧰 Additional context used
📓 Path-based instructions (4)
**/*.{ts,tsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.{ts,tsx}: Use types over interfaces for TypeScript
Avoid using enums; prefer string unions or const objects instead

**/*.{ts,tsx}: Always import tasks from @trigger.dev/sdk, never use @trigger.dev/sdk/v3 or deprecated client.defineJob pattern
Every Trigger.dev task must be exported and have a unique id property with no timeouts in the run function

Files:

  • internal-packages/run-engine/src/engine/systems/waitpointSystem.ts
**/*.{ts,tsx,js,jsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Use function declarations instead of default exports

Import from @trigger.dev/core using subpaths only, never import from root

Files:

  • internal-packages/run-engine/src/engine/systems/waitpointSystem.ts
**/*.ts

📄 CodeRabbit inference engine (.cursor/rules/otel-metrics.mdc)

**/*.ts: When creating or editing OTEL metrics (counters, histograms, gauges), ensure metric attributes have low cardinality by using only enums, booleans, bounded error codes, or bounded shard IDs
Do not use high-cardinality attributes in OTEL metrics such as UUIDs/IDs (envId, userId, runId, projectId, organizationId), unbounded integers (itemCount, batchSize, retryCount), timestamps (createdAt, startTime), or free-form strings (errorMessage, taskName, queueName)
When exporting OTEL metrics via OTLP to Prometheus, be aware that the exporter automatically adds unit suffixes to metric names (e.g., 'my_duration_ms' becomes 'my_duration_ms_milliseconds', 'my_counter' becomes 'my_counter_total'). Account for these transformations when writing Grafana dashboards or Prometheus queries

Files:

  • internal-packages/run-engine/src/engine/systems/waitpointSystem.ts
**/*.{js,ts,jsx,tsx,json,md,yaml,yml}

📄 CodeRabbit inference engine (AGENTS.md)

Format code using Prettier before committing

Files:

  • internal-packages/run-engine/src/engine/systems/waitpointSystem.ts
🧠 Learnings (1)
📚 Learning: 2025-07-12T18:06:04.133Z
Learnt from: matt-aitken
Repo: triggerdotdev/trigger.dev PR: 2264
File: apps/webapp/app/services/runsRepository.server.ts:172-174
Timestamp: 2025-07-12T18:06:04.133Z
Learning: In apps/webapp/app/services/runsRepository.server.ts, the in-memory status filtering after fetching runs from Prisma is intentionally used as a workaround for ClickHouse data delays. This approach is acceptable because the result set is limited to a maximum of 100 runs due to pagination, making the performance impact negligible.

Applied to files:

  • internal-packages/run-engine/src/engine/systems/waitpointSystem.ts
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (27)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (8, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (3, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (7, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (6, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (8, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (5, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (2, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (5, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (6, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (4, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (2, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (7, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (3, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (1, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (4, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (1, 8)
  • GitHub Check: sdk-compat / Cloudflare Workers
  • GitHub Check: units / packages / 🧪 Unit Tests: Packages (1, 1)
  • GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - pnpm)
  • GitHub Check: sdk-compat / Deno Runtime
  • GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - npm)
  • GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - npm)
  • GitHub Check: sdk-compat / Bun Runtime
  • GitHub Check: sdk-compat / Node.js 22.12 (ubuntu-latest)
  • GitHub Check: typecheck / typecheck
  • GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - pnpm)
  • GitHub Check: Analyze (javascript-typescript)
🔇 Additional comments (1)
internal-packages/run-engine/src/engine/systems/waitpointSystem.ts (1)

400-442: Well-designed fix for the MVCC snapshot race condition.

The separation of the INSERT CTE from the pending-status check into distinct SQL statements is the correct approach under PostgreSQL's READ COMMITTED isolation. The inline comments clearly explain the rationale. 👍

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@internal-packages/run-engine/src/engine/systems/waitpointSystem.ts`:
- Line 435: The type for the raw query result uses the wrong BigInt identifier;
update the type annotation on the prisma.$queryRaw call (the pendingCheck
declaration) to use the TypeScript primitive `bigint` instead of the `BigInt`
constructor/interface so the inferred type matches PostgreSQL COUNT(*) results
(e.g., change `{ pending_count: BigInt }[]` to `{ pending_count: bigint }[]` for
pendingCheck).

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
internal-packages/run-engine/src/engine/systems/waitpointSystem.ts (1)

451-458: Nit: BigInt (wrapper object) vs bigint (primitive) in the type annotation.

Prisma's $queryRaw returns the JavaScript bigint primitive for PostgreSQL bigint/int8 columns. The type annotation on Line 451 uses BigInt (capital B), which is the wrapper object type. This doesn't cause a runtime issue (TypeScript types are erased), but it's technically the wrong type.

Proposed fix
-      const pendingCheck = await prisma.$queryRaw<{ pending_count: BigInt }[]>`
+      const pendingCheck = await prisma.$queryRaw<{ pending_count: bigint }[]>`
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal-packages/run-engine/src/engine/systems/waitpointSystem.ts` around
lines 451 - 458, The type annotation for the result of prisma.$queryRaw uses the
wrapper object `BigInt` instead of the primitive `bigint`; update the generic on
prisma.$queryRaw to use `{ pending_count: bigint }[]` (referencing the
`pendingCheck` variable and the `prisma.$queryRaw` call) so the `pending_count`
property has the correct primitive type and subsequent
`Number(pendingCheck.at(0)?.pending_count ?? 0)` usage (used to compute
`isRunBlocked`) aligns with the actual returned type.
📜 Review details

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 64ddf12 and 86190be.

📒 Files selected for processing (1)
  • internal-packages/run-engine/src/engine/systems/waitpointSystem.ts
🧰 Additional context used
📓 Path-based instructions (4)
**/*.{ts,tsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.{ts,tsx}: Use types over interfaces for TypeScript
Avoid using enums; prefer string unions or const objects instead

**/*.{ts,tsx}: Always import tasks from @trigger.dev/sdk, never use @trigger.dev/sdk/v3 or deprecated client.defineJob pattern
Every Trigger.dev task must be exported and have a unique id property with no timeouts in the run function

Files:

  • internal-packages/run-engine/src/engine/systems/waitpointSystem.ts
**/*.{ts,tsx,js,jsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Use function declarations instead of default exports

Import from @trigger.dev/core using subpaths only, never import from root

Files:

  • internal-packages/run-engine/src/engine/systems/waitpointSystem.ts
**/*.ts

📄 CodeRabbit inference engine (.cursor/rules/otel-metrics.mdc)

**/*.ts: When creating or editing OTEL metrics (counters, histograms, gauges), ensure metric attributes have low cardinality by using only enums, booleans, bounded error codes, or bounded shard IDs
Do not use high-cardinality attributes in OTEL metrics such as UUIDs/IDs (envId, userId, runId, projectId, organizationId), unbounded integers (itemCount, batchSize, retryCount), timestamps (createdAt, startTime), or free-form strings (errorMessage, taskName, queueName)
When exporting OTEL metrics via OTLP to Prometheus, be aware that the exporter automatically adds unit suffixes to metric names (e.g., 'my_duration_ms' becomes 'my_duration_ms_milliseconds', 'my_counter' becomes 'my_counter_total'). Account for these transformations when writing Grafana dashboards or Prometheus queries

Files:

  • internal-packages/run-engine/src/engine/systems/waitpointSystem.ts
**/*.{js,ts,jsx,tsx,json,md,yaml,yml}

📄 CodeRabbit inference engine (AGENTS.md)

Format code using Prettier before committing

Files:

  • internal-packages/run-engine/src/engine/systems/waitpointSystem.ts
🧠 Learnings (1)
📚 Learning: 2025-07-12T18:06:04.133Z
Learnt from: matt-aitken
Repo: triggerdotdev/trigger.dev PR: 2264
File: apps/webapp/app/services/runsRepository.server.ts:172-174
Timestamp: 2025-07-12T18:06:04.133Z
Learning: In apps/webapp/app/services/runsRepository.server.ts, the in-memory status filtering after fetching runs from Prisma is intentionally used as a workaround for ClickHouse data delays. This approach is acceptable because the result set is limited to a maximum of 100 runs due to pagination, making the performance impact negligible.

Applied to files:

  • internal-packages/run-engine/src/engine/systems/waitpointSystem.ts
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (19)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (3, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (2, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (7, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (6, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (8, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (5, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (1, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (1, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (4, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (7, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (2, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (4, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (3, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (6, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (5, 8)
  • GitHub Check: typecheck / typecheck
  • GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - npm)
  • GitHub Check: units / packages / 🧪 Unit Tests: Packages (1, 1)
  • GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - pnpm)
🔇 Additional comments (3)
internal-packages/run-engine/src/engine/systems/waitpointSystem.ts (3)

367-383: Well-documented MVCC rationale — this is excellent.

The comment thoroughly explains the snapshot isolation semantics, why the two statements must be separate, and why the pending check covers all requested waitpoint IDs (not just newly inserted ones). This is exactly the kind of documentation that prevents future maintainers from "optimizing" the two statements back into one.


416-444: CTE restructuring looks correct.

The CTE now only handles the INSERT side effects, and the outer SELECT COUNT(*) FROM inserted satisfies the syntactic requirement for a final SELECT while keeping the data-modifying CTEs executable. Discarding the result is intentional since the blocking decision is now made by the separate query.


509-520: Blocking logic correctly derived from the fresh pending check — LGTM.

The isRunBlocked flag now reflects the latest committed state, ensuring that if a concurrent completeWaitpoint committed between the CTE and this check, the run won't get stuck. The debounced continueRunIfUnblocked enqueue is safe even if both this path and completeWaitpoint fire it (idempotent job ID).

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@internal-packages/run-engine/src/engine/systems/waitpointSystem.ts`:
- Around line 451-458: The type annotation for the result of prisma.$queryRaw
uses the wrapper object `BigInt` instead of the primitive `bigint`; update the
generic on prisma.$queryRaw to use `{ pending_count: bigint }[]` (referencing
the `pendingCheck` variable and the `prisma.$queryRaw` call) so the
`pending_count` property has the correct primitive type and subsequent
`Number(pendingCheck.at(0)?.pending_count ?? 0)` usage (used to compute
`isRunBlocked`) aligns with the actual returned type.

@ericallam ericallam marked this pull request as ready for review February 17, 2026 12:58
Copy link

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 new potential issue.

View 3 additional findings in Devin Review.

Open in Devin Review

Split the CTE in blockRunWithWaitpoint so the pending waitpoint check
is a separate SQL statement. In READ COMMITTED isolation, each statement
gets its own snapshot, so a separate SELECT sees the latest committed
state from concurrent completeWaitpoint calls.

Previously, the CTE did INSERT + pending check in one statement (one
snapshot). If completeWaitpoint committed between the CTE start and
the SELECT, the SELECT would still see PENDING due to the stale
snapshot. Neither side would enqueue continueRunIfUnblocked, leaving
the run stuck forever.
@ericallam ericallam force-pushed the fix/waitpoint-mvcc-race branch from 86190be to 39fa6d2 Compare February 18, 2026 13:30
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
internal-packages/run-engine/src/engine/systems/waitpointSystem.ts (1)

444-444: Optional: replace SELECT COUNT(*) FROM inserted with SELECT 1.

The CTE result is discarded. Since PostgreSQL data-modifying CTEs execute regardless of outer-query reference (as the comment correctly notes), SELECT COUNT(*) FROM inserted can be replaced with SELECT 1 to avoid the unnecessary aggregation.

♻️ Proposed simplification
-        SELECT COUNT(*) FROM inserted`;
+        SELECT 1`;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal-packages/run-engine/src/engine/systems/waitpointSystem.ts` at line
444, Replace the discarded CTE result expression "SELECT COUNT(*) FROM inserted"
with "SELECT 1" so the data-modifying CTE still runs but avoids an unnecessary
aggregation; locate the SQL string containing "SELECT COUNT(*) FROM inserted" in
waitpointSystem.ts and update it to "SELECT 1" while keeping the surrounding CTE
structure and any parameter bindings unchanged.
📜 Review details

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 86190be and 39fa6d2.

📒 Files selected for processing (1)
  • internal-packages/run-engine/src/engine/systems/waitpointSystem.ts
🧰 Additional context used
📓 Path-based instructions (4)
**/*.{ts,tsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.{ts,tsx}: Use types over interfaces for TypeScript
Avoid using enums; prefer string unions or const objects instead

**/*.{ts,tsx}: Always import tasks from @trigger.dev/sdk, never use @trigger.dev/sdk/v3 or deprecated client.defineJob pattern
Every Trigger.dev task must be exported and have a unique id property with no timeouts in the run function

Files:

  • internal-packages/run-engine/src/engine/systems/waitpointSystem.ts
**/*.{ts,tsx,js,jsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Use function declarations instead of default exports

Import from @trigger.dev/core using subpaths only, never import from root

Files:

  • internal-packages/run-engine/src/engine/systems/waitpointSystem.ts
**/*.ts

📄 CodeRabbit inference engine (.cursor/rules/otel-metrics.mdc)

**/*.ts: When creating or editing OTEL metrics (counters, histograms, gauges), ensure metric attributes have low cardinality by using only enums, booleans, bounded error codes, or bounded shard IDs
Do not use high-cardinality attributes in OTEL metrics such as UUIDs/IDs (envId, userId, runId, projectId, organizationId), unbounded integers (itemCount, batchSize, retryCount), timestamps (createdAt, startTime), or free-form strings (errorMessage, taskName, queueName)
When exporting OTEL metrics via OTLP to Prometheus, be aware that the exporter automatically adds unit suffixes to metric names (e.g., 'my_duration_ms' becomes 'my_duration_ms_milliseconds', 'my_counter' becomes 'my_counter_total'). Account for these transformations when writing Grafana dashboards or Prometheus queries

Files:

  • internal-packages/run-engine/src/engine/systems/waitpointSystem.ts
**/*.{js,ts,jsx,tsx,json,md,yaml,yml}

📄 CodeRabbit inference engine (AGENTS.md)

Format code using Prettier before committing

Files:

  • internal-packages/run-engine/src/engine/systems/waitpointSystem.ts
🧠 Learnings (1)
📚 Learning: 2025-07-12T18:06:04.133Z
Learnt from: matt-aitken
Repo: triggerdotdev/trigger.dev PR: 2264
File: apps/webapp/app/services/runsRepository.server.ts:172-174
Timestamp: 2025-07-12T18:06:04.133Z
Learning: In apps/webapp/app/services/runsRepository.server.ts, the in-memory status filtering after fetching runs from Prisma is intentionally used as a workaround for ClickHouse data delays. This approach is acceptable because the result set is limited to a maximum of 100 runs due to pagination, making the performance impact negligible.

Applied to files:

  • internal-packages/run-engine/src/engine/systems/waitpointSystem.ts
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (27)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (8, 8)
  • GitHub Check: sdk-compat / Node.js 20.20 (ubuntu-latest)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (5, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (1, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (4, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (7, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (6, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (6, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (2, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (3, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (2, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (7, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (3, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (8, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (5, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (1, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (4, 8)
  • GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - npm)
  • GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - npm)
  • GitHub Check: units / packages / 🧪 Unit Tests: Packages (1, 1)
  • GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - pnpm)
  • GitHub Check: sdk-compat / Node.js 22.12 (ubuntu-latest)
  • GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - pnpm)
  • GitHub Check: sdk-compat / Deno Runtime
  • GitHub Check: sdk-compat / Cloudflare Workers
  • GitHub Check: sdk-compat / Bun Runtime
  • GitHub Check: typecheck / typecheck
🔇 Additional comments (3)
internal-packages/run-engine/src/engine/systems/waitpointSystem.ts (3)

367-383: LGTM — comment is accurate and well-structured.

The READ COMMITTED snapshot semantics are correctly described, and the rationale for querying all requested waitpoint IDs (not just successfully inserted ones) is clearly explained.


458-520: LGTM — the isRunBlocked derivation and enqueue logic are correct.

The two-statement design correctly eliminates the MVCC race: pendingCheck gets a fresh snapshot that reflects any completeWaitpoint that committed after the CTE started. The debounce key (continueRunIfUnblocked:${runId}) matches completeWaitpoint's key, so concurrent enqueueing from both sides is safely idempotent.


438-442: The _WaitpointRunConnections A/B column assignment is correct. The migration confirms "A" references "TaskRun" and "B" references "Waitpoint", matching the raw SQL's assignment of "A" = runId and "B" = w.id. The table name reflects the explicit @relation("WaitpointRunConnections") name in the schema rather than implicit alphabetical model order; regardless, the A/B columns follow the correct alphabetical model ordering (TaskRun < Waitpoint).

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal-packages/run-engine/src/engine/systems/waitpointSystem.ts`:
- Line 451: The type annotation for the result of prisma.$queryRaw in
pendingCheck is using the constructor type "BigInt" but should use the primitive
"bigint"; update the generic from <{ pending_count: BigInt }[]> to <{
pending_count: bigint }[]> (locate the prisma.$queryRaw call that assigns
pendingCheck in waitpointSystem.ts) so TypeScript correctly types the COUNT(*)
result.

---

Nitpick comments:
In `@internal-packages/run-engine/src/engine/systems/waitpointSystem.ts`:
- Line 444: Replace the discarded CTE result expression "SELECT COUNT(*) FROM
inserted" with "SELECT 1" so the data-modifying CTE still runs but avoids an
unnecessary aggregation; locate the SQL string containing "SELECT COUNT(*) FROM
inserted" in waitpointSystem.ts and update it to "SELECT 1" while keeping the
surrounding CTE structure and any parameter bindings unchanged.

// isolation, each statement gets its own snapshot. The CTE's snapshot is taken when
// it starts, so if a concurrent completeWaitpoint commits during the CTE, the CTE
// won't see it. This fresh query gets a new snapshot that reflects the latest commits.
const pendingCheck = await prisma.$queryRaw<{ pending_count: BigInt }[]>`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

BigInt type annotation should be bigint (lowercase primitive type).

BigInt (capital) is the TypeScript constructor object type, not the bigint primitive. Prisma returns bigint primitives from COUNT(*). While Number() handles both at runtime, the type annotation is incorrect and may cause type-checker confusion.

🛠️ Proposed fix
-      const pendingCheck = await prisma.$queryRaw<{ pending_count: BigInt }[]>`
+      const pendingCheck = await prisma.$queryRaw<{ pending_count: bigint }[]>`
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
const pendingCheck = await prisma.$queryRaw<{ pending_count: BigInt }[]>`
const pendingCheck = await prisma.$queryRaw<{ pending_count: bigint }[]>`
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal-packages/run-engine/src/engine/systems/waitpointSystem.ts` at line
451, The type annotation for the result of prisma.$queryRaw in pendingCheck is
using the constructor type "BigInt" but should use the primitive "bigint";
update the generic from <{ pending_count: BigInt }[]> to <{ pending_count:
bigint }[]> (locate the prisma.$queryRaw call that assigns pendingCheck in
waitpointSystem.ts) so TypeScript correctly types the COUNT(*) result.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Comments