Building a Temporary Admin Access Workflow

The Problem

Every IT team knows the conversation. A developer needs to install a dependency. An engineer needs to run a system-level diagnostic. A designer needs to update a font cache. The ask is always the same: "Can you just make me a local admin for a bit?"

The traditional answers are both bad. Option A: give them permanent local admin rights, accept the expanded attack surface, and hope they don't accidentally break their system or run a malicious installer. Option B: have IT remote in every single time, creating a bottleneck that interrupts both the user and the IT team.

I wanted a third path: just-in-time admin access, approved in Slack, time-limited to 5–30 minutes, and fully audited. No persistent privilege. No IT babysitting. A complete audit trail of every command run during the window.

Goal: Users get self-service access in under 60 seconds. IT maintains approval control. Every session auto-expires. Every sudo command is logged and shipped to the Slack thread.

How It Works

User opens Iru Self Service and clicks "Request Admin Access" A script prompts for a reason, a duration (5, 10, 15, or 30 minutes), and a reason category (Install, Debug, Config, Security, Developer, Other) via osascript dialogs. Collects device identity (hostname, serial) and POSTs a signed request to an API Gateway endpoint.

IT receives an interactive Slack approval message The message shows user, hostname, serial, reason, and category. IT sees four duration-labeled Approve buttons (5/10/15/30 min) — they can approve at the user's requested duration or override it — plus a Deny button. Posted to a dedicated IT channel.

IT clicks Approve — duration-specific Iru tag assigned + background monitor detects it A background LaunchDaemon polls /status every 20 seconds. On approval, it calls iru run directly — the fastest path to processing Library Items. The user sees an "approved" alert within 20 seconds. Each duration maps to a distinct Iru tag, which scopes its own SAP Privileges MDM profile with the matching ExpirationInterval.

Device runs elevation-start.sh via Iru Library Item Calls PrivilegesCLI --add to grant admin, enables a sudoers drop-in for command logging, notifies the backend to start the N-minute timer (where N is the approved duration), installs a network monitor LaunchDaemon.

EventBridge sends a 5-minute warning DM, then fires expiration at T+N Timers are anchored to when the device confirms elevation — not when IT clicks Approve. The user always gets the full approved duration. The warning DM is skipped entirely for 5-minute sessions (a T+0 warning would be instant noise).

On expiration, Iru removes the tag and collects the sudo log A second Iru tag triggers collect-sudo-log.sh, which ships the sudoers log to the backend. The backend uploads it as a file attachment in the original Slack approval thread.

Architecture

The backend is a fully serverless AWS SAM application. There's no always-on infrastructure — all compute is Lambda functions invoked by API Gateway or EventBridge Scheduler.

⚡

API Gateway + Lambda

9 Lambda functions handle request intake, Slack actions, device confirmations, log receipt, status polling, and expiration. All endpoints behind API key auth.

🗄

DynamoDB

Single-table design stores each request's full lifecycle: status, timestamps, Slack thread IDs, device ID, and actor identity for every state transition.

⏰

EventBridge Scheduler

One-time schedules per session for the 5-minute warning (T+25) and expiration (T+30). Auto-delete after firing via ActionAfterCompletion: DELETE.

📱

Iru MDM

Two tags act as signals. Elevation tag triggers the Privileges profile. Log-collection tag triggers log shipping. Device-side iru run forces immediate processing.

🔑

SAP Privileges

Open-source macOS app providing controlled, time-limited local admin via a LaunchAgent. Scoped via an Iru config profile — only activates on tagged devices.

🔐

System Keychain

API key stored in the macOS system keychain (accessible by root) via a provisioning script. Retrieved at runtime — never hardcoded in source.

The Slack ↔ Lambda Handshake

Slack requires a 200 response within 3 seconds of an interactive action. But processing an approval — hitting Iru, writing to DynamoDB, creating EventBridge schedules — takes longer. The solution is a two-Lambda pattern:

handleSlackAction verifies the Slack HMAC-SHA256 signature and immediately invokes processSlackAction asynchronously (InvocationType: 'Event').
handleSlackAction returns 200 to Slack within milliseconds.
processSlackAction runs independently and handles all the heavy work.

Timer Anchoring

An early design mistake: the 30-minute timer was started at approval time. But there's latency between IT clicking Approve and the device actually being elevated — MDM check-in, Iru running the script, PrivilegesCLI executing. A user could lose 3–5 minutes before they even had admin.

The fix: elevation-start.sh POSTs to a /start endpoint when elevation is confirmed on device. The backend creates EventBridge schedules from that timestamp. The user always gets a full 30 minutes from the moment they're actually elevated.

bash

# elevation-start.sh — notify backend that elevation is confirmed
HTTP_STATUS=$(curl -s -o "$ELEVATION_RESPONSE_FILE" -w "%{http_code}" \
  -X POST "$API_ENDPOINT" \
  -H "Content-Type: application/json" \
  -H "x-api-key: $API_KEY" \
  --max-time 15 \
  -d "{\"requestId\":\"$REQUEST_ID\",\"serial\":\"$SERIAL\"}")

Security Features

After eleven rounds of security audits, the system incorporates defense-in-depth across every layer:

Slack Signature Verification

Every webhook verified with HMAC-SHA256. Requests older than 5 minutes rejected. Timing-safe comparison via crypto.timingSafeEqual.

DynamoDB Conditional Writes

All status transitions use ConditionExpression atomically. Two IT admins clicking Approve simultaneously results in exactly one approval.

Input Validation Everywhere

UUID format validation on all device endpoints. Field length limits. Serial validated as 8–14 uppercase alphanumeric. Lambda endpoints reject non-object JSON bodies.

Slack mrkdwn Injection Prevention

All user-controlled fields passed through escapeSlack() before embedding in Block Kit messages. Prevents link injection via <URL|text> syntax.

Device Identity Binding

Serial stored at request time is validated against every subsequent device call. A device can only interact with its own session — not another device's.

Network Loss Revocation

LaunchDaemon polls every 60 seconds. Network loss triggers immediate admin removal. Auth errors (401/403) fail-secure rather than retry.

IT Slash Command

/admin-status restricted to a configured Slack user ID allowlist. Empty allowlist defaults to denying all access (fail closed).

Off-Hours Delegation

Optional off-hours auto-approval routes to an on-call admin. Configuration errors fail closed — requests held for manual review, never auto-approved on misconfiguration.

Transient Failure Resilience

Iru API calls use exponential backoff (1s, 2s) for 5xx/429 — up to 3 attempts. 4xx throws immediately. Prevents a single rate-limit from dropping an entire operation.

Partial Failure Resilience

Elevation removal is the critical path — if it fails, EventBridge retries. Log collection failure is non-critical: session is still marked expired and IT is alerted.

Audit Trail & Delayed Notifications

Every transition records timestamp and actor. User DMs are deliberately delayed until sudo log collection succeeds — audit trail secured before user is notified.

Secrets Management

API key in macOS system keychain — never in scripts. Lambda secrets via AWS SSM. Module-load-time validation ensures Lambdas fail fast if secrets are missing.

Atomic Metadata Writes

Session metadata at /var/root/.iru-elevation/meta.json (mode 600). mktemp + mv pattern — a crash mid-write never leaves a partial file.

iru run Mutex Lock

File lock at /var/run/iru-run.lock serializes all iru run invocations across three daemons. PID-aware — detects and clears stale locks from killed processes.

Post-Run State Verification

After each iru run, daemons verify the expected state change occurred. Single retry after 120s if not confirmed — absorbs Iru tag propagation latency.

Key Design Decisions

Why Iru Tags as Signals?

Iru Library Items can be scoped to specific device tags. By scoping a Library Item to the temp-admin-elevation tag, we get Iru's built-in delivery guarantees: retry on failure, run-at-install semantics, and immediate execution on iru run. We don't need to build our own device delivery mechanism — Iru handles it.

Why SAP Privileges Instead of dseditgroup?

Direct dseditgroup calls add the user to the local admin group and require explicit cleanup. SAP Privileges integrates with macOS's authorization model, provides a visible UI indicator, supports an ExpirationInterval MDM key as a safety-net fallback, and is open-source with active maintenance. The MDM profile approach means the app only works on tagged devices.

Why Two Iru Tags?

Separation of concerns. The elevation tag is removed on revocation or expiration. The log-collection tag is assigned on revocation or expiration. These are often simultaneous but not always. Keeping them separate avoids race conditions and makes each Library Item's trigger unambiguous.

Why EventBridge Scheduler Instead of SQS Delayed Messages?

EventBridge Scheduler supports named one-time schedules that can be deleted by name. Critical for the revoke flow: if IT revokes at T+15, we cancel the T+25 warning and T+30 expiration schedules. SQS delayed messages can't be cancelled after enqueuing.

Lessons Learned

The real attack surface is the device, not the backend

Most of the interesting security findings were in shell scripts — unvalidated data in generated scripts, metadata files with wrong permissions, Python subprocesses without timeouts. Lambda code is easy to reason about; device-side bash is where subtle bugs hide. Treat shell scripts as first-class security artifacts.

Race conditions require database-level guards, not application-level checks

The "fetch → check status → update" pattern is a TOCTOU race. Two concurrent Lambda invocations can both pass the check and both apply the update. DynamoDB's ConditionExpression moves the check into the atomic write. Non-negotiable for state machines where each transition must happen exactly once.

Anchor timers to device confirmation, not approval

Any time you have an async pipeline (approve → MDM deliver → device run → confirm), the user experience is only as good as the last step. Anchoring timers to device confirmation cost one extra API call but resulted in users always getting the full 30-minute window they were promised.

Iterative security auditing finds what point-in-time reviews miss

Eleven rounds of security audits found meaningful issues in almost every round — not because earlier rounds were bad, but because fixing issues and adding features creates new surface area. Build security review into your iteration cycle, not just your launch gate.

Zero open findings is achievable — accept risk explicitly, not by omission

The accepted-risk items were each evaluated deliberately. Accepted risk with documented rationale is categorically different from unfixed risk with no explanation.

New features are new attack surface — review them immediately

The slash command was the highest-severity new finding: signature verification was in place, but no authorization check existed. Any workspace member could enumerate all active admin sessions. The fix was three lines. The gap between "implemented" and "authorized" is where high-severity findings live.

Fail closed beats fail open, especially for access control

The off-hours auto-approval feature had a subtle misconfiguration path that auto-approved every request on invalid configuration. The correct behavior: if you're not sure it's off-hours, require manual approval. Any access-control feature that grants permissions by default on config error is a security risk.

Escape at the output boundary, not at ingestion

Early versions sanitized input at ingestion time — this leads to double-encoding bugs and false confidence. Store raw data; escape at every output boundary. Each output context (Slack mrkdwn, JSON, shell) has different escaping requirements.

Never call a process runner from within a script being run by that process runner

elevation-start.sh was calling iru run at the end of its own execution — but it runs inside an iru run triggered by the approval monitor. The agent holds an internal lock during execution; the nested call deadlocked the outer agent. A nested iru run inside an Iru script is a deadlock by construction.

Verify the check actually checks something

The UTF-8 validation in receiveLog was Buffer.from(x).equals(Buffer.from(x)) — a tautology that always returns true. It passed code review because it looked correct. Always test security checks with input that should fail. A check that never rejects is not a check.

By the Numbers

Lambda functions

Device shell scripts

30m

Max elevation window

<60s

Approval → elevated

~3s

Slack latency

Security fixes

Recently Shipped

Background approval monitor: Self Service script exits in under 5 seconds. A background LaunchDaemon polls for approval every 20 seconds — no more blocking the Iru Self Service app.
Removed all blankPush calls: blankPush triggers an MDM check-in, not a Library Item run. Device-side iru run is now the exclusive mechanism for picking up tag changes.
iru run mutex lock: Three background daemons could previously call iru run concurrently. A PID-aware file lock serializes all invocations.
Delayed revocation DMs: Users are not notified until after sudo logs are successfully collected. Audit trail secured first.
Slack original message lifecycle: The approval message is updated to a "completed" state when log collection succeeds — outcome and timeline shown, all buttons removed.
Fixed nested iru run deadlock: Removed the redundant inner iru run call inside elevation-start.sh that deadlocked the outer agent.
Switched to iru run --reset-daily: Forces full re-evaluation even if the agent's daily run already completed.
Post-run state verification with 120s retry: Daemons confirm expected state changes after every iru run.
Off-hours failure visibility: If off-hours auto-approval fails, a warning posts to the IT Slack thread immediately.
Degraded log warning: If timezone conversion fails during log collection, a visible warning is prepended to the uploaded log.

What's Next

Extended audit retention: Export DynamoDB records to S3 before TTL expiration for long-term compliance storage.
Trend dashboard: Surface reason categories and per-team request frequency in a read-only view for IT leadership.
Rotation-aware off-hours: Pull the on-call rotation from PagerDuty rather than a static Slack user ID.

Built for a macOS-first environment using Iru as the MDM, but the core pattern — Slack-gated JIT access with EventBridge timers and MDM tag signaling — translates to other MDM platforms with an API. Source on GitHub →

Building a Zero-Trust Temporary Admin Access Workflow on macOS

The Problem

How It Works

Architecture

API Gateway + Lambda

DynamoDB

EventBridge Scheduler

Iru MDM

SAP Privileges

System Keychain

The Slack ↔ Lambda Handshake

Timer Anchoring

Security Features

Slack Signature Verification

DynamoDB Conditional Writes

Input Validation Everywhere

Slack mrkdwn Injection Prevention

Device Identity Binding

Network Loss Revocation

IT Slash Command

Off-Hours Delegation

Transient Failure Resilience

Partial Failure Resilience

Audit Trail & Delayed Notifications

Secrets Management

Atomic Metadata Writes

iru run Mutex Lock

Post-Run State Verification

Key Design Decisions

Why Iru Tags as Signals?

Why SAP Privileges Instead of dseditgroup?

Why Two Iru Tags?

Why EventBridge Scheduler Instead of SQS Delayed Messages?

Lessons Learned

The real attack surface is the device, not the backend

Race conditions require database-level guards, not application-level checks

Anchor timers to device confirmation, not approval

Iterative security auditing finds what point-in-time reviews miss

Zero open findings is achievable — accept risk explicitly, not by omission

New features are new attack surface — review them immediately

Fail closed beats fail open, especially for access control

Escape at the output boundary, not at ingestion

Never call a process runner from within a script being run by that process runner

Verify the check actually checks something

By the Numbers

Recently Shipped

What's Next