SM-DP+ topology and tenant isolation for eSIM hosts

Operator-hosted eSIM for MVNE/MVNO tenants rises or falls on the chosen SM-DP+ design, certificate scope, and lifecycle controls. The anchor is the SM-SR authority, but the economic and operational shape is set by the SM-DP+ ingress, storage, and discovery alignment. Selecting the right SM-DP+ topology, and proving multi-tenant SM-SR isolation, determines onboarding speed, incident blast radius, and audit scope. We unpack where bottlenecks appear, how to size them, and what a clean SM-DS discovery looks like when the SM-DP+ topology is shared.

Trust map and control plane: SM-DP+, SM-SR, eUICC, LPA

Before topology selection, the host has to draw the trust map. GSMA SGP.21 and SGP.22 define the protocol path, but they do not decide the commercial risk owner. That is set by certificate custody, audit perimeter, and the way tenant authority is reflected across the eUICC, LPA, SM-DP+, and SM-SR domains.

The SM-SR should be treated as the enable, disable, and binding authority. Tenant isolation then needs more than a tenant_id column. It should be pinned to EID ranges, ISD-R policy, and explicit API tenancy boundaries. An SM-SR call that can address an EID outside the tenant realm is a control-plane defect, even if no profile material is exposed.

On the SM-DP+ side, terminate TLS with mTLS on RSP APIs. Keep DSC private keys in FIPS L3 HSM partitions where the tenant risk model warrants it. OCSP and CRL handling should be bounded, not improvised. Short OCSP caches and stapling reduce responder dependency during brownouts; CRLs remain the safety net when revocation checks degrade.

GSMA specs in scope: SGP.21, SGP.22, SAS-SM
Trust anchors: EUM root, CI delegate certificates
HSM policy: FIPS 140-2/3 L3; tenant-aligned partitions optional
OCSP/CRL strategy: OCSP stapling + 5–15m cache, CRL as safety net
Certificate rotation: TLS 90d; DSC 24–36m with staged rollover

Data roles also belong in the architecture pack, not the legal annex. RSP metadata, activation codes, and device identifiers can be treated as personal data in several markets. Lawful intercept, audit export, and retention policy should be decided before tenants are attached. The user path should be equally explicit: SM-DS discovery first, QR fallback second, with activation code realms routing to the intended SM-DP+ rather than to a shared default.

Three SM-DP+ deployment topologies for operator-hosted tenants

The first pattern is a shared multi-tenant SM-DP+. One PKI domain, one profile store, and one ingress tier serve multiple tenants. It is the lowest operating-cost model because worker pools, monitoring, certificate operations, and storage lifecycle controls are shared. The trade-off is a wider failure surface. A malformed tenant-side callback, an HSM concurrency spike, or a profile packaging defect can stress the common plane unless the ingress enforces quotas early.

The second pattern is virtualized per-tenant SM-DP+. Each tenant receives an isolated instance or namespace behind a common ingress layer. This raises instance count and increases certificate, HSM partition, and release-management work. It also provides a cleaner blast-radius boundary for regulated tenants and for brands with heavy launch peaks. For an MVNE servicing 12+ tenants in EMEA, this model often becomes the middle ground: common operations, but isolated profile storage and tenant-specific rate limits.

The third pattern is an external partner SM-DP+ with an operator-hosted SM-SR. It moves DP+ operations and packaging load outside the host’s direct runbook. It does not remove accountability for discovery alignment, tenant onboarding, or audit evidence. Contracts need to state which party owns SM-DS registration, certificate rollover, profile retention, incident notification, and failed-download settlement.

Discovery is where topology becomes visible. Activation code realms must map to tenant endpoints, and SM-DS registration updates need operational controls that match DNS TTLs. A stale realm can prolong misroutes after migration. Profile package storage should be encrypted at rest with per-tenant keys, and concurrent downloads should be throttled per tenant to protect object storage, HSM signing operations, and database locks.

json

{
  "tenantRoutes": [
    {
      "tenantId": "tenant-emea-07",
      "eidPrefix": "89049032",
      "mccMnc": ["23450", "26209"],
      "activationRealm": "rsp.tenant-emea-07.example",
      "smdpEndpoint": "https://dpplus-emea-07.rsp.example/gsma/rsp2"
    },
    {
      "tenantId": "tenant-apac-02",
      "eidPrefix": "89049088",
      "mccMnc": ["52512"],
      "activationRealm": "rsp.tenant-apac-02.example",
      "smdpEndpoint": "https://dpplus-apac-02.rsp.example/gsma/rsp2"
    }
  ],
  "defaultAction": "reject"
}

The default action matters. A shared ingress should reject unmatched MCC/MNC or EID ranges rather than fall back to a general SM-DP+ endpoint. Silent fallback looks convenient during test cycles and becomes expensive during commercial launch.

Designing a multi-tenant SM-SR without cross-tenant bleed

A multi-tenant SM-SR concentrates authority. It binds profiles, enables and disables them, and records the state transitions that tenants use to reconcile customers, devices, and charges. The design should bind each ISD-R to a tenant by EID range and serving MNC/MCC. Allow-lists should block accidental attachment to a tenant outside the intended realm. This is not only a security control; it prevents audit disputes after bulk provisioning windows.

Lifecycle events should be modeled as idempotent webhooks into tenant BSS/OSS and OCS domains. download, bound, enable, disable, and delete events need tenant correlation IDs, replay protection, and sequence numbers. Tenants must be able to replay a day of events without changing charge state. Conversely, the host must be able to prove that a retry did not execute the same SM-SR operation twice.

Keep MNP and device swap logic separate from RSP state. The SM-DP+ issues profile material; the SM-SR orchestrates enable and disable. Porting events can trigger provisioning workflows, but they are not equivalent to profile lifecycle events. Conflating them creates orphan profiles, incorrect charge triggers, and confused tenant support paths.

Retention and export design need the same precision. LAES boundaries, SM-SR event metadata, and tenant audit extracts should follow the host’s policy while remaining separable at tenant level. A Tier-2 MNO, Southeast Asia, ~8M subscribers, found that tenant audit evidence took longer to approve than the API integration because retention classes had not been mapped to each tenant contract.

Disaster recovery is a sequencing problem. Dual-site SM-SR designs should align database write-ahead state and message offsets, with narrow sequence windows to prevent double execution after failover. During OEM batch onboarding, watch EID prefixes closely. Collisions or mis-entered ranges can route profiles to the wrong tenant during peak windows and remain invisible until activation support tickets arrive.

Provisioning workflow, discovery, and settlement touchpoints

Provisioning starts before the device talks to the RSP. Activation codes can be single-use or pooled, with TTL windows set by fraud appetite and expected channel latency. Pre-provision checks should confirm KYC, stock state, and OCS eligibility before a token is issued. A failed download has cost: SM-DS lookups, profile reservation, storage reads, HSM operations, and support handling.

In the SGP.22 flow, the LPA reaches SM-DS first when discovery is enabled. The activation code realm then needs to land on the intended SM-DP+. QR fallback should use short TTLs and should not outlive the discovery state. This matters during tenant migrations, where old QR assets and cached realms can keep devices pointed at retired endpoints.

Callbacks are part of charging integrity. The RSP should send downloadProgress and final result events with idempotency keys. Tenants need to de-duplicate those events so retries do not create double charges or orphan customer states. Retry and backoff policy should respect device-side timers and cap attempts during transient Wi-Fi or radio impairment. Unlimited retries burn operational capacity and can reserve licenses that never convert.

json

{
  "eventType": "profileDownloadResult",
  "eventId": "9f4c7b6e-3d1a-4f3f-94df-0c81b4d1a912",
  "tenantId": "tenant-emea-07",
  "correlationId": "order-7844129",
  "eid": "89049032123456789012345678901234",
  "iccid": "8944501234567890123",
  "result": "SUCCESS",
  "resultCode": "RSP-2000",
  "attempt": 1,
  "completedAt": "2026-05-07T08:41:23Z"
}

Settlement then needs a defined basis. Some contracts rate successful downloads only. Others charge expired tokens, reserved inventory, profile storage, or retries above a threshold. A Greenfield MVNO, post-2023, multi-IMSI stack, can absorb profile inventory cost differently from a mature host with millions of dormant eSIM profiles. The wrong rating unit makes unit economics opaque within one quarter.

License model: Per activation vs per-inventory slot
Discovery fees: SM-DS registration and per-lookup (contractual)
HSM operations: CapEx for modules, OpEx for RMA, firmware audits
DC footprint: Dual-site compute + encrypted object storage
Compliance cost: SAS-SM audits, penetration tests, data residency
NOC coverage: 24/7 monitoring, synthetic activations per tenant

Sizing, SLOs, and failure scenarios

Sizing should start with launch behavior, not average monthly activations. Device campaigns, OEM releases, retail drops, and number migration windows create short peaks that expose weak isolation. DP+ stateless workers can scale horizontally, but databases, object storage, HSM partitions, and SM-SR sequencing usually set the real ceiling. Pre-warm token pools and tenant caches before launch windows.

Set explicit SLO targets per topology. A practical baseline is 96–98% first-attempt success, sub-300 ms median DP+ time to first byte in-region, and 99.9% monthly webhook delivery with retries. These figures should be measured at tenant level, not only platform level. A shared platform can look healthy while one tenant is blocked by a certificate, realm, or quota defect.

Blast-radius controls belong at ingress. Per-tenant rate limits, circuit breakers, storage quotas, and HSM concurrency ceilings prevent one tenant’s launch from degrading another tenant’s activation path. Resilience targets should be explicit: RPO 0 for license and token databases, and RTO under 30 minutes for active/standby recovery. Failover tests should include synthetic device activations, not only database promotion.

Observability should focus on tenant-visible failure. Run synthetic LPA journeys per tenant, monitor certificate expiry, track OCSP health, and expose activation metrics that distinguish discovery failures from DP+ download failures and SM-SR state failures. Change windows should align certificate rotations and discovery updates with OEM release calendars. Rollback plans must cover DNS, SM-DS entries, token issuance, and tenant webhook endpoints.

Operator-hosted RSP works when SM-DP+ placement, SM-SR tenancy, and discovery are engineered for isolation and audit. The commercial upside is not only lower platform cost. It is lower incident externality, clearer accountability, and predictable activation unit cost across tenants with different launch profiles and regulatory obligations.