SDN Controller vs NMS in Optical Networks: The Control–Management Split
Two pieces of software touch the same ROADM and want different things from it on different clocks. One decides and acts in real time; the other records and oversees across the equipment's whole life. This is what each is for, the interfaces that divide them, the trust boundary between them, and where the clean split breaks.
1. Introduction
Send a 400G wavelength request into a modern optical network and two distinct systems react. One computes a path, picks a wavelength, programs the ROADM and the transponder, and then keeps optimizing the result as the network changes around it. The other never touches that decision — it inventories the card you just lit, watches its pre-FEC error rate for the next three years, and raises a ticket when an amplifier upstream starts to age. The first is the SDN controller. The second is the Network Management System.
The two get conflated because both, loosely, "manage the network." The distinction that actually matters is older and sharper: control versus management. Control is real-time service operations — deciding and acting on the live network on a clock measured in milliseconds to seconds. Management is lifecycle operations — the long-horizon bookkeeping of equipment, faults, performance, and contracts, on a clock measured in minutes to years. The Open Networking Foundation's SDN architecture (TR-521) frames the two not as separate boxes but as different aspects of the same thing, modelled as roles rather than planes.
This article works through that split in the optical context: what the controller is expected to do, what the NMS is expected to do, the southbound and northbound interfaces that both divide and connect them, the trust boundary that keeps them apart, the optical-specific control problems a packet controller never faces, how the layers stack in a real multi-vendor deployment, and the place where the clean line breaks down in production.
2. The control–management split
The ONF SDN architecture defines the controller as the node that performs the real-time convergence of a changing resource pool and a changing service demand toward an optimum, where even the optimization criteria can shift over time. "Real-time" and "convergence" carry the weight of that definition. The controller is a feedback element: it senses network state, computes against demand, acts, and senses again. Management is what the control and data planes are deliberately not built to do or not permitted to do — commissioning equipment, isolating faults, holding inventory, archiving performance, reconciling bills.
The architecture's first structural move is to separate the control plane from the data plane, the same split that lets a disaggregated OpenROADM control layer drive heterogeneous optics through one consistent model. Underneath sits the data plane: the physical glass and silicon — the reconfigurable add/drop node, the transponders, the line amplifiers. The diagram below places the two roles over that data plane. Management is drawn orthogonal to the control stack, not above it: it sets policy and absorbs telemetry across every plane at once rather than sitting in the service path.
The table below is the heart of the contrast. Read it as two answers to the same question — "what does this layer owe the network?" — that diverge on almost every axis: clock, decision authority, the state each holds as authoritative, the interfaces it speaks, and the trust domain it lives in.
| Dimension | SDN controller | NMS (with EMS / OSS management layer) |
|---|---|---|
| Core purpose | Real-time service control — decide and act on the live network | Lifecycle and assurance — record, monitor, and oversee |
| Defining function | Converge resources and demand toward an optimum, continuously (TR-521) | FCAPS: fault, configuration, accounting, performance, security |
| Timescale | ~50 ms (protection) to seconds (provisioning, restoration) | Minutes to years (commissioning, PM history, upgrades, SLA) |
| Decision authority | Computes paths, RWA, restoration; programs the network elements | Provisions through the controller; monitors, correlates, audits |
| Authoritative state | Live topology and spectrum occupancy | System-of-record: inventory, alarm and PM history |
| Primary interfaces | Northbound TAPI / intent; southbound NETCONF, gNMI over OpenConfig / OpenROADM | Northbound to OSS / BSS; southbound TL1 / SNMP / NETCONF; consumes controller APIs |
| Multi-vendor posture | Abstracts a multi-vendor optical layer behind one open northbound interface | Historically single-vendor per EMS domain, federated upward into an NMS |
| Trust domain | May sit in a tenant or orchestrator domain; default-deny exposure | Core FCAPS stays provider-side; holds the audit boundary |
| Optical example | Retune a transponder and reroute a wavelength in seconds | Trend Q-margin to flag an aging EDFA weeks before it fails |
Takeaway: Controller and NMS are not competing products; they are two clocks and two trust postures over the same glass. The controller decides and acts in real time and holds the live model it needs to do so. The NMS remembers, oversees, and holds the system-of-record. Every other difference follows from those two facts.
3. What the controller is expected to do
A 400G wavelength request lands between two metro sites. The controller is expected to turn that intent into lit photons without a human in the loop, and to do it in seconds. First it consults its authoritative model of the network: the topology of fiber spans and degrees, and the spectrum occupancy across the C-band — which slots are taken, which are free, where the guard bands fall. Then it solves the routing-and-wavelength-assignment problem against that occupancy, picks a transponder mode and forward-error-correction setting, checks that the chosen path will actually close optically, and programs two things: the wavelength-selective-switch attenuation profiles at every ROADM the channel traverses, and the transponder's frequency and modulation.
The job does not end when the service comes up. The controller then holds that service against a changing network — re-optimizing channel power as spans age, defragmenting spectrum when a circuit tears down, and rerouting around a fiber cut. These operational interfaces are exposed and consumed through standard northbound and southbound interface protocols, the same discipline that turns vendor-specific CLI work into repeatable optical network automation. The protection numbers set the clock: an optical 1+1 selector switches in under 50 ms (a typical figure for loss-of-light triggered switching), while shared-mesh optical restoration completes in tens of seconds. Both are control-plane timescales, and both depend on the controller holding state that is current to the second.
The interface the controller exposes northward is increasingly the ONF Transport API. A connectivity request to a TAPI-speaking controller is a RESTCONF call against a YANG-modeled context — the controller takes the intent (endpoints, layer, capacity) and does the path computation and device programming itself.
POST /restconf/data/tapi-common:context/tapi-connectivity:connectivity-context
Content-Type: application/yang-data+json
{
"tapi-connectivity:connectivity-service": [{
"uuid": "och-svc-metro-0407",
"connectivity-constraint": {
"service-layer": "PHOTONIC_MEDIA",
"requested-capacity": { "total-size": { "value": "400", "unit": "GBPS" } }
},
"end-point": [
{ "local-id": "A", "service-interface-point": {
"service-interface-point-uuid": "sip-MET07-deg1" } },
{ "local-id": "Z", "service-interface-point": {
"service-interface-point-uuid": "sip-MET08-deg2" } }
]
}]
}
A backhoe severs the working span between two metro ROADMs at 08:14. The controller already holds the live topology and knows two facts the instant the loss-of-light propagates: which wavelengths rode that span, and which alternate route has free spectrum with enough margin. Within tens of seconds it retunes the affected transponders, repaints the WSS attenuation along a diverse path, and brings the circuits back — before a second failure can strand them. No ticket was opened and no engineer dispatched. The NMS, meanwhile, has logged the root-cause alarm and the restoration event for the post-incident review the next morning.
Takeaway: The controller is measured on real-time outcomes — how fast it provisions, how fast it restores, how well it keeps services optimal as the network drifts. To do that it must own a live, trustworthy model of topology and spectrum and be programmable end-to-end. A controller that has to ask a human is not doing its job.
4. What the NMS is expected to do
One cut fiber produces forty alarms. The loss-of-light at the break is the root cause; the backward-defect indications, loss-of-frame, and downstream client failures cascading across ten network elements are symptoms. The NMS is expected to correlate that storm into a single root cause and a single ticket, so the operations team chases one fault instead of forty. That is the "F" in FCAPS — the fault, configuration, accounting, performance, and security functions that the ITU-T's TMN management model (M.3400) groups together as the work of the management layer.
Walk the rest of the set. Configuration management holds the inventory — every shelf, card, and pluggable, with software and firmware versions, plus configuration backups and the commissioning of new elements. Accounting covers usage records, SLA tracking, and the billing adjustment when a service breaches its commitment. Performance management archives the long history: pre-FEC bit-error-rate, Q-margin, and received optical power, trended over weeks and months so a slow degradation shows up as a line on a graph rather than a 3 a.m. outage. Security management owns the logs, role-based access control, and credential lifecycle. None of this runs on a control-plane clock; it runs on minutes-to-years. In optics, this layer grew out of the per-vendor element management system, federated upward into an NMS — the world before NETCONF and YANG gave operators a vendor-neutral way in, when each vendor's gear meant its own TL1 or SNMP interface and its own GUI.
The fault and performance side is where the NMS earns its keep day to day. The alarm log below shows the distinction concretely: the root-cause LOS, the downstream symptoms the NMS should suppress under it, the controller's restoration action, and the clear.
2026-06-24T08:14:02Z NODE-MET-07 CRITICAL LOS och-os 1/3/2 (root: fiber cut, span MET07-MET08)
2026-06-24T08:14:02Z NODE-MET-08 MAJOR BDI odu4 1/3/2 (downstream, suppressed under root LOS)
2026-06-24T08:14:03Z NODE-MET-08 MAJOR LOF otu4 1/3/2 (downstream, suppressed)
2026-06-24T08:14:09Z CONTROLLER INFO restoration: reroute MET07->MET12->MET08, retune 193.95 THz
2026-06-24T08:14:31Z NODE-MET-08 OK LOS cleared och-os 1/3/2 (service restored, 29 s)
An inline EDFA's noise figure creeps up over six weeks as its pump ages. No alarm fires — every channel is still above threshold. But the NMS has been trending received OSNR and Q-margin on the wavelengths crossing that amplifier, and the slope is unmistakable: about 0.1 dB of margin lost per week. The performance-management function raises a maintenance flag long before any service degrades, an engineer is scheduled, and the card is swapped in a planned window. The controller never sees this; it only knows the live margin is still adequate. Catching the slow drift is squarely the management layer's job, and it depends on history the controller does not keep.
Takeaway: The NMS is measured on the long view — clean root-cause correlation, an inventory that matches the ground, performance history that predicts failures, and an audit trail that survives. It is the system-of-record. Where the controller asks "what is true right now," the NMS asks "what has been true, and what is drifting."
5. The interfaces between them
The protocol on the wire tells you which layer you are in. Southbound — the data-controller plane interface — the controller programs network elements with NETCONF or RESTCONF over SSH on port 830, and subscribes to streaming telemetry with gNMI. The data itself is shaped by YANG device models: OpenConfig where the operators want a vendor-neutral model, OpenROADM where the optical line system is disaggregated. Legacy management still rides TL1 and SNMP into the NMS. gNMI exists because polling does not scale to optical telemetry density; its dial-out subscriptions are reported to reach roughly a hundredfold the resolution of SNMP polling, a vendor-and-operator claim from the OpenConfig community rather than a fixed standard figure.
Northbound — the application-controller plane interface — is where the ONF Transport API sits. TAPI is a RESTCONF/YANG interface defined explicitly for the boundary between SDN controllers, orchestrators, management systems, and OSS. It exposes topology, connectivity, path-computation, notification, and OAM services across layers 0, 1, and 2 (photonic, OTN, and Ethernet), and the v2.4.0 release added impairment awareness and a unified alarm-and-performance model. The same TAPI runs between a domain controller and a hierarchical orchestrator — the upper interface carries an abstracted map, the lower one carries device-level detail, but the model is identical. This is the interoperability that multi-vendor operations consolidation depends on, and it is how a controller drives a third-party open line system it did not build. The IETF runs a parallel framework, ACTN (Abstraction and Control of TE Networks), for the same multi-domain problem.
| Interface / protocol | Direction | What it carries |
|---|---|---|
| NETCONF (RFC 6241, SSH:830) | Southbound | Transactional device config with candidate/running datastores and confirmed-commit rollback |
| RESTCONF (RFC 8040) | South / North | HTTP/JSON access to YANG-modeled data; the transport TAPI rides on |
| gNMI (gRPC) | Southbound | Streaming telemetry by subscription; high-resolution PM and state |
| OpenConfig (YANG) | Southbound model | Operator-driven, vendor-neutral device models |
| OpenROADM (YANG) | Southbound model | Disaggregated ROADM, transponder, and OLS models plus a device API |
| TAPI v2.4.0 (RESTCONF/YANG) | Northbound | Topology, connectivity, path-compute, notification, OAM across L0/L1/L2; impairment-aware |
| SNMP | Southbound | Legacy polling and traps into the NMS |
| TL1 | Southbound | Transaction language to legacy optical element managers |
Takeaway: Southbound is how the controller programs the glass — NETCONF and gNMI over OpenConfig and OpenROADM. Northbound is how intent and abstraction flow — TAPI, the same model whether the consumer is an orchestrator or an OSS. The NMS increasingly consumes the controller's northbound API rather than touching devices directly; that shift is the whole direction of travel.
6. Trust domains: the split as a security boundary
The controller can sit in a trust domain that is not the network owner's. A wholesale customer, or a higher-level orchestrator run by a different team, can hold a controller that requests services across an operator's optical layer — while that operator keeps the equipment's lifecycle, fault management, and audit firmly on its own side. The ONF architecture is explicit about why: the single strongest reason to keep a task out of the SDN control and data planes is that the controller may live in a customer trust domain, while business and security reasons demand that core management stay in the provider domain. The recommended default is to expose nothing rather than everything.
That default-deny posture is the mechanism. A network element exposes to its controller only the resources and the actions that policy permits; everything else is invisible, and an unexpected signal on an unconfigured port is treated as an exception to be raised, not traffic to be accepted. The management layer, sitting in the provider domain, instantiates that policy, installs the enforcement, and audits expected-versus-discovered state. This is the same boundary that makes multi-vendor optical line system integration tractable: the line system can be driven by a customer-chosen controller precisely because the provider's management retains the lock, the audit, and the system-of-record.
Trust boundary: The default posture is expose-nothing. A network element shows the controller only what policy allows; the management layer holds the enforcement and the audit. Treat the split as a security control, not an org-chart artifact — collapsing the controller and the NMS into one trust domain is a decision, and on a wholesale network it is usually the wrong one.
A content provider runs its own transport controller and buys spectrum from a carrier's open line system. The provider's controller issues a TAPI connectivity request for a wavelength between two handoff points; the carrier's domain controller computes the path within its own line system and lights it. The provider sees an abstracted service — two endpoints and a capacity — and nothing of the carrier's internal ROADM topology, amplifier inventory, or other tenants. The carrier's NMS, in its own trust domain, holds the real inventory, correlates faults on the physical plant, and audits that the provider receives exactly the contracted spectrum and no more. Two controllers, two trust domains, one wavelength.
Takeaway: The control-management split is also a trust split. The controller can be delegated outward; core FCAPS, equipment lifecycle, and audit stay with the resource owner. Default-deny exposure is what makes that delegation safe, and it is why the NMS is not merely a viewer of the controller's world.
7. Optical-specific control: RWA, impairments, restoration
In a packet network the controller picks a path and the bits arrive or they don't. In an optical network the controller must also prove the light will arrive intact before it commits the circuit, and that is a problem no packet controller faces. Routing-and-wavelength-assignment is constrained: a wavelength must satisfy continuity (the same frequency end to end, unless a regenerator intervenes) and contiguity (enough adjacent spectrum for the channel's width). On top of that sits the physical-layer question — will the chosen path close? The controller computes an impairment-aware feasibility estimate, comparing the generalized signal-to-noise ratio the path will deliver against the minimum the transponder mode needs.
Computed per candidate path before a wavelength is committed. OSNRreq depends on the modulation format M, the symbol rate Rs, and the FEC scheme; OSNRrx is built from the accumulated ASE, nonlinear, and filtering penalties along the route. The received estimate is only as good as the per-span fiber and amplifier data behind it — data the management layer maintains. A positive margin means commit; a thin or negative margin means choose another path or insert a regenerator.
Restoration is where this becomes real-time. A colorless-directionless-contentionless ROADM lets the controller reroute a wavelength onto a diverse path and retune the transponder to whatever frequency is free there, the same switching that underpins optical protection and restoration. Shared-mesh optical restoration completes in tens of seconds — slower than the sub-50 ms reroute available to IP-over-DWDM protection, because the controller is recomputing feasibility and retuning hardware, not flipping a pre-armed selector. The mechanism has a clear boundary: impairment-aware path computation is only as trustworthy as the controller's link model, and that model drifts. Splices age, EDFA gain tilts, connectors degrade — and the record of that drift lives in the NMS's performance history, not in the controller's live snapshot.
A 400G channel at 193.95 THz rides a working path that fails. The controller's first candidate reroute has free spectrum, but only at a different frequency, and the longer path costs about 2 dB of OSNR. The controller checks the margin: the transponder's 16QAM mode needs more headroom than the new path offers, so it drops to a lower-order format (QPSK), which tolerates more noise and needs a lower required OSNR, retunes the transponder, and commits. The circuit comes back at reduced capacity but stays up. That entire decision — feasibility, mode selection, retune — is the controller's, executed in seconds. The NMS records that the service is running degraded and that a margin event occurred on that path.
Takeaway: Optical control is path selection plus a physics proof. The controller owns RWA, impairment feasibility, and hardware retune in real time. But the feasibility math is fed by fiber and amplifier ground truth that the management layer keeps current — which is the first place the clean split starts leaking into the next section.
8. The hierarchy in practice
In a real multi-vendor network the stack has four tiers, and the controller-versus-NMS question resolves differently at each. At the bottom are the physical network elements — ROADMs, transponders, amplifiers — grouped into domains, often one domain per vendor or per region. Above each domain sits an optical domain controller doing real-time control within its own scope: RWA, power balancing, and wavelength restoration for its slice of the network. Above the domain controllers sits a multi-domain orchestrator — a hierarchical controller that stitches services end to end across domains using the same TAPI, but at a higher abstraction; it sees a map of inter-domain links, not every device. And alongside the whole stack sits the OSS/NMS/BSS, holding the cross-domain system-of-record and running fulfillment and assurance.
This maps onto the IETF's ACTN naming directly: the domain controller is a provisioning network controller, the orchestrator a multi-domain service coordinator. The vendor reality matches the network-as-code picture — Cisco's Optical Network Controller exposes a TAPI northbound as an optical domain controller; Ribbon's Muse provides SDN domain orchestration above its Apollo line system, which itself speaks NETCONF/YANG with OpenROADM interworking; Ciena's MCP and Nokia's NSP play the domain-controller and management-and-control role for their platforms. The orchestrator that sits above them is frequently driven by intent-based networking at the top of the stack, turning a service request into multi-domain connectivity without an operator touching a device.
Takeaway: "Controller" is not one box — it is a domain controller doing real-time control and an orchestrator doing real-time stitching, both distinct from the NMS that holds the cross-domain record. The same TAPI runs between the control tiers; the NMS increasingly consumes that API rather than owning a parallel southbound path to the devices.
9. Convergence — and the failure mode at the seam
The line between controller and NMS has been migrating toward the controller for a decade, and the ONF architecture predicted it. The document is explicit that most functions historically performed by element and network management systems fall within the long-term scope of the SDN controller, and that today's managers may find themselves recast as application clients sitting north of the controller rather than as a separate stack beside it. The market has followed: vendor domain controllers now absorb fault and performance management, and platforms that began life as an NMS expose programmatic, model-driven provisioning. The hard boundary that does not move is the system-of-record and the audit — those shift north and become the thing the controller is measured against, not something that disappears.
That migration creates a specific failure mode, and it lives exactly at the seam between the two layers. Real-time control needs a current, trustworthy model of the network. But the ground truth — what is physically installed, which alarms are real versus suppressed symptoms, what the fiber plant degraded to last quarter — is maintained by the management layer. When the two disagree, control acts on fiction. A controller restoring against stale inventory reroutes a wavelength onto a span the operator retired weeks ago. A controller computing feasibility against an outdated impairment model commits a circuit that will not close. The reconciliation between the controller's live state and the NMS's system-of-record is where most real integration pain concentrates, and it is the reason the NMS cannot be reduced to a read-only dashboard.
Reconciliation hazard: A controller that restores against stale inventory reroutes a wavelength onto capacity that is in the model but not in the ground. Audit expected-versus-discovered state continuously, and treat the NMS as the system-of-record the controller is reconciled against — not a passive viewer. The most dangerous moment is the one where the live model and the record have quietly diverged and nothing has yet forced them back into agreement.
Takeaway: Convergence is real — the controller keeps absorbing the NMS's old jobs — but it does not erase the management layer. It relocates the system-of-record north and makes reconciliation the new design problem. Build the seam deliberately, or the controller will eventually act on a network that no longer exists.
10. Conclusion
The controller-versus-NMS question is not a product comparison; it is two clocks and two trust postures over the same glass. The controller decides and acts in real time, holds the live topology and spectrum, and is measured on how fast it provisions and restores. The NMS remembers, oversees, holds the audit, and is measured on whether its record matches the ground. Treat them as competing tools and you will either starve the control plane of automation or strip the management plane of its system-of-record — both are expensive mistakes.
As TAPI and model-driven telemetry mature, the controller will keep absorbing functions the NMS used to own, and the management layer will keep moving north toward orchestration and assurance. The boundary that endures is the one that matters most under failure: someone has to know what is actually installed and what has actually drifted, and the controller has to be reconciled against that truth before it acts. Design that seam first, and the rest of the architecture — the interfaces, the hierarchy, the trust domains — tends to fall into place around it.
References
- Open Networking Foundation, SDN Architecture (TR-521), Open Networking Foundation.
- Open Networking Foundation, Transport API (TAPI) Functional Requirements (TR-547), Open Networking Foundation.
- ITU-T M.3400 — TMN Management Functions, International Telecommunication Union.
- IETF, RFC 6241 — Network Configuration Protocol (NETCONF), Internet Engineering Task Force.
- IETF, RFC 8453 — Framework for Abstraction and Control of TE Networks (ACTN), Internet Engineering Task Force.
Developed by MapYourTech Team
For educational purposes in Optical Networking Communications Technologies
Note: This guide is based on industry standards, best practices, and real-world implementation experiences. Specific implementations may vary based on equipment vendors, network topology, and regulatory requirements. Always consult with qualified network engineers and follow vendor documentation for actual deployments.
Feedback Welcome: If you have any suggestions, corrections, or improvements to propose, please feel free to write to us at [email protected]
Optical Communications & Network Automation Expert | Author of 3 Books for Optical Engineers | Founder, MapYourTech
Optical networking engineer with nearly two decades of experience across DWDM, OTN, coherent optics, submarine systems, and cloud infrastructure. Founder of MapYourTech. Read full bio →
Follow on LinkedInRelated Articles on MapYourTech