Network Outages are going to happen. The marketing department talks about Zero Outages — and that’s a great goal to have. But as the pragmatic engineering and operations team, you can prepare for outages to prevent them and to remediate instantly.
Skip the Pretend Redundancy
Detect and Triage Faults
PCAP: Packet capture
Ready for Remote Testing
Logging Enabled & Synchronized
Prepare the humans
Skip the Pretend Redundancy
Everybody knows you need hot spares. Pretend-Redundancy is what you get when you spend twice as much on equipment and software but never test it. Salesmen love it because they sell everything you need to be reliable.
Many companies buy spare equipment and even plug it in, but actually implementing working redundancy is far more challenging. Proper physical design can be elusive, and testing time-consuming. But it’s worth it to have a robust network. Outage-Prepared Networks not only build redundancy, they test their redundancy.
Redundancy in networks means that the system:
Has adequate replicas (e.g., spare servers, spare routers)
Continuously synchronizes all necessary information between replicas
Automatically detects faults
Automatically takes the faulted component offline (so that traffic doesn’t keep going to the faulted component)
Automatically selects a replacement component replica
Causes the replacement component to become active or start to take the additional workload
Detect & Triage Faults
To prepare for outages, you’ve got to be equipped to know when the faults are occurring.
Get the basics: Get the status and health information your devices have to offer using SNMP. Is the Interface up? Get alerts when components go offline.You can this fault detection and reporting with a variety of systems, including OpenNMS, SolarWinds, and others.
Measure the normals: Beyond the basic good/bad readings, you can define Key Performance Indicators (KPIs) like number of active users, concurrent calls, CPU and memory usage, and set thresholds for what is normal or abnormal. CPU of 50% may be normal, but 90% will be normal for some workloads and abnormal for others.
Get the alerts to a human. Generating alerts isn’t enough: some responsible human has to get the alert and do the right thing! The human analysts also need an escalation path to alert the problems-solvers.
Analyze. When you get an alert, you need to troubleshoot it to determine its severity. Fault-detections are imperfect, so you need to determine how true the alert is, and triage the situation for seriousness. Few automated alerts are reliable enough to be trusted.
In the June 2018 Visa Europe outage, the problem was bad hardware. The Network Operations team needed to contact people who could travel to the equipment site. The network monitoring analysts need a reliable communication path to other teams.
PCAP: Packet Capture
Outage-Ready Networks have packet capture capabilities online throughout the network — at least at the low-speed disaggregated points. Capturing 100 Gbps Ethernet may be impractical today. But capturing the traffic going in and out of each 1 Gbps load balancer should be achievable.
Capturing traffic should not require a dispatch to a physical site: the system should be built to allow troubleshooters to activate packet capture, collect and analyze data within minutes.
Ready for Remote Testing
The Operations staff need the ability to test remotely to replicate problems remotely.
Voice (Phone) services — It’s easy to REGISTER a SIP / IMS phone over the Internet, or to build VPNs, to replicate the experience of users attaching at different points in the network
Web services — It’s easy to use VPN or DNS to route traffic to a particular entry point in the network
Local ISPs — If you have a global service, using advanced BGP or DNS based routing, you can setup service with local ISPs in your markets and provide test units there to let you test the experience of local users in those markets.
After the outage has begun, it’s usually too late to build Remote Testing capability. Outage-Prepared organizations build these capabilities in advance.
Logging Enabled & Synchronized
All of our systems have logging. In telecom, though, many of those logs are disabled automatically. To prepare for outages, enable the logging you can, and be sure the logs are useful.
Enable the logging. Turn on the logging you can afford to enable without crippling the device. Debug logging is often too much.
Synchronize the clocks. To find out what truly happened in a problem, you often need to know the sequence of events that led up to it. But so often the clocks in systems are not synchronized, so that uncovering what occurred fast enough to remediate an outage is too difficult.
Centralize the logging. Whenever possible, aggregate all your logs to a central location bearing in mind you’ll lose the logging locations as well. Centralized logging with an analyzer tool like Splunk can radically reduce remediation time, but if your centralized-logging sites are down or unreachable, you still need to troubleshoot the problems. I prefer systems that store some logs locally, and send a copy to the centralized log storage.
Prepare the Humans
To prepare for an outage, you need all of your staff ready to help. Too many organizations depend on their senior most engineers for outage troubleshooting, but this is a big mistake. You need your full staff prepared to triage and analyze hard problems.
Training should focus on how to determine whether each component in your network is working. Components can be servers (like www2, or the DNS server 220.127.116.11) or services (like the SIP SBC at 18.104.22.168 or the REST API at https://foo.com/api/v2/).
The 24×7 troubleshooting staff should have:
List of supported product. (Supported means that if it breaks, we have to fix it.)
Diagrams of how the product works, for purposes of understanding dependencies.
List of components involved in each product
Method for testing each component. E.g.,
REGISTER with SIP to sip:firstname.lastname@example.org and make a call to sip:+email@example.com;user=phone
tcpdump is more efficient than tshark at raw writing to disk; e.g.,
tcpdump -s 1514 -i eth2 -w file.pcap
will tend to capture more than a similar tshark command.
(b) A busy Linux box, or high packet rate, will lose some data because tcpdump or tshark are not running all the time. You can run tcpdump at a higher priority with the “nice” command and a negative nice level:
Sometimes the disk system just cannot keep up with the rate of traffic, and the disk buffers aren’t large enough. Without tuning kernel disk buffers, you can make a ramdisk. This example checks to see there’s about 1660 MB of RAM doing nobody any good, and it makes a 1000 MB ramdisk using the “tmpfs” filesystem feature, and writes a big capture to it.
[root@sniffer /]# free -m
total used free shared buffers cached
Mem: 2010 1748 262 0 159 1239
-/+ buffers/cache: 349 1660
Swap: 3999 0 3999
# mkdir /tmp/ramdisk
# mount -t tmpfs -o size=1000m tmpfs /tmp/ramdisk/
# nice --adjustment=-10 tcpdump -s 1514 -i eth2 -w /tmp/ramdisk/ecg_sniffer_eth2_20180321.pcap
tcpdump: listening on eth2, link-type EN10MB (Ethernet), capture size 1514 bytes
1199062 packets captured
1199075 packets received by filter
0 packets dropped by kernel
ECG would be glad to help with your Voice and Video Network Engineering, and 24×7 customer support. Ping us to learn more.
“DNS is one of those things that gets overlooked… You make your voice servers super-redundant, but you take it for granted that DNS will always work.” — Fonality
The Domain Name System (DNS) conventionally helps devices convert domain names, like VoIPCarrier.com, to IP addresses, like 22.214.171.124. But in Voice and Unified Communication (UC) services, it serves an additional key role in fault tolerance, and enabling encryption. For large networks, it can also assist in balancing load across many data centers.
Background: In typical Voice Network implementations, the UC client or SIP phone is configured to make requests to its local “recursive caching” DNS server. (So called “recursive”, because it makes multiple steps to arrive at the answer, and “caching” because it stores the resulting answers for some time.) This server consults with the Root and Top-Level-Domain (TLD) servers that handle domains like .com. Those TLD servers direct the DNS client’s recursive caching DNS to the authoritative DNS server for the Voice/UC provider, and that authoritative server provides the IP addresses, failover addresses, load balancing settings, and encryption settings for the UC or Voice Service.
There are some common variations on this theme:
1. Voice & UC Operators that don’t use the Internet. For Cable operators, “Metropolitan Service Operators” (MSOs) and Fiber to the Premise (FTTx) Providers, DNS can be used, but it doesn’t use the Root Servers or TLDs. These operators cannot provide their service across the Internet.
2. Operators that don’t use DNS at all. Some networks are built with the IP addresses configured in the devices, so no DNS is ever used. As 2600hz argues, this strategy can mean that inevitable IP address renumbering becomes extremely expensive and can create outages for customers. “This can be dangerous if you have customers who provision their own phones and, later on, you move data centers or change IP blocks for some reason and have no way to change the IPs of those phones.”
Nine Common Vulnerabilities Most Voice & UC Providers Are Exposed To
I. SIP Phones & Clients depend on their local DNS servers.
The most common mistake is that SIP Phones and Desktop UC Clients are relying on the local DNS servers in their data networks. For some users, this is the outdated Windows server in the closet, but it’s often the underpowered DNS software provided in the local router by the Internet Service Provider (ISP).
The risk is that when this local, underpowered, or under-managed DNS server malfunctions, the Voice/UC service fails. The Voice/UC provider is responsible for important features, like the ability to make emergency phone calls, or conduct business (video calling, desktop sharing, messaging, etc.), but all of those services fail where the local DNS server fails.
Key Remedy Options:
Option A: Configure the Voice/UC Clients to use DNS servers provided by the Voice/UC Provider itself.
Option B: Configure the Voice/UC Clients to use reliable DNS servers. For example, you may configure the clients to use Google (126.96.36.199, for example) and Level(3) (188.8.131.52, for example).
II. Depending on a single organization to run the DNS server for authoritative service.
Any one DNS server on the Internet can be attacked and overloaded using botnet attacks. Even most organizations that offer DNS services can be overwhelmed. History shows that attacks tend to be launched against individual DNS operators. (Note that the Global Root DNS servers are under continuous attack, and have defenses that are effective so far.)
Option A: Buy DNS hosting from multiple (Authoritative) hosting companies, confirming that they’re not using the same underlying infrastructure.
Option B: Run your own DNS servers, but also use another company’s DNS Authoritative server’s too. To be effective, the systems should not be synchronized automatically. Some human review should be required to make the change in each place.
III. Failing to use locally cached answers in case all DNS fails.
Some SIP Phones can store a cache of DNS records locally, provided to the device by the configuration file. For example, the Polycom UC SIP devices support these configurations, and only use them if no DNS replies are returned. The Polycom implementation supports basic A records, as well as the advanced records, SRV and NAPTR, used for load balancing and encryption management.
Yealink SIP Phones has a Static DNS Cache supporting A, SRV, and NAPTR that can be configured either as the preferred source of DNS settings, or can be used as as backup.
Software vendors who provide desktop and Mobile software should build in this capability, so that the configuration file delivered to the software has a DNS cache backup in it.
IV. Too short of Time-To-Live Cache Lifetimes.
Every DNS records has a “Time To Live” (TTL) setting, which specifies how long caching servers should store the record.
Large Values, such as four hours (which would be configured as 14400 seconds), have the advantage of being stable. If there’s an interruption in access in the DNS system, these ensure that the old records will stay in the cache for some time. But the downside is that when changes are needed, the operator may have to wait four hours for the change to go into effect.
Small Values, such as 60 seconds, have the advantage of allowing relatively quick changes. For example, if a provider learns that their primary Session Border Controller (SBC) site is having problems, they can make a change, and expect all DNS requests within 60 seconds to get the new value and route requests to the new location. But small values have a downside: if there’s a short interruption in the client’s access to the DNS servers, then the clients will lose all old records.
Large TTLs don’t provide complete protection against DNS failures. Note that even with a large TTL, you should expect that records will begin expiring immediately upon failure of your authoritative caching service. For example, if you have a TTL of four hours, and an outage occurs at 12:00pm, then by 1:00pm approximately 25% of your users’ caches will have expired and by 2:00pm, half of your users are offline.
Key Remedy Options:
Option A: Use a relatively large DNS TTL value, such as four hours, for stability.
Option B: Use a large DNS TTL, but change it to a smaller number in the hours before a scheduled change
V. All DNS Servers in one BGP prefix or BGP Autonomous System
BGP is the system used by large organizations to configure the routes for IP traffic. An IP Prefix, such as 184.108.40.206/21, specifies the routing path for 2,048 IP addresses.
But because BGP Prefixes are subject to attack and downtime too, you don’t want to have all of your Recursive Caching, or your DNS servers, in the same BGP prefix advertisement.
(For the same reasons, you also don’t want to have all of your Session Border Controllers (SBCs), or all of your Load Balancers, or all of your Production Firewalls in the same BGP prefix if you can avoid it.)
Hijacking one BGP prefix is generally easier than hijacking more than one, so by having a diverse set of BGP prefixes, you make this attack more difficult and improve the robustness of your network.
A rarer attack that is still possible is hijacking of a BGP Autonomous System Number (ASN). The ASN is used in Internet Routing to identify a large company or Internet Service Provider. An ASN attack would allow an attacker to interrupt BGP traffic for an entire ISP or large enterprise.
Key Remedy Options:
Option A: Ensure that your DNS servers (both authoritative and the recursive caching servers used by client devices) are from at least two different BGP prefix advertisements. Typically this is easy to check because the IP addresses begin with different digits.
Option B: Ensure that DNS servers are advertised via two different Autonomous System Numbers (ASN) in BGP. This can often be accomplished by buying authoritative and recursive-caching service from an additional provider.
VI. Depending on a single link to the Internet for your DNS authoritative service.
Any one Internet link is subject to attack and overload. One of the most common forms of Internet Attack is a Distribute Denial of Service (DDoS) attack, where millions of vulnerable devices are used to blast traffic at a single destination. DDoS mitigation is an active industry, with a number of firms providing tools and services to fight DDoS attacks.
Arbor Networks is one of the leading firms providing equipment built to fight DDoS attacks. Akamai and AT&T offer services to clients to help fight DDoS attacks.
However good all these services are, while the DDoS attack is in early stages, access to the DNS servers, Session Border Controllers, and Load Balancers will be interrupted for some time.
Many operators have redundant Internet links to provide fault tolerance. In the normal implementation, all Internet links are setup to be capable of handling all of the traffic. However, this means that every DDoS attack will hit every link simultaneously.
Key Remedy Options:
Option A: Separate traffic for each site, so that you have distinct IP blocks at each site. For example, your “Eastern Site” may advertise 220.127.116.11/21 and your “Western Site” may advertise 18.104.22.168/21.
Option B: For the classic form of IP redundancy you may want the same IPs through both site. Use Option A, but also advertise a common block through both providers.
VII. Depending on a single DNS registrar.
The DNS registrar is responsible for loading domain name information into the root and Top-Level Domain (TLD) servers. For example, if you own VoIPCarrierX.com, you’ll pay a registrar, like Register.com, GoDaddy, or EasyDNS a small annual fee to keep your data loaded in the root name servers.
But every registrar is susceptible to attack or malfunction. For example, in 2016, the DNS Registrar for New York Times was attacked, leading to an outage for the news site.
Use Multiple Registrars; this will necessitate multiple domain names.
VIII. Depending on a single Top Level Domain (.com, or .net)
As discussed above, if a DNS registrar is attacked, then the attackers can control the access to the domain name. But most of the Top Level Domains (TLDs) are managed separately from one another, and it is conceivable that an entire Top Level Domain could be attacked.
For example, the servers that handle .com are different from the servers that handle .net. Each country’s top level domain, like .us (for the United States) and .br (for Brazil) are also distinct.
By using different Top Level Domains, you can protect your services against excessive dependence on any one TLD server. For US-based companies, .com and .net are distinct domains, but there may be value in finding more diversity outside a single nation’s DNS management.
Voice/UC services have a distinct advantage here: users are not normally aware of the domain name in use. Google cannot easily use multiple domain names: everyone expects “gmail.com” to stay put at that name. But SIP phones and UC clients can be configured to use multiple domain names.
This is an important capability at the heart of many of the options here. To use multiple domain names, the client has to be capable of failing over.
Use Multiple Domain Names from distinct Top-Level Domains.
IX. Voice & UC Servers Not Verifying your servers with TLS
TLS (formerly known as SSL) has an important capability of validating that the server used is the legitimate and intended server. For example, if your SIP phone is configured to connect to VoIPCarrierX.net, then the SIP phone can validate that, once it is finally connected to some IP address (say, 22.214.171.124), the server provides it a certificate that is cryptographically signed by a key that the phone already trusts. The list of trusted parties is stored in the “Root Chain” on the TLS client; if the SIP phone has been configured to trust, say, Verisign, then the certificate for could be signed by Verisign identifying the server as legitimate for VoIPCarrierX.net. TLS also has a mechanism so that the client confirms mathematically that the server actually has some internal secret information — a long password key — proving it is truly the owner of the signed certificate. StackExchange has a nice introduction to this technology.
This means that after the entire DNS process has played out, using all the techniques above, the SIP phone or UC client can be built to verify that the server it finally reached was legitimate. This is a key requirement so that, if the DNS protections put in place fail, the phone will refuse to talk to an illegitimate hijacker.
Key Remedy Option:
Use SIP over TLS, and verify the Server Certificate.
DNS can be an effective and powerful way to manage Unified Communications services, but it does expose the operator to a number of attack methods. Smart Service Providers are taking the challenge seriously, implementing a variety of techniques to introduce diversity and genuine redundancy, including BGP, TLS, Top Level Domains.
Careful testing is crucial for stability before launching new SIP Access Device models
The Product Definition must be baked into the Test Plan used to approve the device, & new software for it
Customer Technical Support teams need extensive early access to devices before they are deployed
Space Probes are designed to be sent far away, beyond our reach. They run software, and send back data to Home Base. If the Space Probe, dies, it may never call home, and there’s not much we can do to fix them from here.
For Voice Operators, the SIP Phones, IADs, User Agents, and Soft-phones clients are a bit like Space Probes: Once you deploy them, you can’t go touch them. If they never “phone home,” you may be out of luck. So you have to plan well to make SIP clients work properly, before they’re launched.
Planning the Launch
The Challenge. There you are, in
charge of making the amazing new Hologram SIP phone work on your platform. The Product and Marketing folks LOVE this new phone, especially the way it integrates with your clients’ PCs and does video calls via Google Hangouts and Apple Facetime. And it’s your job to integrate the device into your BroadSoft BroadWorks network.
Avoiding Device Management Disasters
You’ve known about the SIP device disasters:
…when your Support Staff got calls from customers using the Yealink T-58V phone before they had ever heard of it
…and when the Engineering team had a week to roll out video calling on all the Polycom VVX phones, because customers were buying cameras
…or when the CEO of your company got the latest Mitel 6869 phone, and read that the phone could show him his company directory in Microsoft Exchange, but then it didn’t work.
…and who could forget when you upgraded 1000 phones to discover the new firmware required a different file format?
BroadWorks for Device Management?
BroadWorks provides some mature tools for managing new device types, including some important key features:
The BroadWorks system owner operator can add a new device type at any time, without BroadSoft’s assistance.
BroadWorks records many key SIP parameters of the device type.
BroadWorks stores the files like software and images, for delivery to the device.
BroadWorks automatically generates configuration files as the device user’s changes are made
BroadWorks system owners can define custom tags to handle new features and situations. Example: Most phones don’t support Apple Facetime, but if you have one phone with a FaceTime gateway, you can define “%FACETIME_GATEWAY%” as a custom tag in BroadWorks. Then BroadWorks can replace that tag macro with the appropriate value for each customer.
BroadWorks can deliver the files to the device with HTTP or HTTPS, confirming username/password or SSL client certificate.
BroadWorks can signal the device to download the new configuration file.
Key Milestones for Productizing a SIP Access Device
Decide: Which of the Features will you support?
Prototyping & Testing: Do the Features Work?
Management: Can you Control It?
Support: Can you help users use it?
Features: You Can’t Support Them All
Every SIP phone comes with a bazillion features, but no one network can support them all. You have to decide which features you’re going to try to support. Some of the top features are:
Ordinary calling with Caller ID
“HD Voice,” usually using G.722
Power over Ethernet
LLDP-MED for automatic Ethernet VLAN selection
Busy Lamp Field (BLF) / Line State Monitoring
Shared Call Appearance / Shared Lines
Programmable “Soft Keys”
Customizable Dial Plans (so the phone knows when you’re done dialing)
Bluetooth headset support
USB headset support
Multi-party Video Calling
Multicast Paging (where the speakerphone)
3-Way and N-Way Conferencing (where a large number of people can be conferenced together using the conference button)
Customizable Contact List
LDAP or Microsoft Active Directory or Exchange directory access
Customizable background logos
…But many will try
But there are exotic features on every phone. Are you going to help your customers if they contact your help center asking about these?
Practically, new phones are far too complex to support every feature. Choose the features you actually intend to fully, thoroughly support.
Prototyping and Testing: Prove Those Features Work
Once you’ve chosen your features, you can develop prototype templates so that BroadWorks can generate configuration files for that phone. In most cases, in the BroadSoft community, responsible vendors will post prototype templates with example configuration files to BroadSoft’s site, Xchange. These configuration file samples can be in one of the categories, “Interoperability – CPE Kit” or “Interoperability – Configuration Guide”.
Essential Parts of an Identity/Device Profile Type
An Identity/Device Profile Type (IDPT) is a specific model of device, such as a SIP phone.
The key parts are:
Profile: Identity/Device Profile Type (IDPT) Profile
File Templates: Identity/Device Profile Type (IDPT) Files & Authentication
In the IDPT Profile, you can set key settings that apply to every device of that type, such as:
What should the SIP Request URI look like?
Does the device supports RTP early media?
Can the device register?
Does it support HTTP authentication?
How does it handle Privacy headers?
Ideally, your vendor will specify these settings exactly.
For each IDPT, BroadWorks allows you to upload any number of templates and static files. Templates can be translated by an elaborate process of search-and-replace, and are usually used for configuration files. Static files are typically firmware/software and graphic images.
An excerpt from the template BWDEVICE_%BWMACADDRESS%.cfg is shown below, for the device type “Polycom VVX 500”. When a “Polycom VVX 500” phone is defined on the system, and assigned the MAC address 004fb2012345, then the file BWDEVICE_0004fb2012345.cfg would be created.
As shown in this example, the BroadWorks-standard tags begin with “%BW”, including
%BWLINEPORT-1%, the SIP Address of Record (AoR), user portion
%BWHOST-1%, the SIP domain for the AoR
%BWEXTENSION-1%, the extension that can be dialed to reach the user within the user’s organization
But you can define any number of other tags, which BroadWorks can track and manage. In the example above, you see %CONFERENCE%, which BroadWorks can replace with a value you specify.
A Tag Set can be created for each IDPT, with special settings appropriate for that device type. These can be set at the system level, and then overridden at the enterprise, service provider, group, or device levels of the hierarchy.
When prototyping your new device, you adjust the configuration files to implement the features in your particular network. You do this by revising the templates, and adjusting the tag values to work in your environment.
The Test Plan is Product Definition
The test plan you define is the definition of the product. If it’s an important part of the product, then you must put it in the test plan.
This test plan will be the Regression Test Plan later. It should be written so that all of the tests pass at the point the product is approved.
In a Test Plan, you have a “Device Under Test” (DUT) and tell the tester precisely what steps to take, and what to expect. You have to be specific: define the specific features and settings in your BroadWorks lab.
Record the actual results so that you can review and analyze them later. The person who’s doing testing, the Test Engineer, should record subtleties of behavior.
Answer the question What Did We Learn, because in each test, you’re discovering something about the system.
Example Test Environment
Device Under Test, DUT: The Apple Blackhole Phone Model 1
User built in BroadWorks as 229-316-1002
BLF monitoring on position 1 for 229-316-1001
Lab Phone 1: Polycom VVX 500 at 229-316-1000
Lab Phone 2: Yealink at 229-316-1001
Cell Phone 1: iPhone X at 919-559-6000
Example Test Plan
Test Case ID: TC01
Feature: Call with Caller ID
Detailed Procedure: On DUT, while the handset is on hook, press the Speakerphone button, and dial 919-559-6000. Do not press “Send”.
Expected Results: On the 919-559-6000 cell phone, it should ring, and you should see the caller ID 229-316-1000. Answer on the cell phone, and confirm that you get two-way audio. Hangup on the cell phone, and confirm that the call is disconnected on the DUT.
Actual Results: _______________________________________
What Did We Learn? _______________________________________
Test Pass? TRUE / FALSE
Test Case ID: TC02
Feature: Busy Lamp Field – monitoring
Detailed Procedure: On DUT, watch Busy Lamp Field Position 1. Place a call on test phone 229-316-1001 to 919-559-6000. Check Expected results, then disconnect by hanging up on 229-316-1001.
Expected Results: When you can hear ringback on the 229-316-1001 device, you should see the BLF indicator in position 1 on DUT. When the call is answered on cell phone 919-559-6000, the BLF indicator should change to the red bar on position 1 within 1 second. When you hang up on 229-316-1001, the BLF indicator on DUT should return to black within 1 second.
Actual Results: _______________________________________
What Did We Learn? _______________________________________
Test Pass? TRUE / FALSE
Management: Can You Control It?
Once you’ve proven that the core features work, you need to prove you can control it.
Prove you can replace the software. The simplest, naive way is to replace the firmware file, and trigger phones to reboot. This can cause an unlimited phones to retrieve the new software each time.
Instead, most operator wish to selectively upgrade a few phones at a time. You can do this with a custom tag, such as %FIRMWARE%, that specifies the filename, such as “sip-1.56.2.bin”. Then you can change the tag on specific phones that should have the new version of software.
ECG Alpaca, can be used to selectively update tags on devices within a BroadWorks platform. This is routinely used to upgrade specific devices, e.g., 1000 devices per maintenance window.
Alpaca can also selectively send the command to the SIP phones to reset them, and then monitor to confirm that all phones reboot properly.
All software generates logs. A manageable SIP access device puts the logs somewhere useable. With BroadWorks, you can have the device upload its logs, and BroadWorks can retain those logs for viewing on the Profile Server.
Alternately, some operators configure devices to send logs to a central log collector via syslog.
New Device Procedures.
When onboarding a new device, be sure to specify the requirements for new devices. What should happen to a new device from the manufacturer?
But this means that a new-in-box phone cannot be sent directly to the end user, without the user’s intervention in the process.
Some vendors, including Polycom and Yealink, offer a redirection service. To use these, new, or factory defaulted, phone with Internet service can reach out to the phone manufacturer to get a URL to the voice operator. For example, the phone 0004b2012345 could connect to Polycom to be told to go to https://xsp.voipcarrier.co/dms/vvx600.This works only because the voice operator has registered that MAC address with Polycom, and registered the URL.
Support: Enabling yourself to do a great job
The final step before rollout of a new device type is ensuring your Customer Support department have adequate access and experience.
Provide the Customer Support with the device to use. It’s remarkably common for operators to neglect to give their support folks experience on the devices that customers have. This leads to stress and poor customer service.
Have Customer Support run through the Test Plan to be sure they understand all of the supported features and settings. Since the Test Plan encapsulates the entire product definition, this means that Customer Support will know how to do it well.
The consulting firm, ECG, offers Voice Carrier technical consulting, including the proper rollout of new devices types, and support of Voice access devices in BroadWorks and other multi-vendor environments. www.ecg.co, firstname.lastname@example.org
IP Fragmentation of SIP Messages is an enduring source of trouble.
Fragmentation of SIP traffic is a problem on the rise. It appears when everything has been working fine, and seemingly without cause, some SIP messages are lost in the network. The result is a frustrating scenario where some SIP messages are delivered fine, but others are not.
To explain SIP fragmentation, let’s start at the beginning: Layer 2. Every link on an internet has a Maximum Transfer Unit (MTU) size which determines the maximum size of a packet that can traverse the link, in bytes. For Ethernet, this is often 1500 bytes. This means that no one Ethernet frame — and therefore one packet of data — can be transmitted across a standard Ethernet network that is larger than 1500 bytes. The duty of the Ethernet interface is to transmit only frames that meet this standard.
However, many applications need to send more data than this in a message. So the Operating System must accommodate both the application’s need for large messages, and the network’s requirement to send packets of a limited size. How is this done?
Fragmentation is fairly cheap for the fragmenter, but reassembling the fragments when they arrive is a fairly expensive operation. Simon Dredge (Metaswitch) has discussed the computational costs of re-assembling UDP fragments, arguing that the re-assembly should be done in a specialized kernel module of a Session Border Controller, rather than where the user applications run:
[T]he receiver has the tricky job of taking these seemingly miscellaneous packet fragments, deciphering them from other packets or packet fragments arriving simultaneously and piecing them back together – somewhat akin to a jigsaw puzzle – but without the aid of the picture on the lid of the box. Naturally, this process takes memory resources to store packet fragments, while waiting for their counterparts to arrive, then processing cycles to compile them . . . If fragmented packets are not successfully reassembled in a timely manner, then a retransmission will be requested or initiated, thereby further compounding the reassembly issue.
When a large message is fragmented, the separate fragments travel as separate IP datagram packets through the network. It’s possible for any one of those to be lost, but if one fragment is lost, IP has no mechanism to detect that and recover. The Internet Protocol software merely discards all the other fragments at the receiver. It depends on something else to retransmit the entire message again, on the hope that all fragments will be delivered.
Consider this analogy: This is like posting five boxes for shipment with no insurance: if four of them arrive, the mailman doesn’t track down the fifth box. The recipient must determine that a box is missing and ask for its contents to be replaced.
Worse than losing a packet, the loss of a single fragment of message D wastes all the network bandwidth (capacity) that was used sending the remaining fragments. They’re useless.
TCP, which is built on top of IP, typically does not use IP fragmentation. Instead, TCP has segmentation. TCP segmentation is optimized for the case of lost segments; when one is lost, TCP slows down the transmission, and retransmits the missing segment.
Fragmentation and SIP
SIP is usually used over UDP. When the SIP messages are small, this is no problem. In fact, for normal phone calls (e.g., SIP on PSTN gateways and SIP trunks), individual SIP messages almost always fit in a single UDP message, well under 1500 bytes, and therefore no fragmentation occurs at all (See Figure 3).
But as services become more sophisticated, the size of SIP messages grows. In particular, “Busy Lamp Field”, also known as “Line State Monitoring”, will often be responsible for sending large data sets via SIP messages. To receive the status information on the 11 people monitored on my Polycom VVX 600, my phone receives around 11,000 bytes of data in a single SIP NOTIFY message. If I were using SIP over UDP, that would take 8 IP fragments.
Even basic fragmentation has been a problem for some SIP systems. Back in 2009, Eric Hernaez (SkySwitch) reported a “major vendor’s” switch crashed by SIP fragments. But today, most SIP/UDP platforms support fragmentation with some reliability.
There are some approaches for reducing SIP message size in SIP. Alex Balashov (Evariste Systems) describes the challenges of reducing SIP header size in a pure-proxy system, but suggests some SIP headers you can remove in many cases. In 2009, Thomas Gelf suggested dropping unnecessary codecs, and using SIP compressed headers, like “m” instead of “Contact:”
SIP Fragmentation problems spiked again in 2016, when we saw numerous incidents where SIP fragmentation contributed substantially to network problems. Networks often run fine until the SIP messages grow just enough to cross the MTU boundary.
The buffer limitations in these devices can be a real problem. The 200-fragment capacity of a Cisco ASA5500 can easily be overwhelmed by normal traffic of thousands of SIP phones. (Add this as a reason data firewalls create trouble with large-scale VoIP deployments.)
Session Border Controllers and Fragmentation
Some versions of the market leading Session Border Controller, the Oracle Acme Packet SBC, has limitations on handling fragments. The “Traffic Manager” from the Network Processors to the CPU controls the rate that IP fragments are delivered from the network to the CPU in the popular 4250, 3800 and 4500 platforms. Terry Kim has a great depiction of the traffic manager for the Oracle Acme Packet SBC. He highlights that the Oracle Acme Packet SBC handles fragmented SIP as “untrusted” traffic — so if trusted endpoint devices, like customers, are sending fragmented SIP regularly, then the SBC doesn’t provide them the same capacity that they would have if they were sending non-fragment capacity.
The Oracle Acme Packet SBC was designed in an era that believed SIP over TCP would become more dominant, so that fragmented UDP should be rare. But the single rate limit for SIP fragmented traffic can be a real performance limitation when fragmented SIP becomes very common.
The Metaswitch Perimeta was designed much later, after the industry discovered that most SIP is still operating over UDP. The Perimeta was designed for fragmented SIP to be very common.
TCP: Segments over Fragments
The SIP standard, RFC 3261 mandates that TCP should be used to prevent fragmented SIP. Indeed, SIP over TCP does solve many of the problems by replacing IP fragmentation with TCP segmentation.
TCP segments slice up the stream of SIP messages into neat segments that fit within MTUs. Critically, TCP provides a fast and efficient mechanism for filling in gaps in the stream. Contrast this to IP fragmentation, where the entire SIP message must be re-sent any time one segment is lost.
However, a commonly-encountered problem using SIP/TCP is in limitations of Highly-Available Session Border Controllers, where the TCP synchronization and retransmission mechanism means that the state of any SIP/TCP connection may not be efficiently replicated to the standby SBC. For example, if SBC-A is active, then reboots, the SIP Phone using TCP typically has to reconnect to SBC-B. The SIP phone has to detect that the link to SBC-A has been lost. Both the Oracle Acme Packet SBC and the Metaswitch Perimeta SBC transmit a TCP RST (Reset) message when possible to notify the SIP phones that they need to re-register.
When using SIP/UDP, the failover to SBC-A to SBC-B can be a non-event: the IP address in use simply moves from one SBC instance to another because both SBC units in the pair know all of the state of the SIP registrations, subscriptions, and phone calls to all endpoints. But with SIP/TCP, the re-registration storm can be substantial, and disruptive. Imaging a network of 50,000 endpoints, all forced to re-register each time you reboot a single SBC instance.
The Fragments Are Coming
If you operate basic SIP trunking, or even some forms of today, you might not experience a lot of problems with fragmentation. But you should expect messages are growing larger, especially with integration of Fixed and Mobile networks, and the large SIP messages of IMS networks. How do you test? Send big SIP messages!
SIP messages are only getting bigger, so it is something you need to design and test for now, or it will be a major headache when it does happen. … That could involve pro-active pressure testing to find and resolve the problem areas before they bite you for real, and also monitoring largest message sizes flowing around your network so that you can predict when you’re going to start seeing more fragmentation and act accordingly.
When we teach classes on VoIP networks, we discuss the variety of SIP standards that can be used by working systems. For example, identifying the calling party varies between platforms: One system uses P-Asserted-Identity to indicate the caller, and another uses From. Then there’s the codec used for audio: One system uses G.722 and another uses G.722.2. One system expects national telephone numbers, and another system expects international-formatted numbers. The list goes on.
When setup two SIP systems to exchange phone calls, you need to be aware of the issues. If you’re lucky, the two systems have a lot in common. If you’re not, you may have a lot to fix.
Number Formatting Interop Checklist
1. Establish format for local telephone numbers.
Not always required.
2. Establish format for national telephone numbers. E.164 With Plus is recommended, e.g., +12292442099
3. Establish format for international telephone numbers. E.164 With Plus is recommended, e.g., +12292442099
4. Determine which short codes are supported, such as emergency codes 911, 111, and 0
5. Determine what control codes are supported, such as *67
Popular Telephone Number Formats
One of the most common questions for SIP interop is how the called telephone number will be formatted. There are several popular formats, and they occur in the Request-URI (after the “INVITE”) and in the To header.
This article is intended to familiarize you with many of the common options. The software you’re using — BroadWorks, Metaswitch, Asterisk, Sonus, Genband, etc. — will have certain capabilities and defaults. If you can find a way to interop end-to-end between the two call-control servers, without having to rewrite the number in an Session Border Controller in the middle, then you prevent substantial hassle.
Complete National Number
Perhaps the most common format is the complete national number. For example, in the United States, you might see this complete national number: 9193160013.
Within the UK, you might call a London number by dialing eight digits:
INVITE sip:email@example.com:5060;user=phone SIP/2.0
Via: SIP/2.0/UDP 192.168.1.43:5060;branch=z9hG4bK6ae0c87576E0FF92
From: "UK User 1" <sip:firstname.lastname@example.org>;tag=90C2A5B1-AAFDEC86
When a human is dialing the number to place a call across the PSTN, the receiving system (INVITE User Agent Server, UAS) will need to determine the intended PSTN telephone number. One popular method is to apply the local area code of the calling party. For example, if this call is dialed by user “u2292442099”, and their US area code is “229”, then the UAS may reasonably interpret the dialed digits 3160013 as +1-229-316-0013 by prepending the national and area code to the dialed digits.
This is a very awkward format to support between carriers or systems, and quite impractical if multiple local areas are involved.
Dial by Extension
In PBX-like environments, “extensions” can be dialed. An extension is a locally-defined telephone number, but which is not useable routing between carriers. You can expect to only see this when the INVITE Request URI represents digits that a user is dialing personally.
Preferred: I recommend this format whenever possible, because it’s the only one with a token that signals the type of number it is: the leading Plus. By including the Plus before the number, all parties involve understand that the next digits will be the country code, followed by the national and local number.
While it’s easy to say this is preferred, it’s not the right fit for most cases where an ordinary human is dialing the call. Most keypads don’t have an obvious + dialing symbol, and users conventionally dial a shorter number.
The “ext.6100” in this example makes it incompatible with a telephone number; there’s no automatic mapping between ext.6100 and a telephone number. This is intended to route a call to the user known to the UAS with the username “ext.6100”.
VoIP Drove down the cost of making phone calls. We love that about VoIP: free long distance! In the telecom industry now, the idea that calls within a country would cost a retail user more than local calls seems quaint.
But the low cost of VoIP calling has brought a new adversary: “Robocalling,” so called because “Robots” (computers with pre-recorded messages) place phone calls. Many of these calls are placed in violation of existing laws. According to US Federal Trade Commission official Lois Greisman, lawbreakers “can place robocalls for less than one cent per minute,” and I suspect that only counts the calls that are actually answered. And though the call may be dialed by an automaton, it may be connected to a live person when its answered.
Elderly Victimized by VoIP Robocalling
The main ones who suffer are non-VoIP Users, disproportionately the elderly. Older people see no reason to switch from the same telephone service they’ve had most of their lives, and they retain legacy, POTS-based phone service. In rural areas, this is often served by feature-poor TDM equipment with only “Custom Local Area Signal Services” (CLASS) features, developed in the 1980s in the rotary phone era. The best available Robocall blocking uses Simultaneous Ring, which is not available in the CLASS feature set in classic telephone service.
Because the pain of these automated calls is felt so acutely by elderly citizens in America, the US Senate Special Committee on Aging is pushing for laws to force Service Providers to Provide Blocking. That the Aging Committee is pushing to regulate telephone services serves as evidence of the gravity.
By building efficient networks, VoIP technologists unwittingly created a special problem for users of legacy technology. (Hint: if you have aging relatives, be sure to upgrade their phone service to a modern option.)
“We now have this onslaught—it’s terrible,” said Indiana Attorney General Greg Zoeller, who added that the scams are “primarily directed at seniors.” Scammers attempt to trick seniors into paying for taxes they don’t owe, or providing banking information for lotteries they haven’t won.
The Wall Street Journal reports on one 88-year-old Arkansan who lost $110,932. And often, blocking or changing the Caller ID of the Robocaller is a key part of the scam. Another 57-year-old person lost nearly $250,000 in a scam relying on a “device that conceals the identity and location of the caller.”
The Nature of the Problem
p computers to launch outbound attacks, undoubtedly through VoIP protocols, and send them through grey-market PSTN Termination Providers. These are firms that offer low rates to accept a SIP telephone call and deliver it to a PSTN phone. But if they do nothing to ensure that the caller IDs used are accurate, and may be respond slowly to complaints against their customers. Further, they may choose to not participate in CDR-based Call Traces, the de facto industry standard method used to determine the source of a call. These can be anywhere in the world, and these firms need not exist for long.
The PSTN Termination Providers typically don’t actually operate PSTN Gateways themselves. That is, they have no ability to directly send the traffic into the legacy TDM SS7/C7 telephone network to reach any phone in the world. So the PSTN Termination Providers buy services from more legitimate PSTN Gateway (PSTN GW) Providers. These PSTN Gateway Providers then have direct connections to the major telephone carriers of the world, and paths to reach all the smaller ones as well.
AT&T is an example of the PSTN Gateway (GW). In an FCC Hearing on Robocalling in September, 2015, Adam Panagia of AT&T reports that they often receive a single coordinated Robocalling campaign through 20 or more wholesale carriers. If even a legitimate call center in India is routing calls the US, AT&T may receive calls from that single call center through 30 to 50 wholesale customers.
VoIP Fraud Plays a Role
In addition to the use of PSTN Termination Providers to perpetrate Robocalling fraud, some attackers use exploited VoIP Telephone Providers. In these cases, a VoIP provider has a security vulnerability, such as a SIP account with a simple password, or a customer device that is not protected by a firewall.
When attackers get access to this service provider, they use it as one of the paths to send calls to the victims. The defrauded Service Providers whose service is stolen are victims along with those who are called and then scammed. The poor security of many VoIP Service Providers contributes to the Robocalling problem, but Service Providers have the power to improve their security.
The FCC made a welcome change, and though some service providers were already providing some types of blocking before the June 2015 ruling, it opens options for blocking calls even among the scrupulous. So the question is now: which calls should be blocked?
Caller ID Trusted Currently, Caller ID fundamentally trusted from the original caller. Unlike modern email, there is no technical mechanism to confirm that the caller ID provided on a call is actually genuine. VoIP technology, such as SIP From, “Trust Domains” P-Asserted-Identity, and SIP “Identity” have not helped because they only apply to limited areas of the network, and don’t provide any certainty as a call travels from a call center in India to a retiree in Iowa.
Verifiable Caller ID is a key requirement to block scammers, according to Scott Mullen, VP Technology of Bandwidth.com. They’re a major PSTN termination and origination firm in the USA with extensive SS7/TDM interconnects, noted in the FCC workshop that toll-free fraud is a major problem they deal with. Even International calls come through in a way that tricks the US recipient into believing it’s a call originating from the US.
Problem Calls Distributed Further, calls from scammers are distributed across many networks. One PSTN GW Provider has some data, while another has different data. It’s not easy to coalesce the information into a unified view to make smart decisions about the call.
Can’t scan a phone call before delivery Another difficulty for scanning telephone calls lies in the real-time nature: unlike an email that can be analyzed in its entirety before it is deposited into your mailbox, only a few bits of information are available to a call-blocking system: (a) Time of call, (b) Claimed Calling party number, (c) Called party number, (d) Input source (such as a particular wholesale customer link).
I should note that we can learn about calls after they are delivered: Robocalls that are answered are usually disconnected very quickly. Short call durations provide some clues after some of the calls are delivered, which may be used to improve the go/no-go decision for future calls.
Technology Types Calls do flow across and SIP and TDM networks: but scam calls almost always originate as SIP. The cost of maintaining TDM infrastructure appears to be too high for scammers, as it probably invalidates the business model.
Give me a call, never
For the moment, Call Blocking based on Caller ID reputation is providing some relief. These work by using a database of known scammer caller IDs, then blocking the calls that are in that database.
The industry de facto standard for Personal Blocking Services is NOMOROBO, released in 2013 after winning a US Federal Trade Commission competition for technical solutions to Robocalling. At the September 2015, NOMOROBO was that annoying better sibling, with the FCC mom asking all the other participants, in effect, “Why can’t you be more like NOMOROBO?”
Nomorobo is free, and blocks calls from Caller IDs that are repeatedly used for annoying calls. It’s available to anyone who has Simultaneous Ring, a feature found on Broadworks, Metaswitch, and most other modern telecom platforms.
How Grandma Gets Attacked
Let’s first look at the normal call path from an attacker to the Victim. In this case, we’ll say that the victim is a subscriber of “Adams Rural Telephone Company.” The Attacker launches calls, dialing outbound calls, and some of those calls route to “PSTN Termination Provider #1”. This company determine where the calls should route, and, in those cases where Adams’ subscribers are the victim, routes the calls to Adams’. This supposes that both Adams and the Attacker do business with the same PSTN Termination Provider, but in many cases there will be several intermediate PSTN Providers.
How Grandma Blocks the Attack
NOMOROBO depends on the Simultaneous Ring feature. To use it, Grandma would setup her calls to ring both her phone and the Personal Blocking Service at the same time.
In this way, NOMOROBO receives an inbound call with the Caller ID provided by the Attacker. If NOMOROBO finds this Caller ID in its blacklist database, it can answer the call – thus ending the call for the Victim. The victim experiences only a single telephone ring.
NOMOROBO is clearly a market leader. They are benefiting from the “Network Effect,” where their popularity makes the service work better by aggregating data. NOMOROBO certainly receives many simultaneous calls during Robocall campaigns, and can exploit the concurrency to make smarter decisions.
But NOMOROBO must depend on the caller ID and called ID only, and cannot do anything to improve the quality of the data. NOMOROBO does not know when calls end (i.e., the duration of answered calls), so it cannot improve its decisions based on calls that turn out to be undesirable and are quickly ended.
NOMOROBO also cannot analyze the audio of the phone call. In the US, many Robocalls begin with dead silence, which is unnatural for ordinary person-to-person calls. Normal humans use this to decide which calls to hang up quickly. This mismatch between robocalls and human-dialed calls should be useful to an automated blocking system, if it were available.
Service-Provider Based Blocking
What happens when the victim does not have the Simultaneous Ring service, or lacks the skills to set it up? Or what if a Service Provider would like to block Robocalls for all of its subscribers? Telephone Service Providers can also block calls by using a network-based service.
Consider the network path shown. Normally calls flow from the Attacker, to an intermediate, and finally to the victim’s service provider.
Service Providers (SPs) can provide protection by using SIP call routing to route calls through an intermediate service. Instead of immediately sending every call to the victim, the SP can route the call to an intermediate service (or device) that checks the blacklist database. To borrow the term from email, only the calls that are not spam are ultimately delivered to the end user.
The Metaswitch SIP Robocall Blocking Service is a current market example of a service using SIP to address this problem. Service Providers can route their calls to the service, which blocks calls coming from caller ID numbers in the blacklist database.
This network model can provide robocall-blocking for an entire Service Provider, and not just for a single user. It’s also far more efficient than Personal Call Blocking, because it doesn’t require all the setup for a media path through the legacy PSTN to check the database for each call.
It also has opportunities for future development: with a stateful SIP proxy, a SP robocall blocking service could know when calls start and end. And once privacy concerns are handled, this approach analyze the audio for hints, such as dead-silence at the start of a call.
But like Personal Call Blocking, this approach still relies on the caller ID, which can be faked for each call.
Nothing is any good if other people like it
Blacklist based blocking services work today precisely because they are not popular. Today’s blocking services rely on calling party ID, as if that’s trustworthy. Scammers do have some incentive to place calls from the same caller ID repeatedly: once they find a telephone number with a matching CNAM caller name that people will answer – with names like “Internal Rev Service,” or “EDUC LOTTERY” – they seem to stick with that same number.
But Robocallers are already adapting to robocall blocking services. Some are calling from randomly-selected working, legal telephone numbers. This method completely defeats simplistic blacklist databases.
This means we really need a trustworthy caller ID. And some in the industry are working to provide it.
Engineers in the Internet Engineering Task Force (IETF) and the Alliance for Telecommunications Industry Solutions (ATIS) have developed a standard called STIR, “Secure Telephony Identifiers Revisited”. When used as designed, each call using STIR will include a signature as evidence that the calling party has the right to call from the telephone number they’re using.
This would be implemented at the entry to the SIP-PSTN network, ideally at the customer’s PBX or at their first Service Provider Interface. For example, a BroadWorks service provider may use SIP authentication to confirm the identity of a caller, and then create the STIR cryptographic signature to confirm the legitimacy of the caller ID.
Don’t strip that header
Currently, there are no SIP headers that must be retained end-to-end through the VoIP networks. All headers can be reconstructed at each step, though a few elements are reused (such as the calling and called party numbers). STIR assumes that VoIP carriers will be able to pass a SIP header through their network from the origin to the terminating carrier. This is certainly technically feasible, but will require substantial coordination – and likely a few SBC software updates for some carriers.
STIR promises a world where you can be certain of the calling party while your phone is ringing. But will it happen? STIR requires substantial technical work on VoIP network infrastructure. Practically every SIP carrier peering/trunk on every SBC deployed will have to be updated.
STIR will require establishment of a Certificate Authority (CA) who can provide the certificates verifying the right to use telephone numbers. We already have Certificate Authorities in the industry servicing the Web industry, so this should not be a major hurdle. You can expect big carriers to desire to be CAs for themselves – likely a smart solution for many cases. For example, AT&T has been, in effect, the “owner” of millions of telephone numbers for years, though they were permaently assigned to their subscribers. It makes sense for AT&T to be the CA for the numbers it already “owns”.
Who Goes There?
To succeed, STIR will have to engage the business models of the modern VoIP PSTN. Private companies and government entities alike use the flexibility of the PSTN to route their calls through any carrier that is convenient. If STIR requires evidence that the telephone number is being used legitimately, then credentials to use the telephone numbers must be distributed to all of the owners of telephone numbers.
Video Relay Service. For example, at the SIP Forum SIPNOC meeting in June 2016, one major Video Relay Service (VRS) for the Deaf and Hard-of-Hearing community commented that they effectively place calls on behalf of their users. A full STIR implementation will require the VRS providers to place the calls outbound for these users, even though the audio portion is actually connected to a Sign-Language Interpreter.
Government Agencies. Government agencies using COTS platforms like BroadWorks sometimes use a variety of paths for routing their calls outbound. They, too, will need the tools and technology to prove their right to use the caller ID, because preventing spoofing of calls from public institutions is one of hopes for STIR.
Call Center Services. Today it’s also common for a firm to hire a call center service to place outbound calls, representing a firm. STIR will require the Call Center firms to be capable of providing a signature showing the right to place calls from that entity. For example, if the Call Center for Delta Airlines needs to call you, then the Call Center service will need credentials (like a password) to allow them to place that outbound call from Delta’s telephone number. The Call Center will need to be upgraded to be capable of generating the STIR Identities.
Ten Years of Work
Once Caller IDs are verifiable, telephone users can make smart decisions about which calls they want to allow.
So How long will it take to block fraudulent caller ID, and get true caller ID? It took about seven years to prevent spoofing of domain names, like “gmail.com”, from the start of the IESG SPF experiment in 2005 to the implementation of DMARC in 2012. The PSTN moves even slower.
But solving the spoofed-caller-ID problem – and the Robocalling it enables – are worth doing. Service Providers should monitor the progress of STIR, and Vendors (like BroadSoft, Oracle Communications, Metaswitch, Genband, Sonus) should plan to support STIR.
Push for STIR Identity Support from your SIP Software Vendor, especially SBC, Feature Servers, and SIP Trunking Devices
Configure your equipment to allow the new Identity header to flow through
Find out how you can confirm and display the identity of calls you receive using STIR
Plan to enable users to legitimately delegate authority to place outbound calls
Unlike software, systems, network, and voice engineering, regulated engineering disciplines require licensing. According to the National Society of Professional Engineers, a college engineering graduate candidate can “begin to accumulate qualifying engineering experience”. What happens during these four years of post-graduate experience? “The experience must be supervised. That is, it must take place under the ultimate responsibility of one or more qualified engineers.” Further, the experience is expected to be “high quality, requiring the candidate to develop technical skill and initiative in the application of engineering principles.”
These are the engineers building the bridges you drive over, the flammable electrical parts you install in your walls, and the explosive power plants you live near. Society expects high standards because of the risks to health and safety.
So to advance in their field, those engineers seek out supervision. And because of the licensing, senior engineers must participate in the mentoring process. They have a valuable structure of mentoring.
But what about the “engineers” who run email servers, build voice networks, and write software? Generally, these information technology (IT) and computing disciplines are unregulated. Anybody can do IT as badly as they like, and we have no mentoring structures in place.
ASCII question, but got no ANSI
Computing has grown without formal, legally-mandated mentoring requirements because computers serve as self-checking devices: if the new programmer tries to do something crazy or foolish, it won’t compile or it won’t work. Unlike faulty bridge designs, computing systems are relatively easy to test. If you try to build a network but don’t know what you’re doing, that network won’t function.
But in IT/computing fields, mentoring is still needed:
IT Engineers often have questions that cannot quickly be googled or tested by experimentation
IT Engineers need to learn from the mistakes & experience of others
IT Engineers need review from other brains to help check their own ideas
So if you’re learning something new, how do you get your questions answered if you’re among people who’d rather avoid people?
A Mentor commits to answer questions
With so many demands on a skilled professional’s life, the key scarcity is the willingness to provide answers when questioned. So if you’re going to be a mentor engineer, be sure you’re available to hear questions, and provide good answers.
That is, mentoring in IT Engineering is first about the mentor’s commitment. The mentor has to be willing to take questions from junior engineers, and commit to answer them.
I had some great mentors as I was learning. Jud McCranie is a professional programmer I met through through a university-operated Bulletin Board System in the early 1990s. He answered a thousand questions from me on programming, and even decrypted my amateurish encryption algorithms. He was a mentor because he took my questions and didn’t ignore them. He challenged my thinking, often asking me questions I couldn’t answer myself. I’m always thankful for his patient consideration of numerous questions from a teenager. (McCranie is cited in one of Knuth’s new volumes, the Art of Computer Programming.)
Another superb question-answering mentor was the late Jon Hamlin, an ex-CIA intern and graduate of Valdosta State University and University of Georgia. Jon was far more interested in system administration, networking and UNIX, and in his role as computer science lab manager at Valdosta State University, Jon had access to extensive SunOS/Solaris resources and the time to think about my questions. He setup a Linux computer and gave me access, and helped me clean up a few messes.
Both of these men put in hours of their lives to answer questions I wrote through email, and they’re part of my motivation to answer questions for others.
Why it’s so hard to get somebody to really answer a question
I just claimed that there’s a scarcity of willingness to answer questions, so the first role of a mentor is to commit to be available to answer questions. Why is that?
There are lots of ignorant people willing to give bad answers. But you don’t want them to be your mentor.
Nervousness. “They don’t want to be embarrassed in front of their boss or co-workers,” Adams writes. This can be caused by simple anxiety, or excessive pride if they don’t want to admit there’s something they don’t know. Antidote: So the mentor must make it easy and unthreatening to ask questions by readily enjoying the curious discovery of new facts. Computing fields move famously fast, so there are always new things to learn.
Avoiding annoyance. They don’t want to ask questions because they perceive it to be inconvenient. Antidote: Therefore the mentor must actively invite the questions.
Bewilderment. The mentee doesn’t know where to start because everything feels so unfamiliar. Antidote: The mentor should set milestones of capability to encourage steady improvements in comprehension; e.g., crawl, then walk, then run. E.g., Login to the server first. Then locate the logs. Then read the logs. Then understand the logs.
Previous trauma. The world is full of jerks, and many people have been criticized by those jerks for asking good questions. The mentee may need to recover from unhealthy work environment where legitimate questions were met with caustic attitudes.
Lack of curiosity. One of the most dangerous problems is a lack of curiosity in the subject matter. Mauricio Porfiri, a robotics researcher at New York University, says that “Being creative and being curious is more important than being the smartest or the best at equations if you want to be a great engineer.” Albert Einstein said that “The important thing is not to stop questioning. Curiosity has its own reason for existing.” The lack of curiosity leads to a dangerous complacency, causing the mentee not to care enough to bother to formulate questions or challenge their own assumptions, but to muddle through with the current level of ignorance. Computing is great because it encourages curiosity. Have a question? Try it out. A chemist or physicist or mechanical engineer needs equipment for a lab, but the Computer Scientist needs only a computer and the means to program it. Ayodeji Awosika writes about the dangers of suppressed curiosity in “The Theory of Nothing: Why Lack of Curiosity Leads to Mediocrity.”
Beyond Q&A: The Weekly Mentoring Meeting
To support growth of a mentee, mentors can schedule regular time with the mentee. Just like the commitment to answer questions, the mentor must make make these meetings a commitment. Commonly, this happens as a weekly meeting to ask questions and plan progress.
Jim Anderson, a Computer Scientist at UNC-Chapel Hill, followed a simple model of tracking progress for his advisees: he wrote the next steps for each of his mentees on the whiteboard in his office. Then they could easily see what was expected. And in each weekly meeting, he could easily recall their responsibilities.
Plan the route out of ignorance. Even in non-supervisory mentoring, it’s helpful for the mentor to plan and track progress so the mentor ensures that the incomprehensible is coming into focus for the mentee. Without this guidance, the mentee may be trapped in a complicated area without a path forward, and without the ability to ask questions to get out of it.
Review recent accomplishments. The weekly meeting is also a good time to review samples of the mentee’s work. The mentor can praise progress and identify the most important improvements the mentee can work on next. But even when identifying the next growth area, the mentee should recognize the mentor’s achievements.
For example, for Computing/IT professionals, reviewing work can mean:
Discussing interesting troubleshooting problems and the approach to troubleshooting.
Reviewing system configurations to see how a mentee’s task was accomplished.
Reviewing source code.
Establishing Healthy Mentorship
To begin healthy IT/Computing mentoring,
Get experienced engineers genuinely willing to answer questions: call them Mentors.
Get other engineers with curiosity, who are willing to ask questions.
If a team has lots of technical work to do, and only a few brilliant engineers available, how do you get work to the right people?
In this article, I discuss methods for managing work in IT and technology teams, such as those doing Network Operations or Engineering, Devops, Security Management, System Administration when you have a mixture of skill levels, including “star engineers” with 10+ years of experience. I would give different advice to a team focused on developing software.
Call the Brain Surgeon!
A hospital is full of medical staff. But when a life-threatening case rolls in, you want the most experienced physician available to handle it. And if you’re doing important, high-profile, mission-critical technical work, you might want the most experienced technician to do the work.
But if the problem at hand is to restore the telephone service for the hospital, or to enable a failed 911 Emergency calling service, then it makes sense to put the best available people on it.
…to suture this wound
Yet often, the risks are not so high, and the need is not actually so urgent. You really don’t want your most-senior staff doing relatively low-risk simple work. Escalating all work to the most-skilled person robs the lesser-skilled staff of experience and deprives the organization of good ideas.
How do you prevent all the hard work from being done by the best and brightest? How do you prevent all the work from flowing up to the highest-skilled people?
Make junior staff work at the highest level possible
Make senior staff support and include junior staff in strategy and planning
Define the barriers between junior and senior staff
Define the workflow for projects
Excelling as a Junior Engineer
Suppose you’re a junior engineer in a team, and all of the challenging work goes to the senior engineers. You know you could do that work, but it’s always going to the senior guy. How do you get the chance to work on hard problems?
It’s possible your management is fumbling, but be sure you’re doing a great job on the problems you have. It’s easy to dream about solving challenging problems, but you have to be sure you’re doing fabulous work in your current tasks:
Are you doing as well as anyone in the world?
Are you aware of how other people solve this problem?
Are you documenting your methods?
Are you listening to the ideas of your colleagues?
Are you clearly explaining your results?
Are you recording your questions and other unknowns, then working to get answers to those questions?
Are you confounded by how difficult it is to make progress?
Are you losing track of a detail, date, times, or meetings?
Are you recording results of the project so others can benefit from it later?
Are you making progress before the deadlines, so that you have substantial progress to demonstrate, and questions to ask?
Are you communicating clearly, with the requester of the work (the “customer”) and others with whom you collaborate?
When you have a chance to talk about the project, are you engaged? Are you contributing ideas?
Are you doing what you can to be pleasant, even fun, to work with?
Make Yourself Available to Hard Problems. Find out how the interesting projects come in, and look for ways to be involved.
Try to make sure your work environment isn’t too hidden and isolated.
Ask to join conference calls on interesting projects
Take notes in meetings, and offer to provide those notes to other people.
Ask questions about everything you don’t understand.
Senior Engineers Plan Projects
What is the role of the Senior Engineers?
Make design decisions.
(Anything that can be done either better or worse is a design decision.)
Analyze the problems and projects
Write plans for the projects
Guide the solutions toward doing the best thing (“what should be done”), minimizing involvement in method chosen.
The distinction of a star technician is perspective and context: they know how things are done and why. They can consider pros and cons of various options for change. They know potential for risks. They should know the organization’s strategy and mission.
With this context, a senior engineer should be planning projects:
What are the objective and the goals?
What are each of the phases of work along the way? Tip: keep each phase <4 hours of work
What kind of data needs to be collected before proceeding?
How will we know when we’re done?
How long should it take to complete each phase?
A star technician is as mature as the plans they can provide to others to execute.
Define the Junior/Senior Boundary
What kind of work should be done by senior engineers, and what kind by junior engineers?
Senior Engineers are obligated to:
Maintain or improve architectural unity of the system.
Provide opinions based on mature technical aesthetic
Research the available options for solving problems (i.e., not just their favorite options)
Write plans for accomplishing the work
Determine which things to do (E.g.: Install Apache on a Linux VM, or buy an F5 Local Traffic Manger?)
Answer questions raised by junior staff
Review and approve the work done by junior staff
Be aware of the skills and capabilities of the junior staff
Do the work that no one else is able or willing to do
Junior Engineers should:
Follow the plan to accomplish the work
Debate and question the plan if they see better ways
Ask good questions to fully understand the rationale behind the plan
Discuss the problems with Senior Engineers when they find something interesting or unexpected.
Define the Workflow
Send all work to the Junior Staff?
A popular remedy for Work-Flows-Up Syndrome is to send all problems to the junior staff. For example, all tickets come in and start at Tier 1, then the hardest 80% escalate to Tier 2, then only the worst get escalated to Tier 3.
Hacks. My experience with this method for actual problem solving is poor: Junior staff are prone to develop Workarounds, Tricks, and Hacks (collectively known as “WTH Solutions”) without adequate context or insight into the best way to solve problems. Without knowing the long-term risks, WTH Solutions make the system more fragile.
Blockages. Just as a star engineer can be a bottleneck, junior staff can block the successful completion of work. To know when to ask for help is a learned skill, and junior staff are often prone to keep poking around the edges of a problem. They can tend to sit on interesting problems too long.
Lack of Senior Insight. Senior staff need the insight on the real-world problems with their system. When every problem starts at the junior staff, the senior staff may be unaware of the problems.
For example: In one case, the network had a serious, but minor issue, occurring that could be worked-around through an easy procedure. The senior staff gave instructions for the junior staff in how to solve it, expecting the junior staff to need to do it no more than weekly.
But the problem grew more frequent. The junior staff were using the workaround dozens of times per day, faithfully following their instructions. But the senior staff were unaware of the scope of the problem because the junior staff were dutifully executing the procedure.
In this case, the senior staff needed to be aware of the scale to develop a more permanent solution. In that case, when a problem was mostly-delegated to the junior staff, the senior staff couldn’t see what was happening and rethink their proposed “solution.”
Plan Before Proceeding
Rather than “fill all the work from the bottom” skill levels, projects should be planned from the beginning. The first step in processing every new problem should be writing a project plan.
What’s a project?
Anything that has several steps is a “project” as I’m using it here.
What should be designed?
A design decision is anything that can be done incorrectly but still appear to work. For example, you could build a network using one large subnet in one big ethernet, or you could set up the network with routing and subnets. Both may appear to work, but one will be better for the situation.
Writing the project plan may require 10% to 20% of the total work time on the task, and should be done by Senior Engineers. The Senior staff is thus responsible for considering options and maintaining architectural and conceptual integrity of the system.
After the plan is developed, the project can be assigned to somebody at the right skill level.
Could the execution of the project affect architectural integrity of the solution? Senior Engineers
Does the junior staff have the skills to do it, or can they learn? Delegate to Junior Engineers
Senior / Junior Followup
Senior engineers need to routinely follow-up with junior staff to talk about their projects.
Most projects cannot be completed start-to-finish without interruptions. Sometimes we need a new license file; or we need to open a ticket with the vendor; or we need access to a protected site; or information from the customer. Due to these interactions with other vendors and customers, I find many technicians need to have 3-5 projects in order to stay busy.
With the proper distribution of tasks, even a team of two engineers or technicians can be more effective. Junior staff have responsibilities to grow technically and professionally, and Senior staff have an opportunity to increase their team’s effective throughput.