“DNS is one of those things that gets overlooked… You make your voice servers super-redundant, but you take it for granted that DNS will always work.” — Fonality

The Domain Name System (DNS) conventionally helps devices convert domain names, like VoIPCarrier.com, to IP addresses, like 216.128.128.50. But in Voice and Unified Communication (UC) services, it serves an additional key role in fault tolerance, and enabling encryption. For large networks, it can also assist in balancing load across many data centers.

Background: In typical Voice Network implementations, the UC client or SIP phone is configured to make requests to its local “recursive caching” DNS server. (So called “recursive”, because it makes multiple steps to arrive at the answer, and “caching” because it stores the resulting answers for some time.) This server consults with the Root and Top-Level-Domain (TLD) servers that handle domains like .com. Those TLD servers direct the DNS client’s recursive caching DNS to the authoritative DNS server for the Voice/UC provider, and that authoritative server provides the IP addresses, failover addresses, load balancing settings, and encryption settings for the UC or Voice Service.

There are some common variations on this theme:

1. Voice & UC Operators that don’t use the Internet. For Cable operators, “Metropolitan Service Operators” (MSOs) and Fiber to the Premise (FTTx) Providers, DNS can be used, but it doesn’t use the Root Servers or TLDs. These operators cannot provide their service across the Internet.

2. Operators that don’t use DNS at all. Some networks are built with the IP addresses configured in the devices, so no DNS is ever used. As 2600hz argues, this strategy can mean that inevitable IP address renumbering becomes extremely expensive and can create outages for customers. “This can be dangerous if you have customers who provision their own phones and, later on, you move data centers or change IP blocks for some reason and have no way to change the IPs of those phones.”

Nine Common Vulnerabilities Most Voice & UC Providers Are Exposed To

I. SIP Phones & Clients depend on their local DNS servers.

The most common mistake is that SIP Phones and Desktop UC Clients are relying on the local DNS servers in their data networks. For some users, this is the outdated Windows server in the closet, but it’s often the underpowered DNS software provided in the local router by the Internet Service Provider (ISP).

The risk is that when this local, underpowered, or under-managed DNS server malfunctions, the Voice/UC service fails. The Voice/UC provider is responsible for important features, like the ability to make emergency phone calls, or conduct business (video calling, desktop sharing, messaging, etc.), but all of those services fail where the local DNS server fails.

Key Remedy Options:

  • Option A: Configure the Voice/UC Clients to use DNS servers provided by the Voice/UC Provider itself.
  • Option B: Configure the Voice/UC Clients to use reliable DNS servers. For example, you may configure the clients to use Google (8.8.8.8, for example) and Level(3) (4.2.2.2, for example).
  • Option C: If the client supports it, locally cache DNS entries for backup use. Polycom devices support this “hosts file” style of entry, so that if no DNS responses are available, the Polycom will use those configured entries as a backup. (Note that Packet8 was using this in 2009 when they report Register.com caused an outage by providing false DNS responses.)

II. Depending on a single organization to run the DNS server for authoritative service.

Any one DNS server on the Internet can be attacked and overloaded using botnet attacks. Even most organizations that offer DNS services can be overwhelmed. History shows that attacks tend to be launched against individual DNS operators. (Note that the Global Root DNS servers are under continuous attack, and have defenses that are effective so far.)

For example, in March 2017, GoDaddy’s DNS service in Europe were shut down. This resulted in a serious outage for operators depending exclusively on GoDaddy’s DNS services.

GoDaddy isn’t alone. In October 2016, DNS provider Dyn was hit, shutting down their customers.

In September 2016, Voice / UC Fonality suffered a four-hour outage due to DNS problems. This was due to having a single, integrated DNS system. A single human error brought down all parts of their DNS servers. After this outage, they added diversity to their DNS system so that it wasn’t hosted by a single provider (themselves!).

Key Remedy Options:

  • Option A: Buy DNS hosting from multiple (Authoritative) hosting companies, confirming that they’re not using the same underlying infrastructure.
  • Option B: Run your own DNS servers, but also use another company’s DNS Authoritative server’s too. To be effective, the systems should not be synchronized automatically. Some human review should be required to make the change in each place.

III. Failing to use locally cached answers in case all DNS fails.

Some SIP Phones can store a cache of DNS records locally, provided to the device by the configuration file. For example, the Polycom UC SIP devices support these configurations, and only use them if no DNS replies are returned. The Polycom implementation supports basic A records, as well as the advanced records, SRV and NAPTR, used for load balancing and encryption management.

Yealink SIP Phones has a Static DNS Cache supporting A, SRV, and NAPTR that can be configured either as the preferred source of DNS settings, or can be used as as backup.

Mitel SIP Phones have similar support for A and SRV records in thie DNS Pre-Caching functionality.

Software vendors who provide desktop and Mobile software should build in this capability, so that the configuration file delivered to the software has a DNS cache backup in it.

IV. Too short of Time-To-Live Cache Lifetimes.

Every DNS records has a “Time To Live” (TTL) setting, which specifies how long caching servers should store the record.

Large Values, such as four hours (which would be configured as 14400 seconds), have the advantage of being stable. If there’s an interruption in access in the DNS system, these ensure that the old records will stay in the cache for some time. But the downside is that when changes are needed, the operator may have to wait four hours for the change to go into effect.

Small Values, such as 60 seconds, have the advantage of allowing relatively quick changes. For example, if a provider learns that their primary Session Border Controller (SBC) site is having problems, they can make a change, and expect all DNS requests within 60 seconds to get the new value and route requests to the new location. But small values have a downside: if there’s a short interruption in the client’s access to the DNS servers, then the clients will lose all old records.

This is such a common problem that Cisco’s OpenDNS SmartCache continues to store and use records after they have expired — but only if the authoritative servers are unavailable. This only helps if the Voice/UC clients are using OpenDNS as the recursive caching servers.

Large TTLs don’t provide complete protection against DNS failures. Note that even with a large TTL, you should expect that records will begin expiring immediately upon failure of your authoritative caching service. For example, if you have a TTL of four hours, and an outage occurs at 12:00pm, then by 1:00pm approximately 25% of your users’ caches will have expired and by 2:00pm, half of your users are offline.

Key Remedy Options:

  • Option A: Use a relatively large DNS TTL value, such as four hours, for stability.
  • Option B: Use a large DNS TTL, but change it to a smaller number in the hours before a scheduled change

V. All DNS Servers in one BGP prefix or BGP Autonomous System

BGP is the system used by large organizations to configure the routes for IP traffic. An IP Prefix, such as 216.128.128.0/21, specifies the routing path for 2,048 IP addresses.

But because BGP Prefixes are subject to attack and downtime too, you don’t want to have all of your Recursive Caching, or your DNS servers, in the same BGP prefix advertisement.

The takeovers of a BGP Prefix is called BGP Prefix Hijacking, and has been a known threat on the Internet for over a decade. Zach Julian has an overview of how BGP Hijacking is done, and reports on several instances of successful BGP hijacking where traffic is intercepted and rerouted.

(For the same reasons, you also don’t want to have all of your Session Border Controllers (SBCs), or all of your Load Balancers, or all of your Production Firewalls in the same BGP prefix if you can avoid it.)

Hijacking one BGP prefix is generally easier than hijacking more than one, so by having a diverse set of BGP prefixes, you make this attack more difficult and improve the robustness of your network.

A rarer attack that is still possible is hijacking of a BGP Autonomous System Number (ASN). The ASN is used in Internet Routing to identify a large company or Internet Service Provider. An ASN attack would allow an attacker to interrupt BGP traffic for an entire ISP or large enterprise.

Key Remedy Options:

  • Option A: Ensure that your DNS servers (both authoritative and the recursive caching servers used by client devices) are from at least two different BGP prefix advertisements. Typically this is easy to check because the IP addresses begin with different digits.
  • Option B: Ensure that DNS servers are advertised via two different Autonomous System Numbers (ASN) in BGP. This can often be accomplished by buying authoritative and recursive-caching service from an additional provider.

VI. Depending on a single link to the Internet for your DNS authoritative service.

Any one Internet link is subject to attack and overload. One of the most common forms of Internet Attack is a Distribute Denial of Service (DDoS) attack, where millions of vulnerable devices are used to blast traffic at a single destination. DDoS mitigation is an active industry, with a number of firms providing tools and services to fight DDoS attacks.

Arbor Networks is one of the leading firms providing equipment built to fight DDoS attacks. Akamai and AT&T offer services to clients to help fight DDoS attacks.

However good all these services are, while the DDoS attack is in early stages, access to the DNS servers, Session Border Controllers, and Load Balancers will be interrupted for some time.

Many operators have redundant Internet links to provide fault tolerance. In the normal implementation, all Internet links are setup to be capable of handling all of the traffic. However, this means that every DDoS attack will hit every link simultaneously.

Key Remedy Options:

  • Option A: Separate traffic for each site, so that you have distinct IP blocks at each site. For example, your “Eastern Site” may advertise 216.128.128.0/21 and your “Western Site” may advertise 68.10.2.0/21.
  • Option B: For the classic form of IP redundancy you may want the same IPs through both site. Use Option A, but also advertise a common block through both providers.

VII. Depending on a single DNS registrar.

The DNS registrar is responsible for loading domain name information into the root and Top-Level Domain (TLD) servers. For example, if you own VoIPCarrierX.com, you’ll pay a registrar, like Register.com, GoDaddy, or EasyDNS a small annual fee to keep your data loaded in the root name servers.

But every registrar is susceptible to attack or malfunction. For example, in 2016, the DNS Registrar for New York Times was attacked, leading to an outage for the news site.

In early 2017, a Brazilian Bank’s operation was taken over after its Registrar was hacked.

Key Remedy Option:

  • Use Multiple Registrars; this will necessitate multiple domain names.

VIII. Depending on a single Top Level Domain (.com, or .net)

As discussed above, if a DNS registrar is attacked, then the attackers can control the access to the domain name. But most of the Top Level Domains (TLDs) are managed separately from one another, and it is conceivable that an entire Top Level Domain could be attacked.

For example, the servers that handle .com are different from the servers that handle .net. Each country’s top level domain, like .us (for the United States) and .br (for Brazil) are also distinct.

It’s easy to assume that all top-level domains are well managed. But because they are run separately, we shouldn’t assume they are. For example, China’s .cn TLD was successfully attacked in 2013.

By using different Top Level Domains, you can protect your services against excessive dependence on any one TLD server. For US-based companies, .com and .net are distinct domains, but there may be value in finding more diversity outside a single nation’s DNS management.

Voice/UC services have a distinct advantage here: users are not normally aware of the domain name in use. Google cannot easily use multiple domain names: everyone expects “gmail.com” to stay put at that name. But SIP phones and UC clients can be configured to use multiple domain names.

This is an important capability at the heart of many of the options here. To use multiple domain names, the client has to be capable of failing over.

This has been a proven technique. Voice Provider 2600hz uses multiple domain names internally, (one in .com and one in .net), and configures its devices to know how to query both.

Key Remedy Option:

  • Use Multiple Domain Names from distinct Top-Level Domains.

IX. Voice & UC Servers Not Verifying your servers with TLS

TLS (formerly known as SSL) has an important capability of validating that the server used is the legitimate and intended server. For example, if your SIP phone is configured to connect to VoIPCarrierX.net, then the SIP phone can validate that, once it is finally connected to some IP address (say, 216.128.128.255), the server provides it a certificate that is cryptographically signed by a key that the phone already trusts. The list of trusted parties is stored in the “Root Chain” on the TLS client; if the SIP phone has been configured to trust, say, Verisign, then the certificate for could be signed by Verisign identifying the server as legitimate for VoIPCarrierX.net. TLS also has a mechanism so that the client confirms mathematically that the server actually has some internal secret information — a long password key — proving it is truly the owner of the signed certificate. StackExchange has a nice introduction to this technology.

This means that after the entire DNS process has played out, using all the techniques above, the SIP phone or UC client can be built to verify that the server it finally reached was legitimate. This is a key requirement so that, if the DNS protections put in place fail, the phone will refuse to talk to an illegitimate hijacker.

Key Remedy Option:

  • Use SIP over TLS, and verify the Server Certificate.

Conclusion

DNS can be an effective and powerful way to manage Unified Communications services, but it does expose the operator to a number of attack methods. Smart Service Providers are taking the challenge seriously, implementing a variety of techniques to introduce diversity and genuine redundancy, including BGP, TLS, Top Level Domains.

  • Careful testing is crucial for stability before launching new SIP Access Device models
  • The Product Definition must be baked into the Test Plan used to approve the device, & new software for it
  • Customer Technical Support teams need extensive early access to devices  before they are deployed

Space Probes are designed to be sent far away, beyond our reach. They run software, and send back data to Home Base. If the Space Probe, dies, it may never call home, and there’s not much we can do to fix them from here.

For Voice Operators, the SIP Phones, IADs, User Agents, and Soft-phones clients are a bit like Space Probes: Once you deploy them, you can’t go touch them. If they never “phone home,”  you may be out of luck. So you have to plan well to make SIP clients work properly, before they’re launched.

Planning the Launch

The Challenge. There you are, in

futuretechnologydevicesconcept_iPhoneof2020_3_thumb
The $99 Hologram Phone does SIP, FaceTime, Google Hangouts, and your job is to get it to market. Concept, Josselin Zaïgouche

charge of making the amazing new Hologram SIP phone work on your platform. The Product and Marketing folks LOVE this new phone, especially the way it integrates with your clients’ PCs and does video calls via Google Hangouts and Apple Facetime. And it’s your job to integrate the device into your BroadSoft BroadWorks network.

Avoiding Device Management Disasters

You’ve known about the SIP device disasters:

  • …when your Support Staff got calls from customers using the Yealink T-58V phone before they had ever heard of it
  • …and when the Engineering team had a week to roll out video calling on all the Polycom VVX phones, because customers were buying cameras
  • …or when the CEO of your company got the latest Mitel 6869 phone, and read that the phone could show him his company directory in Microsoft Exchange, but then it didn’t work.
  • …and who could forget when you upgraded 1000 phones to discover the new firmware required a different file format?

BroadWorks for Device Management?

BroadSoft BroadWorks Identity/Device Profile Type
BroadWorks R20sp1 includes TLS Client Certificate Authentication, key for modern device management.

BroadWorks provides some mature tools for managing new device types, including some important key features:

  1. The BroadWorks system owner operator can add a new device type at any time, without BroadSoft’s assistance.
  2. BroadWorks records many key SIP parameters of the device type.
  3. BroadWorks stores the files like software and images, for delivery to the device.
  4. BroadWorks automatically generates configuration files as the device user’s changes are made
  5. BroadWorks system owners can define custom tags to handle new features and situations. Example: Most phones don’t support Apple Facetime, but if you have one phone with a FaceTime gateway, you can define “%FACETIME_GATEWAY%” as a custom tag in BroadWorks. Then BroadWorks can replace that tag macro with the appropriate value for each customer.
  6. BroadWorks can deliver the files to the device with HTTP or HTTPS, confirming username/password or SSL client certificate.
  7. BroadWorks can signal the device to download the new configuration file.

Key Milestones for Productizing a SIP Access Device

  1. Decide: Which of the Features will you support?
  2. Prototyping & Testing: Do the Features Work?
  3. Management: Can you Control It?
  4. Support: Can you help users use it?

Features: You Can’t Support Them All

Every SIP phone comes with a bazillion features, but no one network can support them all. You have to decide which features you’re going to try to support. Some of the top features are:

  • Ordinary calling with Caller ID
  • “HD Voice,” usually using G.722
  • Power over Ethernet
  • LLDP-MED for automatic Ethernet VLAN selection
  • Busy Lamp Field (BLF) / Line State Monitoring
  • Shared Call Appearance / Shared Lines
  • Programmable “Soft Keys”
  • Customizable Dial Plans (so the phone knows when you’re done dialing)
  • Distinctive Ring
  • Video Calling
  • Bluetooth headset support
  • USB headset support
  • Multi-party Video Calling
  • Multicast Paging (where the speakerphone)
  • 3-Way and N-Way Conferencing (where a large number of people can be conferenced together using the conference button)
  • Customizable Contact List
  • LDAP or Microsoft Active Directory or Exchange directory access
  • Customizable background logos
  • Calendar integration

…But many will try

But there are exotic features on every phone. Are you going to help your customers if they contact your help center asking about these?

Practically, new phones are far too complex to support every feature. Choose the features you actually intend to fully, thoroughly support.

Prototyping and Testing: Prove Those Features Work

Once you’ve chosen your features, you can develop prototype templates so that BroadWorks can generate configuration files for that phone. In most cases, in the BroadSoft community, responsible vendors will post prototype templates with example configuration files to BroadSoft’s site, Xchange. These configuration file samples can be in one of the categories,  “Interoperability – CPE Kit” or “Interoperability – Configuration Guide”.

3F022081-ADEC-42E1-A170-AA1D099E7C3C732DBCEA-06CA-4AAC-B450-3D4BE3F79902

Essential Parts of an Identity/Device Profile Type

An Identity/Device Profile Type (IDPT) is a specific model of device, such as a SIP phone.

BroadSoft BroadWorks Identity/Device Profile Type
BroadWorks supports an unlimited number of Identity/Device Profile Types (IDPTs).

The key parts are:

  • Profile: Identity/Device Profile Type (IDPT) Profile
  • File Templates: Identity/Device Profile Type (IDPT) Files & Authentication
  • Custom Tags

Once a specific device (with a specific MAC address) is created in BroadWorks you can’t change its Identity/Device Profile Type (IDPT). But ECG alpaca_headAlpaca can move a device to a new device type, retaining all of its custom settings, users, authentication, Shared Call Appearance Settings, and even custom files.

Profile

In the IDPT Profile, you can set key settings that apply to every device of that type, such as:

  • What should the SIP Request URI look like?
  • Does the device supports RTP early media?
  • Can the device register?
  • Does it support HTTP authentication?
  • How does it handle Privacy headers?

Ideally, your vendor will specify these settings exactly.

127E1C83-915E-4763-BE53-1E77E619DCC5
The Identity/Device Profile Type (IDPT) Profile page allows you to set key SIP features and controls for the IDPT. These are applied to every device of this type on the platform.

Files

For each IDPT, BroadWorks allows you to upload any number of templates and static files. Templates can be translated by an elaborate process of search-and-replace, and are usually used for configuration files. Static files are typically firmware/software and graphic images.

5D85C8F2-581E-4D1E-BA5A-AD821A13C648
BroadWorks can store a set of files for each IDPT: some are templates to be customized for each phone, and some files are static, like firmware.

File Template Example: BWDEVICE_%BWMACADDRESS%.cfg

An excerpt from the template BWDEVICE_%BWMACADDRESS%.cfg is shown below, for the device type “Polycom VVX 500”. When a “Polycom VVX 500” phone is defined on the system, and assigned the MAC address 004fb2012345, then the file BWDEVICE_0004fb2012345.cfg would be created.

Sample Template: BWDEVICE_%BWMACADDRESS%.cfg

<voIpProt.SIP voIpProt.SIP.enable="1">
<voIpProt.SIP.outboundProxy
  voIpProt.SIP.outboundProxy.address="%OUTBOUND_PROXY%"
  voIpProt.SIP.outboundProxy.transport="%TRANSPORT%">
</voIpProt.SIP.outboundProxy>

<voIpProt.SIP.conference
  voIpProt.SIP.conference.address="%CONFERENCE%"/>
</voIpProt.SIP.conference>
</voIpProt.SIP>
</voIpProt>
<call call.callsPerLineKey="24">  </call>
<reg
  reg.1.displayName="%BWCLID-1%"
  reg.1.address="%BWLINEPORT-1%"
  reg.1.server.1.address="%BWHOST-1%"
  reg.1.server.1.port=""
  reg.1.server.1.transport="%TRANSPORT%"
  reg.1.auth.password="%BWAUTHPASSWORD-1%"
  reg.1.auth.userId="%BWAUTHUSER-1%"
  reg.1.label="%BWEXTENSION-1%"
  reg.1.type="%BWSHAREDLINE-1%"
  reg.1.acd-agent-available="%FEATURE_ACD%"
  reg.1.acd-login-logout="%FEATURE_ACD%"
  reg.1.serverFeatureControl.dnd="%FEATURE_SYNC_DND%"
  reg.1.serverFeatureControl.cf="%FEATURE_SYNC_CF%"
...


BW and Custom Tags

BroadWorks defines a large number of tags that begin with the standard tag %BW, which are defined in the BroadWorks Device Management Configuration Guide (Login Wall).

As shown in this example, the BroadWorks-standard tags begin with “%BW”, including

  • %BWLINEPORT-1%, the SIP Address of Record (AoR), user portion
  • %BWHOST-1%, the SIP domain for the AoR
  • %BWEXTENSION-1%, the extension that can be dialed to reach the user within the user’s organization

But you can define any number of other tags, which BroadWorks can track and manage. In the example above, you see %CONFERENCE%, which BroadWorks can replace with a value you specify.

A Tag Set can be created for each IDPT, with special settings appropriate for that device type. These can be set at the system level, and then overridden at the enterprise, service provider, group, or device levels of the hierarchy.

B0480266-B1E1-4B75-A651-2EA1879AB9D4
The PolycomDemo tag set includes six settings. These can be customized for each device.
3CBD0061-B806-47FA-BB31-60C750A235C1
Each tag name is arbitrary, and the value can be overridden as any text.

Sample Generated Config File: BWDEVICE_0004fb2012345.cfg

<voIpProt.SIP voIpProt.SIP.enable="1">
<voIpProt.SIP.outboundProxy
  voIpProt.SIP.outboundProxy.address="proxy.voipcarrier.co"
  voIpProt.SIP.outboundProxy.transport="TLS">
</voIpProt.SIP.outboundProxy>

<voIpProt.SIP.conference
  voIpProt.SIP.conference.address="conf@voipcarrier.co"/>
</voIpProt.SIP.conference>
</voIpProt.SIP>
</voIpProt>
<call call.callsPerLineKey="24">  </call>
<reg
  reg.1.displayName="Frederick Brooks"
  reg.1.address="9195906001"
  reg.1.server.1.address="proxy.voipcarrier.co"
  reg.1.server.1.port=""
  reg.1.server.1.transport="TLS"
  reg.1.auth.password="YouMayNeedYourSanitySomeday"
  reg.1.auth.userId="AlwaysInvestInYourSanity"
  reg.1.label="6001"
  reg.1.type="private"
  reg.1.acd-agent-available=""
  reg.1.acd-login-logout=""
  reg.1.serverFeatureControl.dnd="0"
  reg.1.serverFeatureControl.cf="0"
...


The Process of Testing

When prototyping your new device, you adjust the configuration files to implement the features in your particular network. You do this by revising the templates, and adjusting the tag values to work in your environment.

The Test Plan is Product Definition

The test plan you define is the definition of the product. If it’s an important part of the product, then you must put it in the test plan.

This test plan will be the Regression Test Plan later. It should be written so that all of the tests pass at the point the product is approved.

In a Test Plan, you have a “Device Under Test” (DUT) and tell the tester precisely what steps to take, and what to expect. You have to be specific: define the specific features and settings in your BroadWorks lab.

Record the actual results so that you can review and analyze them later. The person who’s doing testing, the Test Engineer, should record subtleties of behavior.

Answer the question What Did We Learn, because in each test, you’re discovering something about the system.

Example Test Environment

  • Device Under Test, DUT: The Apple Blackhole Phone Model 1
  • DUT Setup:
    • User built in BroadWorks as 229-316-1002
    • BLF monitoring on position 1 for 229-316-1001
  • Lab Phone 1: Polycom VVX 500 at 229-316-1000
  • Lab Phone 2: Yealink at 229-316-1001
  • Cell Phone 1: iPhone X at 919-559-6000

Example Test Plan

  • Test Case ID: TC01
  • Feature: Call with Caller ID
  • Detailed Procedure: On DUT, while the handset is on hook, press the Speakerphone button, and dial 919-559-6000. Do not press “Send”.
  • Expected Results: On the 919-559-6000 cell phone, it should ring, and you should see the caller ID 229-316-1000. Answer on the cell phone, and confirm that you get two-way audio. Hangup on the cell phone, and confirm that the call is disconnected on the DUT.
  • Actual Results: _______________________________________
  • What Did We Learn? _______________________________________
  • Test Pass? TRUE / FALSE
  • Test Case ID: TC02
  • Feature: Busy Lamp Field – monitoring
  • Detailed Procedure: On DUT, watch Busy Lamp Field Position 1. Place a call on test phone 229-316-1001 to 919-559-6000. Check Expected results, then disconnect by hanging up on 229-316-1001.
  • Expected Results: When you can hear ringback on the 229-316-1001 device, you should see the BLF indicator in position 1 on DUT. When the call is answered on cell phone 919-559-6000, the BLF indicator should change to the red bar on position 1 within 1 second. When you hang up on 229-316-1001, the BLF indicator on DUT should return to black within 1 second.
  • Actual Results: _______________________________________
  • What Did We Learn? _______________________________________
  • Test Pass? TRUE / FALSE

Management: Can You Control It?

Once you’ve proven that the core features work, you need to prove you can control it.

Remote Upgrade.

Prove you can replace the software. The simplest, naive way is to replace the firmware file, and trigger phones to reboot. This can cause an unlimited phones to retrieve the new software each time.

Instead, most operator wish to selectively upgrade a few phones at a time. You can do this with a custom tag, such as %FIRMWARE%, that specifies the filename, such as “sip-1.56.2.bin”. Then you can change the tag on specific phones that should have the new version of software.

D8FE83D4-C73A-4637-B3EB-75E8852DF363
A custom tag can be used to specify the filename the phone should download.

ECG alpaca_headAlpaca, can be used to selectively update tags on devices within a BroadWorks platform. This is routinely used to upgrade specific devices, e.g., 1000 devices per maintenance window.

Alpaca can also selectively send the command to the SIP phones to reset them, and then monitor to confirm that all phones reboot properly.

Logging.

All software generates logs. A manageable SIP access device puts the logs somewhere useable. With BroadWorks, you can have the device upload its logs, and BroadWorks can retain those logs for viewing on the Profile Server.

Alternately, some operators configure devices to send logs to a central log collector via syslog.

New Device Procedures.

When onboarding a new device, be sure to specify the requirements for new devices. What should happen to a new device from the manufacturer?

Some operators require that a certain URL be provisioned in the phone, e.g., https://xsp.voipcarrier.co/dms/start

But this means that a new-in-box phone cannot be sent directly to the end user, without the user’s intervention in the process.

Some vendors, including Polycom and Yealink, offer a redirection service. To use these, new, or factory defaulted, phone with Internet service can reach out to the phone manufacturer to get a URL to the voice operator. For example, the phone 0004b2012345 could connect to Polycom to be told to go to https://xsp.voipcarrier.co/dms/vvx600. This works only because the voice operator has registered that MAC address with Polycom, and registered the URL.

Support: Enabling yourself to do a great job

The final step before rollout of a new device type is ensuring your Customer Support department have adequate access and experience.

Provide the Customer Support with the device to use. It’s remarkably common for operators to neglect to give their support folks experience on the devices that customers have. This leads to stress and poor customer service.

Have Customer Support run through the Test Plan to be sure they understand all of the supported features and settings. Since the Test Plan encapsulates the entire product definition, this means that Customer Support will know how to do it well.

The consulting firm, ECG, offers Voice Carrier technical consulting, including the proper rollout of new devices types, and support of Voice access devices in BroadWorks and other multi-vendor environments. www.ecg.co, info@ecg.co

IP Fragmentation of SIP Messages is an enduring source of trouble.

Fragmentation of SIP traffic is a problem on the rise. It appears when everything has been working fine, and seemingly without cause,  some SIP messages are lost in the network. The result is a frustrating scenario where some SIP messages are delivered fine, but others are not.

To explain SIP fragmentation, let’s start at the beginning: Layer 2. Every link on an internet has a Maximum Transfer Unit (MTU) size which determines the maximum size of a packet that can traverse the link, in bytes. For Ethernet, this is often 1500 bytes. This means that no one Ethernet frame — and therefore one packet of data — can be transmitted across a standard Ethernet network that is larger than 1500 bytes. The duty of the Ethernet interface is to transmit only frames that meet this standard.

However, many applications need to send more data than this in a message. So the Operating System must accommodate both the application’s need for large messages, and the network’s requirement to send packets of a limited size. How is this done?

Basic Internet Protocol has a standard for fragmenting messages so they fit inside the MTU. For example, with an MTU of 1500 bytes, a single 2500 byte SIP message can fit in two frames, or IP datagrams: one fragment may have 1500 bytes, and the remaining 1000 bytes (plus some bytes for headers) will be in the second fragment.

All the King’s Horses & All the King’s Men

Fragmentation is fairly cheap for the fragmenter, but reassembling the fragments when they arrive is a fairly expensive operation. Simon Dredge (Metaswitch) has discussed the computational costs of re-assembling UDP fragments, arguing that the re-assembly should be done in a specialized kernel module of a Session Border Controller, rather than where the user applications run:

[T]he receiver has the tricky job of taking these seemingly miscellaneous packet fragments, deciphering them from other packets or packet fragments arriving simultaneously and piecing them back together – somewhat akin to a jigsaw puzzle – but without the aid of the picture on the lid of the box. Naturally, this process takes memory resources to store packet fragments, while waiting for their counterparts to arrive, then processing cycles to compile them . . .  If fragmented packets are not successfully reassembled in a timely manner, then a retransmission will be requested or initiated, thereby further compounding the reassembly issue.

sip_fragmentation-3_adus_to_frags
Figure 1. The Applications in the operating system send messages to be delivered as packets on a network. Smaller messages, A, B, and C, are already small enough to fit into a single frame each. But the larger message, D, is fragmented to fit into three frames.

When a large message is fragmented, the separate fragments travel as separate IP datagram packets through the network. It’s possible for any one of those to be lost, but if one fragment is lost, IP has no mechanism to detect that and recover. The Internet Protocol software merely discards all the other fragments at the receiver. It depends on something else to retransmit the entire message again, on the hope that all fragments will be delivered.

Consider this analogy: This is like posting five boxes for shipment with no insurance: if four of them arrive, the mailman doesn’t track down the fifth box. The recipient must determine that a box is missing and ask for its contents to be replaced.

Worse than losing a packet, the loss of a single fragment of message D wastes all the network bandwidth (capacity) that was used sending the remaining fragments. They’re useless.

sip_fragmentation-4_frag_loss
Figure 2. A single fragment of message D was lost; this results in loss of the entire message. All bandwidth consumed sending the other fragments of D was wasted. Packet loss occurs primarily due to network congestion inside router queues.

TCP, which is built on top of IP, typically does not use IP fragmentation. Instead, TCP has segmentation. TCP segmentation is optimized for the case of lost segments; when one is lost, TCP slows down the transmission, and retransmits the missing segment.

Fragmentation and SIP

sip_fragmentation-1_unfragmented
Figure 3. A small SIP message, such as this NOTIFY, easily fits in a single UDP message.

SIP is usually used over UDP. When the SIP messages are small, this is no problem. In fact, for normal phone calls (e.g., SIP on PSTN gateways and SIP trunks), individual SIP messages almost always fit in a single UDP message, well under 1500 bytes, and therefore no fragmentation occurs at all (See Figure 3).

But as services become more sophisticated, the size of SIP messages grows. In particular, “Busy Lamp Field”, also known as “Line State Monitoring”, will often be responsible for sending large data sets via SIP messages. To receive the status information on the 11 people monitored on my Polycom VVX 600, my phone receives around 11,000 bytes of data in a single SIP NOTIFY message. If I were using SIP over UDP, that would take 8 IP fragments.

IMG_7234
Figure 4. This Polycom VVX600 receives around 11,000 bytes of data in a single SIP NOTIFY to refresh the 11 Busy Lamp Field monitoring sessions.

Even basic fragmentation has been a problem for some SIP systems. Back in 2009, Eric Hernaez (SkySwitch) reported a “major vendor’s” switch crashed by SIP fragments. But today, most SIP/UDP platforms support fragmentation with some reliability.

There are some approaches for reducing SIP message size in SIP. Alex Balashov (Evariste Systems) describes the challenges of reducing SIP header size in a pure-proxy system, but suggests some SIP headers you can remove in many cases. In 2009, Thomas Gelf suggested dropping unnecessary codecs, and using SIP compressed headers, like “m” instead of “Contact:”

Some vendors are opting to avoid SIP over UDP entirely. For example, Simwood’s Mobile SIM Registration routinely sends SIP messages over 1500 bytes, and they adopted SIP over TCP as their standard. Yann Espanet illustrates the inefficiency of multiple fragmentation steps in his 2009 article. He points out that that Microsoft OCS didn’t support SIP over UDP at all, choosing to support only TCP.

sip_fragmentation-5_multiple_fragments
Figure 5. When the input MTU is larger than the output MTU, a standards-compliant IP router will perform fragmentation again, breaking the IP datagrams into more segments. In this example, the fragmented message D may be inefficiently fragmented further.

SIP Fragmentation problems spiked again in 2016, when we saw numerous incidents where SIP fragmentation contributed substantially to network problems. Networks often run fine until the SIP messages grow just enough to cross the MTU boundary.

sip_fragmentation-2_fragments
Figure 6. Sending a Large SIP message via UDP, like this one with 2811 bytes of payload and perhaps 500 bytes of headers, will cause fragmentation. All of the fragments have to be received for any of them to be useful.

Routing, Switching, Fragmenting

The Internet Protocol Standard requires that routers perform fragmentation. Fragmentation is often worse going through non-standards-compliant Layer 3 Switches. These devices can often do some of the functions of Routers, but not quite all. For example, the Cisco Catalyst 3750 can route packets, but cannot perform fragmentation in some MPLS scenarios, and cannot signal that it cannot fragment. This results in MPLS VPNs with blackholes for certain packets. With SIP, this appears when some SIP messages are delivered while others are dropped silently by the network. And it’s hard to know just how much overhead the MPLS network will take because the MPLS size depends on the exact MPLS path in use at that moment.

sip_fragmentation-6_frag_refusal
Figure 7. Some “router”-like devices refuse to perform IP fragmentation in violation of RFC 1812. A common scenario occurs when a “Layer 3 Switch” needs to transmit a 1500-byte frame to a link with MPLS overhead. Because the resulting frame of 1518 bytes is too large to be transmitted, the frame is silently dropped.

 

Some IP Routing/Switching fragments devices along the path reassemble IP datagrams from fragments. For example, the Cisco ASA firewall has a fragment buffer, allowing only 200 fragments awaiting reassembly, by default.

sip_fragmentation-7_asa_frag_buffer
Figure 8. The Cisco ASA reassembles incoming fragments to perform packet analysis on the complete IP datagram to enforce security policy, storing them in a fixed-size buffer. Any fragments that overflow the buffer are discarded. For SIP, this requires retransmission of the entire SIP message.

The buffer limitations in these devices can be a real problem. The 200-fragment capacity of a Cisco ASA5500 can easily be overwhelmed by normal traffic of thousands of SIP phones. (Add this as a reason data firewalls create trouble with large-scale VoIP deployments.)

Session Border Controllers and Fragmentation

Some versions of the market leading Session Border Controller, the Oracle Acme Packet SBC, has limitations on handling fragments. The “Traffic Manager” from the Network Processors to the CPU controls the rate that IP fragments are delivered from the network to the CPU in the popular 4250, 3800 and 4500 platforms. Terry Kim has a great depiction of the traffic manager for the Oracle Acme Packet SBC. He highlights that the Oracle Acme Packet SBC handles fragmented SIP as “untrusted” traffic — so if trusted endpoint devices, like customers, are sending fragmented SIP regularly, then the SBC doesn’t provide them the same capacity that they would have if they were sending non-fragment capacity.

Unknown
Figure 9. The Oracle Acme Packet SBC’s  “Intelligent Traffic Manager” regulates traffic, including fragments, to the Signaling Processor. Diagram: Terry Kim.

The Oracle Acme Packet SBC was designed in an era that believed SIP over TCP would become more dominant, so that fragmented UDP should be rare. But the single rate limit for SIP fragmented traffic can be a real performance limitation when fragmented SIP becomes very common.

The Metaswitch Perimeta was designed much later, after the industry discovered that most SIP is still operating over UDP. The Perimeta was designed for fragmented SIP to be very common.

TCP: Segments over Fragments

The SIP standard, RFC 3261 mandates that TCP should be used to prevent fragmented SIP. Indeed, SIP over TCP does solve many of the problems by replacing IP fragmentation with TCP segmentation.

TCP segments slice up the stream of SIP messages into neat segments that fit within MTUs. Critically, TCP provides a fast and efficient mechanism for filling in gaps in the stream. Contrast this to IP fragmentation, where the entire SIP message must be re-sent any time one segment is lost.

However, a commonly-encountered problem using SIP/TCP is in limitations of Highly-Available Session Border Controllers, where the TCP synchronization and retransmission mechanism means that the state of any SIP/TCP connection may not be efficiently replicated to the standby SBC. For example, if SBC-A is active, then reboots, the SIP Phone using TCP typically has to reconnect to SBC-B. The SIP phone has to detect that the link to SBC-A has been lost. Both the Oracle Acme Packet SBC and the Metaswitch Perimeta SBC transmit a TCP RST (Reset) message when possible to notify the SIP phones that they need to re-register.

When using SIP/UDP, the failover to SBC-A to SBC-B can be a non-event: the IP address in use simply moves from one SBC instance to another because both SBC units in the pair know all of the state of the SIP registrations, subscriptions, and phone calls to all endpoints. But with SIP/TCP, the re-registration storm can be substantial, and disruptive. Imaging a network of 50,000 endpoints, all forced to re-register each time you reboot a single SBC instance.

The Fragments Are Coming

If you operate basic SIP trunking, or even some forms of today, you might not experience a lot of problems with fragmentation. But you should expect messages are growing larger, especially with integration of Fixed and Mobile networks, and the large SIP messages of IMS networks. How do you test? Send big SIP messages!

Michael Dell (Metaswitch) sums up the state of the art:

SIP messages are only getting bigger, so it is something you need to design and test for now, or it will be a major headache when it does happen.  … That could involve pro-active pressure testing to find and resolve the problem areas before they bite you for real, and also monitoring largest message sizes flowing around your network so that you can predict when you’re going to start seeing more fragmentation and act accordingly.

 

When we teach classes on VoIP networks, we discuss the variety of SIP standards that can be used by working systems. For example, identifying the calling party varies between platforms: One system uses P-Asserted-Identity to indicate the caller, and another uses From. Then there’s the codec used for audio: One system uses G.722 and another uses G.722.2. One system expects national telephone numbers, and another system expects international-formatted numbers. The list goes on.

When setup two SIP systems to exchange phone calls, you need to be aware of the issues. If you’re lucky, the two systems have a lot in common. If you’re not, you may have a lot to fix.


Number Formatting Interop Checklist

1.  Establish format for local telephone numbers.
Not always required.

2. Establish format for national telephone numbers.
E.164 With Plus is recommended, e.g., +12292442099

3. Establish format for international telephone numbers.
E.164 With Plus is recommended, e.g., +12292442099

4. Determine which short codes are supported, such as emergency codes 911, 111, and 0

5. Determine what control codes are supported, such as *67


Popular Telephone Number Formats

One of the most common questions for SIP interop is how the called telephone number will be formatted. There are several popular formats, and they occur in the Request-URI (after the “INVITE”) and in the To header.

This article is intended to familiarize you with many of the common options. The software you’re using — BroadWorks, Metaswitch, Asterisk, Sonus, Genband, etc. — will have certain capabilities and defaults. If you can find a way to interop end-to-end between the two call-control servers, without having to rewrite the number in an Session Border Controller in the middle, then you prevent substantial hassle.

Complete National Number

Perhaps the most common format is the complete national number. For example, in the United States, you might see this complete national number: 9193160013.

INVITE sip:9193160013@vwave.net:5060;user=phone SIP/2.0
Via: SIP/2.0/UDP 192.168.1.43:5060;branch=z9hG4bK6ae0c87576E0FF92
From: "u2292442099" <sip:u2292442099@vwave.net>;tag=90C2A5B1-AAFDEC86
To: <sip:9193160013@vwave.net;user=phone>

Local Number

When the SIP invite represents the digits a user actually dialed on his keypad, the local number will be most common.

INVITE sip:3160013@vwave.net:5060;user=phone SIP/2.0
Via: SIP/2.0/UDP 192.168.1.43:5060;branch=z9hG4bK6ae0c87576E0FF92
From: "u2292442099" <sip:u2292442099@vwave.net>;tag=90C2A5B1-AAFDEC86
To: <sip:3160013@vwave.net;user=phone>

Within the UK, you might call  a London number by dialing eight digits:

INVITE sip:74931232@vwave.net:5060;user=phone SIP/2.0
Via: SIP/2.0/UDP 192.168.1.43:5060;branch=z9hG4bK6ae0c87576E0FF92
From: "UK User 1" <sip:07981555555@vwave.net>;tag=90C2A5B1-AAFDEC86
To: <sip:442074931232@vwave.net;user=phone>

When a human is dialing the number to place a call across the PSTN, the receiving system (INVITE User Agent Server, UAS) will need to determine the intended PSTN telephone number. One popular method is to apply the local area code of the calling party. For example, if this call is dialed by user “u2292442099”, and their US area code is “229”, then the UAS may reasonably interpret the dialed digits 3160013 as +1-229-316-0013 by prepending the national and area code to the dialed digits.

This is a very awkward format to support between carriers or systems, and quite impractical if multiple local areas are involved.

Dial by Extension

In PBX-like environments, “extensions” can be dialed. An extension is a locally-defined telephone number, but which is not useable routing between carriers. You can expect to only see this when the INVITE Request URI represents digits that a user is dialing personally.

INVITE sip:2207@vwave.net:5060;user=phone SIP/2.0
Via: SIP/2.0/UDP 192.168.1.43:5060;branch=z9hG4bK6ae0c87576E0FF92
From: "u2292442099" <sip:u2292442099@vwave.net>;tag=90C2A5B1-AAFDEC86
To: <sip:2207@vwave.net;user=phone>

The extension 2207 probably only makes sense in the group owned by “u2292442099”. The UAS for “vwave.net” has to know what 2207 means in that context.

E.164 Without Plus

This model places the call with a leading country code. In the North American Numbering Plan  (US, Canada, Caribbean),

INVITE sip:19193160013@vwave.net:5060;user=phone SIP/2.0
Via: SIP/2.0/UDP 192.168.1.43:5060;branch=z9hG4bK6ae0c87576E0FF92
From: "u2292442099" <sip:u2292442099@vwave.net>;tag=90C2A5B1-AAFDEC86
To: <sip:19193160013@vwave.net;user=phone>
INVITE sip:442074931232@vwave.net:5060;user=phone SIP/2.0
Via: SIP/2.0/UDP 192.168.1.43:5060;branch=z9hG4bK6ae0c87576E0FF92
From: "u2292442099" <sip:u2292442099@vwave.net>;tag=90C2A5B1-AAFDEC86
To: <sip:442074931232@vwave.net;user=phone>

E.164 With Plus

Preferred: I recommend this format whenever possible, because it’s the only one with a token that signals the type of number it is: the leading Plus. By including the Plus before the number, all parties involve understand that the next digits will be the country code, followed by the national and local number.

INVITE sip:+19193160013@vwave.net:5060;user=phone SIP/2.0
Via: SIP/2.0/UDP 192.168.1.43:5060;branch=z9hG4bK6ae0c87576E0FF92
From: "u2292442099" <sip:u2292442099@vwave.net>;tag=90C2A5B1-AAFDEC86
To: <sip:19193160013@vwave.net;user=phone>
INVITE sip:+442074931232@vwave.net:5060;user=phone SIP/2.0
Via: SIP/2.0/UDP 192.168.1.43:5060;branch=z9hG4bK6ae0c87576E0FF92
From: "u2292442099" <sip:u2292442099@vwave.net>;tag=90C2A5B1-AAFDEC86
To: <sip:442074931232@vwave.net;user=phone>

 

While it’s easy to say this is preferred, it’s not the right fit for most cases where an ordinary human is dialing the call. Most keypads don’t have an obvious + dialing symbol, and users conventionally dial a shorter number.

Steering Digitsdghxwdnsapem8p

Zounds, it’s Historical! Dating from around 1950, when “The subscriber, having dialed the first steering digits, then dials the ordinary number of the distant subscriber and the call is completed without any operator intervening.” his method prepends digits to the actual telephone number to mean certain things; for example “99911” prepended to the telephone number may mean to use Long Distance Gateway 1, and “99912” indicates to use Long Distance Gateway 2.

INVITE sip:999119193160013@vwave.net:5060;user=phone SIP/2.0
Via: SIP/2.0/UDP 192.168.1.43:5060;branch=z9hG4bK6ae0c87576E0FF92
From: "u2292442099" <sip:u2292442099@vwave.net>;tag=90C2A5B1-AAFDEC86
To: <sip:19193160013@vwave.net;user=phone>

Despite the vintage charm, this method is worth avoiding when possible. There are several good alternatives in SIP:

  • “tgrp” from RFC4904: In this method, a trunk group identifier is a tag in the user portion of the URI.
    INVITE sip:+19193160013;tgrp=TG-1@vwave.net:5060;user=phone SIP/2.0
    Via: SIP/2.0/UDP 192.168.1.43:5060;branch=z9hG4bK6ae0c87576E0FF92
    From: "u2292442099" <sip:u2292442099@vwave.net>;tag=90C2A5B1-AAFDEC86
    To: <sip:+19193160013@vwave.net;user=phone>
  • “dtg” for  Destination Trunk Group: In this method, a trunk group identifier is a URI parameter.  This seems to be a BroadSoft BroadWorks extension.
    INVITE sip:+19193160013@192.168.42.2;dtg=1:5060;user=phone SIP/2.0
    Via: SIP/2.0/UDP 192.168.1.43:5060;branch=z9hG4bK6ae0c87576E0FF92
    From: "u2292442099" <sip:u2292442099@vwave.net>;tag=90C2A5B1-AAFDEC86
    To: <sip:+19193160013@192.168.42.2;dtg=1;user=phone>
  • Domain: In this back-to-basics method, a distinct domain name represents the service to be used.
    INVITE sip:+19193160013@tg-1.vwave.net:5060;user=phone SIP/2.0
    Via: SIP/2.0/UDP 192.168.1.43:5060;branch=z9hG4bK6ae0c87576E0FF92
    From: "u2292442099" <sip:u2292442099@vwave.net>;tag=90C2A5B1-AAFDEC86
    To: <sip:+19193160013@tg-1.vwave.net;user=phone>

SIP Username

In non-PSTN environments, you can see calls placed to non-numeric identities.

INVITE sip:alice@atlanta.com:5060 SIP/2.0
Via: SIP/2.0/UDP 192.168.1.43:5060;branch=z9hG4bK6ae0c87576E0FF92
From: "u2292442099" <sip:bob@biloxi.com>;tag=90C2A5B1-AAFDEC86
To: <sip:alice@atlanta.com>

 

The SIP URI “sip:alice@atlanta.com” makes sense within the SIP network, but has no direct mapping to the E.164-dominated telephone network. This is essentially equivalent to dialing by extension.

Don’t be misled by usernames that appear to be a number, such as this:

INVITE sip:ext.6100@26.128.128.1:5060 SIP/2.0
Via: SIP/2.0/UDP 192.168.1.43:5060;branch=z9hG4bK6ae0c87576E0FF92
From: "u2292442099" <sip:bob@biloxi.com>;tag=90C2A5B1-AAFDEC86
To: <sip:ext.6100@26.128.128.1>

The “ext.6100” in this example makes it incompatible with a telephone number; there’s no automatic mapping between ext.6100 and a telephone number. This is intended to route a call to the user known to the UAS with the username “ext.6100”.

What is “user=phone”?

The examples I’ve shown mostly include “user=phone”. This represents that the user portion of the URI (to the left of the @ sign) is to be interpreted as a telephone number. These can be readily rewritten as “tel:” URIs, as described in RFC3966.

 

VoIP Drove down the cost of making phone calls. We love that about VoIP: free long distance! In the telecom industry now, the idea that calls within a country would cost a retail user more than local calls seems quaint.

But the low cost of VoIP calling has brought a new adversary: “Robocalling,” so called because “Robots” (computers with pre-recorded messages) place phone calls. Many of these calls are placed in violation of existing laws. According to US Federal Trade Commission official Lois Greisman, lawbreakers “can place robocalls for less than one cent per minute,” and I suspect that only counts the calls that are actually answered. And though the call may be dialed by an automaton, it may be connected to a live person when its answered.

Elderly Victimized by VoIP Robocalling

The main ones who suffer are non-VoIP Users, disproportionately the elderly. Older people see no reason to switch from the same telephone service they’ve had most of their lives, and they retain legacy, POTS-based phone service. In rural areas, this is often served by feature-poor TDM equipment with only “Custom Local Area Signal Services” (CLASS) features, developed in the 1980s in the rotary phone era. The best available Robocall blocking uses Simultaneous Ring, which is not available in the CLASS feature set in classic telephone service.

1bccd3bf-056f-42e3-9c0d-c26c724c88ce

Because the pain of these automated calls is felt so acutely by elderly citizens in America, the US Senate Special Committee on Aging is pushing for laws to force Service Providers to Provide Blocking. That the Aging Committee is pushing to regulate telephone services serves as evidence of the gravity.

By building efficient networks, VoIP technologists unwittingly created a special problem for users of legacy technology. (Hint: if you have aging relatives, be sure to upgrade their phone service to a modern option.)

“We now have this onslaught—it’s terrible,” said Indiana Attorney General Greg Zoeller, who added that the scams are  “primarily directed at seniors.” Scammers attempt to trick seniors into paying for taxes they don’t owe, or providing banking information for lotteries they haven’t won.

The Wall Street Journal reports on one 88-year-old Arkansan who lost $110,932. And often, blocking or changing the Caller ID of the Robocaller is a key part of the scam. Another 57-year-old person lost nearly $250,000 in a scam relying on a “device that conceals the identity and location of the caller.”

The Nature of the Problem

Attackers setu

image004
Scammers and other illegal robocallers send calls through multiple networks, sometimes using Least Cost Routing, or stolen phone service.

p computers to launch outbound attacks, undoubtedly through VoIP protocols, and send them through grey-market PSTN Termination Providers. These are firms that offer low rates to accept a SIP telephone call and deliver it to a PSTN phone. But if they do nothing to ensure that the caller IDs used are accurate, and may be respond slowly to complaints against their customers. Further, they may choose to not participate in CDR-based Call Traces, the de facto industry standard method used to determine the source of a call. These can be anywhere in the world, and these firms need not exist for long.

The PSTN Termination Providers typically don’t actually operate PSTN Gateways themselves. That is, they have no ability to directly send the traffic into the legacy TDM SS7/C7 telephone network to reach any phone in the world. So the PSTN Termination Providers buy services from more legitimate PSTN Gateway (PSTN GW) Providers. These PSTN Gateway Providers then have direct connections to the major telephone carriers of the world, and paths to reach all the smaller ones as well.

AT&T is an example of the PSTN Gateway (GW). In an FCC Hearing on Robocalling in September, 2015, Adam Panagia of AT&T reports that they often receive a single coordinated Robocalling campaign through 20 or more wholesale carriers. If even a legitimate call center in India is routing calls the US, AT&T may receive calls from that single call center through 30 to 50 wholesale customers.

VoIP Fraud Plays a Role

In addition to the use of PSTN Termination Providers to perpetrate Robocalling fraud, some attackers use exploited VoIP Telephone Providers. In these cases, a VoIP provider has a security vulnerability, such as a SIP account with a simple password, or a customer device that is not protected by a firewall.

When attackers get access to this service provider, they use it as one of the paths to send calls to the victims. The defrauded Service Providers whose service is stolen are victims along with those who are called and then scammed. The poor security of many VoIP Service Providers contributes to the Robocalling problem, but Service Providers have the power to improve their security.

Solving the Problem Technically

Blocking Historically Disallowed Historically, Telephone Service Providers in the US were prevented from blocking calls sent to a subscriber. If you bought service from one, and had a telephone number, then the Service Provider was obligated to deliver every call. The FCC changed that in June, 2015, in a Declaratory Ruling, and by September of that year they were running workshops to encourage the proper kind of blocking.

The FCC made a welcome change, and though some service providers were already providing some types of blocking before the June 2015 ruling, it opens options for blocking calls even among the scrupulous. So the question is now: which calls should be blocked?

Caller ID Trusted Currently, Caller ID fundamentally trusted from the original caller. Unlike modern email, there is no technical mechanism to confirm that the caller ID provided on a call is actually genuine. VoIP technology, such as SIP From, “Trust Domains” P-Asserted-Identity, and SIP “Identity” have not helped because they only apply to limited areas of the network, and don’t provide any certainty as a call travels from a call center in India to a retiree in Iowa.

Verifiable Caller ID is a key requirement to block scammers, according to Scott Mullen, VP Technology of Bandwidth.com. They’re a major PSTN termination and origination firm in the USA with extensive SS7/TDM interconnects, noted in the FCC workshop that toll-free fraud is a major problem they deal with. Even International calls come through in a way that tricks the US recipient into believing it’s a call originating from the US.

Problem Calls Distributed Further, calls from scammers are distributed across many networks. One PSTN GW Provider has some data, while another has different data. It’s not easy to coalesce the information into a unified view to make smart decisions about the call.

Can’t scan a phone call before delivery Another difficulty for scanning telephone calls lies in the real-time nature: unlike an email that can be analyzed in its entirety before it is deposited into your mailbox, only a few bits of information are available to a call-blocking system: (a) Time of call, (b) Claimed Calling party number, (c) Called party number, (d) Input source (such as a particular wholesale customer link).

I should note that we can learn about calls after they are delivered: Robocalls that are answered are usually disconnected very quickly. Short call durations provide some clues after some of the calls are delivered, which may be used to improve the go/no-go decision for future calls.

Technology Types Calls do flow across and SIP and TDM networks: but scam calls almost always originate as SIP. The cost of maintaining TDM infrastructure appears to be too high for scammers, as it probably invalidates the business model.

Give me a call, never

For the moment, Call Blocking based on Caller ID reputation is providing some relief. These work by using a database of known scammer caller IDs, then blocking the calls that are in that database.

The industry de facto standard for Personal Blocking Services is NOMOROBO, released in 2013 after winning a US Federal Trade Commission competition for technical solutions to Robocalling. At the September 2015, NOMOROBO was that annoying better sibling, with the FCC mom asking all the other participants, in effect, “Why can’t you be more like NOMOROBO?”

Nomorobo is free, and blocks calls from Caller IDs that are repeatedly used for annoying calls. It’s available to anyone who has Simultaneous Ring, a feature found on Broadworks, Metaswitch, and most other modern telecom platforms.

How Grandma Gets Attacked

image006
Calls normally route through one or more PSTN Termination Providers before they reach an ordinary telephone company and their victim.

Let’s first look at the normal call path from an attacker to the Victim. In this case, we’ll say that the victim is a subscriber of “Adams Rural Telephone Company.” The Attacker launches calls, dialing outbound calls, and some of those calls route to “PSTN Termination Provider #1”. This company determine where the calls should route, and, in those cases where Adams’ subscribers are the victim, routes the calls to Adams’. This supposes that both Adams and the Attacker do business with the same PSTN Termination Provider, but in many cases there will be several intermediate PSTN Providers.

How Grandma Blocks the Attack

NOMOROBO depends on the Simultaneous Ring feature. To use it, Grandma would setup her calls to ring both her phone and the Personal Blocking Service at the same time.

image008
A Personal blocking service, like NOMOROBO, receives the call through Simultaneous Ring / Call Forking setup on the Victim’s phone. If the Caller ID is in the blacklist, NOMOROBO answers the call, which prevents the Victim from answering.

In this way, NOMOROBO receives an inbound call with the Caller ID provided by the Attacker. If NOMOROBO finds this Caller ID in its blacklist database, it can answer the call – thus ending the call for the Victim. The victim experiences only a single telephone ring.

NOMOROBO is clearly a market leader. They are benefiting from the “Network Effect,” where their popularity makes the service work better by aggregating data. NOMOROBO certainly receives many simultaneous calls during Robocall campaigns, and can exploit the concurrency to make smarter decisions.

But NOMOROBO must depend on the caller ID and called ID only, and cannot do anything to improve the quality of the data. NOMOROBO does not know when calls end (i.e., the duration of answered calls), so it cannot improve its decisions based on calls that turn out to be undesirable and are quickly ended.

NOMOROBO also cannot analyze the audio of the phone call. In the US, many Robocalls begin with dead silence, which is unnatural for ordinary person-to-person calls. Normal humans use this to decide which calls to hang up quickly. This mismatch between robocalls and human-dialed calls should be useful to an automated blocking system, if it were available.

Service-Provider Based Blocking

What happens when the victim does not have the Simultaneous Ring service, or lacks the skills to set it up? Or what if a Service Provider would like to block Robocalls for all of its subscribers? Telephone Service Providers can also block calls by using a network-based service.

Consider the network path shown. Normally calls flow from the Attacker, to an intermediate, and finally to the victim’s service provider.

image011
A Service-Provider Based Blocking Service can provide protection for every customer of a service provider. Using SIP, the calls can be routed through an intermediate device that checks the caller ID. Or, potentially, it could analyze the audio or look for other signatures of fraud.

Service Providers (SPs) can provide protection by using SIP call routing to route calls through an intermediate service. Instead of immediately sending every call to the victim, the SP can route the call to an intermediate service (or device) that checks the blacklist database. To borrow the term from email, only the calls that are not spam are ultimately delivered to the end user.

The Metaswitch SIP Robocall Blocking Service is a current market example of a service using SIP to address this problem. Service Providers can route their calls to the service, which blocks calls coming from caller ID numbers in the blacklist database.

This network model can provide robocall-blocking for an entire Service Provider, and not just for a single user. It’s also far more efficient than Personal Call Blocking, because it doesn’t require all the setup for a media path through the legacy PSTN to check the database for each call.

It also has opportunities for future development: with a stateful SIP proxy, a SP robocall blocking service could know when calls start and end. And once privacy concerns are handled, this approach analyze the audio for hints, such as dead-silence at the start of a call.

But like Personal Call Blocking, this approach still relies on the caller ID, which can be faked for each call.

Nothing is any good if other people like it

Blacklist based blocking services work today precisely because they are not popular. Today’s blocking services rely on calling party ID, as if that’s trustworthy. Scammers do have some incentive to place calls from the same caller ID repeatedly: once they find a telephone number with a matching CNAM caller name that people will answer – with names like “Internal Rev Service,” or “EDUC LOTTERY” – they seem to stick with that same number.

But Robocallers are already adapting to robocall blocking services. Some are calling from randomly-selected working, legal telephone numbers. This method completely defeats simplistic blacklist databases.

This means we really need a trustworthy caller ID. And some in the industry are working to provide it.

Engineers in the Internet Engineering Task Force (IETF) and the Alliance for Telecommunications Industry Solutions (ATIS) have developed a standard called STIR, “Secure Telephony Identifiers Revisited”. When used as designed, each call using STIR will include a signature as evidence that the calling party has the right to call from the telephone number they’re using.

This would be implemented at the entry to the SIP-PSTN network, ideally at the customer’s PBX or at their first Service Provider Interface. For example, a BroadWorks service provider may use SIP authentication to confirm the identity of a caller, and then create the STIR cryptographic signature to confirm the legitimacy of the caller ID.

Don’t strip that header

Currently, there are no SIP headers that must be retained end-to-end through the VoIP networks. All headers can be reconstructed at each step, though a few elements are reused (such as the calling and called party numbers). STIR assumes that VoIP carriers will be able to pass a SIP header through their network from the origin to the terminating carrier. This is certainly technically feasible, but will require substantial coordination – and likely a few SBC software updates for some carriers.

STIR promises a world where you can be certain of the calling party while your phone is ringing. But will it happen? STIR requires substantial technical work on VoIP network infrastructure. Practically every SIP carrier peering/trunk on every SBC deployed will have to be updated.

STIR will require establishment of a Certificate Authority (CA) who can provide the certificates verifying the right to use telephone numbers. We already have Certificate Authorities in the industry servicing the Web industry, so this should not be a major hurdle. You can expect big carriers to desire to be CAs for themselves – likely a smart solution for many cases. For example, AT&T has been, in effect, the “owner” of millions of telephone numbers for years, though they were permaently assigned to their subscribers. It makes sense for AT&T to be the CA for the numbers it already “owns”.

Who Goes There?

To succeed, STIR will have to engage the business models of the modern VoIP PSTN. Private companies and government entities alike use the flexibility of the PSTN to route their calls through any carrier that is convenient. If STIR requires evidence that the telephone number is being used legitimately, then credentials to use the telephone numbers must be distributed to all of the owners of telephone numbers.

Video Relay Service. For example, at the SIP Forum SIPNOC meeting in June 2016, one major Video Relay Service (VRS) for the Deaf and Hard-of-Hearing community commented that they effectively place calls on behalf of their users. A full STIR implementation will require the VRS providers to place the calls outbound for these users, even though the audio portion is actually connected to a Sign-Language Interpreter.

Government Agencies. Government agencies using COTS platforms like BroadWorks sometimes use a variety of paths for routing their calls outbound. They, too, will need the tools and technology to prove their right to use the caller ID, because preventing spoofing of calls from public institutions is one of hopes for STIR.

Call Center Services. Today it’s also common for a firm to hire a call center service to place outbound calls, representing a firm. STIR will require the Call Center firms to be capable of providing a signature showing the right to place calls from that entity.  For example, if the Call Center for Delta Airlines needs to call you, then the Call Center service will need credentials (like a password) to allow them to place that outbound call from Delta’s telephone number. The Call Center will need to be upgraded to be capable of generating the STIR Identities.

Ten Years of Work

Once Caller IDs are verifiable, telephone users can make smart decisions about which calls they want to allow.

unnamed
Call Control from Kedlin Company operates on the phone endpoint.

So How long will it take to block fraudulent caller ID, and get true caller ID?  It took about seven years to prevent spoofing of domain names, like “gmail.com”, from the start of the IESG SPF experiment  in 2005 to the implementation of DMARC in 2012. The PSTN moves even slower.

But solving the spoofed-caller-ID problem – and the Robocalling it enables – are worth doing. Service Providers should monitor the progress of STIR, and Vendors (like BroadSoft, Oracle Communications, Metaswitch, Genband, Sonus) should plan to support STIR.

  • Push for STIR Identity Support from your SIP Software Vendor, especially SBC, Feature Servers, and SIP Trunking Devices
  • Configure your equipment to allow the new Identity header to flow through
  • Find out how you can confirm and display the identity of calls you receive using STIR
  • Plan to enable users to legitimately delegate authority to place outbound calls

Committing time to answer questions is the crucial first step

This is Part 3 in my Series on Supporting/Managing Engineers

Configuring bridging, building bridges

Unlike software, systems, network, and voice engineering, regulated engineering disciplines require licensing. According to the National Society of Professional Engineers, a college engineering graduate candidate can “begin to accumulate qualifying engineering experience”. What happens during these four years of post-graduate experience? “The experience must be supervised. That is, it must take place under the ultimate responsibility of one or more qualified engineers.” Further, the experience is expected to be “high quality, requiring the candidate to develop technical skill and initiative in the application of engineering principles.”

These are the engineers building the bridges you drive over, the flammable electrical parts you install in your walls, and the explosive power plants you live near. Society expects high standards because of the risks to health and safety.

So to advance in their field, those engineers seek out supervision. And because of the licensing, senior engineers must participate in the mentoring process. They have a valuable structure of mentoring.

But what about the “engineers” who run email servers, build voice networks, and write software? Generally, these information technology (IT) and computing disciplines are unregulated. Anybody can do IT as badly as they like, and we have no mentoring structures in place.

ASCII question, but got no ANSI

Computing has grown without formal, legally-mandated mentoring requirements because computers serve as self-checking devices: if the new programmer tries to do something crazy or foolish, it won’t compile or it won’t work.  Unlike faulty bridge designs, computing systems are relatively easy to test. If you try to build a network but don’t know what you’re doing, that network won’t function.

But in IT/computing fields, mentoring is still needed:

  1. IT Engineers often have questions that cannot quickly be googled or tested by experimentation
  2. IT Engineers need to learn from the mistakes & experience of others
  3. IT Engineers need review from other brains to help check their own ideas

It can be hard for a junior engineer to get a solid answer to a question. The best engineers are always busy, and they’d rather spend their time with computers, not people. For example, the Myers-Briggs personality type INTJ is  used to link introverted personality types to Computer Programmers and Engineers. For these introverts, answering your question is draining. “If you’re a true introvert, networking is excruciating,” writes Susan Adams in Forbes 2014.

So if you’re learning something new,  how do you get your questions answered if you’re among people who’d rather avoid people?

A Mentor commits to answer questions

With so many demands on a skilled professional’s life, the key scarcity is the willingness to provide answers when questioned. So if you’re going to be a mentor engineer,  be sure you’re available to hear questions, and provide good answers.

That is, mentoring in IT Engineering is first about the mentor’s commitment. The mentor has to be willing to take questions from junior engineers, and commit to answer them.

As a professional programmer, Jud McCranie answered hundreds of this author’s written questions about programming through 1989-1993. He’s a great example of a mentor, investing in answering questions of a curious mentee.

I had some great mentors as I was learning. Jud McCranie is a professional programmer I met through through a university-operated  Bulletin Board System in the early 1990s. He answered a thousand questions from me on programming, and even decrypted my amateurish encryption algorithms. He was a mentor because he took my questions and didn’t ignore them. He challenged my thinking, often asking me questions I couldn’t answer myself.  I’m always thankful for his patient consideration of numerous questions from a teenager. (McCranie is cited in one of Knuth’s new volumes, the Art of Computer Programming.)

Another superb question-answering mentor was the late Jon Hamlin, an ex-CIA intern and graduate of Valdosta State University and University of Georgia. Jon was far more interested in system administration, networking and UNIX, and in his role as computer science lab manager at Valdosta State University, Jon had access to extensive SunOS/Solaris resources and the time to think about my questions. He setup a Linux computer and gave me access, and helped me clean up a few messes.

Both of these men put in hours of their lives to answer questions I wrote through email, and they’re part of my motivation to answer questions for others.

Why it’s so hard to get somebody to really answer a question

I just claimed that there’s a scarcity of willingness to answer questions, so the first role of a mentor is to commit to be available to answer questions. Why is that?

  1. There are lots of ignorant people willing to give bad answers. But you don’t want them to be your mentor.
  2. 10481690626_b1be89f8cc_o
    Competent Computing/IT Engineers are very busy. Getting a senior engineer to commit the time to mentor is a big deal. Photo: Tim Regan.

    It takes real time and expertise to answer questions. As Erica Friedman writes, sometimes you’re expecting a simple answer about a complex system.”Some answers attempt extremely top-level analysis, but few people will have time or expertise to answer a truly complex simple question.”

  3. Competent engineers usually prefer to stay busy engineering, not doing chit-chat. According to  John Hales of Global Knowledge, communication and explanation are not key character attributes for successful IT Professionals.
  4. IT Engineers are often expected to make progress quickly, so they can’t wait long for answers. As one contributor said on StackExchange, there’s no time pressure to answer questions in online forums, and so it’s easier to get questions answered there.

Because engineers who prefer results over talking, and the demands on competent engineers, it’s hard to get a timely answer from a competent engineer.

The curious case of the unasked question

richard-feynman-1“I would rather have questions that can’t be answered than answers that can’t be questioned.”  ~ Richard Feynman

Once a mentor is available to answer questions in a timely way, the mentee must be willing and able to ask questions. Often they are not.

Michael Adams of Quizbean identifies several common reasons junior staff don’t ask questions. For engineers, some key reasons they don’t ask questions are:

  1. Nervousness. “They don’t want to be embarrassed in front of their boss or co-workers,” Adams writes. This can be caused by simple anxiety, or excessive pride if they don’t want to admit there’s something they don’t know. Antidote: So the mentor must make it easy and unthreatening to ask questions by readily enjoying the curious discovery of new facts. Computing fields move famously fast, so there are always new things to learn.
  2. Avoiding annoyance. They don’t want to ask questions because they perceive it to be inconvenient. Antidote: Therefore the mentor must actively invite the questions.
  3. Bewilderment. The mentee doesn’t know where to start because everything feels so unfamiliar. Antidote: The mentor should set milestones of capability to encourage steady improvements in comprehension; e.g., crawl, then walk, then run. E.g., Login to the server first. Then locate the logs. Then read the logs. Then understand the logs.
  4. Previous trauma. The world is full of jerks, and many people have been criticized by those jerks for asking good questions. The mentee may need to recover from unhealthy work environment where legitimate questions were met with caustic attitudes.
  5. Lack of curiosity. One of the most dangerous problems is a lack of curiosity in the subject matter. Mauricio Porfiri, a robotics researcher at New York University, says that “Being creative and being curious is more important than being the smartest or the best at equations if you want to be a great engineer.” Albert Einstein said that “The important thing is not to stop questioning. Curiosity has its own reason for existing.” The lack of curiosity leads to a dangerous complacency, causing the mentee not to care enough to bother to formulate questions or challenge their own assumptions, but to muddle through with the current level of ignorance. Computing is great because it encourages curiosity. Have a question? Try it out. A chemist or physicist or mechanical engineer needs equipment for a lab, but the Computer Scientist needs only a computer and the means to program it. Ayodeji Awosika writes about the dangers of suppressed curiosity in “The Theory of Nothing: Why Lack of Curiosity Leads to Mediocrity.”

Beyond Q&A: The Weekly Mentoring Meeting

To support growth of a mentee, mentors can schedule regular time with the mentee. Just like the commitment to answer questions, the mentor must make make these meetings a commitment. Commonly, this happens as a weekly meeting to ask questions and plan progress.

Jim Anderson’s approach to mentoring includes a weekly meeting. The advisee’s next steps were documented on his office whiteboard for clarity and easy reference.

Jim Anderson, a Computer Scientist at UNC-Chapel Hill, followed a simple model of tracking progress for his advisees: he wrote the next steps for each of his mentees on the whiteboard in his office. Then they could easily see what was expected. And in each weekly meeting, he could easily recall their responsibilities.

Plan the route out of ignorance. Even in non-supervisory mentoring, it’s helpful for the mentor to plan and track progress so the mentor ensures that the incomprehensible is coming into focus for the mentee. Without this guidance, the mentee may be trapped in a complicated area without a path forward, and without the ability to ask questions to get out of it.

Review recent accomplishments. The weekly meeting is also a good time to review samples of the mentee’s work. The mentor can praise progress and identify the most important improvements the mentee can work on next. But even when identifying the next growth area, the mentee should recognize the mentor’s achievements.

For example, for Computing/IT professionals, reviewing work can mean:

  • Discussing interesting troubleshooting problems and the approach to troubleshooting.
  • Reviewing system configurations to see how a mentee’s task was accomplished.
  • Reviewing source code.

 

Establishing Healthy Mentorship

To begin healthy IT/Computing mentoring,

  • Get experienced engineers genuinely willing to answer questions: call them Mentors.
  • Get other engineers with curiosity, who are willing to ask questions.

Photo: Merrimack College Mentoring Program.

Sometimes junior technical staff are starved for interesting work while senior staff are overworked

This is Part 2 in my Series on Supporting/Managing Engineers

If a team has lots of technical work to do, and only a few brilliant engineers available, how do you get work to the right people?

In this article, I discuss methods for managing work in IT and technology teams, such as those doing Network Operations or Engineering, Devops, Security Management, System Administration when you have a mixture of skill levels, including “star engineers”  with 10+ years of experience. I would give different advice to a team focused on developing software.

Call the Brain Surgeon!

A hospital is full of medical staff.  But when a life-threatening case rolls in, you want the most experienced physician available to handle it.  And if you’re doing important, high-profile, mission-critical technical work, you might want the most experienced technician to do the work.

But if the problem at hand is to restore the telephone service for the hospital, or to enable a failed 911 Emergency calling service, then it makes sense to put the best available people on it.

…to suture this wound

Yet often, the risks are not so high, and the need is not actually so urgent. You really don’t want your most-senior staff doing relatively low-risk simple work. Escalating all work to the most-skilled person robs the lesser-skilled staff of experience and deprives the organization of good ideas.

How do you prevent all the hard work from being done by the best and brightest?  How do you prevent all the work from flowing up to the highest-skilled people?

  • Make junior staff work at the highest level possible
  • Make senior staff support and include junior staff in strategy and planning
  • Define the barriers between junior and senior staff
  • Define the workflow for projects

Excelling as a Junior Engineer

Suppose you’re a junior engineer in a team, and all of the challenging work goes to the senior engineers. You know you could do that work, but it’s always going to the senior guy. How do you get the chance to work on hard problems?

It’s possible your management is fumbling, but be sure you’re doing a great job on the problems you have. It’s easy to dream about solving challenging problems, but you have to be sure you’re doing fabulous work in your current tasks:

  • Technically:
    • Are you doing as well as anyone in the world?
    • Are you aware of how other people solve this problem?
    • Are you documenting your methods?
    • Are you listening to the ideas of your colleagues?
    • Are you clearly explaining your results?
    • Are you recording your questions and other unknowns, then working to get answers to those questions?
  • Organizationally:
    • Are you confounded by how difficult it is to make progress?
    • Are you losing track of a detail, date, times, or meetings?
    • Are you recording results of the project so others can benefit from it later?
    • Are you making progress before the deadlines, so that you have substantial progress to demonstrate, and questions to ask?
  • Inter-personally:
    • Are you communicating clearly, with the requester of the work (the “customer”) and others with whom you collaborate?
    • When you have a chance to talk about the project, are you engaged? Are you contributing ideas?
    • Are you doing what you can to be pleasant, even fun, to work with?

Make Yourself Available to Hard Problems. Find out how the interesting projects come in, and look for ways to be involved.

  • Try to make sure your work environment isn’t too hidden and isolated.
  • Ask to join conference calls on interesting projects
  • Take notes in meetings, and offer to provide those notes to other people.
  • Ask questions about everything you don’t understand.

Senior Engineers Plan Projects

What is the role of the Senior Engineers?

  • Make design decisions.
    (Anything that can be done either better or worse is a design decision.)
  • Analyze the problems and projects
  • Write plans for the projects
  • Guide the solutions toward doing the best thing (“what should be done”), minimizing involvement in method chosen.
  • Solve problems when no-one else will handle them
  • Mentor the Junior Engineers

The distinction of a star technician is perspective and context: they know how things are done and why. They can consider pros and cons of various options for change. They know potential for risks. They should know the organization’s strategy and mission.

With this context, a senior engineer should be planning projects:

  • What are the objective and the goals?
  • What are each of the phases of work along the way?
    Tip: keep each phase <4 hours of work
  • What kind of data needs to be collected before proceeding?
  • How will we know when we’re done?
  • How long should it take to complete each phase?

A star technician is as mature as the plans they can provide to others to execute.

Define the Junior/Senior Boundary

What kind of work should be done by senior engineers, and what kind by junior engineers?

Senior Engineers are obligated to:

  • Maintain or improve architectural unity of the system.
  • Provide opinions based on mature technical aesthetic
  • Research the available options for solving problems (i.e., not just their favorite options)
  • Write plans for accomplishing the work
  • Determine which things to do (E.g.: Install Apache on a Linux VM, or buy an F5 Local Traffic Manger?)
  • Answer questions raised by junior staff
  • Review and approve the work done by junior staff
  • Be aware of the skills and capabilities of the junior staff
  • Do the work that no one else is able or willing to do

Junior Engineers should:

  • Follow the plan to accomplish the work
  • Debate and question the plan if they see better ways
  • Ask good questions to fully understand the rationale behind the plan
  • Discuss the problems with Senior Engineers when they find something interesting or unexpected.

Define the Workflow

Send all work to the Junior Staff?

A popular remedy for Work-Flows-Up Syndrome is to send all problems to the junior staff. For example, all tickets come in and start at Tier 1, then the hardest 80% escalate to Tier 2, then only the worst get escalated to Tier 3.

Hacks.  My experience with this method for actual problem solving is poor: Junior staff are prone to develop Workarounds, Tricks, and Hacks (collectively known as “WTH Solutions”) without adequate context or insight into the best way to solve problems. Without knowing the long-term risks, WTH Solutions make the system more fragile.

Blockages. Just as a star engineer can be a bottleneck, junior staff can block the successful completion of work. To know when to ask for help is a learned skill, and junior staff are often prone to keep poking around the edges of a problem. They can tend to sit on interesting problems too long.

Lack of Senior Insight. Senior staff need the insight on the real-world problems with their system. When every problem starts at the junior staff,  the senior staff may be unaware of the problems.

For example: In one case, the network had a serious, but minor issue, occurring  that could be worked-around through an easy procedure. The senior staff gave instructions for the junior staff in how to solve it, expecting the junior staff to need to do it no more than weekly.

But the problem grew more frequent. The junior staff were using the workaround dozens of times per day, faithfully following their instructions. But the senior staff were unaware of the scope of the problem because the junior staff were dutifully executing the procedure.

In this case, the senior staff needed to be aware of the scale to develop a more permanent solution. In that case, when a problem was mostly-delegated to the junior staff, the senior staff couldn’t see what was happening and rethink their proposed “solution.”

Plan Before Proceeding

Rather than “fill all the work from the bottom” skill levels, projects should be planned from the beginning. The first step in processing every new problem should be writing a project plan.

What’s a project?

Anything that has several steps is a “project” as I’m using it here.

What should be designed?

A design decision is anything that can be done incorrectly but still appear to work. For example, you could build a network using one large subnet in one big ethernet, or you could set up the network with routing and subnets. Both may appear to work, but one will be better for the situation.

Writing the project plan may require 10% to 20% of the total work time on the task, and should be done by Senior Engineers. The Senior staff is thus responsible for considering options and maintaining architectural and conceptual integrity of the system.

After the plan is developed, the project can be assigned to somebody at the right skill level.

  • Could the execution of the project affect architectural integrity of the solution? Senior Engineers
  • Does the junior staff have the skills to do it, or can they learn? Delegate to Junior Engineers

Senior / Junior Followup

Senior engineers need to routinely follow-up with junior staff to talk about their projects.

Most projects cannot be completed start-to-finish without interruptions. Sometimes we need a new license file; or we need to open a ticket with the vendor; or we need access to a protected site; or information from the customer. Due to these interactions with other vendors and customers, I find many technicians need to have 3-5 projects in order to stay busy.

Senior Engineers should have routine conversations with junior engineers to discuss the projects, and how they are going. The Project Plans will never be perfect, so discussing the project as it proceeds is a healthy part of mentoring.

Work Flows Where You Pump It

With the proper distribution of tasks, even a team of two engineers or technicians can be more effective. Junior staff have responsibilities to grow technically and professionally, and Senior staff have an opportunity to increase their team’s effective throughput.

Photo Credit: Ewan Cross

 

Most experienced professionals value the opportunity to mentor others; not so for some elite technologists

To enable more people in a technical team to do work, more people have to know how to do it. But for many engineers, training doesn’t come natural.

This is Part 1 in my Series on Managing Engineers

Is one-on-one training a rational activity, or just a feel-good strategy from the HR department? Many engineers don’t believe in the value of training and mentoring junior people because they perceive no personal benefit in training junior people. But there are benefits for everyone involved.

The reasons high-tech engineers have for not mentoring others are distinctive to the field.

  • Time Calculus: The time and effort to do a task they’ve done is less than the time and effort to explain it to somebody else then to ask them to do it. For example: suppose a 2-hour routine-cognitive task requires three hours of training. It’s easier just to do it than to train someone else in how to do it.
  • Mathematical, Non-Verbal People: For some people, it’s much easier to control a computer and get the problem done than clearly express ideas verbally.
  • Hero Syndrome: The senior person likes to be the one to solve problems because they get their self-worth. And the Customers (those who need the help) would rather just get the problem solved the fastest way possible.
  • Too Much Variety in tasks. Some organizations rarely solve the same problem twice. In these cases, the trainable skills are relatively few.
  • Weak Junior Staff. Sometimes the people available to train aren’t trainable (brains) or coachable (pride). This is the responsibility of organization to find appropriate staff.
  • Job Protection. There’s a myth that some people won’t train others because the would-be trainer would thus be subject to losing their job. I’m not sure I’ve seen real evidence of this.
  • Technology changes too fast. Technical skills have a very short lifetime: what you learn today may be good for about two years, according to IT World.
  • My other trainee is a compiler. Many senior technicians argue that automating the task is a better than training someone else to do it; often that is true. But our desire to write code must be tempered by reality: code doesn’t last very long, and will need humans to maintain it. As Alan Perlis famously said, “It is easier to write an incorrect program than understand a correct one.” Without another human’s review of the problem, you may be solving the wrong problem.

Good Reasons to Train and Mentor Junior Engineers

“In learning you will teach, and in teaching you will learn.”
― Phil Collins

Despite the objections listed above, there are ample good reasons to train and mentor junior engineers.

  • Explaining a practical art to another human is good for humanity, by enabling more people to do more useful work.
  • The Teacher Learns.  Explaining and answering questions increases the capabilities of the trainer substantially. The effort to verbally explain a system or technology forces the explainer to crystalize a model of the system they may have not had earlier. (My knowledge of SIP, VoIP Call Control, and RTP grew vastly as I was forced by teaching to verbalize my knowledge, and confront areas I didn’t understand.)
  • Senior Staff should focus on the Hardest Problems. After the training is done, the time/effort for the senior staff is substantially reduced. Suppose that two hour task required three hours of training; now when the task needs to be repeated, the junior staff can complete it. This makes the senior staff more available for the challenging problems for which they are distinctively qualified. When senior staff are doing the easy stuff, they’re usually leaving the harder problems unconsidered.
  • Technology is fundamentally about improving efficiency through use of effective tools, and having more processors capable of completing a task makes the clients more efficient than having to wait on a single-threaded processor. By having only one person capable of providing the work, you’re creating a bottleneck and failure mode for the rest of the world.
  • Redundant Array of Imperfect Humans (RAIH): Everybody gets sick or needs a holiday. By mentoring someone else, you’re making it easier for yourself to get a break. You may not want to take a break, but eventually you’re going to get sick, die, take vacation, or get a better job.
  • Pair Programming Benefits. Even when automating a task, you can get benefit by bringing in somebody else. There are many documented benefits of pair programming; the Mentoring benefit is key. (Yes, pair programming takes more person-hours — but only about 15%. And for that 15% you get better results.)
  • Fundamental principles change little. While technology moves fast, key fundamentals that don’t: the Buffer (1950s); Interrupt (1950s); Relational Database (1960s); Packet-Switched Networking (1960s); Structured Programming (1960s); multi-user Operating Systems and Security (1970s); Digital Audio and Video (1960s – 1980s).

How to Accomplish Mentoring & Training

“You cannot help people permanently by doing for them, what they could and should do for themselves.”
― Abraham Lincoln
When you decide to start training another technician, start with the right assumptions.
  1. I cannot learn for you.
  2. I cannot replace your own curiosity.
  3. If you won’t ask questions, you’re probably not teachable.
  4. You can’t learn by watching or listening: you have to learn by doing.

More on this in a future post.

Photo Credit: Guido Gloor Modjib

If you use BroadWorks with CSV CDRs, you’re probably accustomed to reading these:

00255943365CF3FC1CF15820160404194616.3061-040000,ECG,Normal,+12296543428,+19125293400,Originating,+12296543428,Public,+12296543409,20160404194616.306,1-040000,Yes,20160404194624.106,20160404194650.684,016,VoIP,,3409,private,,,,local,Group,,PCMU/8000,216.128.52.5,fd390cb4-eb6d3d15-f8fb7fd2@10.23.6.217,,,,Hermes Communications,,,,,,,,,,n,,,17552886061:0,17552886091:0,Ordinary,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2296543428@vwave.net,Chris Brice,Public,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,33.292,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Group,,,,,,,,,,,,,,,,+19125293400,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2296543428@vwave.net,Primary Device,33.292,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,PolycomSoundPointIP-SPIP_650-UA/3.2.2.0477,,,,,,,,,,,,,,,,,,,,,,,,,,10626401:2,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,

Should you tire of counting commas, here’s a CDR decoder meant to run on Mac or Linux systems with gawk: bwcdrdecoder script

With it, you can get output like this:

D6B36AFC-688B-4A5F-89EC-CEBAFD6A444E

bwcdrdecoder was made to understand BroadWorks Release 21 CDRs; and generally older CDRs as well.

How To Make bwcdrdecoder Your Own

  1. Download bwcdrdecoder script
  2. Save it as “bwcdrdecoder” someplace in your path, or in your home directory, ~
  3. Change the permission on the file to allow execution, e.g.,
    chmod o+x ~/bwcdrdecoder
  4. Execute it to read in some BroadWorks CSV CDRs, e.g.,
    ~/bwcdrdecoder BW-CDR-20160501000000-2-5CF3FC1CF158-001455.csv

Troubleshooting

If you don’t have GNU awk “gawk” installed, you’re probably using Mac OS X, and you should install it. I recommend using “brew”

If gawk is installed in another location besides /usr/local/bin/gawk, then you’ll need to edit bwcdrdecoder to change the first line. For example, if bwcdrdecoder is in /usr/bin, you can change the first line to reflect that.