MIAMI, FLORIDA – BroadSoft Connections 2018 – The US Department of Justice (DOJ) has now allowed ECG to publish information about the long-term relationship in which ECG, Inc. has been proudly supporting Federal law enforcement since 2007. This partnership is one of ECG’s proudest contributions to reliable communications for critical, life-saving goals.
The US DOJ approved release of the information in October 2018, saying:
“ECG fills several engineering roles for the United States Department of Justice to support the deployment of a BroadSoft BroadWorks Unified Communications Suite which enables more than 25,000 end users in multiple geographically dispersed locations throughout the U.S. Each of these sites requires hosted communications functionality from the centralized BroadWorks servers.”
This is a classic case of Voice that Absolutely Has To Work — All the Time. The statement continues:
“In addition, ECG provides platform design requirements, project management, engineering process enhancements, training, security analysis, evaluation of NIST 800-53 control assessments, and vulnerability remediation and support of Plan of Action and Milestone (POA&M) management. The DOJ also deployed Alpaca, ECG’s java programming tool developed specifically for the BroadWorks platform.”
Alpaca is the BroadWorks Management Tool, transforming and migrating users between BroadWorks systems for replacement, scaling, and management. Alpaca is the only commercially supported platform that migrates 100% of BroadWorks user features and allow seamless (no-server-downtime, no-user-downtime) operation.
Original Release Date: September 19, 2018. Updated 18:19 UTC
SIP Service Providers and Enterprises
Background & Description
SIP based Enterprises and Service Providers (SIP Operators) that provide SIP UA configuration files (such as for Cisco, Polycom, Yealink, Mitel devices), but which do not authenticate those downloads effectively, are vulnerable to attack by having those configuration files downloaded. The configurations contain SIP servers and authentication credentials; when disclosed, attacks can be launched against those SIP networks. In the past, firewall rules were sufficient authentication by confirming that downloads originated from known networks.
In attacks September 19, 2018 (UTC), evidence emerged that attackers are successfully retrieving the SIP UA configuration files including authentication credentials, REGISTER, and launch outbound calls via SIP to high-cost destinations, even in networks where IP access lists and firewall rules are in place to limit access. The attack methods appear consistent with use of botnet agents installed within the networks of the attacked entities. These attacks are succeeding in production, Interconnected Voice networks that do have firewall rules and access lists in place.
Key traffic-pumping destinations in this attack are in country code +224 (Democratic Republic of Congo) and to +1-876 (Jamaica).
The observed use of legitimate user IP address space from which to launch SIP attacks represents a substantial escalation in the strategy used by attackers.
Even with strong SIP authentication and firewall rules, SIP Operators may be exploited for fraudulent economic benefit of the attackers. Toll fraud to high-cost destinations based on traffic pumping can create substantial costs for SIP Operators, and for potentially theft of confidentiality.
ECG recommends the following immediate measures to prevent this type of attack:
Use TLS with Client Certificate Authentication to restrict SIP-UA configuration to ensure that only legitimate devices with manufacturer-signed client TLS certificates (“manufacturer installed certificates”, or MICs) are able to download configuration files. For SIP UA Configuration platforms that do not have intrinsic TLS Client Certificate Authentication and Authorization support, implement an intermediate HTTPS proxy to verify client certificates.
After limiting SIP UA Configuration Downloads to be restricted by TLS and Client Certificate Authentication, therefore ensuring attackers cannot retrieve the SIP authentication credentials, update the SIP authentication credentials to use SIP passwords of 12 characters or longer.
Block outbound calling to high-cost destinations whenever possible.
Manage firewall rules to minimize access to SIP-UA Configuration Servers.
Monitor for outbound calling to high-cost destinations and block attacks, using toll-fraud monitoring tools.
The SIP UA models can operate without TLS Client Certificate Authentication on Config, but have been reported by the manufacturers to have the capability.
The information you have accessed or received is provided “as is” for informational purposes only. ECG, Inc. (“ECG”) does not provide any warranties of any kind regarding this information. In no event shall ECG or its contractors or subcontractors be liable for any damages, including but not limited to, direct, indirect, special or consequential damages, arising out of, resulting from, or in any way connected with this information, whether or not based upon warranty, contract, tort, or otherwise, whether or not arising out of negligence, and whether or not injury was sustained from, or arose out of the results of, or reliance upon the information.
ECG does endorse certain commercial products or services, including in some cases the subjects of analysis. Any reference to specific commercial products, processes, or services by service mark, trademark, manufacturer, or otherwise, does not constitute or imply their endorsement, recommendation, or favoring by ECG.
Many BroadWorks service providers are migrating to new data centers and cloud hosting providers.This change brings many aspects into consideration, such as
Rollback recovery options
Managing BroadWorks Configurations
Choosing Operating System versions
To look more into the migration, Mark Lindsey spoke with ECG AMTS Ashley Lee for a Q&A. Ashley is an ECG Associate Member of Technical Staff and Software Support Engineer, he provides technical expertise and creative solutions for the enterprise and service provider markets. His clients include CenturyLink, Momentum Telecom, UVA, and Vonage.
Mark: Will the existing servers be migrated or will new servers be setup to replace the original BroadWorks Servers?
Ashley: When conducting a migration, you may want to upgrade to newer hardware or use different hardware due to the distance between the original and new data center. Using new hardware also provides you with a rollback option should, as the original hardware will still be in place in the original data center during the final switchover/migration.
Mark: Will the BroadWorks servers require new IP addresses and hostnames in the new data center?
Ashley: In some cases using the same IP addresses will not be available in the new data center. This is generally the case when your BroadWorks platform is using private addresses. Additionally, if your original systems are using hostnames which identify the location, then you may want to update the hostnames to either remove the location identifier or update the identifier. Changing either of these, the hostname or IP, require additional BroadWorks configuration adjustments. Several of these adjustments are outlined in the BroadWorks Documentation. However, there are also other items you will need to ensure are updated, such as internal DNS, /etc/hosts files on all servers, and some BroadWorks settings/references not mentioned in the BroadWorks documentation/guides.
Mark: How do you decide which OS to use when migrating BroadWorks servers?
Ashley: When choosing which supported OS to install on the new BroadWorks servers there are a few things to consider. First, what OS is installed on the original server? Second, is this OS still supported? Finally, what are the impacts of running 2 different OS versions between the peer BroadWorks Servers?
Generally it is recommended to install the same OS on the new BroadWorks servers, primarily due to the ease of maintenance and updates for the servers. However, if the OS installed on the original BroadWorks server is no longer supported, it is recommended to upgrade to a supported OS from a security standpoint. Installing a newer/supported OS will also mean you will not have to upgrade this server’s OS when the peer server is scheduled to be upgraded later. This will save you time and help out with planning the upgrade of the peer server.
Lastly, you may ask is it supported to run 2 different OS versions between peer servers and more importantly what impacts does this have on the BroadWorks servers, especially the AS and DBS. To quote BroadSoft TAC, “BroadWorks for the most part is OS blind.” This means generally there will not be any impact between the BroadWorks Servers. Additionally in ECG’s experience we’ve seen a production BroadWorks platform upwards of 30,000 subscribers, run successfully for 6 months with different OS versions installed on each geo-redundant peer, this includes the AS and DBS. At the end of the 6 months the peer server was upgraded to a matching OS. The only consideration here would be if any of the 3rd party applications currently in use are not supported by the new OS. In this case it would be recommended to upgrade the 3rd party application to a supported version for the new BroadWorks server’s OS.
The final caveat here would be the DBS. In some older OS versions, the DBS requires additional 3rd party software or drivers, which must be manually installed to match the current kernel version. In newer OS versions, this software is available via repositories and is much simpler to install and manage.
Mark: What are the special challenges in relocating a BroadWorks Database Server (DBS)?
Ashley: There are a couple of special challenges with relocating a BroadWorks DBS. To be specific, if you are simply relocating the BroadWorks servers and not changing the IP address or hostname, then the move should be fairly simple and straightforward. However, if you are changing the IP address/hostname of the DBS, then you will need to plan accordingly as the procedure to adjust the IP address/hostname of the DBS vastly differs from other BroadWorks Servers. This is in part due to the scripts which BroadWorks Documents outline, but also due to the database itself referencing the hostname in several places.
In some cases it would be easier to simply reinstall BroadWorks DBS on the server and set the server up as a new peer. This would likely be the easier option as with ECG’s experience, not many engineers are familiar with the scripts recommended by the BroadSoft documentation. Also should something fail during the execution of one of the scripts, it is generally faster to reinstall BroadWorks DBS software, than to attempt to recover or fix what has been mangled by the script.
Mark: In the past, you’ve mentioned a special challenge in planning disk space for the BroadWorks DBS. Can you tell more about that.
Ashley: Generally when you are installing a new BroadWorks server you will need to ensure you follow the BroadWorks documentation and guides. However, in a migration you already have a server which is healthy and working fine. In most cases you can audit the existing server to confirm how the new server should be built. For most of the BroadWorks servers, auditing the existing server should be fairly easy and hard to miss any fine details. For example on an AS or NS, executing ‘df -h’, will generally provide you with the drive setup for the new server. However, for the DBS, ‘df -h’ will not provide you with information regarding the drives setup for the Oracle database. If your audit is not conducted by someone who is knowledgeable in regards to BroadWorks DBS installation and setup, then the additional drives for the Oracle database may be neglected.
In the event, the DBS’s new chassis has room for additional drives, omitting these drives from the audit can be remedied easily. However, should the chassis not have room for the additional drives, then the hardware would not be usable. In some cases you may expect to be able to use a NAS to resolve the problem with missing drives. However, using a NAS for the Oracle drives is not recommended. This is due to the DBS requiring access to the NAS at all times, and by using a NAS, your DBS in the new Data Center is dependent on two different devices always being online.
Network Outages are going to happen. The marketing department talks about Zero Outages — and that’s a great goal to have. But as the pragmatic engineering and operations team, you can prepare for outages to prevent them and to remediate instantly.
Skip the Pretend Redundancy
Detect and Triage Faults
PCAP: Packet capture
Ready for Remote Testing
Logging Enabled & Synchronized
Prepare the humans
Skip the Pretend Redundancy
Everybody knows you need hot spares. Pretend-Redundancy is what you get when you spend twice as much on equipment and software but never test it. Salesmen love it because they sell everything you need to be reliable.
Many companies buy spare equipment and even plug it in, but actually implementing working redundancy is far more challenging. Proper physical design can be elusive, and testing time-consuming. But it’s worth it to have a robust network. Outage-Prepared Networks not only build redundancy, they test their redundancy.
Redundancy in networks means that the system:
Has adequate replicas (e.g., spare servers, spare routers)
Continuously synchronizes all necessary information between replicas
Automatically detects faults
Automatically takes the faulted component offline (so that traffic doesn’t keep going to the faulted component)
Automatically selects a replacement component replica
Causes the replacement component to become active or start to take the additional workload
Detect & Triage Faults
To prepare for outages, you’ve got to be equipped to know when the faults are occurring.
Get the basics: Get the status and health information your devices have to offer using SNMP. Is the Interface up? Get alerts when components go offline.You can this fault detection and reporting with a variety of systems, including OpenNMS, SolarWinds, and others.
Measure the normals: Beyond the basic good/bad readings, you can define Key Performance Indicators (KPIs) like number of active users, concurrent calls, CPU and memory usage, and set thresholds for what is normal or abnormal. CPU of 50% may be normal, but 90% will be normal for some workloads and abnormal for others.
Get the alerts to a human. Generating alerts isn’t enough: some responsible human has to get the alert and do the right thing! The human analysts also need an escalation path to alert the problems-solvers.
Analyze. When you get an alert, you need to troubleshoot it to determine its severity. Fault-detections are imperfect, so you need to determine how true the alert is, and triage the situation for seriousness. Few automated alerts are reliable enough to be trusted.
In the June 2018 Visa Europe outage, the problem was bad hardware. The Network Operations team needed to contact people who could travel to the equipment site. The network monitoring analysts need a reliable communication path to other teams.
PCAP: Packet Capture
Outage-Ready Networks have packet capture capabilities online throughout the network — at least at the low-speed disaggregated points. Capturing 100 Gbps Ethernet may be impractical today. But capturing the traffic going in and out of each 1 Gbps load balancer should be achievable.
Capturing traffic should not require a dispatch to a physical site: the system should be built to allow troubleshooters to activate packet capture, collect and analyze data within minutes.
Ready for Remote Testing
The Operations staff need the ability to test remotely to replicate problems remotely.
Voice (Phone) services — It’s easy to REGISTER a SIP / IMS phone over the Internet, or to build VPNs, to replicate the experience of users attaching at different points in the network
Web services — It’s easy to use VPN or DNS to route traffic to a particular entry point in the network
Local ISPs — If you have a global service, using advanced BGP or DNS based routing, you can setup service with local ISPs in your markets and provide test units there to let you test the experience of local users in those markets.
After the outage has begun, it’s usually too late to build Remote Testing capability. Outage-Prepared organizations build these capabilities in advance.
Logging Enabled & Synchronized
All of our systems have logging. In telecom, though, many of those logs are disabled automatically. To prepare for outages, enable the logging you can, and be sure the logs are useful.
Enable the logging. Turn on the logging you can afford to enable without crippling the device. Debug logging is often too much.
Synchronize the clocks. To find out what truly happened in a problem, you often need to know the sequence of events that led up to it. But so often the clocks in systems are not synchronized, so that uncovering what occurred fast enough to remediate an outage is too difficult.
Centralize the logging. Whenever possible, aggregate all your logs to a central location bearing in mind you’ll lose the logging locations as well. Centralized logging with an analyzer tool like Splunk can radically reduce remediation time, but if your centralized-logging sites are down or unreachable, you still need to troubleshoot the problems. I prefer systems that store some logs locally, and send a copy to the centralized log storage.
Prepare the Humans
To prepare for an outage, you need all of your staff ready to help. Too many organizations depend on their senior most engineers for outage troubleshooting, but this is a big mistake. You need your full staff prepared to triage and analyze hard problems.
Training should focus on how to determine whether each component in your network is working. Components can be servers (like www2, or the DNS server 18.104.22.168) or services (like the SIP SBC at 22.214.171.124 or the REST API at https://foo.com/api/v2/).
The 24×7 troubleshooting staff should have:
List of supported product. (Supported means that if it breaks, we have to fix it.)
Diagrams of how the product works, for purposes of understanding dependencies.
List of components involved in each product
Method for testing each component. E.g.,
REGISTER with SIP to sip:firstname.lastname@example.org and make a call to sip:+email@example.com;user=phone
tcpdump is more efficient than tshark at raw writing to disk; e.g.,
tcpdump -s 1514 -i eth2 -w file.pcap
will tend to capture more than a similar tshark command.
(b) A busy Linux box, or high packet rate, will lose some data because tcpdump or tshark are not running all the time. You can run tcpdump at a higher priority with the “nice” command and a negative nice level:
Sometimes the disk system just cannot keep up with the rate of traffic, and the disk buffers aren’t large enough. Without tuning kernel disk buffers, you can make a ramdisk. This example checks to see there’s about 1660 MB of RAM doing nobody any good, and it makes a 1000 MB ramdisk using the “tmpfs” filesystem feature, and writes a big capture to it.
[root@sniffer /]# free -m
total used free shared buffers cached
Mem: 2010 1748 262 0 159 1239
-/+ buffers/cache: 349 1660
Swap: 3999 0 3999
# mkdir /tmp/ramdisk
# mount -t tmpfs -o size=1000m tmpfs /tmp/ramdisk/
# nice --adjustment=-10 tcpdump -s 1514 -i eth2 -w /tmp/ramdisk/ecg_sniffer_eth2_20180321.pcap
tcpdump: listening on eth2, link-type EN10MB (Ethernet), capture size 1514 bytes
1199062 packets captured
1199075 packets received by filter
0 packets dropped by kernel
ECG would be glad to help with your Voice and Video Network Engineering, and 24×7 customer support. Ping us to learn more.
“DNS is one of those things that gets overlooked… You make your voice servers super-redundant, but you take it for granted that DNS will always work.” — Fonality
The Domain Name System (DNS) conventionally helps devices convert domain names, like VoIPCarrier.com, to IP addresses, like 126.96.36.199. But in Voice and Unified Communication (UC) services, it serves an additional key role in fault tolerance, and enabling encryption. For large networks, it can also assist in balancing load across many data centers.
Background: In typical Voice Network implementations, the UC client or SIP phone is configured to make requests to its local “recursive caching” DNS server. (So called “recursive”, because it makes multiple steps to arrive at the answer, and “caching” because it stores the resulting answers for some time.) This server consults with the Root and Top-Level-Domain (TLD) servers that handle domains like .com. Those TLD servers direct the DNS client’s recursive caching DNS to the authoritative DNS server for the Voice/UC provider, and that authoritative server provides the IP addresses, failover addresses, load balancing settings, and encryption settings for the UC or Voice Service.
There are some common variations on this theme:
1. Voice & UC Operators that don’t use the Internet. For Cable operators, “Metropolitan Service Operators” (MSOs) and Fiber to the Premise (FTTx) Providers, DNS can be used, but it doesn’t use the Root Servers or TLDs. These operators cannot provide their service across the Internet.
2. Operators that don’t use DNS at all. Some networks are built with the IP addresses configured in the devices, so no DNS is ever used. As 2600hz argues, this strategy can mean that inevitable IP address renumbering becomes extremely expensive and can create outages for customers. “This can be dangerous if you have customers who provision their own phones and, later on, you move data centers or change IP blocks for some reason and have no way to change the IPs of those phones.”
Nine Common Vulnerabilities Most Voice & UC Providers Are Exposed To
I. SIP Phones & Clients depend on their local DNS servers.
The most common mistake is that SIP Phones and Desktop UC Clients are relying on the local DNS servers in their data networks. For some users, this is the outdated Windows server in the closet, but it’s often the underpowered DNS software provided in the local router by the Internet Service Provider (ISP).
The risk is that when this local, underpowered, or under-managed DNS server malfunctions, the Voice/UC service fails. The Voice/UC provider is responsible for important features, like the ability to make emergency phone calls, or conduct business (video calling, desktop sharing, messaging, etc.), but all of those services fail where the local DNS server fails.
Key Remedy Options:
Option A: Configure the Voice/UC Clients to use DNS servers provided by the Voice/UC Provider itself.
Option B: Configure the Voice/UC Clients to use reliable DNS servers. For example, you may configure the clients to use Google (188.8.131.52, for example) and Level(3) (184.108.40.206, for example).
II. Depending on a single organization to run the DNS server for authoritative service.
Any one DNS server on the Internet can be attacked and overloaded using botnet attacks. Even most organizations that offer DNS services can be overwhelmed. History shows that attacks tend to be launched against individual DNS operators. (Note that the Global Root DNS servers are under continuous attack, and have defenses that are effective so far.)
Option A: Buy DNS hosting from multiple (Authoritative) hosting companies, confirming that they’re not using the same underlying infrastructure.
Option B: Run your own DNS servers, but also use another company’s DNS Authoritative server’s too. To be effective, the systems should not be synchronized automatically. Some human review should be required to make the change in each place.
III. Failing to use locally cached answers in case all DNS fails.
Some SIP Phones can store a cache of DNS records locally, provided to the device by the configuration file. For example, the Polycom UC SIP devices support these configurations, and only use them if no DNS replies are returned. The Polycom implementation supports basic A records, as well as the advanced records, SRV and NAPTR, used for load balancing and encryption management.
Yealink SIP Phones has a Static DNS Cache supporting A, SRV, and NAPTR that can be configured either as the preferred source of DNS settings, or can be used as as backup.
Software vendors who provide desktop and Mobile software should build in this capability, so that the configuration file delivered to the software has a DNS cache backup in it.
IV. Too short of Time-To-Live Cache Lifetimes.
Every DNS records has a “Time To Live” (TTL) setting, which specifies how long caching servers should store the record.
Large Values, such as four hours (which would be configured as 14400 seconds), have the advantage of being stable. If there’s an interruption in access in the DNS system, these ensure that the old records will stay in the cache for some time. But the downside is that when changes are needed, the operator may have to wait four hours for the change to go into effect.
Small Values, such as 60 seconds, have the advantage of allowing relatively quick changes. For example, if a provider learns that their primary Session Border Controller (SBC) site is having problems, they can make a change, and expect all DNS requests within 60 seconds to get the new value and route requests to the new location. But small values have a downside: if there’s a short interruption in the client’s access to the DNS servers, then the clients will lose all old records.
Large TTLs don’t provide complete protection against DNS failures. Note that even with a large TTL, you should expect that records will begin expiring immediately upon failure of your authoritative caching service. For example, if you have a TTL of four hours, and an outage occurs at 12:00pm, then by 1:00pm approximately 25% of your users’ caches will have expired and by 2:00pm, half of your users are offline.
Key Remedy Options:
Option A: Use a relatively large DNS TTL value, such as four hours, for stability.
Option B: Use a large DNS TTL, but change it to a smaller number in the hours before a scheduled change
V. All DNS Servers in one BGP prefix or BGP Autonomous System
BGP is the system used by large organizations to configure the routes for IP traffic. An IP Prefix, such as 220.127.116.11/21, specifies the routing path for 2,048 IP addresses.
But because BGP Prefixes are subject to attack and downtime too, you don’t want to have all of your Recursive Caching, or your DNS servers, in the same BGP prefix advertisement.
(For the same reasons, you also don’t want to have all of your Session Border Controllers (SBCs), or all of your Load Balancers, or all of your Production Firewalls in the same BGP prefix if you can avoid it.)
Hijacking one BGP prefix is generally easier than hijacking more than one, so by having a diverse set of BGP prefixes, you make this attack more difficult and improve the robustness of your network.
A rarer attack that is still possible is hijacking of a BGP Autonomous System Number (ASN). The ASN is used in Internet Routing to identify a large company or Internet Service Provider. An ASN attack would allow an attacker to interrupt BGP traffic for an entire ISP or large enterprise.
Key Remedy Options:
Option A: Ensure that your DNS servers (both authoritative and the recursive caching servers used by client devices) are from at least two different BGP prefix advertisements. Typically this is easy to check because the IP addresses begin with different digits.
Option B: Ensure that DNS servers are advertised via two different Autonomous System Numbers (ASN) in BGP. This can often be accomplished by buying authoritative and recursive-caching service from an additional provider.
VI. Depending on a single link to the Internet for your DNS authoritative service.
Any one Internet link is subject to attack and overload. One of the most common forms of Internet Attack is a Distribute Denial of Service (DDoS) attack, where millions of vulnerable devices are used to blast traffic at a single destination. DDoS mitigation is an active industry, with a number of firms providing tools and services to fight DDoS attacks.
Arbor Networks is one of the leading firms providing equipment built to fight DDoS attacks. Akamai and AT&T offer services to clients to help fight DDoS attacks.
However good all these services are, while the DDoS attack is in early stages, access to the DNS servers, Session Border Controllers, and Load Balancers will be interrupted for some time.
Many operators have redundant Internet links to provide fault tolerance. In the normal implementation, all Internet links are setup to be capable of handling all of the traffic. However, this means that every DDoS attack will hit every link simultaneously.
Key Remedy Options:
Option A: Separate traffic for each site, so that you have distinct IP blocks at each site. For example, your “Eastern Site” may advertise 18.104.22.168/21 and your “Western Site” may advertise 22.214.171.124/21.
Option B: For the classic form of IP redundancy you may want the same IPs through both site. Use Option A, but also advertise a common block through both providers.
VII. Depending on a single DNS registrar.
The DNS registrar is responsible for loading domain name information into the root and Top-Level Domain (TLD) servers. For example, if you own VoIPCarrierX.com, you’ll pay a registrar, like Register.com, GoDaddy, or EasyDNS a small annual fee to keep your data loaded in the root name servers.
But every registrar is susceptible to attack or malfunction. For example, in 2016, the DNS Registrar for New York Times was attacked, leading to an outage for the news site.
Use Multiple Registrars; this will necessitate multiple domain names.
VIII. Depending on a single Top Level Domain (.com, or .net)
As discussed above, if a DNS registrar is attacked, then the attackers can control the access to the domain name. But most of the Top Level Domains (TLDs) are managed separately from one another, and it is conceivable that an entire Top Level Domain could be attacked.
For example, the servers that handle .com are different from the servers that handle .net. Each country’s top level domain, like .us (for the United States) and .br (for Brazil) are also distinct.
By using different Top Level Domains, you can protect your services against excessive dependence on any one TLD server. For US-based companies, .com and .net are distinct domains, but there may be value in finding more diversity outside a single nation’s DNS management.
Voice/UC services have a distinct advantage here: users are not normally aware of the domain name in use. Google cannot easily use multiple domain names: everyone expects “gmail.com” to stay put at that name. But SIP phones and UC clients can be configured to use multiple domain names.
This is an important capability at the heart of many of the options here. To use multiple domain names, the client has to be capable of failing over.
Use Multiple Domain Names from distinct Top-Level Domains.
IX. Voice & UC Servers Not Verifying your servers with TLS
TLS (formerly known as SSL) has an important capability of validating that the server used is the legitimate and intended server. For example, if your SIP phone is configured to connect to VoIPCarrierX.net, then the SIP phone can validate that, once it is finally connected to some IP address (say, 126.96.36.199), the server provides it a certificate that is cryptographically signed by a key that the phone already trusts. The list of trusted parties is stored in the “Root Chain” on the TLS client; if the SIP phone has been configured to trust, say, Verisign, then the certificate for could be signed by Verisign identifying the server as legitimate for VoIPCarrierX.net. TLS also has a mechanism so that the client confirms mathematically that the server actually has some internal secret information — a long password key — proving it is truly the owner of the signed certificate. StackExchange has a nice introduction to this technology.
This means that after the entire DNS process has played out, using all the techniques above, the SIP phone or UC client can be built to verify that the server it finally reached was legitimate. This is a key requirement so that, if the DNS protections put in place fail, the phone will refuse to talk to an illegitimate hijacker.
Key Remedy Option:
Use SIP over TLS, and verify the Server Certificate.
DNS can be an effective and powerful way to manage Unified Communications services, but it does expose the operator to a number of attack methods. Smart Service Providers are taking the challenge seriously, implementing a variety of techniques to introduce diversity and genuine redundancy, including BGP, TLS, Top Level Domains.
Careful testing is crucial for stability before launching new SIP Access Device models
The Product Definition must be baked into the Test Plan used to approve the device, & new software for it
Customer Technical Support teams need extensive early access to devices before they are deployed
Space Probes are designed to be sent far away, beyond our reach. They run software, and send back data to Home Base. If the Space Probe, dies, it may never call home, and there’s not much we can do to fix them from here.
For Voice Operators, the SIP Phones, IADs, User Agents, and Soft-phones clients are a bit like Space Probes: Once you deploy them, you can’t go touch them. If they never “phone home,” you may be out of luck. So you have to plan well to make SIP clients work properly, before they’re launched.
Planning the Launch
The Challenge. There you are, in
charge of making the amazing new Hologram SIP phone work on your platform. The Product and Marketing folks LOVE this new phone, especially the way it integrates with your clients’ PCs and does video calls via Google Hangouts and Apple Facetime. And it’s your job to integrate the device into your BroadSoft BroadWorks network.
Avoiding Device Management Disasters
You’ve known about the SIP device disasters:
…when your Support Staff got calls from customers using the Yealink T-58V phone before they had ever heard of it
…and when the Engineering team had a week to roll out video calling on all the Polycom VVX phones, because customers were buying cameras
…or when the CEO of your company got the latest Mitel 6869 phone, and read that the phone could show him his company directory in Microsoft Exchange, but then it didn’t work.
…and who could forget when you upgraded 1000 phones to discover the new firmware required a different file format?
BroadWorks for Device Management?
BroadWorks provides some mature tools for managing new device types, including some important key features:
The BroadWorks system owner operator can add a new device type at any time, without BroadSoft’s assistance.
BroadWorks records many key SIP parameters of the device type.
BroadWorks stores the files like software and images, for delivery to the device.
BroadWorks automatically generates configuration files as the device user’s changes are made
BroadWorks system owners can define custom tags to handle new features and situations. Example: Most phones don’t support Apple Facetime, but if you have one phone with a FaceTime gateway, you can define “%FACETIME_GATEWAY%” as a custom tag in BroadWorks. Then BroadWorks can replace that tag macro with the appropriate value for each customer.
BroadWorks can deliver the files to the device with HTTP or HTTPS, confirming username/password or SSL client certificate.
BroadWorks can signal the device to download the new configuration file.
Key Milestones for Productizing a SIP Access Device
Decide: Which of the Features will you support?
Prototyping & Testing: Do the Features Work?
Management: Can you Control It?
Support: Can you help users use it?
Features: You Can’t Support Them All
Every SIP phone comes with a bazillion features, but no one network can support them all. You have to decide which features you’re going to try to support. Some of the top features are:
Ordinary calling with Caller ID
“HD Voice,” usually using G.722
Power over Ethernet
LLDP-MED for automatic Ethernet VLAN selection
Busy Lamp Field (BLF) / Line State Monitoring
Shared Call Appearance / Shared Lines
Programmable “Soft Keys”
Customizable Dial Plans (so the phone knows when you’re done dialing)
Bluetooth headset support
USB headset support
Multi-party Video Calling
Multicast Paging (where the speakerphone)
3-Way and N-Way Conferencing (where a large number of people can be conferenced together using the conference button)
Customizable Contact List
LDAP or Microsoft Active Directory or Exchange directory access
Customizable background logos
…But many will try
But there are exotic features on every phone. Are you going to help your customers if they contact your help center asking about these?
Practically, new phones are far too complex to support every feature. Choose the features you actually intend to fully, thoroughly support.
Prototyping and Testing: Prove Those Features Work
Once you’ve chosen your features, you can develop prototype templates so that BroadWorks can generate configuration files for that phone. In most cases, in the BroadSoft community, responsible vendors will post prototype templates with example configuration files to BroadSoft’s site, Xchange. These configuration file samples can be in one of the categories, “Interoperability – CPE Kit” or “Interoperability – Configuration Guide”.
Essential Parts of an Identity/Device Profile Type
An Identity/Device Profile Type (IDPT) is a specific model of device, such as a SIP phone.
The key parts are:
Profile: Identity/Device Profile Type (IDPT) Profile
File Templates: Identity/Device Profile Type (IDPT) Files & Authentication
In the IDPT Profile, you can set key settings that apply to every device of that type, such as:
What should the SIP Request URI look like?
Does the device supports RTP early media?
Can the device register?
Does it support HTTP authentication?
How does it handle Privacy headers?
Ideally, your vendor will specify these settings exactly.
For each IDPT, BroadWorks allows you to upload any number of templates and static files. Templates can be translated by an elaborate process of search-and-replace, and are usually used for configuration files. Static files are typically firmware/software and graphic images.
An excerpt from the template BWDEVICE_%BWMACADDRESS%.cfg is shown below, for the device type “Polycom VVX 500”. When a “Polycom VVX 500” phone is defined on the system, and assigned the MAC address 004fb2012345, then the file BWDEVICE_0004fb2012345.cfg would be created.
As shown in this example, the BroadWorks-standard tags begin with “%BW”, including
%BWLINEPORT-1%, the SIP Address of Record (AoR), user portion
%BWHOST-1%, the SIP domain for the AoR
%BWEXTENSION-1%, the extension that can be dialed to reach the user within the user’s organization
But you can define any number of other tags, which BroadWorks can track and manage. In the example above, you see %CONFERENCE%, which BroadWorks can replace with a value you specify.
A Tag Set can be created for each IDPT, with special settings appropriate for that device type. These can be set at the system level, and then overridden at the enterprise, service provider, group, or device levels of the hierarchy.
When prototyping your new device, you adjust the configuration files to implement the features in your particular network. You do this by revising the templates, and adjusting the tag values to work in your environment.
The Test Plan is Product Definition
The test plan you define is the definition of the product. If it’s an important part of the product, then you must put it in the test plan.
This test plan will be the Regression Test Plan later. It should be written so that all of the tests pass at the point the product is approved.
In a Test Plan, you have a “Device Under Test” (DUT) and tell the tester precisely what steps to take, and what to expect. You have to be specific: define the specific features and settings in your BroadWorks lab.
Record the actual results so that you can review and analyze them later. The person who’s doing testing, the Test Engineer, should record subtleties of behavior.
Answer the question What Did We Learn, because in each test, you’re discovering something about the system.
Example Test Environment
Device Under Test, DUT: The Apple Blackhole Phone Model 1
User built in BroadWorks as 229-316-1002
BLF monitoring on position 1 for 229-316-1001
Lab Phone 1: Polycom VVX 500 at 229-316-1000
Lab Phone 2: Yealink at 229-316-1001
Cell Phone 1: iPhone X at 919-559-6000
Example Test Plan
Test Case ID: TC01
Feature: Call with Caller ID
Detailed Procedure: On DUT, while the handset is on hook, press the Speakerphone button, and dial 919-559-6000. Do not press “Send”.
Expected Results: On the 919-559-6000 cell phone, it should ring, and you should see the caller ID 229-316-1000. Answer on the cell phone, and confirm that you get two-way audio. Hangup on the cell phone, and confirm that the call is disconnected on the DUT.
Actual Results: _______________________________________
What Did We Learn? _______________________________________
Test Pass? TRUE / FALSE
Test Case ID: TC02
Feature: Busy Lamp Field – monitoring
Detailed Procedure: On DUT, watch Busy Lamp Field Position 1. Place a call on test phone 229-316-1001 to 919-559-6000. Check Expected results, then disconnect by hanging up on 229-316-1001.
Expected Results: When you can hear ringback on the 229-316-1001 device, you should see the BLF indicator in position 1 on DUT. When the call is answered on cell phone 919-559-6000, the BLF indicator should change to the red bar on position 1 within 1 second. When you hang up on 229-316-1001, the BLF indicator on DUT should return to black within 1 second.
Actual Results: _______________________________________
What Did We Learn? _______________________________________
Test Pass? TRUE / FALSE
Management: Can You Control It?
Once you’ve proven that the core features work, you need to prove you can control it.
Prove you can replace the software. The simplest, naive way is to replace the firmware file, and trigger phones to reboot. This can cause an unlimited phones to retrieve the new software each time.
Instead, most operator wish to selectively upgrade a few phones at a time. You can do this with a custom tag, such as %FIRMWARE%, that specifies the filename, such as “sip-1.56.2.bin”. Then you can change the tag on specific phones that should have the new version of software.
ECG Alpaca, can be used to selectively update tags on devices within a BroadWorks platform. This is routinely used to upgrade specific devices, e.g., 1000 devices per maintenance window.
Alpaca can also selectively send the command to the SIP phones to reset them, and then monitor to confirm that all phones reboot properly.
All software generates logs. A manageable SIP access device puts the logs somewhere useable. With BroadWorks, you can have the device upload its logs, and BroadWorks can retain those logs for viewing on the Profile Server.
Alternately, some operators configure devices to send logs to a central log collector via syslog.
New Device Procedures.
When onboarding a new device, be sure to specify the requirements for new devices. What should happen to a new device from the manufacturer?
But this means that a new-in-box phone cannot be sent directly to the end user, without the user’s intervention in the process.
Some vendors, including Polycom and Yealink, offer a redirection service. To use these, new, or factory defaulted, phone with Internet service can reach out to the phone manufacturer to get a URL to the voice operator. For example, the phone 0004b2012345 could connect to Polycom to be told to go to https://xsp.voipcarrier.co/dms/vvx600.This works only because the voice operator has registered that MAC address with Polycom, and registered the URL.
Support: Enabling yourself to do a great job
The final step before rollout of a new device type is ensuring your Customer Support department have adequate access and experience.
Provide the Customer Support with the device to use. It’s remarkably common for operators to neglect to give their support folks experience on the devices that customers have. This leads to stress and poor customer service.
Have Customer Support run through the Test Plan to be sure they understand all of the supported features and settings. Since the Test Plan encapsulates the entire product definition, this means that Customer Support will know how to do it well.
The consulting firm, ECG, offers Voice Carrier technical consulting, including the proper rollout of new devices types, and support of Voice access devices in BroadWorks and other multi-vendor environments. www.ecg.co, firstname.lastname@example.org
IP Fragmentation of SIP Messages is an enduring source of trouble.
Fragmentation of SIP traffic is a problem on the rise. It appears when everything has been working fine, and seemingly without cause, some SIP messages are lost in the network. The result is a frustrating scenario where some SIP messages are delivered fine, but others are not.
To explain SIP fragmentation, let’s start at the beginning: Layer 2. Every link on an internet has a Maximum Transfer Unit (MTU) size which determines the maximum size of a packet that can traverse the link, in bytes. For Ethernet, this is often 1500 bytes. This means that no one Ethernet frame — and therefore one packet of data — can be transmitted across a standard Ethernet network that is larger than 1500 bytes. The duty of the Ethernet interface is to transmit only frames that meet this standard.
However, many applications need to send more data than this in a message. So the Operating System must accommodate both the application’s need for large messages, and the network’s requirement to send packets of a limited size. How is this done?
Fragmentation is fairly cheap for the fragmenter, but reassembling the fragments when they arrive is a fairly expensive operation. Simon Dredge (Metaswitch) has discussed the computational costs of re-assembling UDP fragments, arguing that the re-assembly should be done in a specialized kernel module of a Session Border Controller, rather than where the user applications run:
[T]he receiver has the tricky job of taking these seemingly miscellaneous packet fragments, deciphering them from other packets or packet fragments arriving simultaneously and piecing them back together – somewhat akin to a jigsaw puzzle – but without the aid of the picture on the lid of the box. Naturally, this process takes memory resources to store packet fragments, while waiting for their counterparts to arrive, then processing cycles to compile them . . . If fragmented packets are not successfully reassembled in a timely manner, then a retransmission will be requested or initiated, thereby further compounding the reassembly issue.
When a large message is fragmented, the separate fragments travel as separate IP datagram packets through the network. It’s possible for any one of those to be lost, but if one fragment is lost, IP has no mechanism to detect that and recover. The Internet Protocol software merely discards all the other fragments at the receiver. It depends on something else to retransmit the entire message again, on the hope that all fragments will be delivered.
Consider this analogy: This is like posting five boxes for shipment with no insurance: if four of them arrive, the mailman doesn’t track down the fifth box. The recipient must determine that a box is missing and ask for its contents to be replaced.
Worse than losing a packet, the loss of a single fragment of message D wastes all the network bandwidth (capacity) that was used sending the remaining fragments. They’re useless.
TCP, which is built on top of IP, typically does not use IP fragmentation. Instead, TCP has segmentation. TCP segmentation is optimized for the case of lost segments; when one is lost, TCP slows down the transmission, and retransmits the missing segment.
Fragmentation and SIP
SIP is usually used over UDP. When the SIP messages are small, this is no problem. In fact, for normal phone calls (e.g., SIP on PSTN gateways and SIP trunks), individual SIP messages almost always fit in a single UDP message, well under 1500 bytes, and therefore no fragmentation occurs at all (See Figure 3).
But as services become more sophisticated, the size of SIP messages grows. In particular, “Busy Lamp Field”, also known as “Line State Monitoring”, will often be responsible for sending large data sets via SIP messages. To receive the status information on the 11 people monitored on my Polycom VVX 600, my phone receives around 11,000 bytes of data in a single SIP NOTIFY message. If I were using SIP over UDP, that would take 8 IP fragments.
Even basic fragmentation has been a problem for some SIP systems. Back in 2009, Eric Hernaez (SkySwitch) reported a “major vendor’s” switch crashed by SIP fragments. But today, most SIP/UDP platforms support fragmentation with some reliability.
There are some approaches for reducing SIP message size in SIP. Alex Balashov (Evariste Systems) describes the challenges of reducing SIP header size in a pure-proxy system, but suggests some SIP headers you can remove in many cases. In 2009, Thomas Gelf suggested dropping unnecessary codecs, and using SIP compressed headers, like “m” instead of “Contact:”
SIP Fragmentation problems spiked again in 2016, when we saw numerous incidents where SIP fragmentation contributed substantially to network problems. Networks often run fine until the SIP messages grow just enough to cross the MTU boundary.
The buffer limitations in these devices can be a real problem. The 200-fragment capacity of a Cisco ASA5500 can easily be overwhelmed by normal traffic of thousands of SIP phones. (Add this as a reason data firewalls create trouble with large-scale VoIP deployments.)
Session Border Controllers and Fragmentation
Some versions of the market leading Session Border Controller, the Oracle Acme Packet SBC, has limitations on handling fragments. The “Traffic Manager” from the Network Processors to the CPU controls the rate that IP fragments are delivered from the network to the CPU in the popular 4250, 3800 and 4500 platforms. Terry Kim has a great depiction of the traffic manager for the Oracle Acme Packet SBC. He highlights that the Oracle Acme Packet SBC handles fragmented SIP as “untrusted” traffic — so if trusted endpoint devices, like customers, are sending fragmented SIP regularly, then the SBC doesn’t provide them the same capacity that they would have if they were sending non-fragment capacity.
The Oracle Acme Packet SBC was designed in an era that believed SIP over TCP would become more dominant, so that fragmented UDP should be rare. But the single rate limit for SIP fragmented traffic can be a real performance limitation when fragmented SIP becomes very common.
The Metaswitch Perimeta was designed much later, after the industry discovered that most SIP is still operating over UDP. The Perimeta was designed for fragmented SIP to be very common.
TCP: Segments over Fragments
The SIP standard, RFC 3261 mandates that TCP should be used to prevent fragmented SIP. Indeed, SIP over TCP does solve many of the problems by replacing IP fragmentation with TCP segmentation.
TCP segments slice up the stream of SIP messages into neat segments that fit within MTUs. Critically, TCP provides a fast and efficient mechanism for filling in gaps in the stream. Contrast this to IP fragmentation, where the entire SIP message must be re-sent any time one segment is lost.
However, a commonly-encountered problem using SIP/TCP is in limitations of Highly-Available Session Border Controllers, where the TCP synchronization and retransmission mechanism means that the state of any SIP/TCP connection may not be efficiently replicated to the standby SBC. For example, if SBC-A is active, then reboots, the SIP Phone using TCP typically has to reconnect to SBC-B. The SIP phone has to detect that the link to SBC-A has been lost. Both the Oracle Acme Packet SBC and the Metaswitch Perimeta SBC transmit a TCP RST (Reset) message when possible to notify the SIP phones that they need to re-register.
When using SIP/UDP, the failover to SBC-A to SBC-B can be a non-event: the IP address in use simply moves from one SBC instance to another because both SBC units in the pair know all of the state of the SIP registrations, subscriptions, and phone calls to all endpoints. But with SIP/TCP, the re-registration storm can be substantial, and disruptive. Imaging a network of 50,000 endpoints, all forced to re-register each time you reboot a single SBC instance.
The Fragments Are Coming
If you operate basic SIP trunking, or even some forms of today, you might not experience a lot of problems with fragmentation. But you should expect messages are growing larger, especially with integration of Fixed and Mobile networks, and the large SIP messages of IMS networks. How do you test? Send big SIP messages!
SIP messages are only getting bigger, so it is something you need to design and test for now, or it will be a major headache when it does happen. … That could involve pro-active pressure testing to find and resolve the problem areas before they bite you for real, and also monitoring largest message sizes flowing around your network so that you can predict when you’re going to start seeing more fragmentation and act accordingly.
When we teach classes on VoIP networks, we discuss the variety of SIP standards that can be used by working systems. For example, identifying the calling party varies between platforms: One system uses P-Asserted-Identity to indicate the caller, and another uses From. Then there’s the codec used for audio: One system uses G.722 and another uses G.722.2. One system expects national telephone numbers, and another system expects international-formatted numbers. The list goes on.
When setup two SIP systems to exchange phone calls, you need to be aware of the issues. If you’re lucky, the two systems have a lot in common. If you’re not, you may have a lot to fix.
Number Formatting Interop Checklist
1. Establish format for local telephone numbers.
Not always required.
2. Establish format for national telephone numbers. E.164 With Plus is recommended, e.g., +12292442099
3. Establish format for international telephone numbers. E.164 With Plus is recommended, e.g., +12292442099
4. Determine which short codes are supported, such as emergency codes 911, 111, and 0
5. Determine what control codes are supported, such as *67
Popular Telephone Number Formats
One of the most common questions for SIP interop is how the called telephone number will be formatted. There are several popular formats, and they occur in the Request-URI (after the “INVITE”) and in the To header.
This article is intended to familiarize you with many of the common options. The software you’re using — BroadWorks, Metaswitch, Asterisk, Sonus, Genband, etc. — will have certain capabilities and defaults. If you can find a way to interop end-to-end between the two call-control servers, without having to rewrite the number in an Session Border Controller in the middle, then you prevent substantial hassle.
Complete National Number
Perhaps the most common format is the complete national number. For example, in the United States, you might see this complete national number: 9193160013.
Within the UK, you might call a London number by dialing eight digits:
INVITE sip:email@example.com:5060;user=phone SIP/2.0
Via: SIP/2.0/UDP 192.168.1.43:5060;branch=z9hG4bK6ae0c87576E0FF92
From: "UK User 1" <sip:firstname.lastname@example.org>;tag=90C2A5B1-AAFDEC86
When a human is dialing the number to place a call across the PSTN, the receiving system (INVITE User Agent Server, UAS) will need to determine the intended PSTN telephone number. One popular method is to apply the local area code of the calling party. For example, if this call is dialed by user “u2292442099”, and their US area code is “229”, then the UAS may reasonably interpret the dialed digits 3160013 as +1-229-316-0013 by prepending the national and area code to the dialed digits.
This is a very awkward format to support between carriers or systems, and quite impractical if multiple local areas are involved.
Dial by Extension
In PBX-like environments, “extensions” can be dialed. An extension is a locally-defined telephone number, but which is not useable routing between carriers. You can expect to only see this when the INVITE Request URI represents digits that a user is dialing personally.
Preferred: I recommend this format whenever possible, because it’s the only one with a token that signals the type of number it is: the leading Plus. By including the Plus before the number, all parties involve understand that the next digits will be the country code, followed by the national and local number.
While it’s easy to say this is preferred, it’s not the right fit for most cases where an ordinary human is dialing the call. Most keypads don’t have an obvious + dialing symbol, and users conventionally dial a shorter number.
The “ext.6100” in this example makes it incompatible with a telephone number; there’s no automatic mapping between ext.6100 and a telephone number. This is intended to route a call to the user known to the UAS with the username “ext.6100”.