Committing time to answer questions is the crucial first step

This is Part 3 in my Series on Supporting/Managing Engineers

Configuring bridging, building bridges

Unlike software, systems, network, and voice engineering, regulated engineering disciplines require licensing. According to the National Society of Professional Engineers, a college engineering graduate candidate can “begin to accumulate qualifying engineering experience”. What happens during these four years of post-graduate experience? “The experience must be supervised. That is, it must take place under the ultimate responsibility of one or more qualified engineers.” Further, the experience is expected to be “high quality, requiring the candidate to develop technical skill and initiative in the application of engineering principles.”

These are the engineers building the bridges you drive over, the flammable electrical parts you install in your walls, and the explosive power plants you live near. Society expects high standards because of the risks to health and safety.

So to advance in their field, those engineers seek out supervision. And because of the licensing, senior engineers must participate in the mentoring process. They have a valuable structure of mentoring.

But what about the “engineers” who run email servers, build voice networks, and write software? Generally, these information technology (IT) and computing disciplines are unregulated. Anybody can do IT as badly as they like, and we have no mentoring structures in place.

ASCII question, but got no ANSI

Computing has grown without formal, legally-mandated mentoring requirements because computers serve as self-checking devices: if the new programmer tries to do something crazy or foolish, it won’t compile or it won’t work.  Unlike faulty bridge designs, computing systems are relatively easy to test. If you try to build a network but don’t know what you’re doing, that network won’t function.

But in IT/computing fields, mentoring is still needed:

  1. IT Engineers often have questions that cannot quickly be googled or tested by experimentation
  2. IT Engineers need to learn from the mistakes & experience of others
  3. IT Engineers need review from other brains to help check their own ideas

It can be hard for a junior engineer to get a solid answer to a question. The best engineers are always busy, and they’d rather spend their time with computers, not people. For example, the Myers-Briggs personality type INTJ is  used to link introverted personality types to Computer Programmers and Engineers. For these introverts, answering your question is draining. “If you’re a true introvert, networking is excruciating,” writes Susan Adams in Forbes 2014.

So if you’re learning something new,  how do you get your questions answered if you’re among people who’d rather avoid people?

A Mentor commits to answer questions

With so many demands on a skilled professional’s life, the key scarcity is the willingness to provide answers when questioned. So if you’re going to be a mentor engineer,  be sure you’re available to hear questions, and provide good answers.

That is, mentoring in IT Engineering is first about the mentor’s commitment. The mentor has to be willing to take questions from junior engineers, and commit to answer them.

As a professional programmer, Jud McCranie answered hundreds of this author’s written questions about programming through 1989-1993. He’s a great example of a mentor, investing in answering questions of a curious mentee.

I had some great mentors as I was learning. Jud McCranie is a professional programmer I met through through a university-operated  Bulletin Board System in the early 1990s. He answered a thousand questions from me on programming, and even decrypted my amateurish encryption algorithms. He was a mentor because he took my questions and didn’t ignore them. He challenged my thinking, often asking me questions I couldn’t answer myself.  I’m always thankful for his patient consideration of numerous questions from a teenager. (McCranie is cited in one of Knuth’s new volumes, the Art of Computer Programming.)

Another superb question-answering mentor was the late Jon Hamlin, an ex-CIA intern and graduate of Valdosta State University and University of Georgia. Jon was far more interested in system administration, networking and UNIX, and in his role as computer science lab manager at Valdosta State University, Jon had access to extensive SunOS/Solaris resources and the time to think about my questions. He setup a Linux computer and gave me access, and helped me clean up a few messes.

Both of these men put in hours of their lives to answer questions I wrote through email, and they’re part of my motivation to answer questions for others.

Why it’s so hard to get somebody to really answer a question

I just claimed that there’s a scarcity of willingness to answer questions, so the first role of a mentor is to commit to be available to answer questions. Why is that?

  1. There are lots of ignorant people willing to give bad answers. But you don’t want them to be your mentor.
  2. 10481690626_b1be89f8cc_o
    Competent Computing/IT Engineers are very busy. Getting a senior engineer to commit the time to mentor is a big deal. Photo: Tim Regan.

    It takes real time and expertise to answer questions. As Erica Friedman writes, sometimes you’re expecting a simple answer about a complex system.”Some answers attempt extremely top-level analysis, but few people will have time or expertise to answer a truly complex simple question.”

  3. Competent engineers usually prefer to stay busy engineering, not doing chit-chat. According to  John Hales of Global Knowledge, communication and explanation are not key character attributes for successful IT Professionals.
  4. IT Engineers are often expected to make progress quickly, so they can’t wait long for answers. As one contributor said on StackExchange, there’s no time pressure to answer questions in online forums, and so it’s easier to get questions answered there.

Because engineers who prefer results over talking, and the demands on competent engineers, it’s hard to get a timely answer from a competent engineer.

The curious case of the unasked question

richard-feynman-1“I would rather have questions that can’t be answered than answers that can’t be questioned.”  ~ Richard Feynman

Once a mentor is available to answer questions in a timely way, the mentee must be willing and able to ask questions. Often they are not.

Michael Adams of Quizbean identifies several common reasons junior staff don’t ask questions. For engineers, some key reasons they don’t ask questions are:

  1. Nervousness. “They don’t want to be embarrassed in front of their boss or co-workers,” Adams writes. This can be caused by simple anxiety, or excessive pride if they don’t want to admit there’s something they don’t know. Antidote: So the mentor must make it easy and unthreatening to ask questions by readily enjoying the curious discovery of new facts. Computing fields move famously fast, so there are always new things to learn.
  2. Avoiding annoyance. They don’t want to ask questions because they perceive it to be inconvenient. Antidote: Therefore the mentor must actively invite the questions.
  3. Bewilderment. The mentee doesn’t know where to start because everything feels so unfamiliar. Antidote: The mentor should set milestones of capability to encourage steady improvements in comprehension; e.g., crawl, then walk, then run. E.g., Login to the server first. Then locate the logs. Then read the logs. Then understand the logs.
  4. Previous trauma. The world is full of jerks, and many people have been criticized by those jerks for asking good questions. The mentee may need to recover from unhealthy work environment where legitimate questions were met with caustic attitudes.
  5. Lack of curiosity. One of the most dangerous problems is a lack of curiosity in the subject matter. Mauricio Porfiri, a robotics researcher at New York University, says that “Being creative and being curious is more important than being the smartest or the best at equations if you want to be a great engineer.” Albert Einstein said that “The important thing is not to stop questioning. Curiosity has its own reason for existing.” The lack of curiosity leads to a dangerous complacency, causing the mentee not to care enough to bother to formulate questions or challenge their own assumptions, but to muddle through with the current level of ignorance. Computing is great because it encourages curiosity. Have a question? Try it out. A chemist or physicist or mechanical engineer needs equipment for a lab, but the Computer Scientist needs only a computer and the means to program it. Ayodeji Awosika writes about the dangers of suppressed curiosity in “The Theory of Nothing: Why Lack of Curiosity Leads to Mediocrity.”

Beyond Q&A: The Weekly Mentoring Meeting

To support growth of a mentee, mentors can schedule regular time with the mentee. Just like the commitment to answer questions, the mentor must make make these meetings a commitment. Commonly, this happens as a weekly meeting to ask questions and plan progress.

Jim Anderson’s approach to mentoring includes a weekly meeting. The advisee’s next steps were documented on his office whiteboard for clarity and easy reference.

Jim Anderson, a Computer Scientist at UNC-Chapel Hill, followed a simple model of tracking progress for his advisees: he wrote the next steps for each of his mentees on the whiteboard in his office. Then they could easily see what was expected. And in each weekly meeting, he could easily recall their responsibilities.

Plan the route out of ignorance. Even in non-supervisory mentoring, it’s helpful for the mentor to plan and track progress so the mentor ensures that the incomprehensible is coming into focus for the mentee. Without this guidance, the mentee may be trapped in a complicated area without a path forward, and without the ability to ask questions to get out of it.

Review recent accomplishments. The weekly meeting is also a good time to review samples of the mentee’s work. The mentor can praise progress and identify the most important improvements the mentee can work on next. But even when identifying the next growth area, the mentee should recognize the mentor’s achievements.

For example, for Computing/IT professionals, reviewing work can mean:

  • Discussing interesting troubleshooting problems and the approach to troubleshooting.
  • Reviewing system configurations to see how a mentee’s task was accomplished.
  • Reviewing source code.

 

Establishing Healthy Mentorship

To begin healthy IT/Computing mentoring,

  • Get experienced engineers genuinely willing to answer questions: call them Mentors.
  • Get other engineers with curiosity, who are willing to ask questions.

Photo: Merrimack College Mentoring Program.

Sometimes junior technical staff are starved for interesting work while senior staff are overworked

This is Part 2 in my Series on Supporting/Managing Engineers

If a team has lots of technical work to do, and only a few brilliant engineers available, how do you get work to the right people?

In this article, I discuss methods for managing work in IT and technology teams, such as those doing Network Operations or Engineering, Devops, Security Management, System Administration when you have a mixture of skill levels, including “star engineers”  with 10+ years of experience. I would give different advice to a team focused on developing software.

Call the Brain Surgeon!

A hospital is full of medical staff.  But when a life-threatening case rolls in, you want the most experienced physician available to handle it.  And if you’re doing important, high-profile, mission-critical technical work, you might want the most experienced technician to do the work.

But if the problem at hand is to restore the telephone service for the hospital, or to enable a failed 911 Emergency calling service, then it makes sense to put the best available people on it.

…to suture this wound

Yet often, the risks are not so high, and the need is not actually so urgent. You really don’t want your most-senior staff doing relatively low-risk simple work. Escalating all work to the most-skilled person robs the lesser-skilled staff of experience and deprives the organization of good ideas.

How do you prevent all the hard work from being done by the best and brightest?  How do you prevent all the work from flowing up to the highest-skilled people?

  • Make junior staff work at the highest level possible
  • Make senior staff support and include junior staff in strategy and planning
  • Define the barriers between junior and senior staff
  • Define the workflow for projects

Excelling as a Junior Engineer

Suppose you’re a junior engineer in a team, and all of the challenging work goes to the senior engineers. You know you could do that work, but it’s always going to the senior guy. How do you get the chance to work on hard problems?

It’s possible your management is fumbling, but be sure you’re doing a great job on the problems you have. It’s easy to dream about solving challenging problems, but you have to be sure you’re doing fabulous work in your current tasks:

  • Technically:
    • Are you doing as well as anyone in the world?
    • Are you aware of how other people solve this problem?
    • Are you documenting your methods?
    • Are you listening to the ideas of your colleagues?
    • Are you clearly explaining your results?
    • Are you recording your questions and other unknowns, then working to get answers to those questions?
  • Organizationally:
    • Are you confounded by how difficult it is to make progress?
    • Are you losing track of a detail, date, times, or meetings?
    • Are you recording results of the project so others can benefit from it later?
    • Are you making progress before the deadlines, so that you have substantial progress to demonstrate, and questions to ask?
  • Inter-personally:
    • Are you communicating clearly, with the requester of the work (the “customer”) and others with whom you collaborate?
    • When you have a chance to talk about the project, are you engaged? Are you contributing ideas?
    • Are you doing what you can to be pleasant, even fun, to work with?

Make Yourself Available to Hard Problems. Find out how the interesting projects come in, and look for ways to be involved.

  • Try to make sure your work environment isn’t too hidden and isolated.
  • Ask to join conference calls on interesting projects
  • Take notes in meetings, and offer to provide those notes to other people.
  • Ask questions about everything you don’t understand.

Senior Engineers Plan Projects

What is the role of the Senior Engineers?

  • Make design decisions.
    (Anything that can be done either better or worse is a design decision.)
  • Analyze the problems and projects
  • Write plans for the projects
  • Guide the solutions toward doing the best thing (“what should be done”), minimizing involvement in method chosen.
  • Solve problems when no-one else will handle them
  • Mentor the Junior Engineers

The distinction of a star technician is perspective and context: they know how things are done and why. They can consider pros and cons of various options for change. They know potential for risks. They should know the organization’s strategy and mission.

With this context, a senior engineer should be planning projects:

  • What are the objective and the goals?
  • What are each of the phases of work along the way?
    Tip: keep each phase <4 hours of work
  • What kind of data needs to be collected before proceeding?
  • How will we know when we’re done?
  • How long should it take to complete each phase?

A star technician is as mature as the plans they can provide to others to execute.

Define the Junior/Senior Boundary

What kind of work should be done by senior engineers, and what kind by junior engineers?

Senior Engineers are obligated to:

  • Maintain or improve architectural unity of the system.
  • Provide opinions based on mature technical aesthetic
  • Research the available options for solving problems (i.e., not just their favorite options)
  • Write plans for accomplishing the work
  • Determine which things to do (E.g.: Install Apache on a Linux VM, or buy an F5 Local Traffic Manger?)
  • Answer questions raised by junior staff
  • Review and approve the work done by junior staff
  • Be aware of the skills and capabilities of the junior staff
  • Do the work that no one else is able or willing to do

Junior Engineers should:

  • Follow the plan to accomplish the work
  • Debate and question the plan if they see better ways
  • Ask good questions to fully understand the rationale behind the plan
  • Discuss the problems with Senior Engineers when they find something interesting or unexpected.

Define the Workflow

Send all work to the Junior Staff?

A popular remedy for Work-Flows-Up Syndrome is to send all problems to the junior staff. For example, all tickets come in and start at Tier 1, then the hardest 80% escalate to Tier 2, then only the worst get escalated to Tier 3.

Hacks.  My experience with this method for actual problem solving is poor: Junior staff are prone to develop Workarounds, Tricks, and Hacks (collectively known as “WTH Solutions”) without adequate context or insight into the best way to solve problems. Without knowing the long-term risks, WTH Solutions make the system more fragile.

Blockages. Just as a star engineer can be a bottleneck, junior staff can block the successful completion of work. To know when to ask for help is a learned skill, and junior staff are often prone to keep poking around the edges of a problem. They can tend to sit on interesting problems too long.

Lack of Senior Insight. Senior staff need the insight on the real-world problems with their system. When every problem starts at the junior staff,  the senior staff may be unaware of the problems.

For example: In one case, the network had a serious, but minor issue, occurring  that could be worked-around through an easy procedure. The senior staff gave instructions for the junior staff in how to solve it, expecting the junior staff to need to do it no more than weekly.

But the problem grew more frequent. The junior staff were using the workaround dozens of times per day, faithfully following their instructions. But the senior staff were unaware of the scope of the problem because the junior staff were dutifully executing the procedure.

In this case, the senior staff needed to be aware of the scale to develop a more permanent solution. In that case, when a problem was mostly-delegated to the junior staff, the senior staff couldn’t see what was happening and rethink their proposed “solution.”

Plan Before Proceeding

Rather than “fill all the work from the bottom” skill levels, projects should be planned from the beginning. The first step in processing every new problem should be writing a project plan.

What’s a project?

Anything that has several steps is a “project” as I’m using it here.

What should be designed?

A design decision is anything that can be done incorrectly but still appear to work. For example, you could build a network using one large subnet in one big ethernet, or you could set up the network with routing and subnets. Both may appear to work, but one will be better for the situation.

Writing the project plan may require 10% to 20% of the total work time on the task, and should be done by Senior Engineers. The Senior staff is thus responsible for considering options and maintaining architectural and conceptual integrity of the system.

After the plan is developed, the project can be assigned to somebody at the right skill level.

  • Could the execution of the project affect architectural integrity of the solution? Senior Engineers
  • Does the junior staff have the skills to do it, or can they learn? Delegate to Junior Engineers

Senior / Junior Followup

Senior engineers need to routinely follow-up with junior staff to talk about their projects.

Most projects cannot be completed start-to-finish without interruptions. Sometimes we need a new license file; or we need to open a ticket with the vendor; or we need access to a protected site; or information from the customer. Due to these interactions with other vendors and customers, I find many technicians need to have 3-5 projects in order to stay busy.

Senior Engineers should have routine conversations with junior engineers to discuss the projects, and how they are going. The Project Plans will never be perfect, so discussing the project as it proceeds is a healthy part of mentoring.

Work Flows Where You Pump It

With the proper distribution of tasks, even a team of two engineers or technicians can be more effective. Junior staff have responsibilities to grow technically and professionally, and Senior staff have an opportunity to increase their team’s effective throughput.

Photo Credit: Ewan Cross

 

Most experienced professionals value the opportunity to mentor others; not so for some elite technologists

To enable more people in a technical team to do work, more people have to know how to do it. But for many engineers, training doesn’t come natural.

This is Part 1 in my Series on Managing Engineers

Is one-on-one training a rational activity, or just a feel-good strategy from the HR department? Many engineers don’t believe in the value of training and mentoring junior people because they perceive no personal benefit in training junior people. But there are benefits for everyone involved.

The reasons high-tech engineers have for not mentoring others are distinctive to the field.

  • Time Calculus: The time and effort to do a task they’ve done is less than the time and effort to explain it to somebody else then to ask them to do it. For example: suppose a 2-hour routine-cognitive task requires three hours of training. It’s easier just to do it than to train someone else in how to do it.
  • Mathematical, Non-Verbal People: For some people, it’s much easier to control a computer and get the problem done than clearly express ideas verbally.
  • Hero Syndrome: The senior person likes to be the one to solve problems because they get their self-worth. And the Customers (those who need the help) would rather just get the problem solved the fastest way possible.
  • Too Much Variety in tasks. Some organizations rarely solve the same problem twice. In these cases, the trainable skills are relatively few.
  • Weak Junior Staff. Sometimes the people available to train aren’t trainable (brains) or coachable (pride). This is the responsibility of organization to find appropriate staff.
  • Job Protection. There’s a myth that some people won’t train others because the would-be trainer would thus be subject to losing their job. I’m not sure I’ve seen real evidence of this.
  • Technology changes too fast. Technical skills have a very short lifetime: what you learn today may be good for about two years, according to IT World.
  • My other trainee is a compiler. Many senior technicians argue that automating the task is a better than training someone else to do it; often that is true. But our desire to write code must be tempered by reality: code doesn’t last very long, and will need humans to maintain it. As Alan Perlis famously said, “It is easier to write an incorrect program than understand a correct one.” Without another human’s review of the problem, you may be solving the wrong problem.

Good Reasons to Train and Mentor Junior Engineers

“In learning you will teach, and in teaching you will learn.”
― Phil Collins

Despite the objections listed above, there are ample good reasons to train and mentor junior engineers.

  • Explaining a practical art to another human is good for humanity, by enabling more people to do more useful work.
  • The Teacher Learns.  Explaining and answering questions increases the capabilities of the trainer substantially. The effort to verbally explain a system or technology forces the explainer to crystalize a model of the system they may have not had earlier. (My knowledge of SIP, VoIP Call Control, and RTP grew vastly as I was forced by teaching to verbalize my knowledge, and confront areas I didn’t understand.)
  • Senior Staff should focus on the Hardest Problems. After the training is done, the time/effort for the senior staff is substantially reduced. Suppose that two hour task required three hours of training; now when the task needs to be repeated, the junior staff can complete it. This makes the senior staff more available for the challenging problems for which they are distinctively qualified. When senior staff are doing the easy stuff, they’re usually leaving the harder problems unconsidered.
  • Technology is fundamentally about improving efficiency through use of effective tools, and having more processors capable of completing a task makes the clients more efficient than having to wait on a single-threaded processor. By having only one person capable of providing the work, you’re creating a bottleneck and failure mode for the rest of the world.
  • Redundant Array of Imperfect Humans (RAIH): Everybody gets sick or needs a holiday. By mentoring someone else, you’re making it easier for yourself to get a break. You may not want to take a break, but eventually you’re going to get sick, die, take vacation, or get a better job.
  • Pair Programming Benefits. Even when automating a task, you can get benefit by bringing in somebody else. There are many documented benefits of pair programming; the Mentoring benefit is key. (Yes, pair programming takes more person-hours — but only about 15%. And for that 15% you get better results.)
  • Fundamental principles change little. While technology moves fast, key fundamentals that don’t: the Buffer (1950s); Interrupt (1950s); Relational Database (1960s); Packet-Switched Networking (1960s); Structured Programming (1960s); multi-user Operating Systems and Security (1970s); Digital Audio and Video (1960s – 1980s).

How to Accomplish Mentoring & Training

“You cannot help people permanently by doing for them, what they could and should do for themselves.”
― Abraham Lincoln
When you decide to start training another technician, start with the right assumptions.
  1. I cannot learn for you.
  2. I cannot replace your own curiosity.
  3. If you won’t ask questions, you’re probably not teachable.
  4. You can’t learn by watching or listening: you have to learn by doing.

More on this in a future post.

Photo Credit: Guido Gloor Modjib

If you use BroadWorks with CSV CDRs, you’re probably accustomed to reading these:

00255943365CF3FC1CF15820160404194616.3061-040000,ECG,Normal,+12296543428,+19125293400,Originating,+12296543428,Public,+12296543409,20160404194616.306,1-040000,Yes,20160404194624.106,20160404194650.684,016,VoIP,,3409,private,,,,local,Group,,PCMU/8000,216.128.52.5,fd390cb4-eb6d3d15-f8fb7fd2@10.23.6.217,,,,Hermes Communications,,,,,,,,,,n,,,17552886061:0,17552886091:0,Ordinary,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2296543428@vwave.net,Chris Brice,Public,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,33.292,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Group,,,,,,,,,,,,,,,,+19125293400,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2296543428@vwave.net,Primary Device,33.292,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,PolycomSoundPointIP-SPIP_650-UA/3.2.2.0477,,,,,,,,,,,,,,,,,,,,,,,,,,10626401:2,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,

Should you tire of counting commas, here’s a CDR decoder meant to run on Mac or Linux systems with gawk: bwcdrdecoder script

With it, you can get output like this:

D6B36AFC-688B-4A5F-89EC-CEBAFD6A444E

bwcdrdecoder was made to understand BroadWorks Release 21 CDRs; and generally older CDRs as well.

How To Make bwcdrdecoder Your Own

  1. Download bwcdrdecoder script
  2. Save it as “bwcdrdecoder” someplace in your path, or in your home directory, ~
  3. Change the permission on the file to allow execution, e.g.,
    chmod o+x ~/bwcdrdecoder
  4. Execute it to read in some BroadWorks CSV CDRs, e.g.,
    ~/bwcdrdecoder BW-CDR-20160501000000-2-5CF3FC1CF158-001455.csv

Troubleshooting

If you don’t have GNU awk “gawk” installed, you’re probably using Mac OS X, and you should install it. I recommend using “brew”

If gawk is installed in another location besides /usr/local/bin/gawk, then you’ll need to edit bwcdrdecoder to change the first line. For example, if bwcdrdecoder is in /usr/bin, you can change the first line to reflect that.

Background

The Polycom SoundPoint IP SIP Phones and Adtran IADs are used for Hosted IP PBX Access Devices, managed by the BroadWorks platform. In a non-geographically-redundancy network, the devices use SIP to register to a single SIP SBC IP address.

To support geographic redundancy of SBCs, the devices must support registration to multiple IP addresses. It must select the proper IP address in each case to maintain its service and operation on the platform.

polycom_adtran_sbc_georedundancy_1In a conventional non-redundant design, each SIP Access device registers with only a single SBC. In a geo-redundant environment the SIP Access Device has to decide properly when and if to use each of the two sites.

The behavior of a SIP Access Device controls the effectiveness of geo-redundant failover. There are no hard and fast standards on the proper behavior. For example:

  • How does the access device determine that the primary site is unavailable?
  • After determining the primary site is unavailable, what should happen to calls that had already been started through that site?
  • How long should the access device wait before attempting to register with the secondary site?
  • After successfully registering with the secondary site, when should the access device check the status of the primary site?
  • Should the access device check the status of the primary site with a SIP registration, or some other SIP method?
  • What happens to SIP subscriptions setup on the access device during a failover to a secondary site?

This TR documents the best results determined for supporting this service on the Polycom SIP phones and geo-redundant SBCs, supported by BroadWorks.

Failover Retransmission Timing

Polycom –

Failover is based on a lack of response for both Polycom 3.x and 4.x software. It uses a 2-exponential backoff starting at 500 ms with a maximum delay of 2000 ms.

Assuming a SIP REGISTER 200 OK Contact expires=30 value causing the phone to re-register every 30 seconds, the worst-case timeline for failover is as follows, for both Adtran TA900 and Polycom phones.

  • Time <0: Device Under Test (DUT )receives 200 OK for SIP REGISTER
  • Time 0: DUT is successfully registered.
  • Time 15: DUT transmits REGISTER to lab-sd1
  • Time 15.5: DUT retransmits REGISTER to lab-sd1
  • Time 16.5: DUT retransmits REGISTER to lab-sd1
  • Time 18.5: DUT transmits REGISTER to lab-sd2
  • Time >18.5: DUT successfully registers via lab-sd2

Times are given in seconds.

Adtran –

The IAD uses a 2-exponential backoff starting at 500 ms with a default maximum delay of 2000 ms.

Assuming default settings, the worst-case timeline for failover is as follows.

  • Time <0: Device Under Test (DUT )receives 200 OK for SIP REGISTER
  • Time 0: DUT is successfully registered.
  • Time 15: DUT transmits REGISTER to lab-sd1
  • Time 15.5: DUT retransmits REGISTER to lab-sd1
  • Time 16.5: DUT retransmits REGISTER to lab-sd1
  • Time 18.5: DUT transmits REGISTER to lab-sd2
  • Time >18.5: DUT successfully registers via lab-sd2

Times are given in seconds.

Subscriptions

Polycom –

SIP Subscriptions are lost when the DUT re-registers with the secondary SBC. They are re-established after an hour. DUT does not re-SUBSCRIBE when it switches to a new SBC. BroadWorks does not keep the subscriptions coupled to the registration Contact which would route through the user’s current SBC.

Because subscriptions are not maintained between SBCs, features such as these will not be fully functional during the hours:

  • Busy Lamp Field
  • Shared Call Appearance
  • Message Waiting Indicator

Adtran –

SIP Subscriptions are not applicable when dealing with Adtran IADs.

Testing
Polycom SIP Version 3

The Polycom 3.x software is the newest software available for many popular phones, including the SoundPoint IP 330 and 500. Therefore, for networks that include these and related generations of phones, the geo-redundancy behavior of these devices affects the core network operation significantly.

In the tests below, the DUT is a Polycom SoundPoint IP 330 running 3.1.2c software.

Configuration and DNS Lookups

The DUT was configured to perform DNS lookups for “lab2.e-c-group.com”, and configured for transport=”DNSnaptr”.

@ORIGIN lab2.e-c-group.com.
_sip._udp 600        IN        SRV 20 10 5060 lab-sd2
_sip._udp 600        IN        SRV 10 10 5060 lab-sd1
lab-sd1   600        IN        A              216.128.128.11
lab-sd2   600        IN        A              216.128.128.40

Fault Detection

DUT detects the fault only when it fails to receive a reply to a SIP REGISTER.

If the SBC returns a SIP 400 or SIP 503 response to DUT, DUT does not attempt lab-sd2.

Affinity for Active SBC

Every new registration request restarts on the primary SBC.

And, even after registered on the secondary SBC, lab-sd2, every new INVITE is attempted on the primary SBC.

Revert

Because every new request is attempted in the primary SBC, each new REGISTER or INVITE will cause an attempt to revert to the primary SBC.

Key Findings

Geographic Failover Should Work

The Polycom 3.x software should provide a functional failover option.

Overload-After-Recovery Risk

Polycoms running 3.x will attempt to register with a recovered SBC during the registration expiration interval. For example, if all devices are configured to re-register every 30 seconds, and the Polycoms re-attempt registration at half of the registration expiration time, then every Polycom 3.x device will attempt to register with the primary SBC, after its recovery, in the space of 15 seconds. This is likely to cause an overload on the newly-recovered SBC.

Polycom SIP Version 4

The Polycom 4.x software is available for many of Polycom’s newer SoundPoint IP Phones, such as the 550 and 670. This software provides several additional configuration options for proper support of failover.

Parameter Explanation Default Recom-
mendation
authOptimizedInFailover If set to 1, when failover occurs, the first new SIP request is sent to the server that sent the proxy authentication request.

If set to 0, when failover occurs, the first new SIP request is sent to the server with the highest priority in the server list.

0 1
onlySignalWithRegistered If 1, the phone determines if the user is registered (voIpProt.SIP.outboundProxy.failOver.RegisterOn must be enabled). 1 1
failRegistrationOn If 1, the phone will silently invalidate an existing registration at the point of failing over. 1 1
failOver.failBack.mode The mode for failover failback.

 

newRequests – all new requests are forwarded first to the primary server regardless of the last used server.

DNSTTL – the phone tries the primary server again after a timeout equal to the DNS TTL configured for the server that the phone is registered to.

registration – the phone tries the primary server again when the registration renewal signaling begins.

 

duration – the phone tries the primary server again after the time specified by …failOver.failBack.timeout expires.

newRequests DNSTTL
reRegisterOn If 1, the phone will first attempt to register with (or via) the server to which the signaling is to be diverted, and only if the registration succeeds (200 OK with valid expires) will the signaling diversion proceed with that server. 0 1

Configuration and DNS Lookups

The DUT was configured to perform DNS lookups for “lab2.e-c-group.com”, and configured for transport=”DNSnaptr”.

We tested with both TCP and UDP on the Polycom 4.x software.

TCP Configured

@ORIGIN lab2.e-c-group.com.
@         600        IN        NAPTR 50 10 "S" "SIP+D2T" "" _sip._tcp
_sip._udp 600        IN        SRV   20 10 5060             lab-sd2
_sip._udp 600        IN        SRV   10 10 5060             lab-sd1
lab-sd1   600        IN        A                            216.128.128.11
lab-sd2   600        IN        A                            216.128.128.40

UDP Configured

@ORIGIN lab2.e-c-group.com.
_sip._udp 600        IN        SRV 20 10 5060 lab-sd2
_sip._udp 600        IN        SRV 10 10 5060 lab-sd1
lab-sd1   600        IN        A              216.128.128.11
lab-sd2   600        IN        A              216.128.128.40

Fault Detection

DUT detects the fault only when it fails to receive a reply to a SIP REGISTER.

If the SBC returns a SIP 400 response to DUT, DUT does not attempt lab-sd2.

If the SBC returns a SIP 503 response, the DUT attempts to re-register with lab-sd2. But it does not stay registered properly with lab-sd2; it allows the registration to expire. In effect, when the primary SBC returns a SIP 503, the DUT stays registered only part of the time.

Affinity for Active SBC

Using the settings that we recommend in this document, DUT continues to consistently use a specific SBC until it fails, or the DNS TTL timer expires.

Revert

Using the recommendations of this document, DUT will continue to use the secondary SBC for the duration specified by DNS TTL value. After this expires, DUT will re-attempt the primary SBC.

Key Findings

Geographic Failover Should Work

The Polycom 4.x software should provide a functional failover option. Because the affinity/revert function is less aggressive than the 3.x software, the 4.x software should provide better functionality for a geographically-redundant system.

Adtran TA900E IAD

Testing began with firmware version A4.07.00E. However, due to an issue with this version of the firmware the failover SIP register command is malformed. Thus, a firmware upgrade is required to allow the Adtran DUT to support failover. The firmware was upgraded to the most recent version of R10.5.0E.

Configuration and DNS Lookups

The DUT was configured to perform DNS lookups for “lab2.e-c-group.com”. However, the Adtran IADs only support “SRV” lookup and have no support for “DNSnaptr”. The DUT does correctly prioritize the SBCs based on the DNS lookup.

Fault Detection

DUT detects the fault primarily when it fails to receive a reply to a SIP REGISTER. If the SIP Trunk setting “sip-server rollover service-unavailable-or-timeout” is set then failover on a SIP 503 response can occur for requests other than SIP REGISTER.

If the SBC returns a SIP 400 response to DUT, DUT does not attempt lab-sd2.

Affinity for Active SBC

Every new registration request restarts on the primary SBC.

And, even after registered on the secondary SBC, lab-sd2, every new INVITE is attempted on the primary SBC.

Revert

Because every new request is attempted in the primary SBC, each new REGISTER or INVITE will cause an attempt to revert to the primary SBC.

Key Findings

Geographic Failover Should Work with updated firmware

The A4.07.00E software sends malformed the failover SIP messages. However, the R10.5.0EE firmware for the Adtran IAD will provide a functional failover option.

Overload-After-Recovery Risk

Default Adtran IAD configurations will attempt to register with a recovered SBC during the registration expiration interval. For example, if all devices are configured to re-register every 30 seconds, and the IAD re-attempts registration at half of the registration expiration time, then every IAD will attempt to register with the primary SBC, after its recovery, in the space of 15 seconds. This is likely to cause an overload on the newly-recovered SBC.

Lab Testing

Test-By-Test Lab Testing records are available here. In these tests, we used the following equipment:

  • Polycom SoundPoint IP 330 0a1e
  • Polycom SoundPoint IP 601 35b2
  • Polycom SoundPoint IP 550 28ec8e
  • Adtran TA 908e
  • Acme Packet NN4250, 6.2 software, lab-sd1
  • Acme Packet NN4250, 6.2 software, lab-sd2
  • BroadWorks R18 Lab, lab2.e-c-group.com
  • Cisco Small Business SG300-10P PoE Ethernet Switch

 

This article based on ECG Tech Report TR-ECG15273.
Lab Testing: Matt Keathley

 

Question:

How do you make a display filter that filters out most RTP frames, but leaves a representative sample? Sometimes it’s convenient to see a sampling of RTP frames in Wireshark, without having to see 50 per second.

Answer:

Rather then see 50 frames per second for every RTP flow, how about one frame every 5 seconds?

Wireshark display filter:

rtp[3:1]==0 or rtp.marker==1

Shows an RTP packet for each RTP stream
— about every 5 seconds
— or when the stream starts afresh

How does it work?

  • The 3rd and 4th bytes of the RTP frame are sequence number
  • The sequence number increases monotonically (40704, 40705, 40706, etc.)
  • rtp[x:y] gives the Y-number of bytes that appear at X-offset in the RTP frame, where the first byte in the packet is at 0 offset
  • rtp[3:1] gives the 1 byte that appears in the 4th byte of the frame (see the “00” in attached screenshot). This is the least-significant byte of the number.
  • Normal VoIP RTP sends 1 frame every 20 millseconds
  • Since the RTP frame is a 2-byte value, then 1 out of every 256 frames will have a least-significant-byte value of 0
  • 256 [sequence numbers] * 20 ms = 5.12 seconds
  • I’m glossing over some details in the previous two points
  • Each time a new RTP flow starts, the sender should send an RTP frame with rtp.marker==1

When you’re connecting to the rest of the world to make and receive phone calls, you have several design options available. Or, more precisely, your Voice Service Providers have many options available.

VoIP via Layer-3 VPN

In this case, a Layer-3 VPN, such as MPLS over the Voice Provider’s own equipment, is used to connect a Voice customer to the Voice service provider. Shared infrastructure is used, but the traffic from the Internet cannot route to the Voice equipment. The same physical links might carry Internet traffic as well, as shown on black hairline from the Internet Service Providers to the Customer Data Network.
In this case, a Layer-3 VPN, such as MPLS over the Voice Provider’s own equipment, is used to connect a Voice customer to the Voice service provider. Shared infrastructure is used, but the traffic from the Internet cannot route to the Voice equipment. The same physical links might carry Internet traffic as well, as shown on black hairline from the Internet Service Providers to the Customer Data Network.

MPLS is the way we see this implemented most commonly. In this case, the customer has a location and an MPLS or Ethernet VPN path back to the Voice Service Provider.

Pros

+ Protection against bad days on the Internet. E.g., if global BGP is working poorly. Or if “Cyber Warriors” in a despotic regime decide to launch a Denial of Service (DoS) attack against voice networks.

+ Usually it’s easier to prioritize traffic in a VPN.

+ The Voice Service Provider can ensure end-to-end quality of service if they want to because they own or manage all the queues (i.e., routers and switches) along the path from their equipment to the enterprise.

Cons

– You have to depend on the reliability of the MPLS path; it’s usually harder to have redundant links to the Voice service provider.

VoIP via Internet Infrastructure

The Voice Service Provider and the Customer both connect to the Internet and  exchange the IP addresses of their equipment. SIP and RTP are unencrypted. Devices accept or reject SIP calls based on the incoming IP address.
The Voice Service Provider and the Customer both connect to the Internet and exchange the IP addresses of their equipment. SIP and RTP are unencrypted. Devices accept or reject SIP calls based on the incoming IP address.

No special MPLS is used here, but we depend on the same shared routers.

Pros

+ The Voice Service Provider is also an ISP, and they manage all of the queues. So if they want to, they can provide high Quality of Service.

+ Usually there is no congestion inside Service provider networks. They can upgrade their links easily an inexpensively to get adequate capacity.

+ Potential for backup options via the public Internet.

Cons

– Bad things on the Internet might affect this; e.g., DoS attacks, or BGP malfunctions. However, within a service provider’s own network, the effects can often be mitigate.

– Sometimes harder to do QoS; not for technical reasons, but because ISPs are sometimes no good at prioritizing packets or guaranteeing bandwidth.

VoIP via VPN Over the Internet

In this model, the Voice Service Provider and the customer both connect to the Internet. A VPN, such as IPSEC VPN or permanent TLS connection, is setup across the Internet between the two parties. At least the SIP Signaling traverses the VPN.
In this model, the Voice Service Provider and the customer both connect to the Internet. A VPN, such as IPSEC VPN or permanent TLS connection, is setup across the Internet between the two parties. At least the SIP Signaling traverses the VPN.

Pros

+ Get to choose from among any ISP

+ Assuming the Customer has Internet access, there’s no construction time required to setup

+ Protection against IP address spoofing in either direction; so if you receive a SIP packet you can trust it was genuinely sent from the service provider

Cons

– No protection against unreliability on the Internet

– No quality of service can be guaranteed

– The links between the Voice service provider and the ISP may be questionable. For example, if a streaming video service, like Netflix, goes into business, certain Internet links that worked in the past may become saturated.

– IPSEC tunnels add extra complexity to the system.

VoIP via Internet

The Voice Service Provider and the Customer both connect to the Internet and  exchange the IP addresses of their equipment. SIP and RTP are unencrypted. Devices accept or reject SIP calls based on the incoming IP address. The Voice Service Provider may not have a ”Provider-Edge” router at all.
The Voice Service Provider and the Customer both connect to the Internet and exchange the IP addresses of their equipment. SIP and RTP are unencrypted. Devices accept or reject SIP calls based on the incoming IP address.
The Voice Service Provider may not have a ”Provider-Edge” router at all.

Pros

+ Get to choose from among any ISP

+ Assuming the Customer has Internet access, there’s no construction time required to setup

Cons

– No protection against unreliability on the Internet

– No quality of service can be guaranteed

– Risks of poor quality due to congestion on the Internet.

The BroadSoft BroadWorks DBS is a different animal than other BroadWorks servers, and it requires a special set of commands to keep it alive and well. The level of care and feeding required for the database reminds of BroadWorks App Server release R12 and R13; those were not happy days.

Check status of the FRA Disk Group

  • dbsctl diskinfo
  • /etc/init.d/oracleasm listdisks
    • On a healthy, normal system, this should list two entries: DATA and FRA
  • bwBackup.pl
  • srvctl status diskgroup -g FRA
  • As root: /sbin/blkid | grep asm

Installation Logs

These show a lot of what happen when installing the Oracle parts of the DBS:

  • /var/broadworks/logs/installation/oracle*
  • /u01/app/oraInventory/logs/installActions* show

Convert a standalone DBS to be a secondary DBS

  1. Clear ASM DATA and FRA disk groups
  2. Install DBS on DBS2 as a standalone primary
  3. On the primary/active DBS: use config-ssh to mesh both bwadmin and oracle accounts
  4. On the other DBS which will become the standby: sitectl convert bwCentralizedDbX
  5. On the primary/active DBS: peerctl add dbs2

Fix the kernel 2.6.18-400.1.1.el5 name mismatch for OracleASM

Upgrading to the BroadWorks DBS R20sp1 requires a kernel update. As of 2014 December 19, updating the RHEL 5 kernel installs version 2.6.18-400.1.1.el5.

However, only oracleasm-2.6.18-400.el5-2.0.5-1.el5 is available from Oracle.

However, the kernel module appears to be cross-compatible. Here’s how I moved it:

mkdir -p /lib/modules/2.6.18-400.1.1.el5/kernel/drivers/addon/oracleasm
cp /lib/modules/2.6.18-400.el5/kernel/drivers/addon/oracleasm/oracleasm.ko /lib/modules/2.6.18-400.1.1.el5/kernel/drivers/addon/oracleasm
cat /lib/modules/2.6.18-400.el5/modules.dep >> /lib/modules/2.6.18-400.1.1.el5/modules.dep

Submit your favorite tidbits in comments!

“Network Neutrality” advocates teach us that banning certain Active Queue Management (AQM) algorithms will result in greater freedom on the Internet. For example: Barack Obama wants to ban “Paid Prioritization,” which Cisco calls “Low Latency Queueing”.

Even folks interested in computers, but who don’t build or run networks, seem to have some downright strange opinions. For example, when Netflix became a customer of Comcast, Ben Gilbert at Engadget claimed Netflix was “paying off” Comcast”. When I buy a 50 Mbps link from Time Warner, instead of a 5 Mbps link, is the extra $30/month “paying them off”?

Net Neutrality and Freedom of the Press

American Journalist A.J. Liebliing famously wrote, “Freedom of the press is guaranteed only to those who own one.” The folks who run routers and network links are something like the owners of printing presses. If our government restrict the rights of someone who owns a computing device which copies and transmits information, isn’t that a violation of a basic right of humankind?

It’s Distributed Computing, Not Common Carriage

Somehow, ISPs have been confused with Fedex. While a common carrier delivers an entire package or a working application, ISPs carry only packets. British Consultant and Scientist, Martin Geddes, argues strongly that the Internet is not “common carriage” like the postal service, but truly a distributed computing platform. It doesn’t make sense to treat an ISP like UPS or the Royal Mail when “packets are merely arbitrary divisions of data flows. Broadband is fundamentally different from other forms of common carriage infrastructure. . . Packets are not people, and you don’t need to be ‘fair’ to them. There is no good karma in being even-handed to fragments of data flows in flight. It is only the fairness to people that we care about, which means only end user outcomes matter.”

Netflix’s Self-Destructive Taste in ISPs

Cogent is among the cheapest and worst of the “tier 1” ISPs. Netflix bought Internet access through Cogent, but unlike other web-hosting companies, Cogent doesn’t expect to pay money to interconnect with other ISPs. Historically, this type of unpaid peering has been agreeable because customers (downloaders) and data sources (servers) were distributed roughly equally across the Internet.

So when Cogent needed bigger links to deliver Netflix data — i.e., they wanted to install additional 10 Gbps Ethernet links to other ISPs — they expected to get them for free and pass the savings on to Netflix. Cogent would pay for their router and their ports, but they expected Level(3), Comcast, and others to provide free access to router ports on their side.

It normally doesn’t work that way. But because Cogent and Netflix had a business model built around the historical unpaid peering, it was very difficult for them to switch.

But why can’t we stop Comcast from Mangling our Data?

But Netflix’s choice of Cogent as an ISP and a naÏve business model — has gotten conflated with Comcast’s 2008 penchant for throttling Internet speeds. The FCC promised to enforce a ban on Internet throttling. “We need to protect consumers’ access, said FCC Chairman Kevin Martin, a Republican. “While Comcast has said it would stop the arbitrary blocking, consumers deserve to know that the commitment is backed up by legal enforcement.” So according to the FCC, “throttling” Internet access is already against the law.

Why is Netflix complaining? Because they don’t want to have to pay much for their Internet links.

But Why do I Have So Few ISP Choices? I hate my ISP!

Many homes and businesses in America have three ways of getting Internet access:

  • THE telephone company’s voice-grade cable.
  • Coax from the cable company.
  • Wireless radio signal.

Telephone Wires, 1933 Edition

The telephone wires were built to support voice and  voice only. But with the advent of “DSL,” the same wires are good enough for a substantial Internet connection. There are more tricks coming to increase spectral efficiency, but for most folks, their Internet speed via DSL is between 3 and 6 Mbps.

Why is their only one telephone company who owns the wire going to your building? In the early 1930s, offering telephone service for all Americans became a national priority. But laying the wires to deliver all this service is quite expensive. Through the early 20th century, a bargain was developed, so that each telephone company:

  • Had to promise to offer service to every address in its territory
  • …And would get a monopoly on some territory of the country
  • And would be guaranteed some minimum profits.

That national priority allowed the monopoly, and that’s why you have only one telephone company in most places.

If you’re guaranteed minimum profits, then if your costs go up 10%, then you can raise your rates at least 10% with the government’s blessing. So even as technology improved, the telephone companies had no incentive to minimize costs. Vendors took note.

Telephone Wires, 1996 Edition

In 1996, the telecom act allowed anybody to become a telephone company in the US. You had a to follow all the same rules, but if you did, you could buy Unbundled Network Elements including access to the copper wires that connected to each home. The logic was something like this: the telephone companies got a fabulous deal and government-sponsored loans for most of the 20th century, and the infrastructure built with that government assistance should be available to other players for more innovation.

Many ISPs and telephone companies have been built on those new rules. And the price of local service and of long distance in the United States has dropped precipitously, along with the stock value of telecom vendors like Nortel that were built on the shoulders of the overpriced telecom rules from the 1930s.

While the FCC made a good start at implementing these rules, they became convinced that they weren’t really necessary. The telephone companies had no incentive to unbundled their network elements, and they found ways to get out of offering new services, like fiber links, to their “competitor” wholesale operators.

Dane Jasper of Sonic.net argues that reinvigoration of the 1996 Telecom Act would do a lot to bring more competition. “I call on the FCC to reconsider the decisions of that past era, and to take steps to reintroduce UNE-L (unbunded network element: loop) requirements, including access to available dark fiber, which is a critical backhaul component for competitive carriers. Copper unbundling is only fully viable when the middle mile fiber isn’t missing from the equation.” Ironically, the laws requiring this are already on the books.

Cable Television Wires

I’m no expert on cable television vendors, but in my experience on a project in South Georgia I learned that the cable companies are granted a local monopoly, a franchise, which allows them to sell services to limited households, and gives them the “pole-attach” rights they need to run their  cables around town.

Entertainment services were not a major priority initially. But because of the vast bandwidth requirements of serial, analog television, the technology used — coax — delivers enormous capacity for Internet access. In the FCC parlance, Cable Internet providers are considered “fiber” because they use fiber optic cables to deliver “Hybrid-Fibert-Coax” Internet to neighborhoods. The total bandwidth available over coax cable is well over 1 Gbps.

So it was local regulation, and the high cost of construction, that means most folks only have one provider for Oprah re-runs and for Coax-based Internet. My understanding is that there are no general rules preventing multiple coax or fiber providers from delivering service. If you can sign a contract with your local town, you can offer high-speed Internet there by building a new HFC network. If you have a small town of 10,000 homes, Cisco expects it to cost around $5.6 million (i.e., $561 per household-passed by the coax).

Wireless Internet

Wireless Internet, including LTE, is available to lots of folks as of my writing. But as of now, it’s also relatively expensive to deliver a megabit over LTE or EV-DO. You can expect this to change and become a viable alternative.

Engadget and others have it wrong.

Engadget’s deceptive depiction of ISPs suggests there’s some beautiful spaghetti cable connecting your ISP to all the web sites. In reality, each party — both each house, and each “website” has to pay the ISP to provide a service to it.

Engadget seems to think that services like Netflix and Amazon are like the water we drink, falling freely from the sky, while ISPs are collecting it and charging you a fee. The model is completely wrong: Netflix and Amazon buy access plug into the ISP’s equipment, just like homes and businesses do.  And if Netflix doesn’t buy enough access, they’ll have slow Internet, just like you do.

Would Net Neutrality Raise Your Internet Bill?

Under modern telecom rules in the US, unlimited long distance talking is essentially free. Under the old rules, AT&T would offer a discount price of only $0.11 per minute — but only during certain days, after certain hours.

We know that telephone service was drastically overpriced prior to the 1996 Telcom Act; we know this because the prices dropped as soon as technology could be developed to exploit the new freedoms under Federal Law.

It seems likely that placing additional restrictions on telephone service would be on-sided, guaranteeing only additional freedoms and guarantees to consumers. It would be a compromise, with ISPs promised guaranteed minimum profitability in exchange for certain difficulties.

And that compromise — in the name of freedom — would be likely to introduce a new era of overpriced telecom.

So, What’s The Solution?