Get in on the ground floor – Android Alliance first meetup tomorrow!

The Android Alliance, a Philadelphia born group centered on Android development, is about to hold its first meetup tomorrow and it looks to be a great start.

Read about what’s the goals of the group from its founder, Corey Leigh Latislaw.

Read about the first meetup, scheduled for tomorrow, from Arpit Mathur

Join the LinkedIn Group and Google Group.

And take part in the conversation by following the group on Twitter.

Professional Development

Comcast Interactive Media (CIM) is a group within Comcast Cable that built the XfinityTv website and mobile apps, apart from other sites. I have been working here for close to 5 years now and joined when XfinityTv was still in its infancy. But this post is not about work at CIM but one of the extracurricular activities that I am involved in. It is called Professional Development. The Professional Development Group was initiated in September 2008 by the HR group and then the volunteers were allowed to take it over. It has been lead by Robert Philibert, a designer, and I since then. We have invited speakers from within Comcast, universities in the area and from other companies. The purpose of the group is to expose CIMmers to people and topics that they would not normally come across on their own. Some of the discussions we have had were on mergers and acquisitions, business models and intellectual property rights. The speakers have been very cooperative and willing to offer their time at our request. Once we had a few big names speak at our events, others were more willing to be approached and sometimes looked forward to it. Our events happen between 12-1pm once a month during the week so that people don’t have to miss work. But they do have to push their lunch to later because the room that we have these discussions in is an Amphitheater and we are not allowed to bring food into it. But this has not deterred people from attending. Our technique to inviting speakers is to send them an email to introduce ourselves. We then request an interview to decide on the specifics of the presentation. Then we set a date that is convenient to the speaker and the availability of the Amphitheater. This has also helped us develop a good reputation in our company. Some of the speakers we approach, already know us having seen other presentations that we have organized in the past. I recommend every company to have a group like ours to help their employees be informed about other areas of the business or topics of interest.

Mission statement:
“We strive to sharpen the professional skills of CIMmers and connect them to the icons of our company and industry, so that they can effect the changes that will make CIM lead the competition in Choice and Control.”

So let’s go around the room and introduce ourselves…

Hi.

I’m Jack.

I work for Comcast Interactive Media as part of Our product team. In my four years with CIM I’ve been involved with developing interactive technologies for online video, DVR access, and most recently controlling your TV using the Xfinity TV app.

I guess you can say I was tasked with helping bridge the gap between the Interwebs and home. I worked in the daily operation with our world class (plug) creative and engineering talent to deliver these products with our customers in mind.

My reward?
…getting to engage our customers, providing tips for clever features, listening to your ideas, share news regarding future product releases. I’ll be your inside source to the development shop. Cool?

Look to this blog for updates or say hi to me on Twitter (@xfinitytvapps). I’m looking forward to this new endeavor and learning more about all of you.

Work hard and have fun

When I was growing up, one of my favorite movies was Real Genius. The story revolves around some incredibly smart college students who work insanely hard, but only some of whom know how to have fun. I saw it and immediately knew that, when I grew up, I wanted to work somewhere where I would work hard on cool stuff, but also have fun.

All tech companies say the same thing: “We work hard and we play hard.” CIM is no different — we love working hard and having fun. But every tech company is a little different. To give you a peek behind the scenes, here’s how we do it.

In October of last year, we were issued a list of 60-day priorities, a large portion of which focused on the Xfinity TV web site. Our team reviewed the priorities and drafted a plan… to get nearly all of the items done in just 30 days. At the end of those 30 days, here’s what our tracking white board looked like:

30 days at CIM

With 30 more days to go, the team focused on code refactoring and optimization. At times, the team needed some distraction. One day, after watching a Strong Bad video, the team started imitating it:

link to video

The holidays were ushered in by a code release to launch all of the hard work we’d done in just 60 days. And so the team sat around the Yule Log to enjoy cookies and candy canes and wait for Santa:

Gathered 'round the ole Yule Log HD

That’s how we work hard and have fun at CIM!

JRugged – Making your code more RUGGED

Cowboy coding
We have all done some cowboy coding at some point in our life.  I think we can even recognize when we may be asked to perform our cowboy coding.  The scenario is something like:

Your boss or your boss’s boss comes down and says “… we have to have this new thing – and we have to have it by this deadline or the sky will fall and we will all die…”

It is at this point that as a developer you begin to run through how you might develop what is being asked for and work through in your head exactly how you will tackle the problem at hand.  It is also usually right at this point that decisions about what short cuts need to be taken in order to get the job complete on time get made.  We, as developers, short cut things like:

  • unit-testing (because we tend to do it after development, making it feel superfluous)
  • monitoring/data collecting about the system we are developing
  • resilience to failure

These last two items, monitoring and resilience to failure,  are the focus of this post.

Out of the box monitoring
When I mention monitoring what is the first thing that comes to mind?  Is the machine my software is deployed on running?  Does it have connectivity?  What is the program’s memory footprint?  How much free memory or disk is available?  While all of these are important base questions to know the answers to – they do very little to help you understand your running, deployed software.  To understand your running software you need to be able to answer questions like:

  • How many requests per second has my software performed in the past minute, hour, day?
  • How often did my software fail in the past minute, hour, day?
  • How often did my software succeed in fulfilling a request in the past minute, hour, day?
  • What was the latency for the calls that were made into my software in the past minute, hour, day?

Fault Tolerance
How often is it that case that the software you build has to call an outside resource?  Maybe your software needs to make an API call to another system to get some information/data or maybe your system has to integrate with a remote system in a specific way; how do you insulate yourself from that other system’s failures?  How do you go about keeping the system you develop responsive and allowing it to fail quickly and respond back to the end users in a gracefully degraded way?

I believe that making a software system that “gracefully and quickly” fails when an outside resource is not available is usually accomplished by introducing something like timeouts, retries or other systematic back-off mechanisms.  Adding timeouts or back-off can be problematic and can cause additional unforeseen issues like threads that hang or thread counts that run out of control.  What we really would like to do is detect errors and if the error rate is high enough turn off ALL remote calls to that resource to save the user and the system from the cost of having to ‘wait’ for timeouts to occur.

Enter a Java project that helps you move beyond out of the box monitoring and timeout-based fault tolerance with ease: JRugged.

JRugged
JRugged provides straightforward add-ons to existing code to make it more tolerant of failures and easier to manage/monitor. In other words, it makes your Java code more rugged!

The purpose of the project is to help answer the questions we posed above in a straightforward and easy to understand way.  By answering questions like how many requests per second am I processing currently, JRugged makes it dead simple for any project to be able to gather and understand the metrics for their running systems as well as assisting in making those production systems as resilient to failure as possible.

For collecting performance statistics we have a PerformanceMonitor object that provides the following output:

RequestCount: 26
AverageSuccessLatencyLastMinute: 974.3247446446182
AverageSuccessLatencyLastHour: 1051.3236248591827
AverageSuccessLatencyLastDay: 1052.9298656896194
AverageFailureLatencyLastMinute: 0.0
AverageFailureLatencyLastHour: 0.0
AverageFailureLatencyLastDay: 0.0
TotalRequestsPerSecondLastMinute: 0.34042328314561054
SuccessRequestsPerSecondLastMinute: 0.34042328314561054
FailureRequestsPerSecondLastMinute: 0.0
TotalRequestsPerSecondLastHour: 0.006920235124926995
SuccessRequestsPerSecondLastHour: 0.006920235124926995
FailureRequestsPerSecondLastHour: 0.0
TotalRequestsPerSecondLastDay: 2.893097268271628E-4
SuccessRequestsPerSecondLastDay: 2.893097268271628E-4
FailureRequestsPerSecondLastDay: 0.0
TotalRequestsPerSecondLifetime: 0.6241298190023524
SuccessRequestsPerSecondLifetime: 0.6241298190023524
FailureRequestsPerSecondLifetime: 0.0
SuccessCount: 26
FailureCount: 0

For adding fault tolerance and resilience we have CircuitBreakers.  CircuitBreakers provide the following characteristics:

LastTripTime: 0
TripCount: 0
ByPassState: false
ResetMillis: 10000
HealthCheck: GREEN
Status: UP
FailureInterpreter: org.fishwife.jrugged.DefaultFailureInterpreter@39ce9085
ExceptionMapper: null

There are three modules in the project:

jrugged-core
Contained in jrugged-core are building block classes that can be used by independently to build out functionality within your application.  Most of the items in jrugged-core utilize a simple decorator pattern allowing them to be easily inserted into your existing Java projects with little or no impact.

For example, if a developer wanted to wrap a performance monitor around a section of code for a backend call it might look like the following:

public BackEndData processArgument(final String myArg) {
     final BackEndService theBackend = backend;
     public BackEndData call() throws Exception {
          return perfMonitor.invoke(new Callable() {
               public BackEndData call() throws Exception {
                    return theBackend.processArgument(myArg);
               }
          });
     }
}

If you were then interested in the number of requests per second made to that back-end you could interrogate the ‘perfMonitor’ object to find out.

jrugged-spring
The jrugged-spring library is built upon the classes and items exposed in jrugged-core to provide an easy Spring integration path.  jrugged-spring provides Spring interceptors that can be utilized in conjunction with a Spring proxy to ‘wrap’ methods in classes based on  regular Spring configuration files.   If you are currently using Spring, this is the way to go, as there is no change to your existing code needed.  Just add needed lines to the Spring config and you automatically have the performance information gathered into an object that can be exposed easily with JMX.

jrugged-aspects
Similar to the jrugged-spring library, the aspects library provides the user with handy annotations that can be used on methods to wrap the performance monitoring or circuitbreaker classes around the target method.  Getting statistics or incorporating graceful and quick failures becomes simply a matter of adding the appropriate annotations and assigning a name to get it to start collecting information or providing the fault tolerance of a circuit breaker.

What is the take away?
Having lots of information about how our software runs and handles failures is important to the business and we should already be building mechanisms into our developed code to provide them; the problem is that there is rarely time to do so.  JRugged makes adding these critical components to your software so easy that it would be criminal not to add them.  Please go and check out the project at http://code.google.com/p/jrugged/; we are always looking for comments and enhancements on how this works out for you and suggestions for future enhancements.

Benchmarking the HttpClient Caching Module

Shortcuts:

Overview

During the past year, CIM has submitted many features to the HttpClient Caching module. We recently ran some benchmarks to quantify the performance benefits and test a few failure scenarios.

Through our benchmarks, we aimed to do the following:

  • Characterize latency and capacity benefits provided by the caching module.
  • Characterize latency and capacity benifits of locally-bound memcached instances vs. memcached pool. (when is one better than the other?)
  • Verify failover behavior of consistent-hashing memcached algorithm by killing off one of the memcached instances after the cache has warmed.

We made a few choices to help simplify our testing:

  • The client issued unconditional requests. Conditional requests (If-Match, If-Modified-Since) were only sent from the cache to the origin.
  • The client issued an even distribution of requests.
  • The origin only responded with HTTP 200 or 304 status codes.
  • Target figures are for a warmed cache. We let the tests run long enough so that the warm-up period was negligible.

Environment

  • Java 1.6.0_22
  • HttpClient trunk, revision 1024393
  • 4 large EC2 instances running Ubuntu 10.04 (Lucid Lynx) (ami-4234de2b)
  • Ubuntu packages:
    • memcached 1.4.2-1ubuntu3
    • apache2 2.2.14-5ubuntu8.3
    • libapache2-mod-php5 5.3.2-1ubuntu4.5

Components

Client

At startup the client creates a connection pool and starts a group of worker threads, each of which submits requests to the server.

Request URLs are randomly selected from a configured range, resulting in an even distribution.

Each worker collects its own statistics, which are aggregated and logged by a reporter thread at the configured interval. The statistics included: requests per second, latency (average, median, 95 and 99th percentile), mapping of HTTP response codes to quantity, and a mapping of cache events (hits, misses, validations) to quantity.

Origin

Our origin server is a php script which was dropped into an Apache instance’s docroot.

Our origin requirements:

  • Serve varied max-age values across the URL space to ensure staggered cache revalidation.
  • Recognize conditional requests, and be able to decide to return a 200 or 304 in a predictable way.
  • Sleep for a configured amount of time during each response in order to establish a minimum latency.
  • Generate responses of a configured size.

We looked at production access logs to get an idea of our real-world cache hit ratio. We also derived that 25% of our conditional requests (from cache revalidation) should return a 304 response.

Tests

We ran the following tests:

1. No Caching

All requests were sent directly to the origin.

2. Local Memcached – Bounded

Each client used its own memcached instance. memcached was configured to store around half of our total data set size. (Total data set was 5MB, memcache stored only 2.5MB)

3. Local Memcached – Unbounded

Each client used its own memcached instance. memcached was able to store our entire data set.

4. Pooled Memcached

Each client shared a memcache pool which used consistent hashing (ketama) to store cache entries.

5. Consistent Hashing Memcached – Failover

Using the same shared memcached pool as test 4, we failed 2 of the memcached nodes one at a time and restored them.

Graphs

Overall

This graph shows latency over time for all tests. After the cache filled, latency was constant.

Note that the high latency for the “Local Memcached – Bounded” test can be attributed to our even distribution of request URLs. This ensured that items were constantly being evicted from the cache. A more Pareto-like distribution would have kept the more frequently accessed items in the cache for longer periods of time, which is similar to the patterns we see in our production environments.

overview graph

Memcached Failover – Cache Misses

The first post cache-warming spike corresponds to the first memcached node being killed. The second spike came after the second memcached node was killed.

The last 2 spikes correspond to the 2 failed memcached nodes being brought back to life, and the likely shuffling of data around the consistent hashing ring (with “shuffling” meaning that cache misses are occurring despite cache entries existing on some of the nodes that were alive for the entire test).

memcached failover - cache misses

Memcached Failover – Cache Events

You can see that despite multiple memcached nodes failing, our cache hit ratio never dropped by more than 20%.

memcached failover - cache events

Conclusion

The HttpClient caching module successfully reduces request latency and server load given the proper operating conditions.

Using the cache, we observed that while higher percentile latency was not heavily effected (cache misses will always dominate the tail end of latencies), median and average latency were greatly reduced.

Using a local memcached instance lowered request latency, which can be partially attributed to memcached running on the same host.

Using pooled memcached instances we saw that load was greatly reduced on the origin at the expense of slightly higher latency. Lesser load can be attributed to the cache filling quicker and each node cooperating to keep the cache current. The higher latency can be attributed to the overhead of locating data and the cost of the communicating with other memcached instances over the network.

Finally, we see that when using consistent hashing, node failure is handled gracefully.

Acknowledgements

Thanks to Michajlo Matijkiw for doing the bulk of the benchmarking work.

Also thanks to everyone at CIM who has contributed to the HttpClient caching module: Jon Moore, Ben Schmaus, Joe Campbell, Mohammed Uddin, Dave Mays, Brad Spenla, Dave Cleaver.

Two Talks at Øredev 2010

I was fortunate to be invited to give two talks at Øredev 2010 in Malmö, Sweden earlier this month. In general, this was one of the best-run technical conferences I’ve attended, including several innovations like simple rating of talks with green/yellow/red card “voting” near the exits of the rooms or having a dedicated “chalk talk” area for folks to have followup conversations with speakers they found intriguing.
Read More »

Comcast Debuts iPad Xfinity TV App and You Can Download it NOW

CIM’s SVP & GM, Matt Strauss, posted a piece about the app on Comcast Voices.

YouTube: Watch Comcast Cable President, Neil Smit, Demo the Xfinity TV App at Web 2.0:

You can download the app and get your questions answered about it on Xfinity.com.

Presentation: The Subversion Command Line Client

YouTube: The Subversion Command Line Client Part 1 of 2:

YouTube: The Subversion Command Line Client Part 2 of 2:

Getting to Know CIM

A little video montage put together by Matt. You can follow him on Twitter at @MatthewCanning.