Archive for December 2009


December 21, 2009

I have a few difficulties with the way our team currently decides to plan what to work on for the next quarter or two. I’m not sure how to resolve this yet, but I thought I might just lay it out and see if anything interesting pops up.

Currently the managers decide what our priorities are for the next quarter by looking at all of the open bugs in Bugzilla. They take this list and present it to each of the major customers. Each customer looks at the bugs (really just the titles) and picks which ones they think are important to them and rate them as a priority. This information is collated by the managers and a list from most important to least important is generated. The top items that can be accomplished in the quarter are what we’re supposed to work on.

I have a few problems with this setup however. First of all, it requires that all possible future work be written down as a bug in Bugzilla and that the title of that bug not really reflect the technical nature of the problem, but be written in a way that’s marketed towards the users so that they might rate it higher. It also leaves all of the decisions up to the customers who can be very feature focused and does not allow us to perform important refactoring or rewrites of sections of the code. Building new features on an already unstable base can be dangerous if  you don’t have the time to fix the foundation first. I do think that the customers should have a very large say in what the development team spends their time on, however the development team is never involved in this process which is unfortunate and bound for failure.

It is also difficult in that all ideas that you might want to work on or have discussed with coworkers have to be entered into Bugzilla to be considered. This isn’t too big of a problem, just an adjustment in work flow, however often managers don’t like bugs to be opened if they can’t be fixed in the shorter term. We’d have to open up very long-term bugs to have some of them worked on at all due to the selection process.

We also have no real way of addressing some of our larger issue scalability problems. Many of the problems are due to our underlying data model, however fixing the data model could break current user automations. No customer is going to vote on a bug that might make their work more difficult in the short term, even if it improves it in the long term. What most likely will result is we will scale using our current model until it breaks and can’t scale any more, then our users will be in a very uncomfortable situation having to convert to a new data model that we’ll have to create quickly and under pressure. This new data model probably won’t be as robust as something we developed out of crunch mode. The conversion process for customers will also be more painful as there will have been more time for them to create automations and other processes based on the old data model.

I don’t have a solution to this, but perhaps rereading this post after a holiday break will provide new enlightenment to me. For everyone who’s about to start their vacations (if you get one), Happy Holiday’s. For those who still have to work for a few more I’ll be making my last post of the year on Wednesday.


December 18, 2009

There’s only two more posts for me on here before 2010, and due to not being able to think of anything else to post about, I’m going to place my top development-ish related New Years resolutions that I’m going to make up on the spot.

  1. Finish an iPhone app – I’ve started a few small apps but I haven’t had one fully finished in a usable form. I want to create at least one app and release it for free on the app store, just for fun.
  2. Design an elegant Report system – Our current application has difficulties generating reports due to bad code and weird business rules. I want to rewrite the entire report system into something elegant and fast.
  3. Play more Go – I really like Go, and I haven’t played much in the last year or so. I’d like to start more games in person and maybe schedule regular times to play online games. I feel that Go helps me as a developer for both its analytical possibilities, along with the strategic maneuverings that are useful in any office place.
  4. Be more awesome – This one is sort of vague. By being more awesome I mean to increase my productivity, make myself someone reliable that people go to for help, push the boundaries on development and research for my team, and generally become more awesome at work.
  5. Work on more home projects – This goes a bit along with the first one. I do a lot of development at work, but it’s a very narrow area of development. In order to really expand my abilities as a software developer I have to free myself of the project restrictions placed on me by my current priorities at work and do more spare-time development. Maybe I’ll make a game or improve some process that I do by automating it. Maybe I’ll find an open source project to contribute to. Speaking of…
  6. Contribute to an Open Source project – It’s often quite a bit more difficult than people make it sound to contribute to an open source project and have your changes accepted. Just reading up on the code for some projects in order to understand it enough to make a change can take a lot of time and dedication. I’d like to pick a project to help out on and contribute to as I feel it will help me as a developer along with giving back to the community.
  7. Improve my average pomodoro count – Currently my average number of pomodoros is lower than I would like. I’d like to increase how many pomodoros I complete per day by an average of 3-4. Trying to remove procrastination activities and learning how to focus better is still an ongoing process.
  8. Continue to update this blog – I’d like to be able to do another post like this one in about a year. That means not giving up on the blog and making it part of my routine so that it doesn’t get abandoned.

That’s just a quick list that I came up with in 10 minutes or so. It’s not exhaustive and I’m not sure if I’ll actually meet all of the goals but I’d like to try.

A Month of the Pomodoro Technique

December 16, 2009

[The new location for this blog post is here. This blog has been moved to]

It has been one month as of today that I started using the Pomodoro Technique at work in an attempt to increase my productivity. I think that after a month of use I can provide an evaluation of the technique; what I find good about it along with some of the negative sides. While I think that using pomodoros has increased my productivity overall, there are still some gaps that I have yet to discover ways of filling.

There were several reasons why I started to use the pomodoro technique. I was first drawn in to the concept by an article in the free PragPub magazine. This made me interested enough to read the Pomodoro Technique Illustrated book. I decided to give the technique a try to see if it could allow me to focus more on my work. I had been bouncing a bit too much from task to task as new tasks get added on by management, and older tasks would get forgotten about until they were asked about again. The technique was my own decision to improve myself as far as time management and work efficiency.

A quick intro for those who don’t know: The pomodoro technique is a time management system for minimizing multi-tasking along with prioritizing activities. Work is divided into 25 minute chunks called pomodoros that are separated by 5 minute breaks. The main idea is that for those 25 minutes you should be focusing 100% to the task at hand and avoiding internal and external distractions as much as possible.

My Process

There are many aspects of the pomodoro technique that I really like and that have helped me over the last month. I think the first and most important one is the artifacts that the technique produces. By artifacts I mean the Activity Inventory, the daily Todo sheets, the memory maps, and the Record sheet. These items add a lot of organization to my workflow along with preventing me from forgetting tasks I may have been working on.

For the activity inventory, I use a fairly basic legal notepad. I write down entries of things to do, one per line. The sheet often extends to multiple pages. At the end of the day I cross of anything that was completed on my Todo sheet for that day so that I can just look at the uncrossed entries when deciding what to do at the start of my day. I sometimes put my estimated number of pomodoros next to each item, but not always. I also try to write down everything I can think of, even if it isn’t work related but I want to get it done at some point. Some non-work related items I can get done during the day, but others are just good notes to myself that I can read each day and give myself reminders like making dinner plans for the week or calling the dentist.

My Todo sheet is done on a large spiral notebook of grid-paper. Grid paper is my favorite type of paper for notes, and it works very well for the Todo sheet. I leave 4 boxes empty to the left of each entry in case I need to add a U (unplanned) or small note to a task. I try to leave at least 7 boxes free on the right hand side so I can draw in the pomodoro boxes. I write down each task I want to do that day in pen and then I draw empty boxes in pencil to the right of it, each box indicating a pomodoro I think it will take. There’s usually enough space between those boxes and the item that I use that for marking interruptions. At the end of the day, I use the back of the page to draw up my memory map for the day.

The record sheet I keep in a spreadsheet (I use Numbers from iWork). I have 3 sheets on my record sheet spreadsheet. The first is an overview record sheet. Each row contains one day, with the columns indicating the date, completed pomodoros, internal interruptions, external interruptions, underestimated pomodoros, overestimated pomodoros, completed tasks, incomplete tasks, unplanned activities added to the todo, and unplanned activities added to the inventory. At the bottom of this sheet there is a line that gives averages for each of these columns. More on this a bit later.

The second sheet is an extended log. Here I make basically a copy of my todo sheet for the day. It contains the name of the activity, the number of pomodoros spent on it, two columns for guess errors in case I guessed the wrong number of pomodoros to complete the activity, the number of internal interruptions, external interruptions, if the activity was completed (I just put in an x if it was), and any comments. There are total lines for each day.

The final sheet is a chart sheet. Currently I only have one chart, but I’ll probably expand this when the task makes it to my todo list. The chart shows completed pomodoros for each day. I just use the data of the second column (completed pomodoros) from the first sheet to create this chart. The chart gives me a good indication of how well I was able to perform on each day and shows me some trends. I don’t show the record sheet to anyone else so it’s just for my own purposes. This whole system isn’t something I would bring up to managers during a review or anything, that would bring too much stress into the project and would lead to the temptation to lie.

Positive Aspects

One aspect of the technique that I like is the recording of pomodoros spent each day. When starting a day, you can set a goal for yourself to beat your average number of pomodoros done in a day, which is recorded at the bottom of the first sheet of the record sheet. That average is also a good indicator of how much you’ll likely be able to accomplish in a day, so you can add tasks to the todo sheet until the number of pomodoro boxes is equal to or maybe one greater than your average. If you end up completing every task, you can spend 5 minutes to add some more on and congratulate yourself for being so productive.

I also enjoy the focus a pomodoro can give to me. I try to be tangentially aware that while I’m doing a pomodoro I should ignore all IMs and emails and try to be bothered by other people around me as little as possible. Those distractions can really get in the way and derail you. If something is really on my mind and I don’t want to forget it I’ll write it down on my activity inventory or todo sheet and mark an interruption, but at least then it’s out of my head and tabled for later. It’s also really nice to be able to mark that X at the end of a pomodoro. There have been many times where I would think about getting up for some water or a snack or something but I have 12 minutes left in my pomodoro. I know that doing something like getting a snack can sometimes drive you completely off course of what you’re doing, and honestly I can wait 12 minutes. The pomodoro gives me the incentive to put off doing those things that end up being procrastinations. Basically letting you procrastinate from procrastinating, which I feel is good.

I’m also finding that having things like the extended log in the record sheet and tracking how long each task takes me that it is easier to estimate new tasks that I get. I tend to have to break down tasks several times, but by doing that and adding all of them together you can get a rough idea of how long a task will take. Also since you know how many pomodoros you can complete in a day (it will be lower than you think), you can estimate how many days longer tasks will take. Very useful since managers are all about the time estimates and wanting to know when something is going to be done.

Being able to make up a todo list daily and work on those items is also the main reason that I am able to write this blog. Often when I start a blog I make a few posts every so often but then I forget to post and it just ends up dying. By promising myself that I will post every MWF and adding an entry at the top of my todo list each of those days provokes me to not be as lax about writing entries. This has worked out fairly well so far as I’ve only missed a few days and I feel more accomplished about my blogging abilities.

Not As Positive Aspects

Ok, this technique is not perfect. I didn’t really think it would be, and I don’t think that there exists a ‘perfect’ technique. Here are some of the faults I find in the technique that impede my productivity. I haven’t figured out a solution to all of them yet.

Pomodoros are an all or nothing affair. Either you work for 25 minutes straight to mark your X or you don’t complete a pomodoro. Since marking that X is the measurable sign of progress, you start to shy away from engaging in an activity if it won’t result in an X. For instance, managers love meetings. I think it is some sort of ambrosia to them to have people sit in a room together bored out of their mind as they rant about things that they think are important. Meetings get in the way of pomodoros. Say I have a meeting set for 4:30pm. It is currently 4:10pm, meaning I only have 20 minutes between now and the meeting. If I start a pomodoro, I won’t be able to finish it because I only have 20 minutes. Managers will come by your desk and poke you to go to the meeting at the exact time the meeting is supposed to start, so I can’t just show up 5 minutes late. In these instances I tend to not start a pomodoro because I won’t have enough time to complete it anyway. I try to fill up the time with administration tasks like catching up on email or organizing, but sometimes I honestly just end up reading blogs.

This can be a real problem, especially if you have lots of scheduled meetings every day or even daily meetings. I had a morning standup for a while that took all of 10 minutes, however it took place about 45 minutes after I got into work. I could spend the first few minutes putting together my Todo list and having my computer turn on. After my todo list was created, I wouldn’t have enough time to do a full pomodoro before the meeting started. This meant that I couldn’t start any real work until that meeting was over, which is about an hour after I’ve come in. So the first hour in my day is wasted because of a meeting and the techniques inabilities to deal with time slots less than your designated pomodoro time (defaults to 25 minutes).

Long meetings, like presentations or informative meetings, are also a small difficulty for pomodoros in that you often don’t get those 5 minute breaks every half hour. This makes me cautious on counting meetings in my Todo sheet as something I can mark off as having spent a pomodoro or two on. Sometimes I feel I can do that, other times I’m not as sure.

I think overall my main problem might not even be with the technique at all, just meetings. It’s an area to consider.


In the end, I like using the pomodoro technique. It’s fairly simple and it keeps me organized. I found a very nice simple application for the mac for measuring pomodoros so I don’t need a real kitchen timer. I like this because it’s easy to see on my computer, doesn’t get in the way, can provide the ticking sound, proper notifications, and has easy keyboard shortcuts. It’s also good that if I keep the volume at the correct level on my computer, it doesn’t irritate my coworkers like a kitchen timer would.

I’m going to continue my current path of using the technique, but I might tweak some things. There are additions I would like to make to my record sheet and I still need to work out a good way to deal with smaller units of time in a useful manner. Maybe a separate series of mini-pomodoros that are tracked differently.


December 15, 2009

At times there seems to be too much stagnation in the world of software development. I don’t mean that progress in new technologies or techniques is slowing, in fact it is probably chugging away at its standard accelerated pace from the rest of the sciences, I mean that developers themselves tend to stagnate.

This is a dangerous thing for any developer and for our community as a whole. Many developers become stuck in a rut of working on the same programs, writing in the same language, fixing bugs or adding minor incremental changes. Maintenance mode: it dominates many software shops and can cause your career and your company to become outdated.

As an example, take a look at Internet Explorer. Microsoft practically stopped developing IE in the mid 2000’s because they had such a high market share they figured there wasn’t a need to do real work on a new version. Minor fixes was all that went into it. Meanwhile other browsers such as Firefox, Safari, and Opera were putting together new and exciting packages. Microsoft was at a loss and had to hurry up and create IE 7 and later 8 to try to catch up to the standards compliant and innovative new browsers that were on the market. If they had kept pushing even when they felt secure, they would have had a better product and probably a better market position.

I feel the same thing occurs not only in the products we make, but in ourselves with our skills. Once you become a Java developer, or a web developer, or a Java Web developer, you stay there. Maybe you’ve been using C, or Java, or PHP all of your career and you’ve never really expanded out of that. Perhaps you’ve written mostly web applications and never really touched desktop or mobile applications.  I feel that that is a significant loss to yourself and to your current position.

The thing is people, and especially developers, need to change from time to time. If you keep programming in the same language on the same types of projects or are just in maintenance mode of the same project constantly, you’ll begin to stagnate. Your skills might be quite good in your area of expertise, but if you’re ever forced to leave your comfort zone you will find you struggle. Perhaps you get laid off and need to get a new job, having some versatility will look good on your resumé. Even if you’re staying with your same position, having skills in additional technologies, different languages, and programming on different platforms will help you in your current task by forcing you to change your point of view.

Change inevitably happens. I remember working at a government job in NY where all of the older programmers who were masters in COBOL were forced to retrain in Java or retire. Many of them had problems with that transition as Java and COBOL are fairly different. The few who were keeping up on their game for the last few decades and learning new strategies and processes on the side were poised to take over top positions after the transition. They already had the knowledge that they taught themselves, while the ones who stagnated either had to struggle to adapt or look for a new job.

My ambition typically involves trying to keep as many new languages in circulation in my head as I can. I use Java primarily at work, but when I start doing my own projects I tend to lean towards Objective-C, C++, Ruby, and Groovy. I mainly write web applications at work, so my home projects tend to be closer to command-line scripts, desktop applications, or mobile apps (the iPhone is a nice platform to play in as a developer). I may not be able to use these projects or the skills that I learn on them directly at my job, but I do become more versatile from the experiences. That versatility allows me to answer questions or think of designs differently than I would if I just focused on web based Java applications.

In addition, try to make sure that you do have side projects. I know for a lot of people who work 8-14 hour days programming that the last thing they want to go home and do is more programming. A lot of the time I would agree with them, however keeping a few side projects that you work on several times a week in your spare time will keep you sharp and can be very rewarding as you make things that you wouldn’t have the opportunity to at work. I know that I’ve felt more accomplished by satisfying my own goals at home than I am at filling another bullet point on a feature list of a product that I have little passion for at work.

If you can’t think of a project, try to find an interesting open source one online to make contributions. It’s not always easy to get started, but those contributions will look great on a resume and can help a multitude of people along with yourself.

Reusable Javascript: Write your own libraries

December 11, 2009

JavaScript is such an interesting language in that it is both extremely popular and hated at the same time. It is the most used programming language in the world, thanks to its inherent involvement in web programming. It is the sauce behind AJAX, and creates the dynamic aspect of dynamic web pages. It’s easy to pick up and write, requiring no compiler and running on any system that has a web browser. Unfortunately this has generated a stigma of it not being a ‘real’ programming language. Often even professional programmers will treat JavaScript in a more backhanded manner than they would code written in Java, C, Perl, Python, Ruby, or one of the many other compiled and scripting languages.

JavaScript, or ECMAScript as it is technically known as, is often written in a very procedural fashion. If you’re lucky you may see it being used as a functional language, when a developer decides to write functions at all. Rarely will you see it generated in a true object-oriented fashion. This is unfortunate because JavaScript has support for great object-oriented programming. Objects can be created through prototypes of other objects, allowing you to create “classes” through merely building a base object from scratch and then cloning it via prototypes.

A lot of JavaScript is created as one-off functions or scripts for a single purpose. Many developers are rebuilding their same wheel multiple times in the same application or even web page. While there are great libraries that showcase how useable pluggable JavaScript can be (take a look at the fully open sourced YUI library some time), very few developers abstract their business logic in a way that is reusable.

I think the problem mostly stems from the ingrained thought of the coupling between JavaScript and the web browser. People tend to write JavaScript for a specific web page, rather than to perform specific functions. A reapplication of the MVC pattern can be quite helpful here to separate concerns and promote reusability. I think this shows greatest when dealing with AJAX/Web Services.

Many applications are using AJAX to create more interactive, faster, and user friendly applications (we’ll ignore the back button concern for now). Web designers love AJAX because it creates a great feel for the users due to instant feedback and the lack of page loading. Web programmers dislike AJAX because it makes writing programs more difficult and can cause a lot more work. However, by separating some of the JavaScript code into distinct units you can reuse common functions across many pages.

Here’s an example:

WIDGET = {} // we use this as a namespace
WIDGET.removeRowFromTable = function() {
   var selectedItem = WIDGET.getSelectedItem();

   if (selectedItem != null) {
      var callback = WIDGET.getCallback(WIDGET.callbackDelegate);
      var postData = "itemId=" + selectedItem;

      YAHOO.util.Connect.asyncRequest('POST', '', callback, postData);

 * Override this method to change what to do on response (view stuff for response)
 * o - JSON object of response data
WIDGET.callbackDelegate = function(o) {
   // Update view by removing row, maybe supplying text box to the user of what happened
   window.alert("The WIDGET.callbackDelegate method should be overridden per view.");

WIDGET.getCallback = function(successFunction) {
   return {
      timeout: function(o) { //default timeout function },
      success: successFunction,
      failure: function(o) {
         // default failure function. You could also change getCallback to pass this in if you want
         // to customize your failure.

This file can be part of your base “AJAX action” javascript which provides the actual actions to perform your AJAX commands. You can then use another javascript file to declare custom versions of WIDGET.callbackDelegate which will overridden the already declared version, just make sure that your new file is placed in your HTML file after the first one. By declaring default visual response to be alerts you will know what methods you should override on each new page.

Douglas Crockford, a fellow Yahoo and a member of the ECMAScript standards group has done a series of webcasts that describe some additional best practices of JavaScript that many developers may not practice. He also wrote a book that’s valuable to check out.

The Game of Go

December 7, 2009

This post actually has very little to do with programming, but is about a topic that some programmers might find interesting anyway. As the title implies, it is about the game of Go; which is also my favorite board game.

As a quick recap for those who don’t know, Go is an ancient chinese game developed around 4,000 years ago and played mostly by chinese, korean, and japanese people. However there is a growing player base in the US along with strong players in Germany and other areas of Europe. Go is the oldest board game still played in its original form, which is interesting because the rules are still being tweaked in official play. The game is played on a wooden board with 19 horizontal and 19 vertical lines called a goban. Two players place white and black playing pieces called stones on the intersections of the grid in an attempt to surround empty space and/or opponent pieces. Each player is attempting to control more of the board than the other player.

The rules are actually really simple. Players take alternate turns, starting with black, placing stones on the board. Stones of the same color that touch each other are considered a unit and treated as a single stone for the purpose of capture; to capture one you must capture them all. To capture a stone you must surround it with your own stones so that the unit no longer has any empty spaces next to it. Stones cannot be moved after they are placed unless they are captured, in which case they are removed from the board. You cannot place a stone on the board if placing it would result in immediate capture. This means you can’t place a piece on an intersection that has no free space next to it, unless that piece would capture a neighboring piece of the opponent and therefor free up some space. At the end of the game each player counts up the number of empty spaces their stones are surrounding and subtract from that the number of stones of theirs that were captured (Japanese rules anyway… there are variations). Whoever has the most points wins.

Now there’s a few extra parts to this. Since black goes first, black has first choice at placing a piece on the board. This gives black an advantage that is worth a few points. To compensate for this, white is given a ‘komi’ at the beginning of the game, currently often 6.5 points. This means that white starts with 6.5 points automatically for having gone second. The half point is there to prevent ties. There’s also an additional rule that the board cannot repeat itself. Certain configurations will result in what is called ‘ko’. From Sensei’s Library: “[Ko] describes a situation where two alternating single stone captures would repeat the original board position. Without a rule preventing such repetition, the players could repeat those two moves indefinitely, preventing the game from ending. To prevent this, the ko rule was introduced.”

The game is fairly easy to learn. Children as young as 2 can pick up the basic concepts and some strategies for the game. Wikipedia has one of my favorite descriptions about Go as it relates to mathematics: “In combinatorial game theory terms, Go is a zero sum, perfect information, partisan, deterministic strategy game, putting it in the same class as chess, checkers (draughts), and Reversi (Othello); however it differs from these in its game play. Although the rules are simple, the practical strategy is extremely complex.” Quite a bit more complex than chess I would say, while easier to learn. A game like that is something I find intriguing.

While Go is more popular in Asia, there is some support for it in the US as well. The American Go Association has a ranking system, official rulesets, holds tournaments, and tracks clubs where you can go to find players to learn or play the game. A friend of mine hosts a Promote Go website that has club listings in an easier to search form.

There are many websites and books to learn Go. A company called Slate and Shell publishes what is probably the largest set of English books on the topic. You can also play games online for free if you cannot get in contact with any players in your area. A great place to start would be KGS, which has a simple Java client that works on all platforms along with a server that hosts many games for free. There are other programs that use the Internet Go Server (IGS) protocol and can connect to a variety of servers, the most popular of which is pandanet. There are also computer opponents (stay away from them for the most part) along with computer learning games and versions for other platforms such as the xbox 360 and most mobile phones.

I encourage everyone to give the game a try as it can be one of the most mentally rewarding board games you’ll ever experience. If you’re ever interested in picking up a game with me, leave a comment or email me at jearil at gmail dot com.

Apache Pig Tips #1

December 2, 2009

Pig is a new and growing platform on top of Hadoop that makes writing jobs easier because you can avoid writing Map and Reduce functions in Java directly while still allowing you to do so if you choose. Instead it creates a bunch of basic functions such as COUNT, FILTER, FOREACH, and such that you would normally have to independently write for each data manipulation you want to perform. Unfortunately, the Pig documentation is fairly sparse and performing what you would think is a basic manipulation can become very difficult if there are no examples.

In this post, I’m going to provide some examples based on what I have learned about Pig in the last week. I labeled this as Apache Pig Tips #1 because I expect I may write more in the future as I uncover additional usages.

My problem domain includes a data set that has multiple IDs and a result field:

{tcid, tpid, tsid, date, result}

There are a few more fields but I’ll leave those out for brevity. A quick description of what each of those IDs are: the tcid is a Test Case id that a result was inserted for. The tpid is the Test Plan that the Test Case was a part of for this result. The tsid is the Test Suite that the Test Plan belongs to. The date is the date the result was added, and the result is the actual result (Pass, Fail, Postponed… etc).

Now a Test Plan can have multiple Test Cases in it, however it can only have each test case in it once. A Test Case can also be in multiple Test Plans (though again only once for each Plan). A Test Suite can have multiple Test Plans, but each Test Plan belongs to exactly one Test Suite. Results for a test case in a test plan can be inserted multiple times. Maybe the first time it was tested it failed so a Fail entry is added. At a later date it passes so a new entry is  made with a Pass result. We need to generate a report that shows how many Pass and Fail per test suite using only the latest result (ignoring previous ones).

The tab separated data is located on HDFS in a file named ‘allresults’. First we need to load the data into a variable:

A = LOAD 'allresults' USING PigStorage() AS (tcid:int, tpid:int, tsid:int, date:chararray, result:chararray);

Next we need to find all Test Case/Test Plan combinations and group by them. This will give us a list of items that has multiple results of different dates, but all for the same test case in a test plan.

B = GROUP A BY (tcid, tpid) PARALLEL 100;

The Pig documentation mentions that the GROUP keyword can be applied to a single alias and the BY can apply to a single item. What isn’t easily discovered in the documentation is that the item can be a tuple, which you can define in line by surrounding multiple fields with (). Normally your group by looks like: B = GROUP A BY tcid; However, to group on multiple fields so that each entry is a unique combination of those fields you can surrounded it with () to make it a tuple.

Currently the definition of B looks something like this:

{group {tpid:int, tcid:int}, A {tcid:int, tpid:int, tsid:int, date:chararray, result:chararray}};

Basically we have a Bag where each item in the bag has a Bag containing the unique tpid and tcid, along with a second bag that contains 1 or more result rows. We need to look at that second bag and remove all but the most recent result rows so that we have just the most recent result.

    X1 = ORDER A BY date;
    X2 = LIMIT X1 1;

This will loop through all items that were grouped by tcid and tpid. For each one it will order the inner result bag by date (descending by default). We then take only the first item from each of those ordered bags (the most recent result). We export to X the flattened version of the limited bag. This produces just a Bag of tuples that have all the non-recent results removed.

After that we can split up all of the results into separate aliases by filtering on X multiple times:

pass = FILTER X BY result == 'Pass';
fail = FILTER X BY result == 'Fail';
postpone = FILTER X BY result == 'Postpone';
-- group by suite and count it
passbysuite = GROUP pass BY tsid PARALLEL 100;
failbysuite = GROUP fail BY tsid PARALLEL 100;
postponebysuite = GROUP postpone BY tsid PARALLEL 100;
-- generate counts
passcount = FOREACH passbysuite GENERATE group, COUNT(pass);
failcount = FOREACH failbysuite GENERATE group, COUNT(fail);
postponecount = FOREACH postponebysuite GENERATE group, COUNT(postpone);

We now have 3 Bags, each containing a group which represents the tsid, along with a number correlating to how many results of each type existed.