Archive for November 2009

Pig Frustrations

November 30, 2009

My desires to implement better scalability through pre-processing reports via the Grid have lead me to Pig. Unfortunately, while Pig does remove some of the difficulties of writing for Hadoop (you no longer have to write all of the map-reduce jobs yourself in java), it has many limitations.

The biggest limitation I’ve found so far is just lack of documentation. There are a few tutorials, and a language reference, but not really all that much more. Even internally at Yahoo there is only a few more bits of documentation than found at the official apache site. I know that Pig is very new and hasn’t even hit a 1.0 release yet, however the documentation really has to improve in order for people to start using it more.

Even when looking at the language reference itself I find there are limitations. For instance, you cannot use a FOREACH inside of a FOREACH. There’s no nested looping like that. This makes working with complex data structures quite difficult. The system seems to be based off of doing simple sorting, filtering, and grouping. However if you want to group multiple times to create more advanced data structures: bags within bags within bags, and then sort the inner most bag, there isn’t a construct for that that I have found.

Also I’ve run into error messages that do not really explain what went wrong. Since there is little documentation and even users so far, it’s not easy to resolve some of these errors. A quick search will usually result in a snippet of code where the error was defined or a list of error codes on a web page with no real descriptions to speak of. Definitely no possible solutions.

Despite those flaws, it still seems beneficial to write in Pig rather than writing manual MapReduce jobs.

It’s also possible that some of these flaws can be fixed by writing extensions in Java to call from my Pig scripts. I haven’t gathered up all of the requirements to build and compile custom functions into a jar yet, and perhaps that will be the next step. It’s just unfortunate that some of these things which I feel should be fairly basic are not included in the base package.

I’m still researching however and its possible I just haven’t discovered the way to do what I need. If that changes I’ll be sure to update.

Dynamic Offline Reports

November 23, 2009

Many applications have the primary concern of storing and retrieving data. The raw data by itself is often not very useful, so an additional process is put into place to turn that data into useful information. Many applications generate these reports from the data quickly at the user’s request through either a narrow SQL select statement, in application data processing, or both. However, in larger applications where the data is too large to handle in memory at a time, processing is too heavy, and report customization is too varied, information generation needs to be pushed to its own scalable system.

A lot of data has various pieces of metadata associated with it naturally. Data such as what user added the record, the date it was added or modified (or both), maybe the size of the record, categories, tags, keywords, or other pieces of associate data that is used to break it up into more manageable chunks. This meta data is useful, but only if we can use it as an aggregate to generate specific information relating to the grouping of items based on this data.

Sometimes generating these more specific reports is as easy as adding additional WHERE or GROUP BY clauses in SQL. However, when more advanced business rules are taking place where there isn’t an easy or succinct way of extracting this information via a query, or if the query returns such a large amount of data as to cause memory issues, a different approach can be taken.

For instance; in an application I am currently working on we need to generate reports based on a table with about 5 million rows. The SQL queries we use can limit the amount of rows returned to perhaps a few hundred thousand for some of our larger reports. However, a lot of the data needs to be processed in application code rather than by the database itself due to some special business rules. Because of this, we end up creating a large number of objects in Java to hold these result rows. If multiple users are generating different reports we might end up holding too many of these objects in memory at a time and receive an OOM error. Also, the processing on this data can be intense enough that if the server is slammed with report requests that the entire system slows down, causing difficulties for people wanting to insert or modify data. This is the case I am in while I contemplate offline report generation.

The basic idea is that the main application should be concerned purely with manipulating the data of the system. That is basic CRUD stuff such as creating new records, updating them, the rare deletions, and showing single records to the user (so they can edit or delete it). We want that part of the application to remain fast, and not be effected by the purely read-only needs imposed by report generation. In order to nullify the impact, we move reporting to its own system that reads from a separate read-only replication of our production database.

When a report request comes in to our application, we send a request to the separate reporting system. This can be done either as a web service or maybe an RPC call. The reporting system uses its read-only copy of the data to generate the report numbers and send it back, causing no speed delay for insertion or regular operation of the main application.

This doesn’t solve our OOM issues however, as many drivers for our database (MySQL) return ResultSet objects with the entire contents of the results which might be too large to fit into memory. However, since we’re using a read-only list anyway we can convert the table or tables we use to process our results into flat files that can be read in on a line by line basis, perform some intermediate result processing, deallocate those lines and work on additional lines. Since our reports are mostly generating statistical data over a large data set, we can process results on that data set in parallel using multiple threads or possibly multiple computers using a Hadoop cluster.

By making report generation asynchronous to our applications general work flow we will free up the processing power and the connection pool that’s used to handle requests by asking users to either poll for when the result is finished or to notify the system when a report is finished and thereby avoid the instances where we use all of our available connections or resources processing reports. There is also the added possibility of continuously generating all possible report data on a separate machine or cluster to decrease access time by increasing storage requirements.

I’m currently researching the use of MapReduce in Hadoop for processing this flat file of all data for reports. In addition, I’m researching a few languages that are reported to be good at concurrent processing so that I can benefit from multiple cores when generating reports from our raw data. My current focus is on Scala, Erlang, and Clojure, but I do not have much to report on those areas yet. If anyone has any information those languages as far as report generation based on a largish data set (currently 5 million, but the rate of growth is fairly alarming), let me know.

Elegant Designs

November 18, 2009

One of the most difficult things I find I have to do is coming up with an elegant solution for a difficult problem. Many problems can be solved in a rough and tumble sort of way by tackling it with brute force. However, when you really want an application that will scale well and perform at its peak you seek out those elegant solutions that look so obvious to someone who reads it, but are so difficult to come up with initially.

My basic theory when faced with coming up with a solution, or looking at improvements to a previous solution, is that if it looks more complicated than you think it should be, it probably is. But how do you reduce the complexity that seems so inherent in the problem at hand? There really isn’t a single answer to that question, else I would be selling it to the highest bidder and retiring by the time I was 30. However, there often exists a few good ways to shift your mind in a way that makes an elegant solution easier to emerge.

First, sometimes it is good to just step back for a moment. When you’re looking at the details of a problem intensely, you miss out on some of the larger picture things that can help you find a better solution. Recently I had a problem that I was pounding my head against for days. The problem was partially due to old requirements restricting my progress. I found that if you have the ability to look at some business requirements and discover that perhaps they aren’t really required, you can change the entire scope of your problem so that an elegant solution can emerge. Sometimes it is not possible to change a business requirement. Even when you can, changing the requirements can cause new difficulties to arise, such as migrating data. However, a confusing or complicated business rule should be changed if it makes it both difficult for you to design around and difficult for the users to understand.

Another method that I used on the same problem is exactly the opposite. Instead of looking back at a high level, dive down deeper. When looking at a problem and trying to come up with a solution you might have several ideas floating around in your head. You jot them down and think a little bit about how they might solve your problem, but you’re still unsure. If you have these possible solutions but are not sure if they would work, pick one and dive further into it. Try to draw it out more fully. Maybe create a prototype or fully design a database table or series of tables. Perhaps you can get a better idea of how something will interact by drawing up the interfaces between pieces of a program or writing out an algorithm is pseudocode. Instead of just viewing a problem from a high level, coming up with a solution, and immediately coding it, intermediate steps such as these can help you find flaws you wouldn’t encounter until implementation time when they’re more expensive to fix.

The best feeling in the world is taking an old design that took thousands of lines of code, or dozens of lines of generated and complicated SQL and reducing it to a few hundred lines of code and a line or two of SQL. Sometimes those types of changes can’t be made unless you change your schema or base objects, but if you’re able to change them to reduce complexity from something monolithic and unreadable to something small and concise, then you’ve discovered elegance.

Your work will be smoother, and your confidence higher if you find that elegant design.

Pomodoro: Day 1 (and 2)

November 17, 2009

After having read a book on Pomodoro along with viewing some of the information on the technique’s website I have started to put the idea to practice. What is actually humorous is that while using a new (to me) technique for time management, I wasn’t able to fit in time to write this entry yesterday. I’m a day behind, but I put writing this up higher on my list and set aside two pomodoros for it.

I won’t really describe the concept too fully here. You can find a more thorough description at the official website or by reading the Pragmagic Programmer book that is released this week. Instead, I’ll describe my experiences in the very small amount of time that I’ve been practicing it, namely yesterday and today, along with my analysis of the value I think it provides.

So yesterday I created my first “ToDo Today” list with 7 items on it totaling an estimated 11 pomodoros. A pomodoro is basically a unit of time measuring 25 minutes plus a 5 minute break for a grand total of 30 minutes. After I started the day I ended up creating 6 more items that I added to my ToDo list for a total of 13. I completed 11 of these items, however several of them were so small and minor that they were done outside of any pomodoro (making dinner reservations for Friday, telling my fiancé about some carpooling, answering a question… etc). I completed a total of 7 pomodoros in the day, which totals to about 3 hours worth of focused work time plus a half hour or so on small, spread out breaks.

One interesting thing I noted, and was expecting from having read up on the technique, is how little of my time was spent being productive. 3 hours of solid productivity in an 8 hour work day. I discovered that there are a few reasons for this. One is that the beginning of my day has a 15 minute meeting at 10:30 that prevents me from doing many, if any, pomodoros before it. I normally arrive to work between 9:45 and 10:00. Since the first thing I need to do is add to my activity inventory and compile a ToDo list for the day, I generally don’t have a solid 25 minutes to focus on a pomodoro. by the time the meeting is over and I’m settled again it’s around 11, giving me an hour to work before lunch.

The afternoon hours are when the real work gets done and where I have the most solid stretch of uninterrupted time. Unfortunately one of the problems that comes up when trying to complete pomodoros is interruptions. I actually tried to write this blog yesterday, however I was interrupted so often by IMs that I had to void the pomodoro to address them. In retrospect I know that I could have possibly ignored some of them until the pomodoro was done, but it became difficult to do so when my fiancé was asking me about plans for the week and a friend of mine was pinging me constantly. It is also very distracting to have my email client constantly making noise when new mail arrives; which is around every 10 seconds due to the mailing lists I’m required to be on at work. I’ve since taken care of both distractions by creating a pomodoro away message for my IM client along with disabling all of the notifications from my email and IM clients.

Even though I only completed 7 pomodoros, I still felt more productive yesterday than I have in weeks. I know that I was able to really focus on something to the exclusion of everything else at least 7 times. I was able to hack out a database design that I had been banging my head against the wall for a week now, mainly by forcing myself to focus on it and avoid procrastination by not making it an option. It’s a difficult thing to experience if you’re used to doing dozens of things at once, working alongside your train of thought.

I also feel that I’m being more complete in my work. I often will go to a meeting or someone will tell me something over IM that I’m supposed to act on or do at some later time but then I forget. By forcing my work schedule and all of my work activities to be pulled from an activity inventory and today list I end up not having to worry about forgetting to do something because I write everything I need to get done down on a list. If I don’t get to it today, at least it will still be waiting in my inventory for tomorrow.

I’ve never been good at time management. I’ve often scoffed at several of the self-help books and I’ll probably continue to do so in all honesty. For me, forcing myself to timebox is something that allows me to gather up my scattered thoughts and see my own accomplishments as items are removed from a list and little Xs are marked in boxes. If I can keep this up for a few months, I think I’ll be in a better place than I am today.

(Oh, I finished in one pomodoro instead of two. I guess I can mark this task as being overestimated).


November 14, 2009

I’ve been having trouble focusing a lot recently. When faced with a difficult problem that I can not find an obvious solution to, I tend to procrastinate or shy away from the problem. Other things become important suddenly; like checking email, talking on IM, reading slashdot. All these things do not lead to a productive day.

If I’m not productive, my energy level in general goes down. I notice that when I finish a day of work where I felt like I accomplished something, my evenings are so much better. There is less stress, and I feel proud of my work and what I’ve done. Even if I didn’t finish what I set out to do, as long as I’ve made some notable progress I get this effect. But I’ve been failing to do that with a particular problem I’ve been working on recently.

I’m a regular reader of Pragmatic Programmers. I believe that the original The Pragmatic Programmer is probably the best book for programmers around. There’s a new one that is still in “beta” called Pomodoro Technique Illustrated: The Easy Way To Do More In Less Time. It’s a fairly basic technique about time boxing and learning to focus on focusing. The basic technique is to pick a task you want to complete, set a kitchen timer for 25 minutes, then do that task and only that task until the timer dings. This means avoiding any distraction, from checking email to going to the bathroom or answering the phone. Now since things come up that you can’t ignore (the phone sometimes, or a colleague), there’s a system for dealing with interruptions. If you complete the full 25 minutes without being interrupted, both internal and external interruptions, then you mark an X next to your task as having completed one pomodoro. The sense of accomplishment you get at the end of the day is how many X’s you’ve built up.

There’s a free PragPub magazine on the Pragmatic Programmer’s website that has an article in the November issue that talks about the basic technique. You can try it out with just the knowledge in there. I ended up buying the book as a way to really force myself to focus, but I’m sure the basic idea is enough to get many people started.

Scalable Data Models

November 11, 2009

Most applications involve storing and retrieving some form of data. It can be posts on a blog, financial information, status updates of friends, or any other large number of diverse topics. Some data is fairly simple and straight forward. A blog for instance is not a very complicated structure to model. There’s a title, the post body, an author, a creation date, maybe even a few fancy things like tags or categories. Overall though it isn’t that complicated, and the interaction between that piece of data and other data isn’t overly complex.

However, some data is complicated, and the interaction between that data and other pieces of data can get quite complicated. That complication needs to be pondered quite heavily when creating your data model. You might think you know all of the data fields of each piece of functionality and write up a quick data model based on the assumptions of what you’re storing. However, in large projects where the usage is high, choosing the correct data model can often be more about reading the data than writing.

Because let’s face it, most data is read, not written. Every time you write some data into your data store, that same piece of data is probably read multiple times by different users or used multiple times in different reports. Writing data is often infrequent compared to how often it is read. In highly scalable designs, you should account for the unbalanced nature of your reads to your writes. This might involve spending a little more time to make your writes by computing additional metadata in advance so that each read will already have that metadata rather than computing it at read time.

One structure you can use for increased read times and reduced computation times is a summary table. A summary table is basically a pre-computed set of data based on your real data table that can be pulled from during read requests rather than deriving the results from your real data on each request. For example, perhaps you have a one to many relationship between Foo and Bar. A single Foo can contain zero to many Bars. Each Bar has a Baz property that can be set to either 1, 2 or 3. Now imagine you want to list the number of Foo items that have Bars with a Baz of 2, and you want to display those results like so:

Foo  | #Bars with Baz of 2
Foo1 | 2
Foo2 | 4
Foo3 | 1

Your original data might look like:

Foo1 { Bar1 { Baz : 1 }, Bar2 { Baz : 2 }, Bar3 { Baz : 2 } },
Foo2 { Bar1 { Baz : 2 }, Bar2 { Baz : 2 }, Bar3 { Baz : 2 }, Bar4 {Baz : 2 } },
Foo3 { Bar1 { Baz : 3 }, Bar2 { Baz : 2 }, Bar3 { Baz : 1} },
Foo4 { Bar1 { Baz : 1 }, Bar2 { Baz : 1 }, Bar3 { Baz : 3 } }

So to create that above table you might need to do something like the following psudocode:

Map<Foo, BazNum> dataMap;
for(Foo f in getAllFoos()) {
   List<Bar> bars = f.getAllBars();
   for(Bar b in bars) {
      if (b.getBaz() == 2) {
         dataMap.set(f, dataMap.get(f)++);

for (Foo f in dataMap.keySet()) {
   print ( + " | " + dataMap.get(f));

To generate that table your program has to go through 3 for loops. Ok you can probably remove the last for loop if you put the print statement in the inner loop above it, however most of the time you’re going to be separating generating the data from displaying the data into different functions so the second iteration will be required. This is an O(n²) operation due to the nested loop.

Imagine if you have 200 Foo objects, each of which has on average 10 million Bar objects. Just storing the objects as tuples in your data store and compiling the report on each request is going to bog down your processing machine very quickly, especially if you receive loads ranging in the hundreds per second. However, if you pre-determine this information during your write, you can easily retrieve the result at any time with a very small processor cost.

insert parentFoo, name, baz into Bar values ("foo1", "bar9000", 2);
update FooBaz set bar2s = bar2s + 1 where foo = "foo1";

Now instead of selecting a count(), joining on Bar, and grouping by Foo, you can merely select from FooBaz with a possible where clause.

This is a rather contrived example that you might not see as great of a speed increase as in the real world, however if you imagine a much more complex data model with many nested objects that would require multiple joins to generate the proper data; or even worse, multiple joins followed by application level processing followed by additional selects followed by more processing; you will see where pre-calculating results upon writes can be a large time saver.

Sometimes it isn’t possible to update a result simply upon a write however. One is forced to do a full set of computations in order to get a result based on all of the data available after the write has happened. This does not mean that you can’t use summary tables with pre-computed results however. You might need to run a separate task to update the summary table on a scheduled basis with a cron job, allowing reports requested to be slightly lagging real-time data. If this is a trade-off that you can afford, scheduled report generation can make impossible request-time level reports turn into possible time-delayed reports.

Defining Requirements

November 9, 2009

When working on a new project one of the first things you should do is gather a list of requirements of what the goals of the project actually are. It’s hard to create something if you don’t have any solid idea of what it is that you’re creating. This list of requirements doesn’t have to be hard fast – there are always changes and new requirements that pop up. But you should have a basic idea of what you’re trying to do before you try to do it at least.

The process of defining your requirements however does not stop at just the initial analysis or design phases. Requirement gathering should be a continuous process throughout development and maintenance. The agile approach that I champion requires that all parts of development be flexible, including requirement gathering. This does not mean that adding new requirements should be a process you throw in at the whim of a user however. Balancing what goes into a project and what stays out, along with prioritizing between requirements is the key to avoiding scope creep and actually releasing a successful project.

There are dozens of tools you can use to keep your project data together. For the initial requirement phase I tend to like just paper. Graph paper to be specific as I find that it is useful for keeping lines neat and drawing diagrams on. However some people use a visualization tool like viso or OmniGraffle, or maybe a spreadsheet, word processing document, or wiki. Maybe  just a plain text file or a bunch of sticky notes. Whatever you use, just make sure you can keep it together to reference later or find a way digitize it so that you have it in one easy and accessible place.

The way I like to start with is make a quick list of all of the tasks that your project should do (or does) and rank them by importance. When starting a new project you should focus on those features that are most important first. This gives you a nice launching point for the users to see what the project does, along with the ability to use it in a useful fashion early on. If you’re doing maintenance, this list gives you the ability to focus your maintenance efforts on those areas that are most important to your customers first.

Try to make sure that there is one main thing that your program does. A goal that it helps the user achieve. Generally I like a project that can be summed up in a single sentence, even if the overall use of the application can be expanded. Something like “A program for storing test results and aggregating those results into useful information.”  That definition describes generically what our program does, but you can expand it to include other goals that support that main goal such as “Store a description and process of the test itself so that you have context for the result you’re storing.” and “Organize tests into interrelated groups that relate to a common business process.”

One task that people either forget or perform incorrectly is to solicit feedback from the customer/user. Forgetting to ask the user for what they want means that you’re making assumptions that could very well turn out to be incorrect. Working for weeks or months to churn out a product that the customer doesn’t even want isn’t a very successful project. Knowing what the customer wants will make it easier to create a product that they are willing to use and that will improve their ability to achieve their goals.

However, one important thing to note is that user’s don’t know what they want. I know that seems to contradict what I said earlier about asking the users what they want. Why ask them if they don’t even know? The user’s think they know, but often they can only tell you what they can imagine. It’s up to you to not take what the user says at face value and instead think about the underlying problem they are describing, ignoring any implementation details they might have provided, and determine a way for them to achieve the real goals that they desire. Asking them how a menu should look or what buttons they want where isn’t very helpful. The user isn’t going to want to spend the time it takes or the research necessary to tell you exactly what would really be best for them. That will take research on your part and probably some demos or prototypes that you can show to the user and gather feedback.

Once you have a list of your prioritized goals you should sketch up a rough plan on how you can achieve them. If your in maintenance mode, maybe some of your goals are too difficult to fit into the current architecture. You need to come up with a plan for modifying the structure of the project so that your goals are achievable. You should try to remain flexible during this process as most of those goals are still loose and subject to change as business requirements or new data comes to light.

Its often difficult to keep in mind your base goal for a product when all of the nagging implementation details start to bog you down. If you ever find yourself in a situation where you’re working on fixing some problem and wondering how even got to this problem, or you think it’s too difficult and there must be an easier way, stop. There probably is, and you might have lost focus on your scope.