Archive for November 2009

Pig Frustrations

November 30, 2009

My desires to implement better scalability through pre-processing reports via the Grid have lead me to Pig. Unfortunately, while Pig does remove some of the difficulties of writing for Hadoop (you no longer have to write all of the map-reduce jobs yourself in java), it has many limitations.

The biggest limitation I’ve found so far is just lack of documentation. There are a few tutorials, and a language reference, but not really all that much more. Even internally at Yahoo there is only a few more bits of documentation than found at the official apache site. I know that Pig is very new and hasn’t even hit a 1.0 release yet, however the documentation really has to improve in order for people to start using it more.

Even when looking at the language reference itself I find there are limitations. For instance, you cannot use a FOREACH inside of a FOREACH. There’s no nested looping like that. This makes working with complex data structures quite difficult. The system seems to be based off of doing simple sorting, filtering, and grouping. However if you want to group multiple times to create more advanced data structures: bags within bags within bags, and then sort the inner most bag, there isn’t a construct for that that I have found.

Also I’ve run into error messages that do not really explain what went wrong. Since there is little documentation and even users so far, it’s not easy to resolve some of these errors. A quick search will usually result in a snippet of code where the error was defined or a list of error codes on a web page with no real descriptions to speak of. Definitely no possible solutions.

Despite those flaws, it still seems beneficial to write in Pig rather than writing manual MapReduce jobs.

It’s also possible that some of these flaws can be fixed by writing extensions in Java to call from my Pig scripts. I haven’t gathered up all of the requirements to build and compile custom functions into a jar yet, and perhaps that will be the next step. It’s just unfortunate that some of these things which I feel should be fairly basic are not included in the base package.

I’m still researching however and its possible I just haven’t discovered the way to do what I need. If that changes I’ll be sure to update.

Dynamic Offline Reports

November 23, 2009

Many applications have the primary concern of storing and retrieving data. The raw data by itself is often not very useful, so an additional process is put into place to turn that data into useful information. Many applications generate these reports from the data quickly at the user’s request through either a narrow SQL select statement, in application data processing, or both. However, in larger applications where the data is too large to handle in memory at a time, processing is too heavy, and report customization is too varied, information generation needs to be pushed to its own scalable system.

A lot of data has various pieces of metadata associated with it naturally. Data such as what user added the record, the date it was added or modified (or both), maybe the size of the record, categories, tags, keywords, or other pieces of associate data that is used to break it up into more manageable chunks. This meta data is useful, but only if we can use it as an aggregate to generate specific information relating to the grouping of items based on this data.

Sometimes generating these more specific reports is as easy as adding additional WHERE or GROUP BY clauses in SQL. However, when more advanced business rules are taking place where there isn’t an easy or succinct way of extracting this information via a query, or if the query returns such a large amount of data as to cause memory issues, a different approach can be taken.

For instance; in an application I am currently working on we need to generate reports based on a table with about 5 million rows. The SQL queries we use can limit the amount of rows returned to perhaps a few hundred thousand for some of our larger reports. However, a lot of the data needs to be processed in application code rather than by the database itself due to some special business rules. Because of this, we end up creating a large number of objects in Java to hold these result rows. If multiple users are generating different reports we might end up holding too many of these objects in memory at a time and receive an OOM error. Also, the processing on this data can be intense enough that if the server is slammed with report requests that the entire system slows down, causing difficulties for people wanting to insert or modify data. This is the case I am in while I contemplate offline report generation.

The basic idea is that the main application should be concerned purely with manipulating the data of the system. That is basic CRUD stuff such as creating new records, updating them, the rare deletions, and showing single records to the user (so they can edit or delete it). We want that part of the application to remain fast, and not be effected by the purely read-only needs imposed by report generation. In order to nullify the impact, we move reporting to its own system that reads from a separate read-only replication of our production database.

When a report request comes in to our application, we send a request to the separate reporting system. This can be done either as a web service or maybe an RPC call. The reporting system uses its read-only copy of the data to generate the report numbers and send it back, causing no speed delay for insertion or regular operation of the main application.

This doesn’t solve our OOM issues however, as many drivers for our database (MySQL) return ResultSet objects with the entire contents of the results which might be too large to fit into memory. However, since we’re using a read-only list anyway we can convert the table or tables we use to process our results into flat files that can be read in on a line by line basis, perform some intermediate result processing, deallocate those lines and work on additional lines. Since our reports are mostly generating statistical data over a large data set, we can process results on that data set in parallel using multiple threads or possibly multiple computers using a Hadoop cluster.

By making report generation asynchronous to our applications general work flow we will free up the processing power and the connection pool that’s used to handle requests by asking users to either poll for when the result is finished or to notify the system when a report is finished and thereby avoid the instances where we use all of our available connections or resources processing reports. There is also the added possibility of continuously generating all possible report data on a separate machine or cluster to decrease access time by increasing storage requirements.

I’m currently researching the use of MapReduce in Hadoop for processing this flat file of all data for reports. In addition, I’m researching a few languages that are reported to be good at concurrent processing so that I can benefit from multiple cores when generating reports from our raw data. My current focus is on Scala, Erlang, and Clojure, but I do not have much to report on those areas yet. If anyone has any information those languages as far as report generation based on a largish data set (currently 5 million, but the rate of growth is fairly alarming), let me know.

Elegant Designs

November 18, 2009

One of the most difficult things I find I have to do is coming up with an elegant solution for a difficult problem. Many problems can be solved in a rough and tumble sort of way by tackling it with brute force. However, when you really want an application that will scale well and perform at its peak you seek out those elegant solutions that look so obvious to someone who reads it, but are so difficult to come up with initially.

My basic theory when faced with coming up with a solution, or looking at improvements to a previous solution, is that if it looks more complicated than you think it should be, it probably is. But how do you reduce the complexity that seems so inherent in the problem at hand? There really isn’t a single answer to that question, else I would be selling it to the highest bidder and retiring by the time I was 30. However, there often exists a few good ways to shift your mind in a way that makes an elegant solution easier to emerge.

First, sometimes it is good to just step back for a moment. When you’re looking at the details of a problem intensely, you miss out on some of the larger picture things that can help you find a better solution. Recently I had a problem that I was pounding my head against for days. The problem was partially due to old requirements restricting my progress. I found that if you have the ability to look at some business requirements and discover that perhaps they aren’t really required, you can change the entire scope of your problem so that an elegant solution can emerge. Sometimes it is not possible to change a business requirement. Even when you can, changing the requirements can cause new difficulties to arise, such as migrating data. However, a confusing or complicated business rule should be changed if it makes it both difficult for you to design around and difficult for the users to understand.

Another method that I used on the same problem is exactly the opposite. Instead of looking back at a high level, dive down deeper. When looking at a problem and trying to come up with a solution you might have several ideas floating around in your head. You jot them down and think a little bit about how they might solve your problem, but you’re still unsure. If you have these possible solutions but are not sure if they would work, pick one and dive further into it. Try to draw it out more fully. Maybe create a prototype or fully design a database table or series of tables. Perhaps you can get a better idea of how something will interact by drawing up the interfaces between pieces of a program or writing out an algorithm is pseudocode. Instead of just viewing a problem from a high level, coming up with a solution, and immediately coding it, intermediate steps such as these can help you find flaws you wouldn’t encounter until implementation time when they’re more expensive to fix.

The best feeling in the world is taking an old design that took thousands of lines of code, or dozens of lines of generated and complicated SQL and reducing it to a few hundred lines of code and a line or two of SQL. Sometimes those types of changes can’t be made unless you change your schema or base objects, but if you’re able to change them to reduce complexity from something monolithic and unreadable to something small and concise, then you’ve discovered elegance.

Your work will be smoother, and your confidence higher if you find that elegant design.

Pomodoro: Day 1 (and 2)

November 17, 2009

After having read a book on Pomodoro along with viewing some of the information on the technique’s website I have started to put the idea to practice. What is actually humorous is that while using a new (to me) technique for time management, I wasn’t able to fit in time to write this entry yesterday. I’m a day behind, but I put writing this up higher on my list and set aside two pomodoros for it.

I won’t really describe the concept too fully here. You can find a more thorough description at the official website or by reading the Pragmagic Programmer book that is released this week. Instead, I’ll describe my experiences in the very small amount of time that I’ve been practicing it, namely yesterday and today, along with my analysis of the value I think it provides.

So yesterday I created my first “ToDo Today” list with 7 items on it totaling an estimated 11 pomodoros. A pomodoro is basically a unit of time measuring 25 minutes plus a 5 minute break for a grand total of 30 minutes. After I started the day I ended up creating 6 more items that I added to my ToDo list for a total of 13. I completed 11 of these items, however several of them were so small and minor that they were done outside of any pomodoro (making dinner reservations for Friday, telling my fiancé about some carpooling, answering a question… etc). I completed a total of 7 pomodoros in the day, which totals to about 3 hours worth of focused work time plus a half hour or so on small, spread out breaks.

One interesting thing I noted, and was expecting from having read up on the technique, is how little of my time was spent being productive. 3 hours of solid productivity in an 8 hour work day. I discovered that there are a few reasons for this. One is that the beginning of my day has a 15 minute meeting at 10:30 that prevents me from doing many, if any, pomodoros before it. I normally arrive to work between 9:45 and 10:00. Since the first thing I need to do is add to my activity inventory and compile a ToDo list for the day, I generally don’t have a solid 25 minutes to focus on a pomodoro. by the time the meeting is over and I’m settled again it’s around 11, giving me an hour to work before lunch.

The afternoon hours are when the real work gets done and where I have the most solid stretch of uninterrupted time. Unfortunately one of the problems that comes up when trying to complete pomodoros is interruptions. I actually tried to write this blog yesterday, however I was interrupted so often by IMs that I had to void the pomodoro to address them. In retrospect I know that I could have possibly ignored some of them until the pomodoro was done, but it became difficult to do so when my fiancé was asking me about plans for the week and a friend of mine was pinging me constantly. It is also very distracting to have my email client constantly making noise when new mail arrives; which is around every 10 seconds due to the mailing lists I’m required to be on at work. I’ve since taken care of both distractions by creating a pomodoro away message for my IM client along with disabling all of the notifications from my email and IM clients.

Even though I only completed 7 pomodoros, I still felt more productive yesterday than I have in weeks. I know that I was able to really focus on something to the exclusion of everything else at least 7 times. I was able to hack out a database design that I had been banging my head against the wall for a week now, mainly by forcing myself to focus on it and avoid procrastination by not making it an option. It’s a difficult thing to experience if you’re used to doing dozens of things at once, working alongside your train of thought.

I also feel that I’m being more complete in my work. I often will go to a meeting or someone will tell me something over IM that I’m supposed to act on or do at some later time but then I forget. By forcing my work schedule and all of my work activities to be pulled from an activity inventory and today list I end up not having to worry about forgetting to do something because I write everything I need to get done down on a list. If I don’t get to it today, at least it will still be waiting in my inventory for tomorrow.

I’ve never been good at time management. I’ve often scoffed at several of the self-help books and I’ll probably continue to do so in all honesty. For me, forcing myself to timebox is something that allows me to gather up my scattered thoughts and see my own accomplishments as items are removed from a list and little Xs are marked in boxes. If I can keep this up for a few months, I think I’ll be in a better place than I am today.

(Oh, I finished in one pomodoro instead of two. I guess I can mark this task as being overestimated).

Focusing

November 14, 2009

I’ve been having trouble focusing a lot recently. When faced with a difficult problem that I can not find an obvious solution to, I tend to procrastinate or shy away from the problem. Other things become important suddenly; like checking email, talking on IM, reading slashdot. All these things do not lead to a productive day.

If I’m not productive, my energy level in general goes down. I notice that when I finish a day of work where I felt like I accomplished something, my evenings are so much better. There is less stress, and I feel proud of my work and what I’ve done. Even if I didn’t finish what I set out to do, as long as I’ve made some notable progress I get this effect. But I’ve been failing to do that with a particular problem I’ve been working on recently.

I’m a regular reader of Pragmatic Programmers. I believe that the original The Pragmatic Programmer is probably the best book for programmers around. There’s a new one that is still in “beta” called Pomodoro Technique Illustrated: The Easy Way To Do More In Less Time. It’s a fairly basic technique about time boxing and learning to focus on focusing. The basic technique is to pick a task you want to complete, set a kitchen timer for 25 minutes, then do that task and only that task until the timer dings. This means avoiding any distraction, from checking email to going to the bathroom or answering the phone. Now since things come up that you can’t ignore (the phone sometimes, or a colleague), there’s a system for dealing with interruptions. If you complete the full 25 minutes without being interrupted, both internal and external interruptions, then you mark an X next to your task as having completed one pomodoro. The sense of accomplishment you get at the end of the day is how many X’s you’ve built up.

There’s a free PragPub magazine on the Pragmatic Programmer’s website that has an article in the November issue that talks about the basic technique. You can try it out with just the knowledge in there. I ended up buying the book as a way to really force myself to focus, but I’m sure the basic idea is enough to get many people started.

Scalable Data Models

November 11, 2009

Most applications involve storing and retrieving some form of data. It can be posts on a blog, financial information, status updates of friends, or any other large number of diverse topics. Some data is fairly simple and straight forward. A blog for instance is not a very complicated structure to model. There’s a title, the post body, an author, a creation date, maybe even a few fancy things like tags or categories. Overall though it isn’t that complicated, and the interaction between that piece of data and other data isn’t overly complex.

However, some data is complicated, and the interaction between that data and other pieces of data can get quite complicated. That complication needs to be pondered quite heavily when creating your data model. You might think you know all of the data fields of each piece of functionality and write up a quick data model based on the assumptions of what you’re storing. However, in large projects where the usage is high, choosing the correct data model can often be more about reading the data than writing.

Because let’s face it, most data is read, not written. Every time you write some data into your data store, that same piece of data is probably read multiple times by different users or used multiple times in different reports. Writing data is often infrequent compared to how often it is read. In highly scalable designs, you should account for the unbalanced nature of your reads to your writes. This might involve spending a little more time to make your writes by computing additional metadata in advance so that each read will already have that metadata rather than computing it at read time.

One structure you can use for increased read times and reduced computation times is a summary table. A summary table is basically a pre-computed set of data based on your real data table that can be pulled from during read requests rather than deriving the results from your real data on each request. For example, perhaps you have a one to many relationship between Foo and Bar. A single Foo can contain zero to many Bars. Each Bar has a Baz property that can be set to either 1, 2 or 3. Now imagine you want to list the number of Foo items that have Bars with a Baz of 2, and you want to display those results like so:

Foo  | #Bars with Baz of 2
==========================
Foo1 | 2
Foo2 | 4
Foo3 | 1

Your original data might look like:

Foo1 { Bar1 { Baz : 1 }, Bar2 { Baz : 2 }, Bar3 { Baz : 2 } },
Foo2 { Bar1 { Baz : 2 }, Bar2 { Baz : 2 }, Bar3 { Baz : 2 }, Bar4 {Baz : 2 } },
Foo3 { Bar1 { Baz : 3 }, Bar2 { Baz : 2 }, Bar3 { Baz : 1} },
Foo4 { Bar1 { Baz : 1 }, Bar2 { Baz : 1 }, Bar3 { Baz : 3 } }

So to create that above table you might need to do something like the following psudocode:

Map<Foo, BazNum> dataMap;
for(Foo f in getAllFoos()) {
   List<Bar> bars = f.getAllBars();
   for(Bar b in bars) {
      if (b.getBaz() == 2) {
         dataMap.set(f, dataMap.get(f)++);
      }
   }
}

for (Foo f in dataMap.keySet()) {
   print (f.name() + " | " + dataMap.get(f));
}

To generate that table your program has to go through 3 for loops. Ok you can probably remove the last for loop if you put the print statement in the inner loop above it, however most of the time you’re going to be separating generating the data from displaying the data into different functions so the second iteration will be required. This is an O(n²) operation due to the nested loop.

Imagine if you have 200 Foo objects, each of which has on average 10 million Bar objects. Just storing the objects as tuples in your data store and compiling the report on each request is going to bog down your processing machine very quickly, especially if you receive loads ranging in the hundreds per second. However, if you pre-determine this information during your write, you can easily retrieve the result at any time with a very small processor cost.

insert parentFoo, name, baz into Bar values ("foo1", "bar9000", 2);
update FooBaz set bar2s = bar2s + 1 where foo = "foo1";

Now instead of selecting a count(), joining on Bar, and grouping by Foo, you can merely select from FooBaz with a possible where clause.

This is a rather contrived example that you might not see as great of a speed increase as in the real world, however if you imagine a much more complex data model with many nested objects that would require multiple joins to generate the proper data; or even worse, multiple joins followed by application level processing followed by additional selects followed by more processing; you will see where pre-calculating results upon writes can be a large time saver.

Sometimes it isn’t possible to update a result simply upon a write however. One is forced to do a full set of computations in order to get a result based on all of the data available after the write has happened. This does not mean that you can’t use summary tables with pre-computed results however. You might need to run a separate task to update the summary table on a scheduled basis with a cron job, allowing reports requested to be slightly lagging real-time data. If this is a trade-off that you can afford, scheduled report generation can make impossible request-time level reports turn into possible time-delayed reports.

Defining Requirements

November 9, 2009

When working on a new project one of the first things you should do is gather a list of requirements of what the goals of the project actually are. It’s hard to create something if you don’t have any solid idea of what it is that you’re creating. This list of requirements doesn’t have to be hard fast – there are always changes and new requirements that pop up. But you should have a basic idea of what you’re trying to do before you try to do it at least.

The process of defining your requirements however does not stop at just the initial analysis or design phases. Requirement gathering should be a continuous process throughout development and maintenance. The agile approach that I champion requires that all parts of development be flexible, including requirement gathering. This does not mean that adding new requirements should be a process you throw in at the whim of a user however. Balancing what goes into a project and what stays out, along with prioritizing between requirements is the key to avoiding scope creep and actually releasing a successful project.

There are dozens of tools you can use to keep your project data together. For the initial requirement phase I tend to like just paper. Graph paper to be specific as I find that it is useful for keeping lines neat and drawing diagrams on. However some people use a visualization tool like viso or OmniGraffle, or maybe a spreadsheet, word processing document, or wiki. Maybe  just a plain text file or a bunch of sticky notes. Whatever you use, just make sure you can keep it together to reference later or find a way digitize it so that you have it in one easy and accessible place.

The way I like to start with is make a quick list of all of the tasks that your project should do (or does) and rank them by importance. When starting a new project you should focus on those features that are most important first. This gives you a nice launching point for the users to see what the project does, along with the ability to use it in a useful fashion early on. If you’re doing maintenance, this list gives you the ability to focus your maintenance efforts on those areas that are most important to your customers first.

Try to make sure that there is one main thing that your program does. A goal that it helps the user achieve. Generally I like a project that can be summed up in a single sentence, even if the overall use of the application can be expanded. Something like “A program for storing test results and aggregating those results into useful information.”  That definition describes generically what our program does, but you can expand it to include other goals that support that main goal such as “Store a description and process of the test itself so that you have context for the result you’re storing.” and “Organize tests into interrelated groups that relate to a common business process.”

One task that people either forget or perform incorrectly is to solicit feedback from the customer/user. Forgetting to ask the user for what they want means that you’re making assumptions that could very well turn out to be incorrect. Working for weeks or months to churn out a product that the customer doesn’t even want isn’t a very successful project. Knowing what the customer wants will make it easier to create a product that they are willing to use and that will improve their ability to achieve their goals.

However, one important thing to note is that user’s don’t know what they want. I know that seems to contradict what I said earlier about asking the users what they want. Why ask them if they don’t even know? The user’s think they know, but often they can only tell you what they can imagine. It’s up to you to not take what the user says at face value and instead think about the underlying problem they are describing, ignoring any implementation details they might have provided, and determine a way for them to achieve the real goals that they desire. Asking them how a menu should look or what buttons they want where isn’t very helpful. The user isn’t going to want to spend the time it takes or the research necessary to tell you exactly what would really be best for them. That will take research on your part and probably some demos or prototypes that you can show to the user and gather feedback.

Once you have a list of your prioritized goals you should sketch up a rough plan on how you can achieve them. If your in maintenance mode, maybe some of your goals are too difficult to fit into the current architecture. You need to come up with a plan for modifying the structure of the project so that your goals are achievable. You should try to remain flexible during this process as most of those goals are still loose and subject to change as business requirements or new data comes to light.

Its often difficult to keep in mind your base goal for a product when all of the nagging implementation details start to bog you down. If you ever find yourself in a situation where you’re working on fixing some problem and wondering how even got to this problem, or you think it’s too difficult and there must be an easier way, stop. There probably is, and you might have lost focus on your scope.

Be Lazy, Automate Testing

November 6, 2009

I work on a product that manages test cases, test plans, and results of software testing. QA engineers create test cases in our system and submit results, usually through some sort of automation, and we keep track of those test results and generate useful reports.

When I first started this job, the product already existed. We were in more of a maintenance mode, adding features and fixing bugs. Ironically, while we provide a product to help out QA, our team has no QA team. There weren’t even any unit tests on the main product and all testing was done sporadically and manually.

This has been improving with a recent push to have all products under Continuous Integration and with a full unit, smoke, and regression tests. However writing up tests and executing them takes time. Currently, many of the developers are working part time as QA before we launch a new release. Unfortunately, many of them forget to use their development tools for the QA process. Namely, automation.

I can manually run through a test case in maybe 5 to 10 minutes, which isn’t too bad. If you have 70-80 test cases to run through before a release, that can take about 2 days of work. Now if we release on a monthly cycle, every 30 days we’ll lose 2 days to manual testing. Also, every release will add more tests which have to be performed along with the previous tests from the last release.

Since we write a web app, a lot of our testing is done by clicking on things in the web UI. Front End testing can be painful, but with the help of selenium, we can automate this process. Creating the selenium script can take a while, but once it’s functional separate runs at a later date are fairly quick, and can be run in an automation server. So while each test will take longer to perform because you have to combine the manual testing process with script creation, the end result is a slowly growing package of tests that will automatically run whenever you wish.

To tie our scripts together, we utilize TestNG inside of a maven package. Selenium has java packages that allow us to launch selenium straight from our TestNG tests. While TestNG is designed for running unit tests, it works for our regression tests quite well. I was even able to build a TestNG listener to automatically insert the results of each test into our product as well, so our product is keeping track of the results if its own tests.

By spending the time to create these automated tests and set them up in an easy to execute automation package (we run the  maven tests from hudson), we’ve greatly extended our testing coverage of our product. This helps us make sure that when we add new features or fix bugs in one place we’re less likely to be breaking them in another. It also saves us time each month by allow us to just spend time creating new automations for new tests rather than manually running all of the old ones along with the new ones.

Later we plan on automatic deployments of the product upon checkin via hudson. Once we have that set up we can continuously run our automated front end tests on each checkin to the code, finding problems faster.

Quick History of OSes

November 4, 2009

In an effort to write more on this blog, I’m posting a note originally meant for a friend who was very confused when I started talking about Linux. This is a lot less technical, and general knowledge for quite a few people. Feel free to send corrections and I’ll update this post.

There are a lot of people who don’t know about the differences in Operating Systems and that choices even exist, and I like to spread that information. I’ve been a Linux advocate and user since around 1997 and I like to give information on the OS and OSes in general when given the chance.

Back in the day (1969), a few guys at Bell Labs, part of AT&T, developed a multi-tasking, networked, multi-user operating system. Basically it means that several people could each run several tasks all on the same computer, and they could do so from other computers (the networked part) and could talk between computers. It was the most popular platform for business and scientific work due to the way that programs could interact with each other and how computers could talk to each other over a network. It was what the original Internet was built upon. It was expensive though and ran mostly on large mainframe computers built by IBM and DEC that ran for hundreds of thousands to millions of dollars, things that normal people couldn’t afford.

Meanwhile, Xerox starts working on some experimental things at Xerox Parc (Palo Alto Research Center) relating to user interactions with computers. They developed things like mice and a Graphic User Interface (GUI). Before that people just typed commands into a text screen with a keyboard. Apple was invited down to Xerox to have a look at some of their experiments and later sort of “borrowed” them and put them into the original Macintosh. The Mac was one of the first Personal Computer (PC) to be bought by the regular joe user for home use. It brought about the use of the mouse and a GUI instead of a command line interface, and people thought it was cool (the original advertisement video, which was only played twice, originally aired during the SuperBowl ( http://www.youtube.com/watch?v=OYecfV3ubP8 ). It was released in 1984 and it had a sort of vibe from the original Orwell book.

IBM wanted to compete with the Mac but they didn’t want to really spend all that much money so they decided to just build a computer by buying off the shelf parts from a lot of different places and putting them together. They did that not only with hardware but with software as well. Bill Gates was trying to sell IBM on BASIC which he and Paul Allen developed while in college. IBM liked it but also asked if they had an OS to sell. Bill said sure, and then proceeded to buy an OS from some company in Seattle for $50k which they probably kicked themselves for in retrospect. They sold the license to IBM but retrained the rights to resell it, which started MS into becoming a multi-billion dollar company.

Later they shoehorned a GUI on top of DOS with Windows 1 through 3.11. The rest of the MS story is pretty well known. Both MS and Apple charged quite a bit for their OS, but less than UNIX. People who went to college and worked on UNIX machines were disappointed that after they left they couldn’t have unix on their own machines because it was stupid expensive.

Around this time (mid 80’s, early 90’s) there was a group of people led by Richard Stallman who was pushing for Free Software. He developed the Free Software Foundation and pushed the agenda that people should have access to Free software. The ‘Free’ that he talked about wasn’t so much ‘free’ as in beer, but more ‘Free’ as in freedom. He wanted software that you could modify yourself if you choose and that anyone could improve upon. He started something called the “Open Source” movement where source code written was open for anyone to change. He doesn’t like copyright law very much. Unfortunately he also didn’t want companies to take all this free software, make changes, and sell it back to people without releasing the source code. He made a license called the GNU General Public license that forces people who make changes to the code and distribute it to also distribute the source code. He also very much wanted to create a full open source operating system that worked on unix. He called it GNU (Gnu’s Not Unix) and created a lot of utilities and programs that form the basis of an OS.

He never made a kernel (the program that’s at the heard of any OS, basically the software that talks to the hardware for other software). Fortunately some Finnish student by the name of Linus Torvolds developed a kernel in 1993 that ran on the generic x86 processor which was found in the 386, 486, and later Pentium and other Intel derivative processors that were part of IBM compatible PCs. He released the kernel as open source and named it Linux because he’s a bit of a narcissist. The GNU tools were added in and it became a full operating system called GNU Linux, or more often just Linux.

The interesting thing about Linux compared to Windows or even Mac OS was that it was free, both as in Freedom and beer. You could make changes to any of the programs as long as you released the results. This fostered a large community of hobby and volunteer programmers who added constantly to the OS. MS has thousands of developers for creating and maintaining Windows, Linux has millions, though granted part time and without as much financial incentive.

It became very popular in the hacker community in the early days, and businesses such as IBM, Google, Yahoo, Amazon, Oracle, and others started to use it in replacement of expensive UNIX machines. Eventually, around the year 2000, Apple completely redesigned the Mac OS based on NextStep and BSD (which was a free UNIX). OS X replaced Mac OS 9 and was pretty incompatible with previous versions. However it was a full and recognized UNIX system that actually used some of the same base commands and infrastructure found in Linux.

OS X and Linux are actually fairly similar (you can run a large majority of Linux apps on OS X just by recompiling). OS X came with a closed UI though and was much more polished as a user experience. Apple still contributes parts of its software that it updates back to the open source community however and Linux has benefitted from that.

The normal arguments against Linux have been that it’s difficult for a new inexperienced user to get working. That had been true for a long while, but eventually it has become as easy to use as Windows or OS X (some would argue its easier than Windows). Ubuntu is a distribution of Linux that is especially popular and noted as having a good user experience. You can download and try it out at http://www.ubuntu.com/.

Linux works on a very wide range of hardware, though not everything is supported (but the list of unsupported is always getting smaller). While most games and quite a bit of business software is designed to run under Windows and not Linux, there are common (free) replacements for them. There are technologies such as Wine (Wine Is Not an Emulator) that allow you to run windows software directly in Linux; some versions even let you play games. For typical home user’s daily tasks such as Web Browsing (Firefox), email (Thunderbird or Evolution), IM (Pidgin), Photo Collections (A few of these, I forget the names), and Music Collections (again a few of these, I forget the names) there are native apps (listed in parentheses) that replace the windows counterparts. Some of those such as Firefox, Thunderbird, and Pidgin were ported from Linux to Windows.

Linux is a free OS for those who are light gamers and don’t want to pay for an expensive OS, or who want the freedom to heavily modify the OS. The modifications don’t always require changing source code and recompiling, a lot of aspects of the OS are just more configurable and there is a wider range of free quality software available. It is also more resistant to viruses, which is one reason that many people turn to Linux and away from Windows. For those who want to just try it out you can download an Ubuntu CD and boot off of it into a sort of trial version, or you can download a VM such as virtualbox (http://www.virtualbox.org/) and install it in there without messing up your current OS.