Archive for October 2009

Data Scalability – The Write Problem

October 19, 2009

Many serious applications on the web need to scale to some extent. For smaller projects that you make for yourself or for a small audience, a single server box hosting both the web server and the database can be enough. For larger audiences and application demands, often you’ll separate the web server from the database into different boxes. This will allow a small bit of scaling as you can utilize one machine for your processing logic and another for your data store. But what about going further than that?

So you find that your front end machine can no longer handle all of the traffic arriving. Typically you’ll start setting up your application on multiple front end machines and hide them behind a load balancer so that your traffic is split between different machines. Now this can add in some new problems depending on how you architected your application (session data can become tricky), but overall it can help release load. In fact if you engineered correctly so as to not utilize session data to a great extent, or if you use a smart load balancer that works on an IP address hashing algorithm to make sure that requests from one client always go to the same server, then you can expand horizontally in a linear fashion. Excellent! But what about your data?

Well most people store their data in some sort of RDBMS (Relational DataBase Management System), basically a database. On the free side there are tools like MySQL, Postgresql, sqlite, berkely db, and probably a slew of others. On the not-free side there are a lot of options, but most people end up talking about Oracle. I’m not going to talk about Oracle because it’s a very large, expensive, difficult to manage piece of software that does a lot, but requires a lot of support. I also won’t talk about MSSQL because no one should host anything on a Windows machine (personal bias). MySQL is what I’m most familiar with and probably the most popular of the free databases so it is where I will focus.

If you have all of these front end machines set up with load distributed between them you’re going to run into a problem where they are all trying to access the same database and they get bogged down because your DB server has too high of a load. What can you do? Well you can set up replication and make slaves is what! Each slave reads commands that were performed on the master and performs them locally so as to keep data in sync. Basically a slave receives all of the same write commands from the master that the master received so that it can have an up to date¹ copy of your data. From then on you can treat your slave as read-only and send read requests to the slave instead of the master, thereby freeing up processing on your master for just writes (and critical reads, but I won’t get into that).

You can even add lots of slaves to scale reads horizontally in a linear fashion, which is exactly what we want. But what about writes? While one of your many front end boxes can choose one of your many slave database boxes to read data from, when it comes time to insert or update any data, those requests all have to go to the master database. In a write heavy application that can become a bottleneck. Unfortunately you cannot scale writes like you can with reads. You could create another DB machine with Master-Master replication (each machine is a slave to the other), but unfortunately it doesn’t help your write problem at all because each master has to write not only its own requests but also replicate from the other thereby nulling any effect you might have wanted to gain.

The only real solution I’ve ever heard about is sharding, which isn’t a very fun solution at all. Basically you figure out some method for separating the data, say by user_id. So all user data with an ID between a certain range gets sent to one master box and all of the data for IDs in another range go to another box. Unfortunately because your data is now in two separate database systems, you can’t run SQL joins on them (Oracle RAC can I believe, but again it’s expensive and hard to maintain for the regular developer). Instead you’ll have to make your queries separately and massage the data in code, separating your model into multiple places that is error prone.

I start to wonder then, if we need to use a database at all. Is there another data storage engine that would work better? I think about projects like memcached, and I wonder if one could make a similar system but with more integrity. There are some efforts to do so with hadoop, but they’re not quite ready yet and I’m not sure about the real-time performance of anything involved with hadoop just yet.

A lot of people won’t run into the write problem with scalability. But if you do it can be a real difficulty to solve in an elegant fashion. I haven’t done so, but I think it’s one of the next areas of research I shall be looking into.

¹ Technically the data isn’t guaranteed to be up to date as the replication process has some lag to it. It can take a few seconds or maybe minutes depending on the data and load for replication to catch up and the data to be mirrored onto the slave. That is why any critical reads that need to know exactly what’s on the master without doubt must be made to the master server.

Advertisements

Dynamic Continuous Integration

October 9, 2009
I have an idea for continuous integration that is a bit beyond what I’m aware of as far as the capabilities of yhudson are concerned.
We work as a team on a project. This project is stored in SVN (like all good projects should be) and is continuously checked out by yhudson to be built on a CI server. However, yhudson is set up to just check out the latest version of trunk on a commit, checking every 5 minutes. The problem is that most developers do their work on development branches, which don’t take part in the CI cycle. These development changes do not gain the benefit of continuous integration until the changes are merged into trunk which results in a lot of loss in developer potential.
In our development system, people create branches for each bug or feature set that they are working on. Now if a bug is trivially small or easy to see and fix, it might not warrant its own branch and will be just fixed on trunk. But for larger changes, a branch is cut so that development that one person is performing won’t adversely effect other developers. It also allows us to postpone features from our monthly release cycle. If you’re working on a feature that is only halfway complete by the time our release branch is cut and code freeze begins, you can just not merge your change into trunk and you won’t dirty up the release branch with half-finished code. Once you finish your change, you merge into trunk and solve any conflicts that might have come up. It’s not a perfect process, but I don’t know if anyone’s will classify (git excluded).
Because of these separate branches, developers are not able to harness the utility of yhudson and our CI environments. yHudson is set up with a specific location in SVN in mind when you create a project, usually trunk. If you want to build your branch, you have to set up a new project and go through all of the configuration of your trunk project but with the new location. There’s no real way to just duplicate a project so you can change a setting, and even if there were the manual steps involved would be enough that most developers won’t bother. There should be a better way.
I think that there should be a different style of project able to be used in yHudson. The project should define a standard build like we did with trunk on our project, but also allow it to monitor all of the folders under branches and treat each one as their own project. Whenever a change is made to one of those branches, a build will be performed. If a new branch is created, a new project should dynamically be created for that branch. When the build is made, it should be deployed according to the Staging instructions of the original project, but to a dynamically generated VM or from a pool of VMs with modifiable settings for things like ports and server names. This will allow us to do continuous deployment on each of our development branches so that if a QA tester finds a problem with a new feature a developer is working on in a branch, the developer can quickly provide a fix and have it automatically go through a full set of continuous integration tests and be deployed to the staging environment for the tester to try out again.
This fast turn around time between bug reporting, bug fixing, and fix deployment could greatly enhance productivity and let us try out different solutions, faster. With a fast feedback loop, we can work directly with the QA department or our customers to solve the issue they’re having and let them try it out right away in such a fashion that doesn’t effect anyone else’s bugs or problems that they are working on.
In essence, dynamic continuous integration would broaden the usage of our continuous integration servers to cover all development, even on separate branches, without the need to manually create (or forget to create) projects for them. I think that if we worked on creating such a functionality in yHudson, along with some helper tools for utilizing large pools of servers for fast staging environment, we can create stronger bonds between developers and testers, along with developers and their customers by having a shorter feedback loop and faster turnaround of problems.

I have an idea for continuous integration that is a bit beyond what I’m aware of as far as the capabilities of hudson are concerned.

We work as a team on a project. This project is stored in SVN (like all good projects should be) and is continuously checked out by yhudson to be built on a CI server. However, hudson is set up to just check out the latest version of trunk on a commit, checking every 5 minutes. The problem is that most developers do their work on development branches, which don’t take part in the CI cycle. These development changes do not gain the benefit of continuous integration until the changes are merged into trunk which results in a lot of loss in developer potential.

In our development system, people create branches for each bug or feature set that they are working on. Now if a bug is trivially small or easy to see and fix, it might not warrant its own branch and will be just fixed on trunk. But for larger changes, a branch is cut so that development that one person is performing won’t adversely effect other developers. It also allows us to postpone features from our monthly release cycle. If you’re working on a feature that is only halfway complete by the time our release branch is cut and code freeze begins, you can just not merge your change into trunk and you won’t dirty up the release branch with half-finished code. Once you finish your change, you merge into trunk and solve any conflicts that might have come up. It’s not a perfect process, but I don’t know if anyone’s will classify (git excluded).

Because of these separate branches, developers are not able to harness the utility of hudson and our CI environments. Hudson is set up with a specific location in SVN in mind when you create a project, usually trunk. If you want to build your branch, you have to set up a new project and go through all of the configuration of your trunk project but with the new location. There’s no real way to just duplicate a project so you can change a setting, and even if there were the manual steps involved would be enough that most developers won’t bother. There should be a better way.

I think that there should be a different style of project able to be used in Hudson. The project should define a standard build like we did with trunk on our project, but also allow it to monitor all of the folders under branches and treat each one as their own project. Whenever a change is made to one of those branches, a build will be performed. If a new branch is created, a new project should dynamically be created for that branch. When the build is made, it should be deployed according to the Staging instructions of the original project, but to a dynamically generated VM or from a pool of VMs with modifiable settings for things like ports and server names. This will allow us to do continuous deployment on each of our development branches so that if a QA tester finds a problem with a new feature a developer is working on in a branch, the developer can quickly provide a fix and have it automatically go through a full set of continuous integration tests and be deployed to the staging environment for the tester to try out again.

This fast turn around time between bug reporting, bug fixing, and fix deployment could greatly enhance productivity and let us try out different solutions, faster. With a fast feedback loop, we can work directly with the QA department or our customers to solve the issue they’re having and let them try it out right away in such a fashion that doesn’t effect anyone else’s bugs or problems that they are working on.

In essence, dynamic continuous integration would broaden the usage of our continuous integration servers to cover all development, even on separate branches, without the need to manually create (or forget to create) projects for them. I think that if we worked on creating such a functionality in Hudson, along with some helper tools for utilizing large pools of servers for fast staging environment, we can create stronger bonds between developers and testers, along with developers and their customers by having a shorter feedback loop and faster turnaround of problems.