6/29/2006

Land - Water - Air

Till yesterday I had no plans whatsoever for the 4th of July long weekend. Well it is not really a long weekend and I am working on Sat and Sun morning and that is other reason why I didn't make any plan.

But things just started falling into places since yesterday afternoon. If every thing works out as planned then on Saturday morning I will be joining my friend David and his GF on their boat and we plan on doing some Jet Ski and Tubing. I cannot swim very well but with life jacket and a life guard on the board I don't mind risking a little bit.

Talking about some risky business, the craziest thing on my mind this weekend is to go for Sky diving in minneapolis. I have been postponning it for way too long. My roomies did it in year 2003 and ever since they have been trying to persuade me to go for it. I think the moment has finally come. I have asked a few friends here at Mayo if they would like to join me for the dive, the answer is negative so far. But I have made up my mind and I am going to do that with company or alone.

Taking off finally

I have been waiting/delaying and waiting my India trip for quite some time, almost 3 years now. But finally I bought my ticket last week and yesterday confirmed my flight connections in India. Now it is time to start packing my bags. There is one weird thing, eventhough I am going back home I don't have the excitement I had when I went back the first time.

1/18/2006

Fibers

Fibers were added to Windows (in NT 3.51 SP3, IIRC) because some customers (not just SQL server) believed that they could improve the performance of their server applications if they had more control over their threading environment.

But why on earth did these customers want fibers?

Well, it's all about scalability, especially on MP system. On a multitasking system, it's easy to forget that a single processor can only do one thing at a time.

The ramifications of this are actually quite profound. If you've got two tasks currently running on your system, then your operating system will have to switch between each of them. That switch is called a context switch, and it can be expensive (just for the sake of argument, let's say a context switch takes 5000 instructions). In a context switch, the operating system has to (at a minimum):

Enter Kernel Mode
Save all the old threads registers.
Acquire the dispatch spinlock.
Determine the next thread to run (if the next thread is in another process, this can get expensive)
Leave the dispatch spinlock.
Swap the old threads kernel state with the new threads kernel state.
Restore the new threads registers.
Leave Kernel Mode
That's a fair amount of work to perform (not outrageous, but not trivial).

The OS won't do this unless it has to. In general, there are three things that will cause the OS to cause a context switch are (there are others, like page faults, but these are the big ones):

When your thread goes to sleep (either by calling Sleep() or calling WaitFor[Single|Multiple]Object[s])
When your thread calls SwitchToThread() or Sleep(0) (this is a special case of the Sleep() API that is identical to SwitchToThread())
When your thread's quanta elapses.
A thread's quanta is essentially the amount of time that the OS will dedicate to a thread if there's another thread in the system that can also run. A quantum is something like 5-10 ticks on a workstation and 10-15 on server, and each tick is typically somewhere between 10 and 15 milliseconds, depending on the platform. In general, your thread will get its full quanta unless there is a higher priority runnable thread in the system (please note: this is a grotesque simplification, but it's sufficient for the purposes of this discussion).

The thing is, for a highly scalable application, context switches are BAD. They represent CPU time that the application could be spending on working for the customer, but instead is spent doing what is essentially OS bookkeeping. So a highly scalable application REALLY wants to reduce the number of context switches. If you ever have a service that's performing poorly, one of the first things to look for is the number of context switches/second - if it's high (for some value of high), then there's invariably a scalability issue in the application that needs to be addressed.

So why fibers? Because for highly scalable applications, you want each of your threads to get their full quanta - in other words, you want the only reason for a context switch to be reason #3 above.

Remember the first cause of context switches: Calling WaitFor*Object. What that means is that if you call EnterCriticalSection on a critical section with contention, then you're highly likely to cause a context switch. The same thing happens when you wait for an I/O to complete, etc. You absolutely want to avoid calling any Win32 APIs that might block under the covers.

So fibers were created to resolve this issue. A fiber is effectively removes steps 1, 3, 5 and 8 from the context switch steps above, switching from one fiber to another just saves the old register state, and restores the new register state. It's up to the application to determine which fiber runs next, etc. But the application can make its own choices. As a result, a server application could have a dozen or more "tasks" running on each thread, and they'd radically reduce their context switch overhead, because saving and restoring registers is significantly faster than a full context switch. The other thing that fibers allow is the ability to avoid the dispatcher spin lock (see John Vert's comment about context switches being serialized across all processors below). Any global lock hurts your scalability, and fibers allow an application to avoid one of the global locks in the system.

Ok, so why have fibers remained obscure?

They've remained obscure first because of the reasons Raymond mentioned in his fibers caveat - using fibers is an all-or-nothing thing, and it's not possible to use fibers from a shared library. Some of the idiosyncrasies of the fiber APIs have been resolved in the current versions of Windows.

They're also HARD to deal with - you essentially have to write your own scheduler.

Raymond also left off a couple of other gotchas: For example, if you're using fibers to improve your apps scalability, you can't call ANY Win32 APIs that might block (including filesystem APIs) because all the Win32 blocking APIs are also have thread affinity (not surprisingly :)) So if you're running 20 fibers on a single thread, when any of the fibers blocks, your thread blocks (however, the fibers can be run from another thread, because fibers don't have thread affinity, so if you have a spare thread around, that thread can run the fibers).

The other reason that fibers have remained obscure is more fundamental. It has to do with Moore's law .

Back when fibers were first implemented, CPUs were a lot slower. Those 5000 instructions for the context switch (again, this is just a guess) took .05 millisecond (assuming one cycle/instruction) to execute on a 100MHz machine (which would be a pretty fast machine in 1995). Well, on a 2GHz machine, that .05 is .0025 millisecond - it's an order of magnitude smaller. The raw cost of a context switch has gone down dramatically. In addition, there has been a significant amount of work in the base operating system to increase the scalability of the dispatcher spinlock - nowadays, the overhead of the dispatcher lock is essentially nonexistant on many MP machines (you start to see contention issues on machines with a lot of CPUs, for some value of "large").

But there's another aspect of performance that has gone up dramatically, and that's the cost of blowing the CPU cache.

As processors have gotten smarter, the performance of the CPU cache has become more and more critical to their speed - because main memory is painfully slow compared to the speed of the processor, if you're not getting your data from the CPU's cache, you're paying a huge hidden penalty. And fibers don't fix this cost - when you switch from one fiber to another, you're going to blow the CPU cache.

Nowadays, the cost of blowing the cache has leveled the playing field between OS context switches and fibers - these days, you don't get nearly the benefit from fibers that you did ten years ago.

This isn't to say that fibers won't become useful in the future, they might. But they're no longer as useful as they were.

Btw, it's important to note that fibers aren't the ONLY solution to the thread quantization issue mentioned above. I/O completion ports can also be used to limit context switches - the built-in Win32 thread pool uses them (that's also what I used in my earlier post about thread pools). In fact, the recomendation is that instead of spending your time rewriting your app to use fibers (and it IS a rewrite), instead it's better to rearchitect your app to use a "minimal context" model - instead of maintaining the state of your server on the stack, maintain it in a small data structure, and have that structure drive a small one-thread-per-cpu state machine. You'll still have the issue of unexpected blocking points (you call malloc and malloc blocks accessing the heap critical section), but that issue exists regardless of how your app's architected.

If you're designing a scalable application, you need to architect your application to minimize the number of context switches, so it's critical that you not add unnecessary context switches to your app (like queuing a request to a worker thread, then block on the request (which forces the OS to switch to the worker, then back to the original thread)).

--
From one of the internet discussion forums

12/28/2005

Why does Win32 even have Fibers?

Fibers were added to Windows (in NT 3.51 SP3, IIRC) because some customers (not just SQL server) believed that they could improve the performance of their server applications if they had more control over their threading environment.

But why on earth did these customers want fibers?

Well, it's all about scalability, especially on MP system. On a multitasking system, it's easy to forget that a single processor can only do one thing at a time.

The ramifications of this are actually quite profound. If you've got two tasks currently running on your system, then your operating system will have to switch between each of them. That switch is called a context switch, and it can be expensive (just for the sake of argument, let's say a context switch takes 5000 instructions). In a context switch, the operating system has to (at a minimum):

  1. Enter Kernel Mode
  2. Save all the old threads registers.
  3. Acquire the dispatch spinlock.
  4. Determine the next thread to run (if the next thread is in another process, this can get expensive)
  5. Leave the dispatch spinlock.
  6. Swap the old threads kernel state with the new threads kernel state.
  7. Restore the new threads registers.
  8. Leave Kernel Mode

That's a fair amount of work to perform (not outrageous, but not trivial).

The OS won't do this unless it has to. In general, there are three things that will cause the OS to cause a context switch are (there are others, like page faults, but these are the big ones):

  1. When your thread goes to sleep (either by calling Sleep() or calling WaitFor[Single|Multiple]Object[s])
  2. When your thread calls SwitchToThread() or Sleep(0) (this is a special case of the Sleep() API that is identical to SwitchToThread())
  3. When your thread's quanta elapses.

A thread's quanta is essentially the amount of time that the OS will dedicate to a thread if there's another thread in the system that can also run. A quantum is something like 5-10 ticks on a workstation and 10-15 on server, and each tick is typically somewhere between 10 and 15 milliseconds, depending on the platform. In general, your thread will get its full quanta unless there is a higher priority runnable thread in the system (please note: this is a grotesque simplification, but it's sufficient for the purposes of this discussion).

The thing is, for a highly scalable application, context switches are BAD. They represent CPU time that the application could be spending on working for the customer, but instead is spent doing what is essentially OS bookkeeping. So a highly scalable application REALLY wants to reduce the number of context switches. If you ever have a service that's performing poorly, one of the first things to look for is the number of context switches/second - if it's high (for some value of high), then there's invariably a scalability issue in the application that needs to be addressed.

So why fibers? Because for highly scalable applications, you want each of your threads to get their full quanta - in other words, you want the only reason for a context switch to be reason #3 above.

Remember the first cause of context switches: Calling WaitFor*Object. What that means is that if you call EnterCriticalSection on a critical section with contention, then you're highly likely to cause a context switch. The same thing happens when you wait for an I/O to complete, etc. You absolutely want to avoid calling any Win32 APIs that might block under the covers.

So fibers were created to resolve this issue. A fiber is effectively removes steps 1, 3, 5 and 8 from the context switch steps above, switching from one fiber to another just saves the old register state, and restores the new register state. It's up to the application to determine which fiber runs next, etc. But the application can make its own choices. As a result, a server application could have a dozen or more "tasks" running on each thread, and they'd radically reduce their context switch overhead, because saving and restoring registers is significantly faster than a full context switch. The other thing that fibers allow is the ability to avoid the dispatcher spin lock (see John Vert's comment about context switches being serialized across all processors below). Any global lock hurts your scalability, and fibers allow an application to avoid one of the global locks in the system.

Ok, so why have fibers remained obscure?

They've remained obscure first because of the reasons Raymond mentioned in his fibers caveat - using fibers is an all-or-nothing thing, and it's not possible to use fibers from a shared library. Some of the idiosyncrasies of the fiber APIs have been resolved in the current versions of Windows.

They're also HARD to deal with - you essentially have to write your own scheduler.

Raymond also left off a couple of other gotchas: For example, if you're using fibers to improve your apps scalability, you can't call ANY Win32 APIs that might block (including filesystem APIs) because all the Win32 blocking APIs are also have thread affinity (not surprisingly :)) So if you're running 20 fibers on a single thread, when any of the fibers blocks, your thread blocks (however, the fibers can be run from another thread, because fibers don't have thread affinity, so if you have a spare thread around, that thread can run the fibers).

The other reason that fibers have remained obscure is more fundamental. It has to do with Moore's law .

Back when fibers were first implemented, CPUs were a lot slower. Those 5000 instructions for the context switch (again, this is just a guess) took .05 millisecond (assuming one cycle/instruction) to execute on a 100MHz machine (which would be a pretty fast machine in 1995). Well, on a 2GHz machine, that .05 is .0025 millisecond - it's an order of magnitude smaller. The raw cost of a context switch has gone down dramatically. In addition, there has been a significant amount of work in the base operating system to increase the scalability of the dispatcher spinlock - nowadays, the overhead of the dispatcher lock is essentially nonexistant on many MP machines (you start to see contention issues on machines with a lot of CPUs, for some value of "large").

But there's another aspect of performance that has gone up dramatically, and that's the cost of blowing the CPU cache.

As processors have gotten smarter, the performance of the CPU cache has become more and more critical to their speed - because main memory is painfully slow compared to the speed of the processor, if you're not getting your data from the CPU's cache, you're paying a huge hidden penalty. And fibers don't fix this cost - when you switch from one fiber to another, you're going to blow the CPU cache.

Nowadays, the cost of blowing the cache has leveled the playing field between OS context switches and fibers - these days, you don't get nearly the benefit from fibers that you did ten years ago.

This isn't to say that fibers won't become useful in the future, they might. But they're no longer as useful as they were.

Btw, it's important to note that fibers aren't the ONLY solution to the thread quantization issue mentioned above. I/O completion ports can also be used to limit context switches - the built-in Win32 thread pool uses them (that's also what I used in my earlier post about thread pools). In fact, the recomendation is that instead of spending your time rewriting your app to use fibers (and it IS a rewrite), instead it's better to rearchitect your app to use a "minimal context" model - instead of maintaining the state of your server on the stack, maintain it in a small data structure, and have that structure drive a small one-thread-per-cpu state machine. You'll still have the issue of unexpected blocking points (you call malloc and malloc blocks accessing the heap critical section), but that issue exists regardless of how your app's architected.

If you're designing a scalable application, you need to architect your application to minimize the number of context switches, so it's critical that you not add unnecessary context switches to your app (like queuing a request to a worker thread, then block on the request (which forces the OS to switch to the worker, then back to the original thread)).

--from a web blog

11/16/2005

A Glimpse of First Snow

Yeah the winter has truly arrived... and following are few pics of it