The Windows 7 chkdsk crash – and how would Linux have fared?

Stories of the Windows 7 showstopper bug with a chkdsk crash have been circulating. Newly appointed Windows President Steven Sinofsky was pretty conversational on the issue, and has now posted their notes on how crash reports get handled. Had it not been Microsoft, how would other vendors have managed?

Unfortunately, Steven’s latest blog post is very long and hard to read, but if you endure through, you’ll learn a few things. A short recap of their actions:

  • They looked at the crash telemetry data sent to Microsoft on any crash, checking if chkdsk.exe was running at the time of the crash.
  • They ran automated tests for two days and on 40 machines, trying to reproduce the bug.
  • They looked for non-responsiveness in manual tests.
  • They asked for support from Microsoft employees globally, getting a few hundred more test configurations to run.
  • They kept reading blogs, forums etc. in hopes of finding a reproducible crash scenario.

Without going too deep, I think we all have to admit that Microsoft did some pretty serious things to look for the crash.

What if this had happened on a Linux box?

No doubt this could happen as well on a Mac as it would on a Linux system. It might be less likely due to various reasons, but it could happen – even if somebody wrote the perfect application, hardware is never 100% reliable.

For me, the interesting thought experiment is this: How would a distributed development team with no central funding organization have managed such a scenario? Would the troubleshooting telemetry be available and where would such a highly critical mass of data be stored? Would somebody have had the necessary hardware to run all those automated tests? How about the raw manpower?

Both Windows and Linux communities have loads and loads of experienced users. The gigantic advantage of the open source world is that those users are able and empowered to fix their own problems, something you certainly cannot say for many Windows issues. That produces a massive developer potential that is simply great at pushing the software forward. But then, I quote Linus Torvalds from his recent interview in Linux Magazine (about Microsoft GPL code submissions):

‘I agree that it’s driven by selfish reasons, but that’s how all open source code gets written! We all “scratch our own itches”.’

If that is how all open source code gets written, how would a crash bug so rarely encountered ever get fixed – unless the crash happened to hit a user with the sufficient skill set to fix it?

Mr. Torvalds probably didn’t mean his words that literally: Of course the open source world has plenty of unselfishness and organizations who get paid for the software and provide support in return. But still, that’s vastly different from a Microsoft-like behemoth who owns the software, with all its profits and problems.

I certainly do not want to position the chkdsk issue as an argument for the operating system holy war. I have never experienced a Linux crash that would have required broad community help to resolve, and I openly admit my inability to judge the issue. Linux is a magnificent open source project with lots of great developers, and might well be able to resolve many problems like this.

But it still keeps boggling me: Were Linux suddenly deployed on a few hundred millions of home desktops at hands of novice users, would somebody care to scratch their itches? Were a large-scale troubleshooting effort like this required, how would it get organized?

If you have experience on resolving issues like these on non-Windows platforms and are willing to share the stories, please do so in the comments section.

August 10, 2009 · Jouni Heikniemi · 18 Comments
Tags: ,  · Posted in: General

18 Responses

  1. Jaba - August 10, 2009

    Similar stuff does happen in OSS world all the time. But it tends to happen in alpha releases or in distributions that are targeted toward developers and curious testers, examples of distributions being Debian Testing/Unstable, Fedora and Gentoo Unstable branch.

    Since Linux distributions are pretty much the same – kernel, GNU stuff, X.org, some window manager(s), basic software such as GIMP and OO.o, finding a bug in one distribution probably helps other distributions, too. Responsible distribution developers report the bugs to upstream (for example, a Gentoo developer should report about a Gnome bug to Gnome's Bugzilla).

    And yes, all source code gets written by selfish reasons. But so does testing: most likely I will not test some scientific application, but if a rumour tells me some version of e2fsck could eat my precious data, I would test the patch somewhere immediately.

    Add to this the fact that new distribution versions are being released all the time. Ubuntu gets a new release every six months, and so do many, many other distributions. All of them do get fair amount of testing from their developers and users. This constant release cycle makes it easier to avoid major changes between releases, so bug hunting stays easier.

    OK, there might a new audio subsystem, such as PulseAudio in some new distribution version. There might be a next-generation desktop manager, like KDE 4. Or some new security features, like PolicyKit. But "hey, let's rebuild this tested and true software stack from scratch for our next distribution release" rarely, if ever happens.

    In some POV variety of Linux distributions is a weakness. Hardware and software vendors tend not to like it.

    But let's speculate that Linux would get deployed on a few hundred million home desktops overnight. Of course that would be a support hell, but on the otherhand it would be extremely unlikely that a single distribution would be installed to everyone. More likely French would be using Mandriva, Chinese would be using Red Flag Linux, in USA they would use Red Hat, around Europe (open)SuSE or Ubuntu (or some of its variants), in Africa they would use Ubuntu (or in case of OLPC, Sugar), users would use Ubuntu Netbook Remix, Eeebuntu or similar. Owners of old hardware would use some light distribution, such as Zenwalk, Damn Small Linux, Vector Linux or Puppy Linux.

    Those who would like to pay for their OS and support, would use SuSE or Xandros.

    The amount of distributions is nowadays pretty amazing, as you can see from Distrowatch: http://distrowatch.com/

    Each and every one of those distributions has a basic infrastructure (forums, irc-channels, bug tracker, mailing lists etc) in place. Sure, quality and activity of them varies, but if you stay on a safe side and use some of the major distributions, the service level is pretty good. The amount of developers is quite high, too – around 1500 people are working on just Linux kernel itself. All of them are not very active, of course, but still that's the ballpark figure between two kernel releases.

    So in case of 500 million novices flooding the Internet with their questions, the front level helpdesk for them would be the distribution they use. Then the distro developers would decide if a) the problem exists between keyboard and chair, b) the problem is actually in the distribution or c) the problem is an actual bug in some application, in which case the bug should be filed upstream.

    Severe filesystem/fsck level bugs are rare in Linux world, but this "annoying as hell if it hits you and a bitch to debug" bug in KDE 4 still lurks around: http://bugs.kde.org/show_bug.cgi?id=171685. No one seems to have idea what's going on and symptoms/workarounds are different for everyone. Let's see, when and how this issue gets resolved.

    And about the steps Microsoft produced and their equivalents in OSS world:

    "They looked at the crash telemetry data sent to Microsoft on any crash, checking if chkdsk.exe was running at the time of the crash." –> in bug reports strace, dmesg, lspci etc output is usually requested.

    "They ran automated tests for two days and on 40 machines, trying to reproduce the bug." –> major software, such as MySQL or filesystems, do have automated tests in place. Also distributions tend to test packages for regressions; for example Gentoo has Tinderbox and openSUSE their own cloud-like build service.

    "They looked for non-responsiveness in manual tests." –> sure, that's what nerds do during nocturnal hours.

    "They asked for support from Microsoft employees globally, getting a few hundred more test configurations to run." –> in case some severe bug all the distributions would be part of testing, unless the bug was found in some very new version.

    "They kept reading blogs, forums etc. in hopes of finding a reproducible crash scenario." –> that's what nerds do during nocturnal hours.

  2. Jouni Heikniemi - August 10, 2009

    Jaba, thanks for the extensive writeup on your thoughts. A few things I want to comment on:

    The majority of bugs gets ironed out in test builds / betas in the Windows world as well. Usually well-defined easy-to-repro bugs are a nonissue for any development team, and releases can be held until critical issues are fixed. The bugs that occur sporadically in production (post-release) just like the KDE one you linked to are the true test of bugslaying ability.

    Without denying the merits and achievements of all Linux testers, I still find alpha-branch testing wholly different from actual production use and bugs found therein. And as the user base broadens, the testing power of an active minority is severely diminished – no matter how accomplished gurus run Linux, they just don't have the breadth of configurations and hardware that the end user base has.

    Likewise for using strace, dmesg and so on. Extremely valid measures for debugging among expert users and developers, but as such, those tools aren't very usable from an end users' perspective.

    First, once you get beyond the realm of developer-minded people, end users do not report bugs. Second, even if they did, they don't attach reports – things have to get far more automated to provide sufficient data for analysing millions and millions of crashes afterwards like Microsoft did.

    I'm not doubting the technical debugging ability of Linux developers for a second. My concerns focus around this: How easy it is to organize people around a mythical bug like KDE#171685 and drive the fixing process until you're done? How will you motivate people into the drudgery of a bug hunt perhaps weeks after weeks? Of the 1500 free-willed kernel developers, how many will jump on a issue that doesn't concern them personally?

    That said, Microsoft has had its dark moments. Despite having customers pay for Windows, support and bug fix rates have been abysmal at times, and the level of transparency has been on par with a lump of coal. Closed source doesn't certainly make things automatically better, but I am impressed with the late developments in the Redmond direction.

    And I certainly hope Linux will be able to avoid the painful periods Windows went through – much of large-scale error tracking technology just wasn't there at the time the number of Windows installations exploded. Linux doesn't have to go through the same steps.

  3. Jaba - August 11, 2009

    Oh. Forgot to mention that at least Gnome and KDE has an automatic bug report tool built-in, which gathers the needed information and sends it out to Bugzilla (with optional comments).

    Gnome's Bug Buddy:

    http://library.gnome.org/users/user-guide/stable/feedback-bugs.html.en

    http://www.builderau.com.au/i/g/gnome216/Bug Buddy.png

    KDE's Dr. Konqi:

    http://farm4.static.flickr.com/3638/3400802859_619a0afe03.jpg?v=0

  4. Jaba - August 11, 2009

    Hmm. I posted a LONG comment here this morning, but it seemed to disappear during posting. Let's try again, I split the comment in two parts this time.

    "Without denying the merits and achievements of all Linux testers, I still find alpha-branch testing wholly different from actual production use and bugs found therein. And as the user base broadens, the testing power of an active minority is severely diminished – no matter how accomplished gurus run Linux, they just don’t have the breadth of configurations and hardware that the end user base has."

    A kernel guru Alan Cox was questioned about this back in LinuxDay 2003 in Helsinki. "How do you test your driver code?" – "I'll upload my code to ftp server and see what happens."

    So yes, in some (most?) cases hardware support will be tested by the other developers and end-users. But the times are changing, and doing so very fast. Gone are the days when practically all the hardware drivers were hacked together by volunteers and private persons.

    Some examples:

    – Couple of years back USB support was not so good, especially when hardware you used was not behaving. For example, back in 2003 I had a 128M USB memory stick, which did not do some standard USB handshake when plugged in, but expected to receive 64 bytes of data before showing itself up to the OS.

    http://www.mail-archive.com/linux-usb-devel@lists.sourceforge.net/msg11159.html

    Then the USB stack got rewritten for kernel 2.6.x and things have been a lot better ever since.

    Quick jump to this day: Linux was the first operating system to get support for USB 3.0. The actual code came from Intel.

    – Intel has opened practically all of their drivers and is working on Linux support VERY actively. Examples about their community enthusiasm include http://www.lesswatts.org/ and http://intellinuxwireless.org/ (the latter isn't even required any more … :-))

    – Ever since AMD bought ATI, their driver support has improved dramatically.

    – Nvidia has been a great supporter of Linux for many years already, although their driver is not open and loading a massive binary blob as a kernel module makes debugging more difficult every now and then.

    – HP is providing excellent driver support for Linux. Practically all of their printers — old and new — are supported in Linux, and that shows. Couple of years back configuring a printer wasn't so easy in Linux. Nowadays I just plug my USB printer to my laptop and everything gets configured automatically. And at work a network laser printer will be found and be usable in seconds.

    – Nokia is working on Linux very actively on the embedded/mobile hardware field. Examples include their Internet Tablet and Maemo interface, and their web browser found in most Symbian phones, which is WebKit-based, which is based on KDE's khtml.

    – If you buy a new router, media center or a NAS, chances are it's running Linux.

  5. Jaba - August 11, 2009

    What simplifies hardware/driver testing dramatically is that every single vendor does not provide their own driver for every single product. That would be very silly, since in real world there are not very many chipsets and chips to be supported. Most network cards are based on Intel or Broadcom chipsets, most motherboards are based on VIA, Intel or some other major vendor chipsets, most graphics cards are from Intel, Nvidia or ATI.

    This is one of my pet pieves about Windows – how the hell they make sure something's working in Windows, because there must be millions of different driver versions around. In Linux the driver support is in kernel itself; just go to kernel.org. See the drivers. Luckily the end-user doesn't have to worry about that nowadays, since the stock kernel in major distributions is usually very good and up-to-date.

    My general feeling is that when it comes to hardware support, it is already very good. BUT regardless of the OS you use, you should upgrade the BIOS/firmware as soon as you get your new tech toy.

    Case in point: Dell Latitude series laptops did ship with very broken BIOS (revision A03), at least in 2007. Several things were not working at all. Some things worked in Linux, some things worked in Windows. One of the things not working at all in Windows was bluetooth support. One of my colleagues was wondering if his laptop was equipped with bluetooth at all. I popped my belowed Kubuntu live-cd to his laptop cd-drive and behold, bluetooth worked out-of-the-box.

    The other things not working so well with that A03 BIOS, regardless of the OS: (un)docking the laptop did freeze everything at least 50% of the time, WLAN did see the access points but could not connect them, connecting your laptop to video projector sometimes worked, sometimes it resulted to 320×240(!) or 640×480 resolution, suspend/hibernate/resume were not working at all.

    After upgrading BIOS to some newer revision did fix most of the issues, and the current revision I've installed (A14) is almost perfect, hickups are extremely rare.

    "I’m not doubting the technical debugging ability of Linux developers for a second. My concerns focus around this: How easy it is to organize people around a mythical bug like KDE#171685 and drive the fixing process until you’re done?"

    Organizing people to do something boring and/or too challenging is never easy. There are ways, some more effective than others. Gentoo has a "bug Saturday" once in every month. That means day full of bug-squashing, hosted at its own irc channel. Trivial bugs ("Hey, there's a typo") will get resolved in almost real-time, more complex ones will get discussed and analyzed. Those bug days are quite popular and sometimes the results are very good.

    Then there are live-meetings organized. Length can be anything from "let's meet in a pub for 2 hours and hack away" to "let's fly to Canary Islands and code for a full week" (http://dot.kde.org/2009/08/06/free-desktop-communities-come-together-gran-canaria-desktop-summit – OK, that was a conference, but if the meeting was at all like PHP Conference or ApacheCon, people DO code there quite a lot together.)

    "How will you motivate people into the drudgery of a bug hunt perhaps weeks after weeks?"

    Most active bug squashers are the ones who doesn't need too much motivating, at least it seems to be so. :-) Maybe they love whatever they do, I don't know. Also things like voting helps in priority: surely a bug with "most severe bug" votes gets more attention than the others.

    "Of the 1500 free-willed kernel developers, how many will jump on a issue that doesn’t concern them personally?"

    Maybe not too many, people tend to have their own special fields. Some are experts in filesystems, others are good at schedulers, others do something else.

    And quite a lot of the code is coming out from (hardware) companies itself. Here are some kernel 2.6.30 statistics:

    http://lwn.net/Articles/334721/

    and the changelog between 2.6.29 and 2.6.30 (warning: couple of megabytes of plain text):

    http://kernel.org/pub/linux/kernel/v2.6/ChangeLog-2.6.30

    * * *

    The install base of Linux is growing steadily and can probably be counted in tens of millions already. Just the latest Fedora version gets downloaded several millions times, and I bet Ubuntu with its derivates is easily more popular than Fedora. Also huge number of Debian users are out there – actually Ubuntu is just a prettified snapshot of Debian testing branch.

    The actual Joe Sixpack is yet to be seen, though. Let's see if Android and upcoming Chrome OS from Google can make a difference.

    Anyway, we have come a long way from the command-line and archaic X interfaces, and for bonus points installing some decent Linux distro for granny (so she can surf the web and see some photos/listen to some music) can actually drop the amount of support calls needed. No sudden popups from some anti-virus software, no spyware, no sudden advertisements from some "free" software, no "Hey! I need an update!" from some single application, since package manager is updating the whole system. That takes surprisingly lot of support burden away.

  6. Jouni Heikniemi - August 11, 2009

    Sorry, but Akismet considered your posts to be potentially spam and sent them over to the approval queue. I just had delay in approving them. I'll get back to answering them at some point :-)

  7. Jaba - August 11, 2009

    Ok, take your time. :-) Poor spammers, they're going through some rough times….

  8. bungle - August 11, 2009

    Well, one might ask, that if Windows was open sourced, how long would it have taken to someone to supply a patch for the chkdsk problem. Microsoft fixed this bug because it was widely talked on internet. They do not fix all bugs this way, and there are many cases in history where it was needed to have public pressure on Microsoft to get things fixed.

  9. Jaba - August 13, 2009

    Thanks for making me thinking of this issue. :-) One more point I would like to make.

    A Linux distribution contains at least hundreds of packages. Everything is built on top of Unix-philosophy: lots of small utils, each and every one of those doing one thing and doing it well. Exceptions are software like OpenOffice.org and up to some point the newer desktop environments, such as KDE 4, even though KDE 4 is very modular.

    My fairly stock Kubuntu 9.04 installation has 1346 packages installed at the moment, total of around 4 gigabytes (-600 megabytes of backup archives). In addition to stock install I have only installed couple of games (Battle for Wesnoth taking about 600 megabytes…), some network/sysadmin utils and couple of utils related to photo management.

    So _most_ of the software is relatively easy to keep clean and debug. Shells, basic utilities, media players, cd/dvd/blu-ray burning software, printing software, stuff like that are not very big in size. In case dvd-burning says FAIL it's fairly easy to point what to blame and bug reports are easy to assign to correct maintainers.

    But of course this approach has a dark side, too. This awkward KDE 4 keyboard bug is a example. Is the reason in X.org event loop? Is QT toolkit to blame? Or is there a bug in KDE libraries? How do you organize people across different projects to work together? Perhaps via irc-meetings? Or some combination of wiki, bug tracker, and/or web forums? Some live meeting?

    (I read the bug one more time and started to think the bug must be related to X.org event loop; someone mentioned that Awesome window manager is doing the same, and both KDE 4 and Awesome are using XCB [ http://xcb.freedesktop.org/ ] which is not tested in the real world very much compared to old xlib. Also one friend of mine who's developing mobile platforms told me they've seen a bug like this where X.org event loop suddenly gets the focus from all the key strokes)

  10. Jaba - August 13, 2009

    Gosh, once again Akismet considers my reply as a spam, I think…

  11. Jouni Heikniemi - August 13, 2009

    Yes, it did exactly that. I'm not sure why it hates you. This time it didn't even go into the approval queue, it was downright marked as spam. Perhaps you write too many comments. Approved the comment nonetheless.

  12. Jaba - August 13, 2009

    Akismet must have been reading net.nyt while I was still active there and consider me a Linux-fanboy.

  13. Jouni Heikniemi - August 13, 2009

    I think much of this, particularly Jaba's latter posts, revolve around the question of single point of responsibility versus the distributed package model for Linux.

    Regarding Bungle's comment: If Windows was open source and the bug would have been easy to spot, it would certainly have been fixed fast. I agree, and I also agree that Microsoft hasn't always been very responsive in fixing some bugs. In this case however, no bug was found even through extensive efforts, and I think the point is more about "How would have the search worked, had this been a Linux bug?".

    Returning to Jaba's thoughts around the package platform, I think the same is reflected. Easily pinpointed flaws tend to get fixed promptly, but muddy issues with vague symptoms (non-repro crashes are the prime example) possibly between components are hard. They're notoriously hard even for Microsoft where product groups don't necessarily agree on whose side of the fence the problem really is, but even moreso for entities who have no common authority to finally resolve the issue.

    I believe the bug resolution methodology for Linux works, but it will get strained once the amount of strange phenomena increase. And it will increase, as the diversity of hardware rises. Yet another aspect is software creation: Right now, most newbie programmers work on Windows. Once they start working on Linux, Linux will see the same problems regarding backward compatibility, bug support etc. as Windows has experienced. (although I think Microsoft has gone even a bit too far, but that's another discussion – a hefty amount of compat work will be necessary in any case)

    The current download numbers for Linux distributions are not, at least in my opinion, reflective of the problems that occur when non-professional users start using the product en masse. That, exactly, is the point where questions of unified support turn relevant. Whether or not commercial Linux vendors turn up to fill the void remains to be seen.

    And yeah, I'm hoping for the best. For both Linux and Windows. Microsoft needs to get its act together after some of the past blunders, and the open source world needs to get ready for the big consumer break.

  14. Jaba - August 13, 2009

    Hardware diversity rises? Linux already runs on dozens of processor architectures and can be run anywhere from a watch to gaming consoles and to the largest super-computers on Earth. It also supports very old hardware, if that's necessary. Mobile platforms are also very well supported, not to forget embedded market.

    It's not the hardware support I'm worried about, that's pretty much a solved problem.

    Software backward compatibility sure might be an issue, depending on what tools and languages are used. Most of the time Linux/Unix as a coding environment tends to be almost dull, if you use the standard libraries etc., they do not constantly change.

  15. Jouni Heikniemi - August 14, 2009

    Processors and different embedded devices as platforms are not the sort of hardware that causes problems. When somebody decides to run Linux on a PlayStation or DVR, he will make every effort to test the system.

    The real diversity in connected peripherals comes with the consumers, who plug into the computer whatever stuff they happen to find at the local supermarket – or may have found a decade ago.

    A high percentage of Windows bluescreens are caused by driver bugs. No doubt a share of those could be avoided by even better encapsulation (as WVDDM does from Vista onwards), but any kernel mode code has a crash potential.

    That said, I'm not claiming hardware issues would be the main problem for Linux; it's just that they are a considerable factor with any OS. Apple's approach of keeping the hardware business to themself is a great way to avoid it, but lacks the openness of PC world that both Linux and Windows represent.

  16. Jaba - August 14, 2009

    Yep, sometimes I'm wondering how today's PC hardware works at all – meaning how it even passes the BIOS POST. It's amazing one can usually just buy some mobo, some CPU, some RAM, GPU and other stuff, slap them together and start using the new computer.

    But this comes down to my earlier point about chipsets and drivers; hardware diversity is (partly) just an illusion. Some random peripheral is usually made from commonly used chipsets and other parts. Quality of parts differs, of course, but sound cards, network cards, wlan adapters, usb memory sticks etc. etc. are not THAT different. Same shit, different logo.

    If Windows finally starts to be 100% 64-bit compatible, next big test will be seen if and when netbook vendors start to ship more ARM based netbooks. Will Microsoft let it go and let people run Linux, or will Microsoft somehow make ARM port of Windows?

  17. Jouni Heikniemi - August 19, 2009

    Also worth checking out: http://www.reddit.com/r/programming/comments/9btf7/fyi_not_so_funny_microsoft_bug_that_hosed_our/

  18. Rosio Harriet - October 9, 2020

    I have to convey my love for your kindness in support of individuals that should have help with this particular field. Your special dedication to getting the message all over became certainly significant and have all the time allowed most people like me to realize their dreams. Your entire insightful key points can mean a whole lot a person like me and even further to my colleagues. Many thanks; from all of us.

Leave a Reply