PowerShell Basics #3: Manipulating data in text files

To continue my series of PowerShell Basics posts, I’ll cover some basic features around manipulating text file data.

First off, the problem is really two separate problems: Handling text file IO and manipulating the contents. PowerShell provides decent tools for both, and since everything is just .NET objects, you can run any .NET string manipulations you desire.

Seeing what’s in there: get-content

First, you usually want to see what’s in a file. The type command familiar from DOS still works; it’s now an alias for a commandlet called Get-Content. Using it is exactly as straightforward as you’d expect: type my.txt shows the contents of my.txt.

But again, “shows” in a very loose sense. Get-Content returns objects, namely the strings that make up the lines of the file. It is the PowerShell UI that actually shows these, and you can manipulate them how you want. For example, to know the number of lines in a file, just take a count:

PS D:\temp> (type my.txt).Count
3

Naturally, this also allows you line-by-line access to the file.

PS D:\temp> $lines = type my.txt
PS D:\temp> $lines[0]
Foo

If you just need the first few lines, you can use the –TotalCount parameter, abbreviated as –t,  as a head (for you Unix-minded) or top (for you SQL people) operator.

PS D:\temp> type my.txt -t 2
Foo
Bar

Note that the –t operator is really just an optimization: you could always get two first lines by manipulating the resulting array with an expression like (type my.txt)[1..2], but that can be really slow with a half-gig log file; with –TotalCount, you only load the necessary lines, but with array manipulations, you get them all.

Writing stuff: set-content, add-content

Typically, your scripts may want to write data into a text file. There are a few approaches to this. First of all, if you just want to output a few lines of text to a file, use the Set-Content commandlet, which really has no equivalent in cmd.exe world. It takes a file name and a group of objects, then writing those objects into the file. In the example below, we just pass it an implicitly created string array.

PS D:\temp> Set-Content animals.txt "cat", "dog", "giraffe"
PS D:\temp> type animals.txt
cat
dog
giraffe

If you want to add new lines to the end of an existing file, we have Add-Content which works the same way, but appends new lines. Therefore:

PS D:\temp> Add-Content animals.txt "elephant", "cow"
PS D:\temp> type animals.txt
cat
dog
giraffe
elephant
cow

You can also use piping and array operations to edit the files. For example, to truncate a text file to just its four last lines (an operation perhaps valid for a log file), you could do:

PS D:\temp> (type animals.txt)[-4..-1] | set-content animals.txt
PS D:\temp> type animals.txt
dog
giraffe
elephant
cow

As you see, the cat got cut out. The “-4..-1” thing is again a feature of PowerShell array indexing: negative indices refer to the elements of an array from the end of the array, i.e. [-4..-1] means “a range of elements from the fourth-last to the last one”.

Sorting the data

When you need the data sorted, you want the Sort-Object cmdlet. It can do a whole lot more for you as well, but for sorting text, very simple things suffice. To get the items into dictionary order, just pipe the whole thing to sort (an alias to Sort-Object).

PS D:\temp> type animals.txt | sort
cow
dog
elephant
giraffe

If you need descending order, just apply –descending (or –desc for short). Also, PowerShell supports culture-specific sorting and defaults to your thread culture. For example, if you need sorting by German rules, add a –culture de-DE. To force sorting by the US English rules, –culture en-US does the trick,

Also, since Sort-Object actually sorts the objects (strings), you can also sort by any property of the System.String class. The most likely application for this is sorting by string length:

PS D:\temp> type animals.txt | sort Length
cow
dog
giraffe
elephant

The variable assignment syntax

There is one more trick to learn here. While the previous syntax is reasonably clean, there are a few scenarios where an even more terse syntax makes sense. PowerShell allows you to access contents of text files as they were variables. The catch is that it requires you to use the full path.

Anyway, you want to get the count of lines in your hosts file? Just type

${c:\windows\system32\drivers\etc\hosts}.Count

and you’re set.

What if you need to load the animals, append a few plants and throw the stuff into living.txt?

PS D:\temp> $living = ${d:\temp\animals.txt}
PS D:\temp> $living += "rose"
PS D:\temp> $living += "pineapple"
PS D:\temp> ${d:\temp\living.txt} = $living
PS D:\temp> type living.txt
dog
giraffe
elephant
cow
rose
pineapple

There’s a lot more you can do. I’ll get to regular expressions, grouping and whatnot quite soon, so stay tuned!

February 5, 2010  Tags:   Posted in: .NET  No Comments

SQL Server Management Studio 2008 and table re-creation

“Saving changes is not permitted. The changes that you have made require the following tables to be dropped and re-created.”… What? I was just changing the null constraint on a column, why are you talking about dropping a table?

imageWhen it hits you the first time, you’re probably confused. The fix itself is easy: Just open Tools / Options and select Designers / Table and Database Designers from the tree, then uncheck “Prevent saving changes that require table re-creation”.

But they didn’t add this warning for nothing. KB956176 describes the phenomenon and the risks involved. There are two major things:

First, when a table gets recreated, that is literally what happens. If you look at the operation in SQL Profiler, you see that a new temporary copy of the table gets created, all the data is transferred by a INSERT INTO … SELECT statement, the original table is dropped and the new one then renamed back to the original name. If you have loads of data, this will take a while. Oh, and you had some indexes? They’ll get recreated and repopulated, too.

Second, if you use SQL Server Change Tracking, deleting and recreating the table wipes out the change tracking information. Depending on how heavily your application relies on CT, the effect may range from a minor nuisance to a total disaster.

When do I need a table rebuild?

Surprisingly often. Although you can add new columns to the end of the table, adding them anywhere else in the table structure (such as between other columns) requires a rebuild. If you change the Identity specification (“autonumbering”) of a field, that warrants a rebuild. Perhaps the most common case is when you change the data type or the null allowance on a column.

So yes, it happens more often than you think.

Why hasn’t this been popping out for years already?

Because older versions of SQL Server Management Studio (and before that, Enterprise Manager) allowed you to save changes that transparently went on to re-create the tables. It was only in SQL Server 2008 when this behavior changed.

And why? Well, before 2008 it was only a slowdown and a (potentially long) service break while the table was rebuilt. With change logging added, the risk of an actual data loss is now tangible. Therefore, a change in policy.

January 29, 2010  Tags:   Posted in: Misc. programming  No Comments

PowerShell Basics #2: Dir for power users

You would think the standard dir command in cmd.exe would do its job adequately, but it’s still surprising how much more functionality can you pack into it. Here’s an introduction to power-using Get-ChildItem, also known by the aliases of dir and ls.

Leveraging FileInfo and DirectoryInfo

For starters, it is important to remember that dir returns objects. Namely, it returns objects of the abstract base type System.IO.FileSystemInfo. The concrete types are FileInfo and DirectoryInfo for files and directories, respectively. Most of the time you will not care about these type issues, though – both concrete types have the same key properties, and you can use them to filter the objects.

So, I can get all my PDC 2009 slides into an array variable by typing $decks = dir D:\Slides\Pdc2009.

Now which of those beasts exceed 10 megs in size?

PS D:\temp> $decks | where { $_.Length -gt 10MB }

    Directory: D:\slides\pdc2009

Mode                LastWriteTime     Length Name
----                -------------     ------ ----
-a---        17.12.2009     12:49   19985726 VTL04 - Rx, Reactive Extensions for .NET.pptx

Umm, ok. Did any directories leak in?

PS D:\temp> $decks | where { $_ -is "System.IO.DirectoryInfo" }

Mode                LastWriteTime     Length Name
----                -------------     ------ ----
d----        17.12.2009     13:34            curl
d----         28.1.2010     11:24            workshops

Now that we know the object model, let's look into what the various arguments in Get-ChildItem can actually do.

Multiple targets

The traditional dir was limited to listing one directory at a time. So much for getting a single list of all the files in your drives’ root directories. With PowerShell, this is no longer a problem. The –Path parameter (which you almost never refer by name, since it is the positionally first argument) takes an array of strings:

PS D:\temp> dir MyDir, MySecondDir

    Directory: D:\temp\MyDir

Mode                LastWriteTime     Length Name
----                -------------     ------ ----
-a---         28.1.2010     11:39         22 Demo.txt

    Directory: D:\temp\MySecondDir

Mode                LastWriteTime     Length Name
----                -------------     ------ ----
-a---         28.1.2010     11:47      18588 Contents.html

Although these files are shown as two separate listings, it’s all just visual. If you do $files = dir MyDir, MySecondDir, you get back an array of FileSystemInfos (of two elements in the case above). The fact that the directory header gets printed into the list doesn’t mean they would be separated in the actual array – it’s just the output formatter that does the trick.

Recursion and filtering

Dir has always supported recursion, but supporting recursion and filtering side-by-side can sometimes be a bit troublesome. For example, the traditional dir had a /s switch for subdirectory recursion: dir C:\ /s listed all the files on your C drive.

It also allowed you a rudimentary approach to file filtering. You could do dir c:\windows\*.exe /s to list all the exe files under the Windows directory and its subdirectories, but the syntax was somewhat unintuitive. Behind the scenes, dir splits the path argument into two elements: where and what to search for (C:\Windows and *.exe, respectively). While handy, it makes certain scenarios hard.

For example, listing all the exe and zip files would require two listings. And what if you needed directory wildcarding, say “Find me all the jpg files in directories matching C:\Images\2008*”?

PowerShell resolves much of this by keeping the two concepts apart from another. With PowerShell, you  have –recurse (usually abbreviated as –r) for recursion, and the path argument specifies the locations you need to search files in. When you need filename filtering, you apply –include (or just –i), which takes an array of wildcard-enhanced masks.

So you want a list of all executable and zip files under your program files directories, even if you had a separate Program Files (x86) as you usually do on 64-bit systems? Do a dir –r "C:\Program Files*" –i *.exe, *.zip.

You also have the liberty of excluding stuff. Say, you didn't want exe and zip files that are actually installers? Throw in a -exclude setup.exe,installer.exe. Of course, feel free to abbreviate to -ex and use wildcards.

Filtering, more speed

The suggestion I gave above for using –include is sound advice, and works every time. However, there is one drawback to it: it can be a bit slow. Enter the –filter parameter, which takes a file mask (like *.exe), and performs the filtering at the file system level, producing the output far faster.

The reasons for this dual parameter set are beyond the scope of this article, but suffice to say that –filter is a faster but more limited option. In practice, you’ll probably encounter the fact that a filter only takes one filtering expression. If you need several, use the –include parameter. Also, you cannot represent an –exclude as a –filter.

A syntax shortcut helps in using -filter: it is the second positional parameter in a Get-ChildItem call. The first one is naturally the path. Therefore, you may omit the parameter name and simply type dir C:\Windows *.exe to get all the .exe files in your Windows directory. If you type dir C:\Windows\*.exe (notice the extra slash), you’re not passing a filter, but just one path. That’s fine, but usually a bit slower – it generally performs equally to specifying the file mask as an –include parameter.

Of course, you’re unlikely to spot the difference in practice unless you have really, really many files. When you do, the filter will often save the day.

Filtering with regular expressions

The Get-ChildItem cmdlet doesn’t support regular expressions, but you can use them by piping the resultant objects through an appropriate filtering expression. For example, in order to find out the slide decks that have a dual vowel (“aa”, “ee”, and so on) in their names, you might type:

dir d:\slides\pdc2009 | where { $_.Name -match '([aeiouy])\1' }

This would net you a set of presentations on IIS, toolkits and various deep dives.

Just the names?

In the cmd.exe times, we used to do dir /b to just print out the names of the files (/b stands for bare). The Get-ChildItem cmdlet has this one too, and it’s called –name (or just –n). While you’ll probably want to work with the objects when in PowerShell, interfacing with non-object-oriented tools often requires passing the path strings around.

There is one more thing here. –name produces a list of names relative to the root directory of the directory listing, which may contain paths like “System32\azroles.dll” for a directory listing that originated from the C:\Windows directory. While this is often fine, you may also need the full paths. For this, use pipelines to convert the objects into their full paths.

dir c:\windows –r | foreach { $_.FullName }

I think that’s about enough for one of the most basic commands, dir. Enjoy!

January 28, 2010  Tags:   Posted in: .NET, Windows IT  No Comments

PowerShell Basics #1: Reading and parsing CSV

I will be giving a talk on the topic of “PowerShell for Developers” at TechDays 2010 in Helsinki, Finland. As a warm-up to my presentation, I will be publishing a series of blog posts on various aspects of PowerShell. My viewpoint is specifically one of developer utility: What can PowerShell do to make a dev’s life easier?

I want to start with something that touches on data. Often, developers receive data in Excel format – usually in order to then import it into a database. Reading data from Excel is somewhat painful, but fortunately, Excel allows for easy saving to the CSV format. PowerShell, on the other hand, provides for several quite easy manipulations.

Simple imports with Import-Csv

Let’s start with a simple CSV file, customers.csv:

ID,Name,Country
1,John,United States
2,Beatrice,Germany
3,Jouni,Finland
4,Marcel,France

Turning this into objects in PowerShell is very straightforward:

PS D:\temp> import-csv customers.csv 

ID Name     Country
-- ----     -------
1  John     United States
2  Beatrice Germany
3  Jouni    Finland
4  Marcel   France

As you can see, the first line in the text file gets parsed as a header, specifying the names on the PowerShell objects. The import doesn’t have a notion of strong typing; therefore, all the properties are imported as pure text. Often this is enough, but if it isn’t, look below…

Headerlessness and other cultures

There are a few scenarios where this won’t work. For example, if your CSV doesn’t have headers, you would get objects with column names such as “1”, “John” and “United States”. Lacking headers, you can supply them as a parameter:

import-csv .\customers.csv -header ID,Name,Country

That was easy (but don’t do it when your data has headers, or you end up duplicating them).

Well then, perhaps you live in a region where the field separator isn’t the usual comma? This is no problem to PowerShell, either

PS D:\temp> type customers-fi.csv
ID;Name;Country
1;John;United States
2;Beatrice;Germany
PS D:\temp> import-csv .\customers-fi.csv -delimiter ';'

ID Name     Country
-- ----     -------
1  John     United States
2  Beatrice Germany

If you know the file originated from your current UI culture, you can just dispense with the delimiter specification and type import-csv –useCulture customers-fi.csv. That will pick up the delimiter from your Windows settings.

When your CSV ain’t a file…

Often you get your CSV data in a file. Occasionally, you might download it through HTTP, or even pull it from a database. No matter how, you may end up with an array of strings that contains your CSV data. The Import-Csv cmdlet reads a file, but if you need to parse the data from another source, use ConvertFrom-Csv.

PS D:\temp> $csv = 'ID,Name,Country
>> 1,John,United States
>> 2,Beatrice,Germany'
>>
PS D:\temp> ConvertFrom-Csv $csv

ID Name     Country
-- ----     -------
1  John     United States
2  Beatrice Germany

As far as the culture switches go, everything discussed above also applies to ConvertFrom-Csv.

How about CSV content oddities?

There are some uses of CSV that veer away from the normal, safe path. The first and a reasonably common scenario is having the field delimiter in the data, something that is usually handled by quoting the field. Of course, up next is the scenario where a field contains the quotation mark.

imageAnd finally, there is the really controversial aspect of having a newline in a CSV field. Many parsers struggle with this, and in fact, the correct behavior isn’t exactly clear. Of course, for practical purposes, anything that makes Excel exports work correctly is usually good. But let’s look at an example that contains all of these anomalies (the original content for this is shown in the Excel screenshot to the right).

Number,String,Multiline,Date
512,"Comma,Rocks","Line 1
Line 2",15.1.2010 15:14
57,"Cool ""quotes""","First
Second",7.1.2010 9:33

The data is split across five lines, but actually contains two records and a header. This alone is somewhat controversial, given CSV’s starting point of one line per record. Anyway, it often still needs to be parsed, and PowerShell does a good job here:

PS D:\temp> Import-Csv .\oddstrings.csv

Number String        Multiline     Date
------ ------        ---------     ----
512    Comma,Rocks   Line 1...     15.1.2010 15:14
57     Cool "quotes" First...      7.1.2010 9:33

There is one key thing to notice. Import-Csv works, because it treats the data source as a single whole. However, ConvertFrom-Csv misparses the multiline fields, as it handles the input line-by-line.

PS D:\temp> type .\oddstrings.csv | ConvertFrom-Csv

Number  String          Multiline Date
------  ------          --------- ----
512     Comma,Rocks     Line 1
Line 2" 15.1.2010 15:14
57      Cool "quotes"   First
Second" 7.1.2010 9:33

Strong typing, then?

For this, there are no pre-cooked solutions. But once your data is imported correctly, separators, multiline fields and all, it’s rather easy to just typecast the stuff, providing you input validation at the same time. Consider this CSV file of event descriptions:

StartsAt,Title,Venue
9.3.2010,TechDays 2010,Helsinki
15.2.2010,Mobile World Congress,Barcelona

Next, you want to filter the data to just show the occurring within the next 30 days. For this, you'll want the datetimes parsed into System.DateTime objects.

PS D:\temp> $events = Import-Csv event.csv | foreach {
  New-Object PSObject -prop @{
    StartsAt = [DateTime]::Parse($_.StartsAt);
    Title = $_.Title;
    Venue = $_.Venue
  }
}
PS D:\temp> $events                            

StartsAt          Venue     Title
--------          -----     -----
9.3.2010 0:00:00  Helsinki  TechDays 2010
15.2.2010 0:00:00 Barcelona Mobile World Congress

Now, filtering the list is a snap.

PS D:\temp> $events | where { $_.StartsAt -lt (get-date).AddDays(30) }

StartsAt          Venue     Title
--------          -----     -----
15.2.2010 0:00:00 Barcelona Mobile World Congress

One more thing to notice here: In the example above, it worked because the date format (“15.2.2010”) was in Finnish and I happened to run the PowerShell in a thread with the Finnish culture. However, if your data happens to come from a culture different than your UI, you need to pass in the correct culture specifier. For example, to parse a date in the US locale, use the following:

$usCulture = [System.Globalization.CultureInfo]::CreateSpecificCulture("en-US")
[DateTime]::Parse("04/07/2009", $usCulture)

Note that specifying the culture explicitly is always a good idea if you plan on saving the script and reusing it later. Although you might pay attention to the parsing details at the time of writing, it is quite conceivable for someone to run the script later on in another thread. As dates are prone to silent misparsing (for example, is 04/07/2009 7th April or 4th July?), you could end up manipulating incorrect data.

In addition to dates, you’ll want to look at cultures when parsing decimal numbers and such. Remember: the –UseCulture switch on Import-Csv only applies to the separator settings, not the optional parsing phase.

Enjoy!

January 22, 2010  Tags: ,   Posted in: .NET  No Comments

Creating custom types in PowerShell, revisited for v2

In June, I blogged on creating custom types within PowerShell. In PowerShell v2, released with Windows 7 and Windows Server 2008 R2, things are a bit easier.

Now, the syntax for adding properties is far nicer, as you can pass a hashtable of values to add-member. Namely, the previous example of enumerating the applications run at startup could be more cleanly written as:

$runkey = Get-Item 'HKLM:\Software\Microsoft\Windows\CurrentVersion\Run'
$values = Get-ItemProperty $runkey.PSPath
foreach ($app in $runkey.Property) {
    $result = New-Object PSObject –prop @{
       Application = $app;
       Path = $values.$app
    } 
    Write-Output $result
}

Compared to the syntax of using separate add-member calls or even using the select-object shortcut as described earlier, this is much cleaner. Of course, it can also be compressed on a single line if so desired.

January 20, 2010  Tags:   Posted in: .NET  2 Comments

VSPaste is great – but vile

VSPaste is a Live Writer plugin I have been using to help in pasting code segments from Visual Studio to my blog. It’s been working great – until I noticed it’s polluting my blog with hidden links to itself.

VSPaste allows me to copy code in Visual Studio and just hit the paste button. And hey presto, code appears on my blog, with syntax highlighting and all. I was happy all along, until I noticed that each of the pasted segments actually has a hidden link to the plugin’s home page.

When I paste code like this:

static void Main(string[] args)
{
    int foo = 5;
}

I actually get a fragment of HTML like this:

<pre class="code"><span style="color: blue">static void </span>  Main(<span style="color: blue">string</span>[] args)
{     <span style="color: blue">int </span>foo = 5;
}</pre>
<a href="http://11011.net/software/vspaste"></a>

The hidden link with no text doesn’t appear every time, but often it does.

Why and why not?

Why: Google Juice, of course. Having lots of links pointing at the site is a good way to improve your ranking, although perhaps not in practice: search engines have become quite a bit more intelligent in weeding out fake links, so the actual effect is somewhat more dubious.

Why not: First off, I don’t, as a matter of principle, want to have uninvited links in my blog content. Second, what happened if search engines actually tracked the empty link and then due to whatever circumstances deemed that page as spam, illegal or whatever? Such measures also tend to negatively affect the rank of pages linking to it, possibly even totally removing them from search results.

This sort of risk could potentially wipe out all the blogs using VSPaste from the search engine results – a massive loss to those who make their living out of blogging. Although unlikely to happen in practice, it shows the problem in a harsh light. Actually, it is exactly this sort of scenarios that have led to the practice of having rel=”nofollow” on links in blog comments.

Plug-in authors should be careful not to mess with the content they help in producing.

PS. In order to work against the effect of all the hidden links, this article doesn’t contain a link to the VSPaste plugin.

January 16, 2010   Posted in: General  No Comments

OpenOffice.org and Microsoft Office: A serious threat for the empire?

The blog world is abuzz on a Microsoft job posting. The US subsidiary looking for a “Linux and Open Office Compete Lead” – and a team of 13 people – seems to signal a meaty victory for the OS crowd, as it implies Microsoft is taking OpenOffice seriously. Or does it?

As I pointed out in the comments of one of the longer posts on the subject, a dozen people really isn’t that much when you consider the fact that the Office business grinds money at $15 billion a year. But I do agree that it’s a change: It’s a public admission that customers actually have valid alternatives and that the situation warrants discussion, also from Microsoft’s end.

Fair enough. Were OOo and Linux powered by companies, Microsoft would probably buy them out of the market. The beauty of Open Source is that it cannot be bought away or controlled. It cannot be stopped by simply turning a few shareholders rich. Instead, market superpowers must fight the OS threat by investing their capital more constructively: making their own solutions better and learning to justify their cost structure. All this advances the state of things by far more than behind-the-scenes stock trading.

But does this imply a serious threat to Microsoft Office? I don’t think so. OOo will steal market share, that’s for sure. But my personal opinion is that it’s not competitive with Microsoft’s offering yet, and its true long-term TCO still remains to be measured. Still, without a grain of doubt, it’s extremely important to have competition.

As for Linux on the server-side, it’s whole lot different. The openness of the Linux Server paradigm has already catalyzed a change in Windows Server and driven Windows Azure towards a more platform-agnostic model of thought. With the amount of serious large-scale backers Linux has, it is no longer bound to the typical limitations of an OS project. I expect the starting decade to be great.

So what does this Compete Lead hiring mean? I’d say it means that Microsoft is slowly getting rid of its arrogance. Focusing a bunch of people to actually think about the competitive losses and improve on what they do is exactly what a responsible business should do. Is it a win for the OS movement? In a recognition sense, maybe. Businesswise, I think those 13 people aren’t going to make the Linux and OpenOffice.org march any easier.

As for the future, I’m hoping that we’ll see similar hires for the Google’s Cloud offering in both infrastructure and app suite segments. Meanwhile, happy new year!

January 5, 2010  Tags: ,   Posted in: General  No Comments

What is an "open" API anyway? (case YouTube / TotLol)

TotLol is a membership-based site that aggregates YouTube content for kids. What’s interesting is its background story and how it went from being ad-based to almost non-existent to membership-based.

The author’s version of the story is interesting. Harshly compressed: He claims to have created a service that was one of the first on YouTube APIs. Then Google gradually and suspiciously changed the Terms of Use to cut out the business from TotLol. The author claims this is because Google wanted to steal his idea.

So what’s going on?

Is the story true or false? Tinfoil hats on? Hard to tell.

However, the story does carry an important message: An API may be technically solid, but business conditions can still wreck a perfectly good app. As long as the terms of use remain as vague as they often are, broadly co-operative offerings such as content aggregation are a risk.

Some have considered this a story of Google’s evilness. I wouldn’t go that far. But it’s certainly a reminder: A great service company may be an abysmal platform company. The mental model for providing stable platforms for building business is vastly different from providing hip and cool services for users.

A platform is a turtle, services can be rabbits

Somehow, in my head all of this adds up to the general discussion on speed of change and business agility. For example, Microsoft is very much stuck on supporting IE 6 on Windows XP, even up to the forthcoming years when it will be even more massively outdated for the web. That blows, but it’s one part of a strategy that has helped business thrive on the Microsoft platform.

Licensing terms, pricing and product structures have changed, but slowly enough to keep most clients on board. Upgrades are offered and sometimes even required, but in spite of that, the Microsoft platform keeps rolling on. It does so for equipment manufacturers, software companies, training consultants and everybody else. While the Redmond-based economy certainly has its flaws, it’s quite an achievement to actually have that sort of a critical mass – and to have had it for so many years.

Let it be said out loud: Assuming Google actually did all that maliciously, Microsoft could have done the same, particularly in the past years. I’m not discussing the relative evilness of these two companies. There is a marked difference in the service/platform orientation though, and I expect it to play more and more of a role as all the cloud hoopla really hits the mainstream.

December 30, 2009  Tags: ,   Posted in: Web  No Comments

ReaderWriterLockSlim performance

A while ago I blogged about the performance of various thread synchronization primitives. Due to the insufficient accuracy of my memory cells, I forgot ReaderWriterLockSlim out of the comparison. Let that be fixed here and now.

The comparison method is still the same, and I have amended the previous post with the results of the Slim version. To summarize:

The Slim version performs significantly better, at approximately 34% of the time it takes for the older version ReaderWriterLock. Below is a duplication of the table containing the relative execution times from my other post. So, the ReaderWriterLockSlim beats its “full” sibling hands down, but is still considerably slower than using Interlocked, and loses somewhat to the Monitor.

Method Execution time
Non-locking 1
lock statement / Monitor 18
ReaderWriterLock 93
ReaderWriterLockSlim 32
Interlocked 8

Also: The Slim version exhibits the same characteristics re lock type as the ReaderWriterLock: Acquiring a Reader lock takes the same time as acquiring a Writer one. Acquiring an Upgradeable reader lock is also equally fast, but upgrading takes roughly the same time as acquiring a full lock, putting a Read+Upgrade cycle at approximately 65 in the table above.

I strongly urge everyone to read the performance notes in the previous post before making conclusions based on these numbers. The fact that the ReaderWriterLock is slower than a lock statement doesn’t mean you should use lock statements in your real-world apps. For example, the benefit of allowing multiple simultaneous readers might well offset the slight impact of acquiring the lock.

December 29, 2009   Posted in: General  One Comment

UTF-8 preamble is a problem when you concatenate files

You’re just changing a couple of words in an XML file with Notepad. Your data modifications are guaranteed to be valid by schema. That couldn’t possibly break anything, could it?

<insert the ugly buzzer sound>

It quite likely couldn’t, unless you were editing an XML file that happened to be using UTF-8. Because while Notepad certainly looks like a very innocent, raw data text editor, it really isn’t when it comes down to UTF-8 encoding.

imageFiles encoded in UTF-8 can contain a Byte-order mark (BOM), also known as a preamble or a signature. It consists of the bytes 0xEF, 0xBB and 0xBF right at the start of the file, and identifiers the encoding of the text file. If you ever see “”, it’s the usual visual interpretation of an unparsed BOM, although other character sets can lead to other kinds of misrepresentations.

Why is this a problem?

Normally, it’s not. Most modern UTF-aware consumers (XML parsers, text editors etc.) understand the BOM just fine, although some problems exist particularly in Unix environments. But if files get concatenated together as binary, the BOM gets embedded in the middle of the file – turning into just normal data.

So, we had strange application somebody a long time ago had written. It created XML files by concatenating together various strings and XML files. The files were pushed into the ASP.NET Response stream by simple Response.Writes and Response.WriteFiles.

At this point, you probably guessed the rest. Somebody went ahead and edited one of the XML files (changing those classic “just two words”) that got added through Response.WriteFile, which is a binary operation… And boom, you have invalid data in your XML file. In this case, the file had always before been edited in a text editor that didn’t add the preamble, but Notepad did.

Removing the BOM

imageIt’s really as trivial as just removing the first three bytes of the file, but unless you happen to have tools for that at your disposal, paste the stuff into an editor that does not add the BOM. Alternatively, use a more sophisticated editor that allows you to choose if you want a preamble or not.

For example, in Visual Studio, you can just choose File > Save As, then drop down the Save button and choose “Save with Encoding”. After that, you’ll have a dialog with lots of options, including “Unicode (UTF-8 without signature)” as well as a “Unicode (UTF-8 with signature)” one.

If you ever need to do this in your own code, the .NET StreamWriter has a constructor that lets you choose whether or not to use the BOM. The default is false, and since most Framework methods use Encoding.UTF8 as the default encoding, BOMs get removed by just reading data in and then writing it back out.

December 21, 2009  Tags:   Posted in: .NET, Misc. programming  4 Comments