ProperCase for C#

For some perverted reason, I HAD to try to write the best propercasing algorithm on Earth. This one does all of the following (highlights bolded):

jouni heikniemi -> Jouni Heikniemi
jouni von lederhosen -> Jouni von Lederhosen
THE EYE OF THE TIGER -> The Eye of the Tiger
1250 MHZ -> 1250 MHz
RoNaLD MCDoNaLD, USa -> Ronald McDonald, USA

Enough babble, the code is up next.

// CONFIGURATION:
// The following words will always be in lower case (except in the start of the string)
static string[] lowerCaseWords = { "of", "the", "and", "or", "a", "an", "von" };
// The following prefixes will cause their next character to be uppercased
// Note: Keep the first character uppercase when defining these; all else must be in lowercase
static string[] upperCasePrefixes = { "Mc", "O'" };
// The following words will be always presented in the case they have here.
static string[] fixedCaseWords = { "USA", "NATO", "MHz" };
/// <summary>
/// Converts the given string into ProperCase.
/// </summary>
/// <param name="original">The original string, f.e. "THE EYE OF THE TIGER"</param>
/// <returns>The string converted into ProperCase, f.e. "The Eye of the Tiger"</returns>
public static string ProperCase(string original) {
if (original == null || original.Length == 0) return "";
// Run the original through the massage word-by-word
string result =
Regex.Replace(original.ToLower(), @"\b(\w+)\b", new MatchEvaluator(HandleSingleWord));
// Always uppercase the first character
return Char.ToUpper(result[0]) + (result.Length > 1 ? result.Substring(1) : "");
}
// This helper method properizes (sp?) the case of a single word (regex match)
// NOTE: The input is in all lowercase as forced by the ProperCase method.
private static string HandleSingleWord(Match m) {
string word = m.Groups[1].Value;
// Is this word defined as all-lowercase?
foreach (string lcw in lowerCaseWords)
if (word == lcw)
return word;
// Is this word defined as a fixed-case word?
foreach (string fcw in fixedCaseWords)
if (String.Compare(word, fcw, true) == 0)
return fcw;
// Ok, this is a normal word; uppercase the first letter
if (word.Length == 1)
return Char.ToUpper(word[0]).ToString();
word = Char.ToUpper(word[0]) + word.Substring(1);
// Check if this word starts with one of the uppercasing prefixes
// Note: Only one of the uppercasing prefixes is applies
foreach (string ucPrefix in upperCasePrefixes)
if (word.StartsWith(ucPrefix) && word.Length > ucPrefix.Length)
return word.Substring(0, ucPrefix.Length) +
Char.ToUpper(word[ucPrefix.Length]) +
(word.Length > ucPrefix.Length + 1
? word.Substring(ucPrefix.Length + 1)
: "");
return word;
}

Afterwards, I spotted a tiny programming error. I don't think it's going to be seen in any production application, but it can produce slightly wrong result in a certain situation. Can you spot it?

October 3, 2004 В· Jouni Heikniemi В· 16 Comments
Posted in: .NET

16 Responses

  1. Jouni - October 3, 2004

    The answer to the quiz above (think before you read!):
    The error actually makes the "O'" uppercase prefix unnecessary. The regex pattern \b also matches the apostrophe, so "O'Neill" is actually handled as two different words. That doesn't really matter, since the N will get uppercased anyway (it's at the start of a word). However, if you come up with really contrived examples such as the oh-so-useful string "o'a", you'll note it's cased as "O'a", while it should by definition be "O'A". Same with "O'the" and so on.
    It can be fixed by making the word split algorithm more robust – either complicate the regex or build the code on string splitting. I promise to come up with a fix if you show me a practical situation where the bug above can bite you. :-)

  2. Kenneth Falck - October 3, 2004

    It doesn't handle sentences starting with a UNIX command name (first letter must be in lowercase)!
    :-P

  3. Jouni - October 4, 2004

    "Unix and proper case" is an oxymoron anyway. :-)

  4. Rahul Guha - October 13, 2004

    Very useful … thanks

  5. David - October 17, 2004

    '"THE MATRIX"' becomes '"the Matrix"' instead of '"The Matrix"' as one would expect.
    Apart from that, wonderfull work!

  6. Name - January 1, 2005

    Great work, but the bug you mentioned above causes the function to produce Ain'T, Don'T, Devil'S, etc. :)

  7. Jouni - January 1, 2005

    Good point. I never had a test case for that functionality. Suppose I need to fix that one soonish…

  8. Jay Turpin - January 29, 2005

    Try changing the regex expression from \w to [\w\']+
    That seemed to work for me.

  9. chris - September 8, 2005

    can someone handle obrien??

  10. CUTTER - February 2, 2006

    Looks good to me thanks for the code!

  11. Anonymous - March 21, 2006

    Thank you – very well written.

  12. Leonard Lee - July 7, 2006

    Thanks you so much! Well Written Codes.

  13. Eliezer - April 12, 2007

    "U.S.A." doesn't get handled properly – it turns into "U.S.a." Any ideas how to tweak this?

  14. afterburn - July 10, 2007

    Error when using possessive used. Like that of "Johnson's"

  15. Anonymous - January 12, 2008

    Could this make your algorithm more efficient?
    http://yyyz.net/CSharpCode/ProperCase.aspx

  16. bungle - January 4, 2011

    Nobody has perfected this so far as far as I know. Jouni's code, while it works ok in most cases, has also these fringe cases.

    Here are some good tries:
    http://search.cpan.org/~doom/Text-Capitalize-1.3/
    http://camendesign.com/code/title-case
    http://individed.com/code/to-title-case/
    http://ejohn.org/blog/title-capitalization-in-javascript/
    http://msdn.microsoft.com/en-us/library/cd7w43ec(v=vs.80).aspx
    http://msdn.microsoft.com/en-us/library/system.globalization.textinfo.totitlecase.aspx

Leave a Reply