Article 6DPAT CodeSOD: A Case of Conversion

CodeSOD: A Case of Conversion

by
Remy Porter
from The Daily WTF on (#6DPAT)

Mattijs's co-worker was responsible for an HTML-to-PDF converter that was used internally. While TRWTF is always going to be PDFs, there were some odd choices used in the converter, starting with the function named ConvertTagsToLowerCase.

private static string ConvertTagsToLowerCase(string RichText){ //remove any double tags which are not in SupportedDoubleTags /> char[] array = new char[RichText.Length]; int arrayIndex = 0; bool inside = false; StringBuilder sb = new StringBuilder(); for (int i = 0; i < RichText.Length; i++) { char let = RichText[i]; if (let == '<') { inside = true; //clear stringbuilder sb.Remove(0, sb.ToString().Length); continue; } if (let == '>') { inside = false; array[arrayIndex] = '<'; arrayIndex++; string HTMLInside = sb.ToString().ToLower(); for (int j = 0; j < HTMLInside.Length; j++) { array[arrayIndex] = HTMLInside[j]; arrayIndex++; } array[arrayIndex] = '>'; arrayIndex++; continue; } if (!inside) { array[arrayIndex] = let; arrayIndex++; } else { //append to the string inside sb.Append(let); } } return new string(array, 0, arrayIndex);}

So, the first question is: why do we need to do this? While HTML tags are by convention lower case, it doesn't (and shouldn't) really matter. But let's set that aside, and walk through the code.

It's easiest to understand this code by going from the bottom up. If they're "inside" (of an HTML tag), they append the character to a string builder. Otherwise, they copy the character into our output array.

When we encounter an <, we are now "inside" an HTML tag, so we clear the string builder. We do this by Removeing every character out to its length, instead of using the Clear method, which is a strange choice clearly driven by not reading the docs.

Finally, when we encounter an >, we are no longer inside an HTML tag. We turn the string builder into a string, and then lowercase it. Now, we copy all the text from that string back into our output array... one character at a time.

And then, of course, as the cherry on top of this, is the misleading comment: remove any double tags which are not in supported tags. I assume this is meant to be marked as a TODO, but I have no idea. It certainly isn't something happening here. Maybe it's a prerequisite?

What puzzles me about this code is that they use both a string builder and an array of characters. The obvious "solution" with their knowledge is to just convert the string to an array, modify each character in place, and then return the new string. That's at least simpler.

If this code is necessary at all, it's because something they're using to parse HTML is itself broken, and expecting a specific casing for tags. But the code itself is a terrible way to solve this problem.

otter-icon.png [Advertisement] Otter - Provision your servers automatically without ever needing to log-in to a command prompt. Get started today!
External Content
Source RSS or Atom Feed
Feed Location http://syndication.thedailywtf.com/TheDailyWtf
Feed Title The Daily WTF
Feed Link http://thedailywtf.com/
Reply 0 comments