CodeSOD: Merge the Files

Remy Porter

from The Daily WTF on 2024-02-20 06:30 (#6JRJE)

XML is, arguably, an overspecified language. Every aspect of XML has a standard to interact with it or transform it or manipulate it, and that standard is also defined in XML. Each specification related to XML fits together into a soup that does all the things and solves every problem you could possibly have.

Though Owe had a problem that didn't quite map to the XML specification(s). Specifically, he needed to parse absolutely broken XML files.

bool Sorter::Work(){if(this->__disposed)throw gcnew ObjectDisposedException("Object has been disposed");if(this->_splitFiles){List<Document^>^ docs = gcnew List<Document^>();for each(FileInfo ^file in this->_sourceDir->GetFiles("*.xml")){XElement ^xml = XElement::Load(file->FullName);xml->Save(file->FullName);long count = 0;for each(XElement^ rec in xml->Elements("REC")){if(rec->Attribute("NAME")->Value == this->_mainLevel)count++;}if(count < 2)continue;StreamReader ^reader = gcnew StreamReader(file->OpenRead());StringBuilder ^sb = gcnew StringBuilder("<FILE NAME=\"blah\">");bool first = true;bool added = false;Regex ^isRecOrFld = gcnew Regex("^\\s+\\<[REC|FLD].*$");Regex ^isEndOfRecOrFld = gcnew Regex("^\\s+\\<\\/[REC|FLD].*$");Regex ^isMainLevelRec = gcnew Regex("^\\s+\\<REC NAME=\\\""+this->_mainLevel+"\\\".*$");while(!reader->EndOfStream){String ^line = reader->ReadLine();if(!isRecOrFld->IsMatch(line) && !isEndOfRecOrFld->IsMatch(line))continue;if(isMainLevelRec->IsMatch(line) && !String::IsNullOrEmpty(sb->ToString()) && !first){sb->AppendLine("</FILE>");XElement^ xml = XElement::Parse(sb->ToString());String ^key = String::Empty;for each(XElement ^rec in xml->Elements("REC")){key = this->findKey(rec);if(!String::IsNullOrEmpty(key))break;}docs->Add(gcnew Document(key, gcnew XElement("container", xml)));sb = gcnew StringBuilder("<FILE NAME=\"blah\">");first = true;added = true;}sb->AppendLine(line);if(first && !added)first = false;if(added)added = false;}delete reader;file->Delete();}int i = 1;for each(Document ^doc in docs){XElement ^splitted = doc->GetData()->Element("FILE");splitted->Save(Path::Combine(this->_sourceDir->FullName, this->_docPrefix + "_" + i++ + ".xml"));delete splitted;}delete docs;}List<Document^>^ docs = gcnew List<Document^>();for each(FileInfo ^file in this->_sourceDir->GetFiles(String::Format("{0}*.xml", this->_docPrefix))){XElement ^xml = XElement::Load(file->FullName);String ^key = findKey(xml->Element("REC")); // will always be first element in document orderDocument ^doc = gcnew Document(key, gcnew XElement("data", xml));docs->Add(doc);file->Delete();}List<Document^>^ sorted = MergeSorter::MergeSort(docs);XElement ^sortedMergedXml = gcnew XElement("FILE", gcnew XAttribute("NAME", "MergedStuff"));for each(Document ^doc in sorted){sortedMergedXml->Add(doc->GetData()->Element("FILE")->Elements("REC"));}sortedMergedXml->Save(Path::Combine(this->_sourceDir->FullName, String::Format("{0}_mergedAndSorted.xml", this->_docPrefix)));// returning a sane valuereturn true;}

This is in the .NET dialect of C++, so the odd ^ sigil is a handle to a garbage collected object.

There's a lot going on here. The purpose of this function is to possibly split some pre-merged XML files into separate XML files, and then take a set of XML files and merge them back together (properly sorted).

So we start by confirming that this object hasn't been disposed, and throwing an exception if it has. Then we try and split.

To do this, we search the directory for "*.xml", and then we... load the file and then save the file? The belief about this code is that it corrects the whitespace, because later on we require some whitespace- but the .NET XML writer doesn't add whitespace, only preserve it, so I suspect this line isn't necessary- or at least shouldn't be. I can envision a world where this somehow makes the code work for reasons that are best not thought about.

Owe writes, to the preceding developers: "Thanks guys, I really appreciate this!"

Now, since we're iterating across an entire directory of XML files, some of the files have been pre-merged (and need to be unmerged), and others haven't been merged at all. How do we tell them apart? We find every element named "REC", and check if it's "NAME" attribute is equivalent to our _mainLevel value. If there are at least two such element, we know that this file has been premerged and thus needs to be unmerged.

Owe writes: "Thanks guys, I really appreciate this!"

And then we get into the dreaded parse XML with regex phase. This is done because the XML files aren't actually valid XML. So it's a mix of string operations and regex matches to try and interpret the data. And remember that whitespace that we thought we required back when we wrote the documents out? Well here's why: our regexes are matching on whitespace.

Owe writes: "Thanks guys, I really appreciate this!"

Once we've constructed all the documents in memory, we can then dump them out to a new set of files. And then, once that's done, we can reopen those files, because now the merging happens. Here we find all the "REC" elements and build new XML documents based off of them. Then a MergeSorter::MergeSort function actually does the merging- and honestly, I dread to think about what that looks like.

The merge sorter sorts the documents, but we actually want to output one document with the elements in that sorted order, so we create one last XML document, iterate across all our sorted document fragments, and then inject the "REC" elements into the output.

Owe writes: "Thanks guys, I really appreciate this!"

While the code and the entire process here is terrible, the core WTF is the "we need to store our XML with the elements sorted in a specific order". That's not what XML is for. But obviously, they don't know what XML is for, since they're doing things in their documents that can't successfully be parsed by an XML parser. Or, perhaps more accurately, they couldn't figure out how to parse as XML, hence the regexes and string munging.

Were the documents sensible, this whole thing could probably have been solved with some fairly straightforward (by XML standards) XQuery/XSLT operations. Instead, we have this. Thanks guys, I really appreciate this.

[Advertisement] ProGet's got you covered with security and access controls on your NuGet feeds. Learn more.

Source	RSS or Atom Feed
Feed Location	http://syndication.thedailywtf.com/TheDailyWtf
Feed Title	The Daily WTF
Feed Link	http://thedailywtf.com/