Why Your Concordance DAT File Won't Parse: The þ Delimiter Encoding Trap
If you work with eDiscovery load files, you've probably hit this: you receive a Concordance DAT file, try to import it into your review platform, and it fails silently or produces garbled data. No ...

Source: DEV Community
If you work with eDiscovery load files, you've probably hit this: you receive a Concordance DAT file, try to import it into your review platform, and it fails silently or produces garbled data. No useful error message. Just broken records. There's a good chance the problem is the þ character. What makes Concordance DAT files special Concordance DAT is the most common load file format in eDiscovery. It uses two unusual delimiter characters: Field separator: þ (thorn, Unicode U+00FE) Quote character: ® (registered sign, Unicode U+00AE) These were chosen decades ago specifically because they almost never appear in actual document metadata. Smart choice — until encoding enters the picture. The trap: þ has two different byte representations Here's where things break. The þ character is encoded differently depending on whether the file is CP1252 (Windows-1252) or UTF-8: Encoding þ bytes ® bytes CP1252 FE AE UTF-8 C3 BE C2 AE CP1252 uses a single byte. UTF-8 uses two bytes. If your parser ass