Skip to main content

Fix UTF-8 BOM, Smart Quotes, and Encoding Errors in CSV Files

9 min readPipeSheets Team

Your CSV opens fine in Excel. The columns look correct. You upload it to QuickBooks, Amazon, Shopify, or wherever it's going — and it fails. The error message says something cryptic like "unexpected character at line 1" or "invalid encoding," or worse, the upload succeeds but every accented letter is replaced with garbled symbols. You've hit an encoding problem, and they're the most frustrating CSV bugs because they're invisible until something breaks.

What Encoding Actually Is

Every character in a text file is stored as a sequence of bytes. The encoding is the agreement between writer and reader about which bytes mean which characters. UTF-8 is the modern standard — it can represent every character in every language using a variable number of bytes per character. But there are still other encodings in use: Windows-1252 (Western European), ISO-8859-1 (Latin-1), UTF-16 (used by some Windows tools), and so on.

When a file written in one encoding is read as if it were another, characters above the basic ASCII range (anything beyond standard English letters, numbers, and punctuation) get interpreted incorrectly. The result is mojibake — garbled text where accented characters become two-or-three-character symbols.

What mojibake looks like

Common mojibake patterns:

  • é (e-acute) → é — UTF-8 file read as Windows-1252
  • ñ (n-tilde) → ñ
  • ü (u-umlaut) → ü
  • € (euro sign) → €
  • ' (right single quote) → ’
  • — (em-dash) → —

If you see any of these patterns in your data, you have a UTF-8 file being misread as Windows-1252 (the most common direction). The fix is to re-save the file with the correct encoding declared.

The UTF-8 BOM Problem

A BOM (byte order mark) is three invisible bytes (EF BB BF in hex) at the start of a UTF-8 file. Microsoft tools (Excel, Notepad) add a BOM when saving as "CSV UTF-8." The BOM helps Windows tools detect the encoding, but it breaks parsers that aren't expecting it. The first column header gets prefixed with an invisible character, which means "Date" becomes "\uFEFFDate" — and any parser that expects exactly "Date" rejects the row.

Where the BOM causes problems

Common BOM-related failures:

  • Amazon Seller Central — fails with "Invalid header" on row 1
  • Shopify product import — fails with "Required column missing"
  • Python csv.DictReader — first column key is "\uFEFFDate" instead of "Date"
  • QuickBooks Online — silently mis-parses the first column
  • Most database COPY/import commands — fail with parse errors

How to remove the BOM

In VS Code: open the file, click the encoding indicator in the bottom-right status bar (it'll say "UTF-8 with BOM"), select "Save with Encoding," and choose "UTF-8" (without BOM). In Notepad++: Encoding menu > Convert to UTF-8 (not UTF-8-BOM). In Sublime Text: File > Save with Encoding > UTF-8. Excel doesn't have a no-BOM option built in; you need to round-trip through another editor.

Smart Quotes and Em-Dashes

Microsoft Word and other word processors automatically convert straight quotes (") into smart quotes (" and "), and hyphens into em-dashes (—). Paste any text from Word into a spreadsheet cell and those substitutions come along. The smart characters are valid Unicode, but many CSV consumers reject or mis-parse them.

Why smart quotes break CSVs

CSV parsers use straight double quotes to quote fields that contain commas or newlines. A smart double quote (") looks similar but is a different character (U+201D vs U+0022). Parsers don't recognize it as a quote, so a field like "Smith, John" (with smart quotes) gets split into two fields at the comma.

Wrong (smart quotes don't quote the field):
Name,Email
“Smith, John”,j@x.com
  parses as: Name=“Smith / John”=j@x.com

Right (straight quotes):
Name,Email
"Smith, John",j@x.com
  parses as: Name=Smith, John / Email=j@x.com

Other characters to watch for

Common Word-isms that break CSVs:

  • Smart double quotes: “ ” → replace with "
  • Smart single quotes: ‘ ’ → replace with '
  • Em-dash: — → replace with - or --
  • En-dash: – → replace with -
  • Ellipsis character: … → replace with ...
  • Non-breaking space (U+00A0) → replace with regular space
  • Soft hyphen (U+00AD) → invisible, just delete
  • Trademark symbol: ™ → ™ may or may not be allowed depending on system

Line Ending Differences

Windows uses \r\n (carriage return + newline) to end lines. Mac and Linux use just \n. Old-style Mac (pre-OS X) used just \r. Most modern tools auto-detect, but some don't — and a file with mixed line endings (one of the few things a hand-edited file can produce) will baffle most parsers.

Symptoms of line-ending issues

All rows appear concatenated into one giant row. Or every other row is empty. Or the parser reports "unexpected end of file" mid-way through. The fix: re-save the file with consistent line endings. In VS Code, click the line-ending indicator (CRLF or LF) in the bottom-right and switch to the one your target system expects.

How to Detect Encoding Problems

Eyeball test

Open the file in a text editor (not Excel — Excel hides encoding issues). Scroll through. Look for any of the mojibake patterns above. Look for unfamiliar characters where you expected letters or punctuation. Check the first few characters — if your editor shows a BOM marker or a strange character before the first column header, you have a BOM.

Hex editor test

For a definitive check, open the file in a hex editor (HxD on Windows, Hex Fiend on Mac, or VS Code with a hex extension). The first three bytes will be EF BB BF if there's a BOM. Look for individual bytes above 7F (0x7F) — these are non-ASCII and may be the source of encoding mismatches.

Round-trip test

Open the file in your spreadsheet tool, save it, then open the saved file in a text editor. Compare to the original. If accented characters now look different, your tool changed the encoding during round-trip — note which encoding it's writing.

How to Fix Encoding Problems

Step 1: Identify the source encoding

If your file came from Excel on Windows, it's probably either Windows-1252 or UTF-8 with BOM. If from Mac Excel, probably UTF-8 (no BOM) but sometimes Mac CSV (old encoding). If from a database export, usually UTF-8. When in doubt, try opening the file in VS Code with different encodings (View > Editor Encoding > Reopen with Encoding) until the characters render correctly.

Step 2: Re-save as UTF-8 without BOM

Once the file renders correctly in your text editor, save it as UTF-8 (no BOM). This is the modern standard and what most systems expect. The exceptions: some older Windows tools require UTF-8 with BOM; some non-English-language tools require Windows-1252 or the language-specific code page.

Step 3: Replace smart characters

Run a find-and-replace pass on smart quotes (both single and double), em-dashes, ellipsis, and non-breaking spaces. Most text editors let you find by Unicode character or by paste-the-character. If you do this a lot, save the find-replace pattern as a macro or build a pipeline that does it automatically.

Step 4: Verify

Re-open the saved file in your text editor. Check the first bytes (no BOM). Spot-check accented characters and any rows that previously had Word-style punctuation. Try a small import to verify the target system accepts the file.

PipeSheets handles encoding normalization automatically. Upload a file in any common encoding, choose a cleanup that includes the "Standardize special characters" step, and the output is clean UTF-8 with ASCII-equivalent punctuation. No hex editor required.

Encoding by Source: Where Each Problem Comes From

Excel on Windows

Excel has two CSV save options: "CSV (Comma delimited)" saves as Windows-1252 (or your local code page). "CSV UTF-8" saves as UTF-8 with BOM. Neither is what most modern systems want (UTF-8 without BOM). Workaround: save as CSV UTF-8, then re-save in a text editor without the BOM.

Google Sheets

Downloads CSVs as UTF-8 without BOM by default. This is the right default — if you have a choice between Excel and Sheets for exporting, Sheets is encoding-friendlier.

Database exports

Postgres, MySQL, and SQL Server all default to UTF-8 (or your server's locale). The encoding is determined by the connection's client encoding, not the database. If you get mojibake from a database export, check the client connection encoding before re-running.

Email and copy-paste

Pasting data from email or web pages into a spreadsheet is the most common source of smart quotes and em-dashes. The pasted text retains whatever rich formatting was in the source. Always paste as plain text (Ctrl+Shift+V or Cmd+Shift+V) when bringing data into a CSV.

Prevention

Build a pre-flight check into every CSV workflow: re-encode to UTF-8 without BOM, strip smart characters, normalize line endings. If you're using a programmatic pipeline (Python, Node, etc.), libraries like Python's `chardet` can auto-detect the source encoding, and `unicodedata` can normalize Unicode forms. For non-technical workflows, build a one-pipeline cleanup once and run it on every incoming file.

Try the automated solution

PipeSheets can fix these issues automatically. Clean your first file free.

Clean Your CSV