[AccessD] CSV with no header

stuart at lexacorp.com.pg stuart at lexacorp.com.pg
Tue Oct 11 19:26:19 CDT 2005


On 11 Oct 2005 at 19:12, John Colby wrote:

> In the end, I don't get to choose.  This is a government export for PlanD of
> the Medicare system (or something like that).  The gov says "these fields,
> in this format".  The only problem is that they simply say "csv with no
> header".  What does that mean exactly?
> 

I would supply:
A plain ASCII text (ie no Unicode).
Each record terminates with CRLF
Fields separate by commas. 
Quotes around text fields if any field is likely to include commas. Otherwise I would 
not quote the text. 
Do not include an initial line containing the field names.

Note that there is no actual specification for a CSV file.
Here's what you will find at http://www.wotsit.org  which is the best place I know to get 
the specifications for just about any file format you can thiing of.

<quote>
-------------------------------------------------------------------
CSV file format
-------------------------------------------------------------------

The CSV format is supposedly one of the most 'standard' interchange
formats between data based programs.  Almost all data based applications
seem to be able to export CSV formatted data, and almost all have
a way to import it.

However, the format is anything BUT standard.  There are quite a
few variations that are very important to understand.

-------------------------------------------------------------------
DATA ENCAPSULATION
-------------------------------------------------------------------
Data is either 'naked' (without encapsulating doublequotes), or quoted.
Quoted data is used to protect imbedded carriage returns, imbedded
commas, odd characters and of course, the quote character itself.  
Quotes that are 'data' are doubled up.  

-------------------------------------------------------------------
UNQUOTED ENCAPSULATION
-------------------------------------------------------------------
Data that does NOT contain newlines, carriage returns, comma's or
quotes (or ASCII data below 0x20 or above 0x7f) stands on its
own:

  data <1234> followed by <The Big Ol' Bear>  

becomes:

  1234,The Big Ol' Bear{CR}


-------------------------------------------------------------------
QUOTED ENCAPSULATION
-------------------------------------------------------------------
Data that contains double quotes, commas, returns or other odd
characters outside the 7-bit ASCII character set is quoted.  

  data <1234 Harrington St, Northwest> followed by <Suite 17 Stop 3>

becomes:

  "1234 Harrington St, Northwest",Suite 17 Stop 3

Note that the second chunk of data does not have the protection
of the quotes: it doesn't need it, having no odd characters within.
   <1234 West "Q" St.>


-------------------------------------------------------------------
QUOTED ENCAPSULATION OF DOUBLE QUOTE
-------------------------------------------------------------------
Data that contains double quotes is a special case, and oddly 
interpreted by all nature of commercial programs.  Consider the 
data:

   <1234 West "Q" St.>

It should become:

   "1234 West ""Q"" St."	<<< RIGHT WAY

Where each internal DATA quote is doubled up.  However, some 
programs (such as Paradox) don't do so nicely.  They would represent
the data as:

   "1234 West "Q" St.",....	<<< WRONG WAY

Where they feel that a doublequote with no comma following is part of
the data. This turns out to be rather bogus, as it can be painful under
various common circumstances (e.g.:

   <1234 West "Q" St. (for "Quantum", or "Quality")>

becomes

   "1234 West "Q" St. (for "Quantum", or "Quality")",

Which of course is entirely ambiguous as to the placement of data in 
the field.


-------------------------------------------------------------------
QUOTED ENCAPSULATION OF CARRIAGE RETURNS
-------------------------------------------------------------------
Often times data has imbedded carriage returns.  YOU MUST MAKE SURE
YOUR ROUTES HANDLE THIS CORRECTLY, as it is a VERY common case:

   <Thomas Aquinus, Esq.{CR}
    Prosecutor for the Pope{CR}
    St. Luke's Dungeon{CR}
    Somewhere in Italy>

becomes

   "Thomas Aquinus, Esq.{CR}
    Prosecutor for the Pope{CR}
    St. Luke's Dungeon{CR}
    Somewhere in Italy"

What makes this so blasted difficult is that at the 'outer level', 
most CSV parsing code is trying to evaluate LINES, not fields. However,
the imbedded CR louses up the logic.  OPTIMALLY, you should code 
your CSV reader to have a 'field-by-field' read logic.  If this is
impossible (or awkward, as in PERL), you may want to have conditional
logic looking for a trailing and unmatched doublequote.

-------------------------------------------------------------------
EMPTY DATA CONVENTIONS
-------------------------------------------------------------------
When data is 'empty' (either the empty string, or the numeric
value '0', or FALSE if boolean), you have the option of either
writing out:

   ""

or writing nothing at all.  Therefore, it is common to see CSV files
that look like both of these examples:

  "","Thos.","","Aquinus","Esq.","Pros.forPope","","Somewhere..."
  ,Thos.,,Aquinus,Esq,Pros.forPope,,Somewhere...

They're both equivalent, and do not violate the spirit of the CSV
standard.

-------------------------------------------------------------------
REMOVING PADDING
-------------------------------------------------------------------
Though not strictly required, it is conventional to remove empty 
space padding before and after data.  Likewise, it is conventional
to remove empty padding when READING data [on the off chance that
someone forgot to remove it when writing the data out!]



-------------------------------------------------------------------
THE FIELD-DESCRIPTION HEADER
-------------------------------------------------------------------
Surprisingly there is absolutly no part of the standard that either
calls for the first record to consist of the names of the fields
in the data following, or, that identifies the first record uniquely
as such should it exist!  

However, it is both common AND EXPECTED that the first line should
consist of the names of the fields for the records that follow.


-------------------------------------------------------------------
VARIABLY FORMATTED CSV's
-------------------------------------------------------------------
Some programmers have abandonded the very precept of a regularly
formatted CSV file and have added their own twists.  Groupwise
CSV export is one of these programs.  They "enhance" the CSV
output format so that the first field of the record determines
the structure of the rest of the record.  This is hopelessly 
hard to cope with.  DON'T DO IT IF YOU WANT TO REMAIN ALIVE.


This document was created in a fit of disgust with the complete
lack of CSV documentation found on the Internet.

Author: Robert J. Lynch
rlynch at lynchmarks.com
Copyright (c) 2001
</quote>







More information about the AccessD mailing list