A data format is the arrangement of data fields. It defines how data is encoded for storage in a computer file, and the way to display and process. A good data researcher should at least know:
What is CSV?
CSV, sometimes called Character Separated Values or Comma Delimited files, is a plain text file that are often used for exchanging data between different applications, including Microsoft Excel and SQL databases.
A CSV file has a fairly simple structure, which is a list of data separated by commas. Although the comma character is mostly used to separate (or delimit) data, sometimes other characters are used, like semicolons.
For example, a high school have its students' contact information as follows. Each row represents only 1 student. In this example, the column 'Name' is called a primary key which makes a row unique to each other.
CSV Table A - Student Contact
Here is another example for students' daily attendance information. It is a simple table to list out who attend a class on a particular date. However, as none of the column is unique, we don't have a primary key for this CSV.
CSV Table B - Student Attendance
From above examples, it can be seen that CSV file is easy to create and read, but it lacks of feasibility to extend a CSV for "1-to-many" information. Sometimes for file size consideration, it would be better to split and store the data in different CSV tables.
What is XML?
XML stands for eXtensible Markup Language, is a textual data format that supports unicode of different human languages. It was firstly introduced by The World Wide Web Consortium in 1998 for web services data exchanges. Due to its feasibility in defining a data schema, there are lots of applications and documents nowadays adopt XML syntax for system configuration files and internet communication protocol, including RSS, Atom, REST API, Microsoft .NET Framework, Apple's iWork, etc.
Here are some key terminologies used in an XML specification:
- Tag: A tag is a markup that begins with < and ends with > . There are 3 type of tags:
- start-tag: eg. < body>
- end-tag: eg. < / body>
- empty-element tag: eg. < break />
- Element: An element is a document component that begins with a start-tag and ends with a matching end-tag or consists only of an empty-element tag. The characters between the start-tag and end-tag are the element's content. An element's content may contain other elements, which are called child elements. eg. < price>100.0< / price>
- Attribute: An attribute consists of a name-value pair that exists within a start-tag or empty-element tag. eg. < img src="earth.jpg" alt="Earth" />
Using our previous example of student contact and attendance, we can put all the information in a single XML file.
The above is encoded with some spacing between tags for easier human reading. For system processing, it is also valid to compress an XML string in a single line.
As seen in the example, XML supports a dynamic data presentation in a hierarchical structure. This feasibility, however, requires extra computation compared to CSV for file parsing. Also, as it creates repeated tags name (eg. Age, Email, Phone) for each record, the file size of XML is usually larger than CSV that contains the same amount of information.
What is JSON?
The file extension of a JSON file use .json
JSON data are normally presented in attribute-value pairs. It has the following basic data types:
- Number: a signed decimal number that may contain a fractional part and may use exponential E notation
- String: a sequence of unicode characters that are delimited with double-quotation marks
- Boolean: either true or false
- Array: an ordered list of elements in which its values may be of any type. Arrays use square bracket notation with comma-separated elements
- Object: a collection of key-value pairs where the keys are usually strings. Each key is unique within an object. Objects are delimited with curly brackets and use commas to separate each pair, while within each pair the colon ':' character separates the key from its value
An example of the syntax as follows:
Once a JSON data is initialized/parsed, a data value can be easily accessed from its key. A simple python example as follows:
Using the example of student contact and attendance above, we can also put all the information in a single JSON file.
Similarly, for system processing, it is also valid to compress a JSON string in a single line.
Similar to XML, the file size of JSON is usually larger than CSV due to repeated keys (eg. Email, Address, etc). Comparing JSON with XML, we can see that JSON file should generally be smaller as it only has 1 "tag" while XML contains both "start-tag" and "end-tag".
What is YAML?
YAML Ain't Markup Language (YAML) has increased in popularity over the past few years. It's often used as a format for configuration files (eg. docker), but its abilities in object serialization make it a viable replacement for XML and JSON.
Let's take a look of a sample YAML file:
YAML has a data structure hierarchy that is maintained by outline indentation. The file starts with three dashes --- . These dashes indicate the start of a new YAML document. YAML supports multiple documents, and compliant parsers will recognize each set of dashes as the beginning of a new one. A valid YAML should also fulfil the following compatibility:
- Whitespace indentation (tab characters NOT allowed) is used for denoting structure
- Comments begin with # , which can start anywhere in a line and continue until the end of the line.
- An associative array entry is represented in the form of key: value with one entry per line
- List items are denoted by - with one item per line.
- Strings are ordinarily unquoted, but it may be enclosed in double-quotes " , or single-quotes ' .
- Multiple documents within a single stream are separated by ---
- Repeated nodes are initially denoted by & and thereafter referenced with * .
Unlike JSON, which can only display data in a hierarchical structure with each child node having a single parent, YAML supports a simple relational scheme that allows repeats of identical data to be referenced from two or more points in the tree rather than entered repeatedly. Interested readers can refer to the official documents (https://yaml.org) for more advanced usages.
Using the example of student contact and attendance above, we can create a single YAML file as follows:
Unlike XML and JSON, data compression for YAML is not available due to outline indentation structure.
In the comparison below, we rank the data format from 1 to 4 where 1 is the best/most desired.
|Ease of File Creation||1||3||2||4|
|Data Parsing Speed||1||3||2||4|
|Flexibility of Data Structure||4||2||2||1|
Different data format has its own edges, which one to use actually depends on the actual use case. (eg. read/write frequency, how data is stored, etc). A good data analytics platform (like ALGOGENE~) weights the pros and cons, and usually derive a hybrid solution to handle different situations.