admin

4 Common Data Formats You Should Know

Programming


A data format is the arrangement of data fields. It defines how data is encoded for storage in a computer file, and the way to display and process. A good data researcher should at least know:

  • CSV
  • XML
  • JSON
  • YAML

What is CSV?

CSV, sometimes called Character Separated Values or Comma Delimited files, is a plain text file that are often used for exchanging data between different applications, including Microsoft Excel and SQL databases.

A CSV file has a fairly simple structure, which is a list of data separated by commas. Although the comma character is mostly used to separate (or delimit) data, sometimes other characters are used, like semicolons.

For example, a high school have its students' contact information as follows. Each row represents only 1 student. In this example, the column 'Name' is called a primary key which makes a row unique to each other.

1
2
3
4
5
6
7
8
Name,Age,Email,Phone,Address
Amy,16,amy@xxx.com,1234-5678,123 Dummy Street
Bob,14,bob@xxx.com,1234-5679,124 Dummy Street
Clarence,13,clarence@xxx.com,1234-5680,125 Dummy Street
David,16,david@xxx.com,1234-5681,126 Dummy Street
Eva,15,eva@xxx.com,1234-5682,127 Dummy Street
Frankie,15,frankie@xxx.com,1234-5683,128 Dummy Street
...

CSV Table A - Student Contact


Here is another example for students' daily attendance information. It is a simple table to list out who attend a class on a particular date. However, as none of the column is unique, we don't have a primary key for this CSV.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
Date,Attendance
2021-05-01,Amy
2021-05-01,Bob
2021-05-01,Clarence
2021-05-01,David
2021-05-01,Eva
2021-05-01,Frankie
2021-05-02,Bob
2021-05-02,Clarence
2021-05-02,David
2021-05-02,Eva
2021-05-02,Frankie
2021-05-03,Bob
...

CSV Table B - Student Attendance


From above examples, it can be seen that CSV file is easy to create and read, but it lacks of feasibility to extend a CSV for "1-to-many" information. Sometimes for file size consideration, it would be better to split and store the data in different CSV tables.


What is XML?

XML stands for eXtensible Markup Language, is a textual data format that supports unicode of different human languages. It was firstly introduced by The World Wide Web Consortium in 1998 for web services data exchanges. Due to its feasibility in defining a data schema, there are lots of applications and documents nowadays adopt XML syntax for system configuration files and internet communication protocol, including RSS, Atom, REST API, Microsoft .NET Framework, Apple's iWork, etc.

Here are some key terminologies used in an XML specification:

  • Tag: A tag is a markup that begins with < and ends with > . There are 3 type of tags:
    • start-tag: eg. < body>
    • end-tag: eg. < / body>
    • empty-element tag: eg. < break />
  • Element: An element is a document component that begins with a start-tag and ends with a matching end-tag or consists only of an empty-element tag. The characters between the start-tag and end-tag are the element's content. An element's content may contain other elements, which are called child elements. eg. < price>100.0< / price>
  • Attribute: An attribute consists of a name-value pair that exists within a start-tag or empty-element tag. eg. < img src="earth.jpg" alt="Earth" />

Using our previous example of student contact and attendance, we can put all the information in a single XML file.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
<class>
    <student> 
        <Name>Amy</Name>
        <Age>16</Age>
        <Email>amy@xxx.com</Email>
        <Phone>1234-5678</Phone>
        <Address>123 Dummy Street</Address>
        <Attendance> 
            <Date>2021-05-01</Date>
        </Attendance> 
    </student>
    <student> 
        <Name>Bob</Name>
        <Age>14</Age>
        <Email>bob@xxx.com</Email>
        <Phone>1234-5679</Phone>
        <Address>124 Dummy Street</Address>
        <Attendance> 
            <Date>2021-05-01</Date>
            <Date>2021-05-02</Date>
            <Date>2021-05-03</Date>
        </Attendance> 
    </student>
    <student> 
        <Name>Clarence</Name>
        <Age>13</Age>
        <Email>clarence@xxx.com</Email>
        <Phone>1234-5680</Phone>
        <Address>125 Dummy Street</Address>
        <Attendance> 
            <Date>2021-05-01</Date>
            <Date>2021-05-02</Date>
        </Attendance> 
    </student>
    <student> 
        <Name>David</Name>
        <Age>16</Age>
        <Email>david@xxx.com</Email>
        <Phone>1234-5681</Phone>
        <Address>126 Dummy Street</Address>
        <Attendance> 
            <Date>2021-05-01</Date>
            <Date>2021-05-02</Date>
        </Attendance> 
    </student>
    <student> 
        <Name>Eva</Name>
        <Age>15</Age>
        <Email>eva@xxx.com</Email>
        <Phone>1234-5682</Phone>
        <Address>127 Dummy Street</Address>
        <Attendance> 
            <Date>2021-05-01</Date>
            <Date>2021-05-02</Date>
        </Attendance> 
    </student>
    <student> 
        <Name>Frankie</Name>
        <Age>15</Age>
        <Email>frankie@xxx.com</Email>
        <Phone>1234-5683</Phone>
        <Address>128 Dummy Street</Address>
        <Attendance> 
            <Date>2021-05-01</Date>
            <Date>2021-05-02</Date>
        </Attendance> 
    </student>
</class>

The above is encoded with some spacing between tags for easier human reading. For system processing, it is also valid to compress an XML string in a single line.

1
<class><student><Name>Amy</Name><Age>16</Age><Email>amy@xxx.com</Email><Phone>1234-5678</Phone><Address>123 Dummy Street</Address><Attendance><Date>2021-05-01</Date></Attendance></student><student><Name>Bob</Name><Age>14</Age><Email>bob@xxx.com</Email><Phone>1234-5679</Phone><Address>124 Dummy Street</Address><Attendance><Date>2021-05-01</Date><Date>2021-05-02</Date><Date>2021-05-03</Date></Attendance></student><student><Name>Clarence</Name><Age>13</Age><Email>clarence@xxx.com</Email><Phone>1234-5680</Phone><Address>125 Dummy Street</Address><Attendance><Date>2021-05-01</Date><Date>2021-05-02</Date></Attendance></student><student><Name>David</Name><Age>16</Age><Email>david@xxx.com</Email><Phone>1234-5681</Phone><Address>126 Dummy Street</Address><Attendance><Date>2021-05-01</Date><Date>2021-05-02</Date></Attendance></student><student><Name>Eva</Name><Age>15</Age><Email>eva@xxx.com</Email><Phone>1234-5682</Phone><Address>127 Dummy Street</Address><Attendance><Date>2021-05-01</Date><Date>2021-05-02</Date></Attendance></student><student><Name>Frankie</Name><Age>15</Age><Email>frankie@xxx.com</Email><Phone>1234-5683</Phone><Address>128 Dummy Street</Address><Attendance><Date>2021-05-01</Date><Date>2021-05-02</Date></Attendance></student></class>

As seen in the example, XML supports a dynamic data presentation in a hierarchical structure. This feasibility, however, requires extra computation compared to CSV for file parsing. Also, as it creates repeated tags name (eg. Age, Email, Phone) for each record, the file size of XML is usually larger than CSV that contains the same amount of information.


What is JSON?

JavaScript Object Notation (JSON) is an open standard file format which was derived from JavaScript in early 2000s. Since its standardization in 2013, JSON nowadays becomes a language-independent data format where many programming languages (eg. Python, Java, C++, etc) include code to generate and parse JSON-form data. JSON is also used by many service providers, browsers, servers, web applications, libraries, frameworks, and APIs for data exchanges, including Google Search API, Facebook API, etc.

The file extension of a JSON file use .json

JSON data are normally presented in attribute-value pairs. It has the following basic data types:

  • Number: a signed decimal number that may contain a fractional part and may use exponential E notation
  • String: a sequence of unicode characters that are delimited with double-quotation marks
  • Boolean: either true or false
  • Array: an ordered list of elements in which its values may be of any type. Arrays use square bracket notation with comma-separated elements
  • Object: a collection of key-value pairs where the keys are usually strings. Each key is unique within an object. Objects are delimited with curly brackets and use commas to separate each pair, while within each pair the colon ':' character separates the key from its value

An example of the syntax as follows:

1
2
3
4
5
6
7
{   
    "shop":"ABC", 
    "product":"A basket of fruits", 
    "price":123.4, 
    "isForSell": true,
    "fruits":["apple","banana","orange"]
}

Once a JSON data is initialized/parsed, a data value can be easily accessed from its key. A simple python example as follows:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
data = {   
    "shop":"ABC", 
    "product":"A basket of fruits", 
    "price":123.4, 
    "isForSell": True,
    "fruits":["apple","banana","orange"]
}

p = data["price"]
print("price=",p)

Using the example of student contact and attendance above, we can also put all the information in a single JSON file.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
[{   
    "Name":"Amy", 
    "Age":16,
    "Email":"amy@xxx.com", 
    "Phone":"1234-5678", 
    "Address":"123 Dummy Street", 
    "Attendance": ["2021-05-01"]
}, {   
    "Name":"Bob", 
    "Age":14,
    "Email":"bob@xxx.com", 
    "Phone":"1234-5679", 
    "Address":"124 Dummy Street", 
    "Attendance": ["2021-05-01","2021-05-02","2021-05-03"]
}, {   
    "Name":"Clarence", 
    "Age":13,
    "Email":"clarence@xxx.com", 
    "Phone":"1234-5680", 
    "Address":"125 Dummy Street", 
    "Attendance": ["2021-05-01","2021-05-02"]
}, {   
    "Name":"David", 
    "Age":16,
    "Email":"david@xxx.com", 
    "Phone":"1234-5681", 
    "Address":"126 Dummy Street", 
    "Attendance": ["2021-05-01","2021-05-02"]
}, {   
    "Name":"Eva", 
    "Age":15,
    "Email":"eva@xxx.com", 
    "Phone":"1234-5682", 
    "Address":"127 Dummy Street", 
    "Attendance": ["2021-05-01","2021-05-02"]
}, {   
    "Name":"Frankie", 
    "Age":15,
    "Email":"frankie@xxx.com", 
    "Phone":"1234-5683", 
    "Address":"128 Dummy Street", 
    "Attendance": ["2021-05-01","2021-05-02"]
}]

Similarly, for system processing, it is also valid to compress a JSON string in a single line.

1
[{"Name":"Amy""Age":16,"Email":"amy@xxx.com""Phone":"1234-5678""Address":"123 Dummy Street""Attendance": ["2021-05-01"]},{"Name":"Bob""Age":14,"Email":"bob@xxx.com""Phone":"1234-5679""Address":"124 Dummy Street""Attendance": ["2021-05-01","2021-05-02","2021-05-03"]},{"Name":"Clarence""Age":13,"Email":"clarence@xxx.com""Phone":"1234-5680""Address":"125 Dummy Street""Attendance": ["2021-05-01","2021-05-02"]},{"Name":"David""Age":16,"Email":"david@xxx.com""Phone":"1234-5681""Address":"126 Dummy Street""Attendance": ["2021-05-01","2021-05-02"]},{"Name":"Eva""Age":15,"Email":"eva@xxx.com""Phone":"1234-5682""Address":"127 Dummy Street""Attendance": ["2021-05-01","2021-05-02"]},{"Name":"Frankie""Age":15,"Email":"frankie@xxx.com""Phone":"1234-5683""Address":"128 Dummy Street""Attendance": ["2021-05-01","2021-05-02"]}]

Similar to XML, the file size of JSON is usually larger than CSV due to repeated keys (eg. Email, Address, etc). Comparing JSON with XML, we can see that JSON file should generally be smaller as it only has 1 "tag" while XML contains both "start-tag" and "end-tag".


What is YAML?

YAML Ain't Markup Language (YAML) has increased in popularity over the past few years. It's often used as a format for configuration files (eg. docker), but its abilities in object serialization make it a viable replacement for XML and JSON.

Let's take a look of a sample YAML file:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
--- #this is document 1
- Product:  
    shop: "ABC"
    product: "A basket of fruits"
    price: 123.4
    isForSell: true
    fruits: 
      - "apple"
      - "banana"
      - "orange"
---   #this is document 2
- Product:
    shop: "XYZ"
    product: "A basket of fruits"
    price: 50.0
    isForSell: true
    fruits: 
      - "banana"
      - "pineapple"

YAML has a data structure hierarchy that is maintained by outline indentation. The file starts with three dashes --- . These dashes indicate the start of a new YAML document. YAML supports multiple documents, and compliant parsers will recognize each set of dashes as the beginning of a new one. A valid YAML should also fulfil the following compatibility:

  • Whitespace indentation (tab characters NOT allowed) is used for denoting structure
  • Comments begin with # , which can start anywhere in a line and continue until the end of the line.
  • An associative array entry is represented in the form of key: value with one entry per line
  • List items are denoted by - with one item per line.
  • Strings are ordinarily unquoted, but it may be enclosed in double-quotes " , or single-quotes ' .
  • Multiple documents within a single stream are separated by ---
  • Repeated nodes are initially denoted by & and thereafter referenced with * .

Unlike JSON, which can only display data in a hierarchical structure with each child node having a single parent, YAML supports a simple relational scheme that allows repeats of identical data to be referenced from two or more points in the tree rather than entered repeatedly. Interested readers can refer to the official documents (https://yaml.org) for more advanced usages.


Using the example of student contact and attendance above, we can create a single YAML file as follows:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
---
- Student:
    Name: "Amy"
    Age: 16
    Email: "amy@xxx.com"
    Phone: "1234-5678"
    Address: "123 Dummy Street"
    Attendance: 
      - "2021-05-01"
- Student:
    Name: "Bob"
    Age: 14
    Email: "bob@xxx.com"
    Phone: "1234-5679"
    Address: "124 Dummy Street"
    Attendance: 
      - "2021-05-01"
      - "2021-05-02"
      - "2021-05-03"
- Student:
    Name: "Clarence"
    Age: 14
    Email: "clarence@xxx.com"
    Phone: "1234-5680"
    Address: "125 Dummy Street"
    Attendance: 
      - "2021-05-01"
      - "2021-05-02"
- Student:
    Name: "David"
    Age: 16
    Email: "david@xxx.com"
    Phone: "1234-5681"
    Address: "126 Dummy Street"
    Attendance: 
      - "2021-05-01"
      - "2021-05-02"
- Student:
    Name: "Eva"
    Age: 15
    Email: "eva@xxx.com"
    Phone: "1234-5682"
    Address: "127 Dummy Street"
    Attendance: 
      - "2021-05-01"
      - "2021-05-02"
- Student:
    Name: "Frankie"
    Age: 15
    Email: "frankie@xxx.com"
    Phone: "1234-5683"
    Address: "128 Dummy Street"
    Attendance: 
      - "2021-05-01"
      - "2021-05-02"

Unlike XML and JSON, data compression for YAML is not available due to outline indentation structure.


Summary

In the comparison below, we rank the data format from 1 to 4 where 1 is the best/most desired.

CSV XML JSON YAML
Ease of File Creation 1 3 2 4
Readability 1 3 2 4
Data Parsing Speed 1 3 2 4
File Size 1 4 3 2
Flexibility of Data Structure 4 2 2 1

Different data format has its own edges, which one to use actually depends on the actual use case. (eg. read/write frequency, how data is stored, etc). A good data analytics platform (like ALGOGENE~) weights the pros and cons, and usually derive a hybrid solution to handle different situations.



 
Gupta
As far as I know, YAML is mostly used as config file. Any systems/applications that used YAML to store raw data???