Creating a Database: Converting a Spreadsheet to a Relational Database (Part 2)

Yohan Jeong
8 min readAug 28, 2020

--

Part 2: Creating a database and tables in MySQL, and Importing files into tables

Photo by Caspar Camille Rubin on Unsplash

A Relational Database Management System (RDMS) is a program that allows us to create, update, and manage a relational database. Structured Query Language (SQL) is a programming language used to communicate with data stored in the RDMS. The SQL skill for using a RDMA is required for many data-related positions these days. In the social media forums like Quora or Reddit, there are many people who search for a public database for practicing their SQL querying skills. However, although there are many public data sets in a single spreadsheet, there are not many public databases online. Even if you found a data set for the topic you have interest in, the format of the data is usually just one spreadsheet, not a database for most cases. Therefore, it will be very useful to know how to convert a data set in one spreadsheet to a relational database. Knowing the process of this conversion can give us many chances to practice SQL querying skills with a variety of databases.

This is the second article of a two part series. The goal of this series is to show how to create a relational database for SQL. The whole process is to convert a data in one spreadsheet to a relational database for SQL. In the previous article, the Entity Relationship Diagram (ERD) for Sample Sales Data was created. The snapshot for the original data set and the ERD created in the previous practice are as follows:

The Original Data Set: 25 Columns and 2924 Rows with Headers
The ERD for the Sample Sales Data

What I am going to do in this post is to practically create a relational database using MySQL based on the data model created in the previous article.

The data sets and the sql scripts used for this practice are here.

Creating a Database

The first thing to do is to create a database. I name the database sample_sales_data. The syntax for creating a database is CREATE DATABASE database_name. Therefore, write the statement for creating a database with the name of sample_sales_data as follows:

CREATE DATABASE sample_sales_data;

In order to keep creating tables in this database, write the statementUSE sample_sales_data;.

Creating Tables

The statement for creating a table in SQL is CREATE TABLE table_name (...);. For each row in the bracket (…), column_name data_type constraints, should be included. The SQL code for creating the tables for this practice is as follows:

CREATE TABLE Territories (
TerritoryID INT NOT NULL AUTO_INCREMENT,
Territory VARCHAR(5) NOT NULL,
PRIMARY KEY (TerritoryID)
);
CREATE TABLE Countries (
CountryID INT NOT NULL AUTO_INCREMENT,
Country VARCHAR(20) NOT NULL,
TerritoryID INT,
PRIMARY KEY (CountryID),
FOREIGN KEY (TerritoryID)
REFERENCES Territories (TerritoryID)
ON DELETE CASCADE ON UPDATE CASCADE
);
CREATE TABLE Cities (
CityID INT NOT NULL AUTO_INCREMENT,
City VARCHAR(20) NOT NULL,
State VARCHAR(20),
CountryID INT,
PRIMARY KEY (CityID),
FOREIGN KEY (CountryID)
REFERENCES Countries (CountryID)
ON DELETE CASCADE ON UPDATE CASCADE
);
CREATE TABLE Customers (
CustomerID INT NOT NULL AUTO_INCREMENT,
CustomerName VARCHAR(50) NOT NULL,
ContactLastName VARCHAR(20),
ContactFirstName VARCHAR(20),
Phone VARCHAR(25),
AddressLine1 VARCHAR(60),
AddressLine2 VARCHAR(20),
PostalCode VARCHAR(15),
CityID INT,
PRIMARY KEY (CustomerID),
FOREIGN KEY (CityID)
REFERENCES Cities (CityID)
ON DELETE CASCADE ON UPDATE CASCADE
);
CREATE TABLE Status (
StatusID INT NOT NULL AUTO_INCREMENT,
Status VARCHAR(15),
PRIMARY KEY (StatusID)
);
CREATE TABLE Orders (
OrderID INT NOT NULL AUTO_INCREMENT,
OrderNumber INT(5),
OrderDate DATE,
Quarter INT,
Month INT,
Year INT,
StatusID INT,
CustomerID INT,
PRIMARY KEY (OrderID),
FOREIGN KEY (StatusID)
REFERENCES Status (StatusID)
ON DELETE CASCADE ON UPDATE CASCADE,
FOREIGN KEY (CustomerID)
REFERENCES Customers (CustomerID)
ON DELETE CASCADE ON UPDATE CASCADE
);
CREATE TABLE ProductLines (
ProductLineID INT NOT NULL AUTO_INCREMENT,
ProductLine VARCHAR(20),
PRIMARY KEY (ProductLineID)
);
CREATE TABLE Products (
ProductID INT NOT NULL AUTO_INCREMENT,
ProductCode CHAR(9),
MSRP INT,
ProductLineID INT,
PRIMARY KEY (ProductID),
FOREIGN KEY (ProductLineID)
REFERENCES ProductLines (ProductlineID)
ON DELETE CASCADE ON UPDATE CASCADE
);
CREATE TABLE DealSizes (
SizeID INT NOT NULL AUTO_INCREMENT,
DealSize VARCHAR(6),
PRIMARY KEY (SizeID)
);
CREATE TABLE Orders_Products (
OrderID INT NOT NULL,
ProductID INT NOT NULL,
Quantity INT,
PriceEach FLOAT,
OrderLineNumber INT,
Sales FLOAT,
SizeID INT,
PRIMARY KEY (OrderID , ProductID),
FOREIGN KEY (OrderID)
REFERENCES Orders (OrderID)
ON DELETE CASCADE ON UPDATE CASCADE,
FOREIGN KEY (ProductID)
REFERENCES Products (ProductID)
ON DELETE CASCADE ON UPDATE CASCADE,
FOREIGN KEY (SizeID)
REFERENCES DealSizes (SizeID)
ON DELETE CASCADE ON UPDATE CASCADE
);

Ten tables are created as the ERD suggests. Basically, most of the tables are similar to each other in terms of the shape and format except for the number of columns and whether the table has foreign keys or not. Therefore, let’s first look at Countries.

CREATE TABLE Countries (
CountryID INT NOT NULL AUTO_INCREMENT,
Country VARCHAR(20) NOT NULL,
TerritoryID INT,
PRIMARY KEY (CountryID),
FOREIGN KEY (TerritoryID)
REFERENCES Territories (TerritoryID)
ON DELETE CASCADE ON UPDATE CASCADE
);
  • Countries is the table_name.
  • CountryID, Country, TerritoryID are the column_names.
  • INT and VARCHAR() are the data_types. CountryID and TerritoryID have INT data type, and Country has VARCHAR data type. The explanation for the data types in SQL is here.
  • NOT NULL and AUTO_INCREMENT are the constraints for columns. Since CountryID and Country do not have null values, the NOT NULL constraint is added. The AUTO_INCREMENT constraint is to allow a unique number to be generated automatically when a new record is inserted into a table.
  • CountryID is the primary key for this table. A primary key constraint uniquely identifies each record in a table. Since CountryID has unique values and identifies the country name, it is set to be the primary key for this table. The primary key constraint can be added directly as follows:CountryID INT NOT NULL AUTO_INCREMENT PRIMARY KEY,. Or it can be added separately as follows:
CountryID INT NOT NULL AUTO_INCREMENT, 
...
PRIMARY KEY (CountryID),
  • TerritoryID is the foreign key. A foreign key is a column to refer to a primary key in another table and is used to link two tables. TerritoryID is basically the primary key for Territories, and Countries refers to this TerritoryID to indicate which territory a country belongs to. The foreign key constraint can be added separately after the table is created using ALTER TABLE table_name ADD statement. Or like the example above, it can be added at the last part of the CREATE TABLE statement, and the syntax is as follows:
FOREIGN KEY (column_name) 
REFERENCES referenced_table(referenced_column)
  • In this example, the referenced table is Territories. Since the foreign key refers to the primary key in the referenced table, it can be affected when rows in the referenced table are deleted or updated. ON DELETE CASCADE ON UPDATE CASCADE is one of the options for what to do for the rows in the referencing table when deleting or updating the rows in the referenced table.

A primary key and foreign key are generally different. But there is a special case when the primary key and foreign key have the same columns like Orders_Products.

CREATE TABLE Orders_Products (
OrderID INT NOT NULL,
ProductID INT NOT NULL,
...
PRIMARY KEY (OrderID , ProductID),
FOREIGN KEY (OrderID)
REFERENCES Orders (OrderID)
ON DELETE CASCADE ON UPDATE CASCADE,
FOREIGN KEY (ProductID)
REFERENCES Products (ProductID)
ON DELETE CASCADE ON UPDATE CASCADE,
...
);

Inserting Records into Tables

The SQL statement for inserting new records in a table is INSERT INTO and its basic syntax is as follows:

INSERT INTO table_name(column1,column2,column3,…) VALUES (value1,value2,value3,…);

If we insert each row of records one by one, we can use the code above. However, the tables for Orders, Products, and Orders_Products have hundreds or thousands of records. In this case, it is much more convenient to create a file which contains all the records for a table and make MySQL automatically import the file into the table.

Creating Files to be Imported into Tables

There are several formats of files which can be imported in MySQL. The file format used in this practice is comma-separated values (CSV). In the CSV format, each column is separated by a comma(,).

When the tables were created, the AUTO_INCREMENT constraints were added for the primary keys (the Orders_Products is exceptional). Therefore, when we create files to be imported, the columns for the primary keys should not be included. For example, the table for Countries has three columns: CountryID, Country, and TerritoryID. Since CountryID is the primary key and has the AUTO_INCREMENT constraint, it should be excluded in the file.

Assuming the table for Territory is already created, what is in the file for Countries should be as follows:

The steps to create the file of these records from the original spreadsheet are as follows:

  • In Excel, copy and paste the columns for Country and Territory from the original data set into a new sheet.
  • Remove duplicates.
  • Create a column for TerritoryID using VLOOKUP function or the mix of INDEX and MATCH functions.
  • Remove the column for Territory.
  • Save the file in the CSV format.

Importing Files

The MySQL basic syntax for importing a file here is as follows:

mysql> LOAD DATA INFILE ‘.../file_name.csv’ 
INTO TABLE table_name
FIELDS TERMINATED BY ‘,’
LINES TERMINATED BY ‘\n’ (column_1,column_2,...,column_n);

Each field and line in the CSV file is basically separated by , and \n, respectively. However, simply following the code above for Customers has a problem since some information in Customers such as CustomerName includes , in its text and MySQL recognizes it as the signal to terminate the field. For this case, we can add OPTIONALLY ENCLOSED BY ‘“’ next to FIELDS TERMINATED BY ‘,’ as follows:

mysql> LOAD DATA INFILE ‘.../customers.csv’ 
INTO TABLE Customers
FIELDS TERMINATED BY ‘,’
OPTIONALLY ENCLOSED BY '"'
LINES TERMINATED BY ‘\n’ (CustomerName,ContactLastName,
ContactFirstName,Phone,AddressLine1,AddressLine2, PostalCode,
CityID);

Following the steps from the previous section to create the imported files automatically adds double quotes outside of the texts which include , as follows:

Therefore, the statement OPTIONALLY ENCLOSED BY ‘“’ orders MySQL to not recognize the , inside the text as the signal for terminating the field.

Conclusion

Through the series of the articles, I show the practice of creating a relational database. Especially, a data set stored in a spreadsheet was used, and it was converted to a relational database using MySQL. The goal of the articles is to show you the entire process of converting a data set in a single spreadsheet to a relational database so that they can have a chance to obtain a variety of databases and to practice their SQL querying skills with different databases.

--

--

Yohan Jeong

Business Analyst at Samsung Electronics America. PhD in Economics from University of California, Davis.