Getting started with Amazon RDS + Python + Postgres
Create, connect, and interact with a remote PostgreSQL database using Amazon RDS and Python
This article here will guide you how to create a database instance using Amazon Relational Database Service and how to interact with a remote database server using only Python.
Step 1: Launch Your AWS Database Instance
First, we need an account on AWS. We are going to an AWS portal to create a new account, or you may use an existing AWS account.
What is Amazon RDS?
Amazon Relational Database Service (RDS) forms a central part to set up a relational database in the cloud. Each cloud database we provision is called a database instance. So, first, click on Services and select Database to navigate to the RDS Dashboard:
AWS Services Console. Select RDS for launching a new database instance.
Next, to create a new database click on “Create database”. Database instances are flexible, and you configure them to meet your needs. You can configure your database with a range of engine types including Amazon Aurora, MySQL, PostgreSQL, and others. Beyond the engine types, you can also configure the database instance type and size, allocate storage, and configure different settings like monitoring, backups and many others.
Now, we are selecting a PostgreSQL engine on which to run the db. In this post, we use a Free tier template. With a free trier, we get a small instance db.t2.micro with 1 CPU and 1GiB memory.
Next, in the Settings section specify your database instance information: instance name and the master user credentials to log in to your instance. You may also specify a port you want to use to access the database. For PostgreSQL installations, the default port is 5432.
If you want to give your database instance a public IP address (so devices outside the VPC will be able to connect to your database) choose Yes for Public Accessible inside the Connectivity section. I choose Yes to make my database be available from my local machine.
You should proceed to the Additional configuration section and specify a database name that Amazon RDS creates when it creates the database instance. For this tutorial, I am going to create a database with a name titanic.
Remember: a database name and a database instance name are two different names. If you do NOT specify a database name, Amazon RDS does NOT create a database when it creates the database instance.
Finally, click on Create Database. It may take a few minutes to launch database. The new database instance appears on the Amazon RDS console. You may connect to the instance once its state changes to available.
The DB instance on the Amazon RDS console
Step 2: Connect to the PostgreSQL database
Once the database is created we can interact with it. Here, we connect to our database from Jupyter notebook using Python package psycopg2:
Note, you don’t even need to install Postgres on our local machine. You only need to install a psycopg2 package, by typing inside your terminal: pip install psycopg2
In the second line, we opened a connection to Postgres by using the
connect()method. We passed five parameters into the psycopg2. First is the
host which specifies where your database is running. In our case it is a remote RDB, and the host url is the Endpoint value from the Amazon RDS console (see the screenshot below). Next, we passed the username and password (lines 5,6). You may start with using master user credentials which you created earlier. Later, we will add a new user as a way to improve security and division of data. Finally, we passed the database name (which is different with the database instance name!!!). Database name I asked you to enter in the Additional configuration, and in my case it was titanic.
Amazon RDS console for a test database instance
Last but not least, we need to create a cursor object (line 9) to interact and execute the commands on the Postgres database .
Step3: Interacting with the database from Python
Once we connected to our database, it’s time to add some data. For this tutorial, I took the titanic dataset from Kaggle (train.csv). After some pre-processing, I split the file into three CSV files: passengers.csv, tripInfo.csv, and survival.csv. The link to my repo with processed dataset and notebook is in the end of article.
3.2. Creating tables
Well, now I am going to copy data from the CSV files into the database, but first, I need to create three tables that fits my dataset. For this purpose, we can use
CREATE TABLE command, but pay attention, the columns should be in the same order as in the CSV files with their respective types.
Remember, every time we need to commit our changes by running
connection.commit() at the end of every transaction (line 22). Finally, you may always query the table INFORMATION_SCHEMA.COLUMNS to display schemas of all the public tables. In particular we are interested in the column names and data types. In Python we can also access remote data using pandas method
A sample output after executing the SQL query.
3.3. Loading data into a Postgres table from CSV
Once we have our tables, we want to copy data from CSV files. Fortunately, it is super simple with psycopg2 using the copy_from() method:
Loading data into a Postgres table from CSV
We can also create a new user and grant privileges to manage our database:
3.4. Execute queries
Now, our remote database is ready and we can start fetching some data from our database tables using Python
pandas.read_sql(). Let’s say we are interested to get the survival rate for each passenger’s class. Below is a simple SQL query with INNER JOIN to do that:
It is very convenient to use pandas to execute SQL, because it reads SQL query into a DataFrame:
The survival rate for each passenger’s class
Delete the database Instance
Last notes. It is a good practice to delete your database instances from AWS which you are no longer using. Otherwise, you will be getting charged for them. So you need to go back to the AWS RDS Console, select your instance and click Delete.
Delete the database Instance
My repo with processed dataset and notebook is here.