Creating API using Kaggle dataset in FastAPI

FastAPI is a web framework in Python used to quickly create production level APIs. It is the third most popular back-end development framework in Python behind Django and Flask but quickly gaining popularity in the development community. It currently has 50K+ stars on Github.

In this post we would be creating an API using a sample dataset from Kaggle. We would be using Postgres, SQLAlchemy, pandas, Postman and FastAPI to develop this. To start with we would setup a virtual environment and activate it. Create a requirements.txt file in the root directory of your project and paste the following contents. We would be using these packages for API development

alembic==1.8.1
anyio==3.6.2
click==8.1.3
colorama==0.4.5
fastapi==0.85.1
fastapi-pagination==0.10.0
greenlet==1.1.3.post0
h11==0.14.0
idna==3.4
importlib-metadata==5.0.0
importlib-resources==5.10.0
Mako==1.2.3
MarkupSafe==2.1.1
numpy==1.23.4
pandas==1.5.1
psycopg2==2.9.4
pydantic==1.10.2
python-dateutil==2.8.2
python-multipart==0.0.5
pytz==2022.5
six==1.16.0
sniffio==1.3.0
SQLAlchemy==1.4.42
starlette==0.20.4
typing-extensions==4.4.0
uvicorn==0.19.0
zipp==3.9.0

Next obvious step is to install these packages. After this is done, we would look for our dataset on Kaggle which we would use for populating data. I'd be using this dataset https://www.kaggle.com/datasets/shiddharthsaran/dragon-ball-dataset. The dataset contains character names and data from the Dragonball Z universe, a very popular Japanese anime.

We would perform some analysis and data cleaning on the dataset if required. For instance, we would like to avoid presence of null values in any of the columns in our dataset. If present, we would like them to replace with suitable values. Data cleaning would be done using pandas library. Luckily, in the dataset we have we do not need to clean the data as there are no null values available. To verify this, let us create a folder called 'data', inside of this we would extract the achieved dataset and rename it to 'data.csv'. Let's us check the presence of null values across all the columns in the dataset.

import pandas as pd
import numpy as np

df = pd.read_csv('data/data.csv')

print(df.isnull().sum())

Once we execute this program, you'd notice all columns would have 0 value which means there are no null entries which makes our life a bit easier. Next, we would work on setting the database and models. Create a file called 'config.py' and paste the following snippet.

DATABASE_USERNAME = 'postgres'
DATABASE_PASSWORD = 'pass12345'
DATABASE_HOST = '127.0.0.1'
DATABASE_NAME = 'dbz-api'

This file would hold database variables. I assume you have Postgres and PgAdmin installed, if not you would be required to install those before reading further. Open the PgAdmin tool and create a database called 'dbz-api'. Now create a 'db.py' file with the following contents.

from sqlalchemy import create_engine, MetaData
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import sessionmaker

import config

DATABASE_USERNAME = config.DATABASE_USERNAME
DATABASE_PASSWORD = config.DATABASE_PASSWORD
DATABASE_HOST = config.DATABASE_HOST
DATABASE_NAME = config.DATABASE_NAME

SQLALCHEMY_DATABASE_URL = f"postgresql://{DATABASE_USERNAME}:{DATABASE_PASSWORD}@{DATABASE_HOST}/{DATABASE_NAME}"

engine = create_engine(SQLALCHEMY_DATABASE_URL)

SessionLocal = sessionmaker(autocommit=False, autoflush=False, bind=engine)

Base = declarative_base()
metadata = MetaData()

def get_db():
    db = SessionLocal()
    try:
        yield db
    finally:
        db.close()

This is the standard file we use for creating database engine, session and meta-data using SQLAlchemy in FastAPI applications. We would next work on creating models and schemas for our database tables. We would get to know column names through data analysis we performed earlier. In this dataset, we have four columns named Character, Power_Level, Saga_or_Movie and Dragon_Ball_Series. We would design our column names of our model accordingly. Let us create a folder named 'api' Create four files inside this folder - models.py, router.py, schema.py, and services.py. Open models.py file and save the given snippet.

from sqlalchemy import Column, String, Integer

from db import Base

class ApiData(Base):
    __tablename__ = "api_data"

    id = Column(Integer, primary_key=True, autoincrement=True)
    character = Column(String(100))
    power_level = Column(String(100))
    saga_or_movie = Column(String(100))
    dragon_ball_series = Column(String(100))

In this file we have defined column names and their data type based on data analysis of our dataset. Open the schema.py file in your favorite editor and paste the following code.

from typing import Optional, List
from pydantic import BaseModel, constr

class ApiDataBase(BaseModel):
    id: Optional[int]
    character: str
    power_level: str

    class Config:
        orm_mode = True

This is similar to serialization in Django, we are defining the fields of the model we want in our API request. Often, you would want a subset of the columns defined in your model. Here, we've skipped showing saga_or_movie and dragon_ball_series fields in the table, but those can be easily added in the schema. Now that we have the schema ready, it is time to automatically create tables in the database which we connected earlier using alembic. Create a migrations folder by typing the following command in the terminal.

alembic init alembic

This would create a migrations folder inside the project named 'alambic'. Inside this you would find an env file. We would open this file and make changes to it by importing our models and database configuration. We would also add support for automatic table creation through meta data.

from logging.config import fileConfig

from sqlalchemy import engine_from_config
from sqlalchemy import pool
import config as dbConfig

from alembic import context
from db import Base

from api.models import ApiData

# this is the Alembic Config object, which provides
# access to the values within the .ini file in use.
config = context.config

# Interpret the config file for Python logging.
# This line sets up loggers basically.
fileConfig(config.config_file_name)

# add your model's MetaData object here
# for 'autogenerate' support
# from myapp import mymodel
# target_metadata = mymodel.Base.metadata
target_metadata = Base.metadata

# other values from the config, defined by the needs of env.py,
# can be acquired:
# my_important_option = config.get_main_option("my_important_option")
# ... etc.

def get_url():
    db_user = dbConfig.DATABASE_USERNAME
    db_password = dbConfig.DATABASE_PASSWORD
    db_host = dbConfig.DATABASE_HOST
    db_name = dbConfig.DATABASE_NAME
    return f"postgresql://{db_user}:{db_password}@{db_host}/{db_name}"

def run_migrations_offline():
    """Run migrations in 'offline' mode.

    This configures the context with just a URL
    and not an Engine, though an Engine is acceptable
    here as well.  By skipping the Engine creation
    we don't even need a DBAPI to be available.

    Calls to context.execute() here emit the given string to the
    script output.

    """
    url = get_url()
    context.configure(
        url=url,
        target_metadata=target_metadata,
        literal_binds=True,
        dialect_opts={"paramstyle": "named"},
    )

    with context.begin_transaction():
        context.run_migrations()


def run_migrations_online():
    """Run migrations in 'online' mode.

    In this scenario we need to create an Engine
    and associate a connection with the context.

    """
    configuration = config.get_section(config.config_ini_section)
    configuration["sqlalchemy.url"] = get_url()
    connectable = engine_from_config(
        configuration, prefix="sqlalchemy.", poolclass=pool.NullPool,
    )

    with connectable.connect() as connection:
        context.configure(
            connection=connection, target_metadata=target_metadata
        )

        with context.begin_transaction():
            context.run_migrations()

if context.is_offline_mode():
    run_migrations_offline()
else:
    run_migrations_online()

Once done with this step, we are ready to generate migrations and apply them. Type following two commands in sequence.

alembic revision --autogenerate -m "Added api data table"
alembic upgrade head

The first command generates the migration files and entries while the second one applies this which creates the actual tables. We would now revisit 'api' folder and work on creating routes and services for our app. Open 'routes.py' file and paste the following code. We would walk through it in a moment.

from typing import List
from fastapi import APIRouter, Depends, status, Response, Request
from sqlalchemy.orm import Session
from .models import ApiData
import pandas as pd
from typing import List
from . schema import ApiDataBase
from fastapi_pagination import Page, add_pagination, paginate, Params

import db

from .import services

router = APIRouter(
    tags=["DBZ"],
    prefix='/api'
)


@router.get('/', status_code=status.HTTP_200_OK, response_model=Page[ApiDataBase])
async def api_data(params: Params = Depends(), database: Session = Depends(db.get_db)):
    result = services.get_api_data(database)
    return paginate(result, params)


@router.post('/populate', status_code=status.HTTP_201_CREATED)
async def create_api_data(database: Session = Depends(db.get_db)):
    df = pd.read_csv('data/data.csv')
    for index, data in df.iterrows():
        result = services.create_new_data(data['Character'], data['Power_Level'], data['Saga_or_Movie'], data['Dragon_Ball_Series'], database)
    return {
        'message': 'Data populated successfully'
    }

This is the finalized version of the routes.py file. We have prefixed this router module with '/api'. All the routes defined would be accessible through localhost:8000/api?page=2&size=20. We have added pagination using this package https://github.com/uriyyo/fastapi-pagination. We would write functions to get data from the database and pass it to route handlers inside 'services.py' file

from fastapi import HTTPException, status
from typing import List
from .models import ApiData

def create_new_data(character, power_level, saga_or_movie, dragon_ball_series, database) -> ApiData:

    new_data = ApiData(character=character, 
                        power_level=power_level,
                        saga_or_movie=saga_or_movie,
                        dragon_ball_series=dragon_ball_series)
    database.add(new_data)
    database.commit()
    database.refresh(new_data)
    return new_data


def get_api_data(database):
    data = database.query(ApiData).all()
    return data

To populate the data from the csv file we have created a end-point called '/populate'. We would hit this API with a post request which would trigger csv file read and for each row a database entry would be added. Reading data would be done through pandas functions. Column data fields would be passed on to create_new_data function defined inside services.py file which would ultimately create entries in the database and commit the results. We can delete this function once done with populating database table since it's only required once in our case.

Let us now take a moment to talk more about the GET request we have for our data.

@router.get('/', status_code=status.HTTP_200_OK, response_model=Page[ApiDataBase])
async def api_data(params: Params = Depends(), database: Session = Depends(db.get_db)):
    result = services.get_api_data(database)
    return paginate(result, params)

For more information on pagination please refer to the Github page of the package we used for implementing pagination - github.com/uriyyo/fastapi-pagination Basically, it would return a Page instance as response model which would process 'page' and 'size' provided as url params and would return number of items based on values passed. In the given image below I've tested the GET end-point with pagination params passed.

Time to create our 'main.py' file import this module into it. This main file would be responsible for running the FastAPI application using the uvicorn server on localhost.

from fastapi import FastAPI, Request
import uvicorn
from api import router as api_router

import db

app = FastAPI(title="DBZ API",
    docs_url="/docs",
    version="0.0.1")

@app.get("/")
async def main(request: Request):
    return {
        'message': 'DBZ API'
    }

app.include_router(api_router.router)

if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8000)

With this we are ready with our API. This should be accessible on port 8000 on your localhost. FAST API has support for automatic docs which you can access by hitting '/docs' url we specified in the settings. It is time to conclude this blog post with this. Please post in the comments for any queries you have. Thanks for the read.

Amit Prafulla's Blog

Amit Prafulla's Blog

Creating an API using Kaggle dataset in FastAPI

Additional features would include pagination and rate limit