Building a Data Lake with AWS

Building a Data Lake with AWS

Hi Peepz,

It officially the Day 3 of the #DevOpsAllStarsChallenge .

#DevOpsAllStarsChallenge is a challenge that aims to equip devops engineers with on demand skills required of them.

Project Architecture Diagram

Project Overview

A data lake is a centralized storage system for large amounts of data, including structured, semi-structured, and unstructured data. Data lakes are used to store data in its original format, and can be used for analytics, machine learning, and other business intelligence.

In this post I will work you through how to levearge AWS services like AWS Glue and Amazon Athena to run and query the data stored on S3 bucket and also send the processed to S3 bucket

Project Structure

nba-data-lake/
├── policies/
│ ├── iam-role.json
├── src/
│ ├── setup_nba_data_lake.py
├── .env
├── .gitignore
├── cleanup.py
├── README.md
└── requirements.txt

Prerequisites

Before you begin make sure you have these:

Prerequisites

1. Sportsdata.io Account

Sign up for a SportsData account and acquire your NBA API Key by following their onboarding instructions.

2. AWS Account

Set up your AWS account:

  • Create a free-tier AWS account here.

  • Log in to the AWS Management Console.

  • Create a unique IAM user with programmatic access.

  • Generate an Access Key and Secret Key for authentication.

  • Initialize the AWS CLI

aws configure

3. Python Environment

Ensure Python version 3.7 or higher is installed:

python3 --version
pip --version
pip install -r requirements.txt

If not , download it from the official website.

4. Environment Variables

Configure the .env file with the following values:

AWS_BUCKET_NAME=your_bucket_name
SPORTS_DATA_API_KEY=your_api_key
NBA_ENDPOINT=https://api.sportsdata.io/v3/nba/scores/json/Players
AWS_REGION=your_region

Walkthrough

Clone the repository:

git clone https://github.com/Kasnoma/NBA-Data-Lake.git
cd nba-data-lake

Run this command to execute the script:

python3 src/setup_nba_data_lake.py

Verify that the resources have been created in your AWS management console:

  • S3 Bucket: Confirm the creation of a bucket containing the raw NBA data in JSON format under the folder raw-data/nba_player_data.json.

  • Glue Database and Table: Ensure the database (glue_nba_data_lake) and table (nba_players) exist in AWS Glue.

Query Data in Amazon Athena:

Navigate to Amazon Athena in the AWS Management Console, and run the following SQL query:

SELECT FirstName, LastName, Position, Team
FROM nba_players
WHERE Position = 'PG';

Conclusion

Transform raw sports data into actionable insights with just a few steps. Feel free to experiment with additional queries and integrate the data lake with advanced analytics tools such as Amazon QuickSight for visualizations.

Checkout my post on medium

Let’s connect LinkedIn | Medium