Analyzing Netflix Movies and TV Shows Dataset – A Simple Data Science Project
Analyzing Netflix Movies and TV Shows Dataset – A Simple Data Science Project
This project explores a dataset of Netflix titles to uncover insights about content type, release trends, and popular genres. It's a great beginner data science project using Python and pandas.
Tools Used: Python, pandas, matplotlib, seaborn
Step 1: Import Libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
Step 2: Load the Dataset
Download the dataset from Kaggle: “Netflix Movies and TV Shows”
df = pd.read_csv("netflix_titles.csv")
df.head()
Step 3: Basic Information
df.info()
df.isnull().sum()
Fill missing values in 'country' or 'director' if needed.
Step 4: Data Cleaning
df['date_added'] = pd.to_datetime(df['date_added'])
df['year_added'] = df['date_added'].dt.year
df['month_added'] = df['date_added'].dt.month
Step 5: Data Visualization
Content Type Count
sns.countplot(data=df, x='type', palette='pastel')
plt.title("Content Type Distribution")
plt.show()
Top 10 Countries by Number of Shows
top_countries = df['country'].value_counts().head(10)
top_countries.plot(kind='barh', color='coral')
plt.title("Top 10 Countries with Netflix Content")
plt.show()
New Titles Added by Year
df['year_added'].value_counts().sort_index().plot(kind='bar', color='skyblue')
plt.title("Content Added by Year")
plt.xlabel("Year")
plt.ylabel("Number of Titles")
plt.show()
Step 6: Genre and Duration Analysis
Most Common Genres
df['listed_in'].value_counts().head(10)
Movie Duration Distribution
df_movies = df[df['type'] == 'Movie']
df_movies['duration'] = df_movies['duration'].str.replace(' min', '').astype(int)
df_movies['duration'].plot(kind='hist', bins=20, color='purple')
plt.title("Movie Duration Distribution")
plt.xlabel("Minutes")
plt.show()
Step 7: Conclusion
- Most of the content on Netflix is movies.
- United States contributes the most to Netflix's catalog.
- Content uploads have grown significantly over recent years.
- Most movies are between 80 to 120 minutes long.
What's Next?
- Create visual dashboards using Plotly or Power BI.
- Perform sentiment analysis on Netflix show descriptions.
- Apply clustering to group similar titles based on metadata.
Comments
Post a Comment