Twitter Dog Dataset Analytics and data wrangling report

THE CHALLENGE

There are so many dog lovers all over the world and lately, there is a trend where people post pictures of their dogs and other dog lovers rate them. WeRateDogs is a popular tweet handle with over 9.2million followers that post and rate pictures of amazing dogs. They are a non profit that source for professional dog ratings.

Screenshot_20220604-124808~2.png

My job was to use dataset that was gotten from WeRateDogs twitter archive to show my data wrangling skills. This project is actually the second project in udacity nanodegree data analysis program. Let’s get into it. I went through the WeRateDogs Twitter profile to get insights for my Udacity Data wrangling project, going through the page, I got curious to know which breed of dog is the most popular. Different stages of Dogs are posted and naturally I want to know which Dog stage is the most common and most popular. And also, if there is any relationship between retweet counts and favorite counts. The dog ratings almost always have a denominator of 10. The numerators, though? Almost always greater than 10. 11/10, 12/10, 13/10, etc. which made me curious about the most popular numerator rating for the dogs

Screenshot_20220604-124909~2.png

THE PROCESS

The main objective of this project was to perform data wrangling process i.e Gathering, Assessing and Cleaning on We rate dog data set from twitter archive after which I stored, analyze, and visualize the wrangled data.

STEPS OF THE ANALYSIS

• GATHERING THE DATA: Three datasets were used in this project.

FIRST DATASET: the first dataset was downloaded using the traditional Pandas Read_csv function.

SECOND DATASET: The second data set was downloaded Programmatically using the Requests library

THIRD DATASET: The last dataset was downloaded from twitter API using the Tweepy library

• ASSESSING THE DATA The datasets were assessed visually using google sheets and programmatically using the pandas function and 9 quality issues and two Tidiness issues were found

QUALITY ISSUES

• Most rows that have false p1 value were not dog related.

• p1, p2, p3 should be in categorical format.

• image num column should be removed as it is not important in analysis because only one image url is in the dataset.

• Timestamp should be in datetime format.

• Some expanded url have more than 1 urls in them in the csv file.

• Drop columns that doesn’t have up to 1000 non null value.

• There are missing rows.

• There are dogs whose names are not registered but adverb and verbs were assigned instead of none.

• Convert None strings to NaN.

• There are duplicated columns after merging all datasets.

TIDINESS ISSUE

• floofer, doggo, puppo, pupper are in different columns and they are supposed to be in one column.
• All dataset should be merged together.

• CLEANING THE DATA The dataset was cleaned by using the define, code and test method and all quality and tidiness issues noted was cleaned as stated below:

• Most rows that have false p1 value were not dog related: The rows were removed.

• p1, p2, p3 should be in categorical format: the columns data type was converted to categorical.

• Remove image num as it is not important in analysis because only one image url is in the dataset: the image num column was removed.

• Timestamp should be in datetime format: timestamp column was converted to datetime data type.

• Some expanded url have more than 1 urls in one column: The expanded columns was splitted into different columns.

Rows that have Retweeted values were dropped because we dont want retweeted data.

• Columns that doesn’t have up to 1000 non null value should be dropped: the columns were dropped because their data is not available and it will have an effect on analysis.

• Missing rows should be dropped : Missing columns were dropped.

• There are dogs whose names are not registered but verb and adverb were assigned instead of none: Their are irregularities in dogs names that are not registered. • INSIGHTS AND VISUALIZATION: After gathering, assessing and cleaning the datasets, I proceeded to generate answers to the questions asked earlier.

• WHICH BREED OF DOG IS THE MOST POPULAR?? There are over one hundred breed of dogs in the dataset but the top fifty popular Dogs are shown in the figure below:

Screenshot_20220605-234540~2.png

The top three dog breeds includes:

• Golden Retrieval

• Labrador Retrieval

• Pembroke

When next you are going for dog breeds, you know the best three dogs to go for.

DOG BREEDS The most common dog breeds is highlighted in the figure below

Screenshot_20220605-234900~2.png

*According to Dogtionary:

DOGGO: A big Pupper usually older.
FLOOFER: Dogs that have lots of fur.
PUPPER : A doggo that is inexperienced, unfamiliar, or in anyway unprepared for the responsibilities with being a doggo.
PUPPO: A transitional phase between doggo and pupper.*

These analysis shows that dog lovers tends to go for inexperienced and untrained dogs so that they can train them to their liking and taste.

To get the best out of Dogs when next you are dog hunting, go for Golden Retrieval or Labrador Retrieval or Pembroke that is still in doggo stage.

RELATIONSHIP BETWEEN RETWEET COUNTS AND FAVORITE COUNTS

Screenshot_20220606-000626~2.png

After removing outliers from the dataset, there is a positive correlation between retweet count and favorite count of dogs that is, as the retweets are going up, the likes are also going up.

WHAT NUMBER IS THE MOST POPULAR NUMERATOR RATING OF DOGS

Screenshot_20220606-000626~3.png

Twelve is the most popular rating numerator for dogs.

CONCLUSION

I hope this beautiful picture of a pupper Golden Retriever make your day and that after reading this article, you have more insights about which dog to choose when next you are dog shopping And lastly, here is the link to my notebook repository for your perusal: github.com/Monsurat-Onabajo/Data-Wrangling-..

Screenshot_20220606-000706~2.png

We Rate Dogs: A Data Analytics And Data Wrangling Project On We Rate Dog Twitter Data Archive