Let’s explore the dataset.
We will explore the employee earnings reports for the City of Boston for the years 2018 to 2022. The datasets contain valuable information about the earnings of city employees, and we will analyze and visualize key aspects.
Loading the data for each year, paying close attention to encoding issues to ensure accurate interpretation
Data Overview
Before delving into analysis, let’s get a sense of the structure of each dataset. We’ll start with a glimpse of the first few rows and examine the basic information.
<class ‘pandas.core.frame.DataFrame’>
RangeIndex: 23605 entries, 0 to 23604
Data columns (total 12 columns):
# Column Non-Null Count Dtype
— —— ————– —–
0 NAME 23605 non-null object
1 DEPARTMENT_NAME 23605 non-null object
2 TITLE 23605 non-null object
3 REGULAR 21939 non-null object
4 RETRO 13643 non-null object
5 OTHER 13351 non-null object
6 OVERTIME 6826 non-null object
7 INJURED 1312 non-null object
8 DETAIL 2355 non-null object
9 QUINN/EDUCATION INCENTIVE 1351 non-null object
10 TOTAL EARNINGS 23605 non-null object
11 POSTAL 23605 non-null object
dtypes: object(12)
memory usage: 2.2+ MB
None
<class ‘pandas.core.frame.DataFrame’>
RangeIndex: 23312 entries, 0 to 23311
Data columns (total 12 columns):
# Column Non-Null Count Dtype
— —— ————– —–
0 NAME 23312 non-null object
1 DEPARTMENT_NAME 23312 non-null object
2 TITLE 23312 non-null object
3 REGULAR 23312 non-null object
4 RETRO 23312 non-null object
5 OTHER 23312 non-null object
6 OVERTIME 23312 non-null object
7 INJURED 23312 non-null object
8 DETAIL 23312 non-null object
9 QUINN/EDUCATION INCENTIVE 23312 non-null object
10 TOTAL EARNINGS 23312 non-null object
11 POSTAL 23312 non-null int64
dtypes: int64(1), object(11)
memory usage: 2.1+ MB
None
<class ‘pandas.core.frame.DataFrame’>
RangeIndex: 21858 entries, 0 to 21857
Data columns (total 12 columns):
# Column Non-Null Count Dtype
— —— ————– —–
0 NAME 21858 non-null object
1 DEPARTMENT_NAME 21858 non-null object
2 TITLE 21858 non-null object
3 REGULAR 21263 non-null object
4 RETRO 342 non-null object
5 OTHER 13482 non-null object
6 OVERTIME 6781 non-null object
7 INJURED 1420 non-null object
8 DETAIL 2101 non-null object
9 QUINN / EDUCATION INCENTIVE 1383 non-null object
10 TOTAL EARNINGS 21858 non-null object
11 POSTAL 21765 non-null object
dtypes: object(12)
memory usage: 2.0+ MB
None
<class ‘pandas.core.frame.DataFrame’>
RangeIndex: 22552 entries, 0 to 22551
Data columns (total 12 columns):
# Column Non-Null Count Dtype
— —— ————– —–
0 NAME 22546 non-null object
1 DEPARTMENT_NAME 22546 non-null object
2 TITLE 22546 non-null object
3 REGULAR 21908 non-null object
4 RETRO 402 non-null object
5 OTHER 14129 non-null object
6 OVERTIME 6846 non-null object
7 INJURED 1456 non-null object
8 DETAIL 2059 non-null object
9 QUINN_EDUCATION_INCENTIVE 1386 non-null object
10 TOTAL_GROSS 22546 non-null object
11 POSTAL 22546 non-null object
dtypes: object(12)
memory usage: 2.1+ MB
None
<class ‘pandas.core.frame.DataFrame’>
RangeIndex: 23204 entries, 0 to 23203
Data columns (total 12 columns):
# Column Non-Null Count Dtype
— —— ————– —–
0 NAME 23204 non-null object
1 DEPARTMENT_NAME 23204 non-null object
2 TITLE 23204 non-null object
3 REGULAR 22604 non-null object
4 RETRO 3092 non-null object
5 OTHER 15826 non-null object
6 OVERTIME 6812 non-null object
7 INJURED 1221 non-null object
8 DETAIL 2116 non-null object
9 QUINN_EDUCATION 1369 non-null object
10 TOTAL_ GROSS 23204 non-null object
11 POSTAL 23204 non-null object
dtypes: object(12)
memory usage: 2.1+ MB
None
From the output, we observe that each dataset contains several columns, including NAME, DEPARTMENT_NAME, TITLE, and various components of earnings such as REGULAR, RETRO, OTHER, OVERTIME, and more.
Now lets combine the all dataset and check the null values
NAME 6DEPARTMENT_NAME 6TITLE 6REGULAR 48080RETRO 97394OTHER 71225OVERTIME 94047INJURED 110542DETAIL 108001QUINN/EDUCATION INCENTIVE 113180TOTAL EARNINGS 67614POSTAL 99 REGULAR 69956 RETRO 90877 OTHER 77737 OVERTIME 84438 INJURED 89799 DETAIL 89118 QUINN/EDUCATION INCENTIVE 91219 QUINN / EDUCATION INCENTIVE 113148 TOTAL EARNINGS 92673QUINN_EDUCATION_INCENTIVE 113145TOTAL_GROSS 91985QUINN_EDUCATION 113162TOTAL_ GROSS 91327dty
From the output, we observe varying degrees of missing data across different columns, highlighting potential areas for further investigation.
Conclusion
We’ve loaded, examined, and combined Boston employee earnings data from 2018 to 2022. The next steps in our journey involve a deeper dive into the datasets, addressing missing values, and uncovering meaningful insights. Stay tuned for upcoming posts where we conduct in-depth analyses and visualize trends within this rich dataset.