In efforts to provide patients proper care and better health outcomes, patients must get to their appointments on time. As a company that aims to remove the barrier of transportation from healthcare, it is imperative that Roundtrip’s rideshares continue to be reliable and punctual. To ensure on-time rides, the main characteristics of late rides need to be identified and explained.
How to do this? Use multiple linear regression. To ensure a multiple linear regression best fits the relationship between ETA and the independent variables, independent variables with high correlation with one another were removed and residuals were verified as normally distributed. Additionally, the data showed linear relationships between ETA and the chosen independent variables. Next, multiple combinations of categorical and continuous independent variables were tested to ensure the highest possible adjusted R-squared.
Chart 1. Normal Distribution of Residuals for this model
The adjusted R-squared of the final ETA regression was .1899 meaning that 18.99% of the variation in ETA is explained by the model. This model included Geo Coordinate ID, Rideshare type, Weekday Name, and the quadratic functional form of Military Hour. Geo Coordinate ID ranges from 1-8 and shows the regional location of the ride. The quadratic form of Military hour was included to portray the hypothesized relationship of ETA that would be higher earlier in the morning and late at night, and lower during the middle of the day. Almost all the categorical and continuous independent variables were found to be statistically significant. (See below) The only variables that were not include Weekday Name S & T. Since the variable Weekday name classified all days by their first letter, Saturday/Sunday and Tuesday/Thursday were both lumped into one category. Because of this, the results were a bit skewed, and R was unable to differentiate the relationship between Saturday’s, Sunday’s, Tuesday’s, and Thursday’s individual effect on ETA.
Chart 2. Regression Output from R
Since the dataset included 27,500+ rides with a variety of characteristics, this regression possessed moderately high explanatory power. The regression performed well on this dataset; however, it provided a poor generalization to Roundtrip’s larger 3-year ride reports dataset. In efforts to avoid overfitting the model, the focus shifted to performing regression analysis directly on the 3-year dataset.
The 3-year rides report dataset include more specific information about the ride, the patient, and the driver. Furthermore, it included the demographics data to show the differences in population, density, and homeownership of the location from which the patient is picked up or dropped off. Similar to the ETA regression, all assumptions needed to satisfy a multiple linear regression were tested and verified within R Studio. However, the dependent variable in this model changed from ETA to minutes late to pick up the patient. As a result, the relationship between the cause and effect of lateness could be directly observed.
The direct cause of lateness could not be identified because rides that ranged anywhere from early to late were included in this data set. To observe the specific characteristics influencing lateness, the data was segmented into ten tiles. However, analysis was only performed on Tile 1 and Tile 10 to reveal the contrast between very late and on-time/early rides.
After the regressions were run separate from one another, the cause of lateness became quite clear. First, the Tile 10 data set resulted in a negative R-squared revealing that the fit of the multiple linear regression is worse than that of a horizontal line. This confirms the hypothesis that the characteristics chosen to cause lateness possessed no significant relationship to on-time rides. However, the Tile 1 regression told an entirely different story. This regression highlighted that the relationship between minutes late to pick up patient and trip vehicle type of taxi or sedan no car seat, trip reason of psychiatric and COVID positive, military hour, multiple home ownership, density and population tiles were all statistically significant.
Chart 3. Example Relationship of Military Hour and Minutes Late
This Tile 1 regression resulted in an adjusted R-squared of about .0891 which means this regression possesses a low explanatory power. Although these models resulted in a super low R-squared, these regressions were successful in identifying problem characteristics that cause late rides. Once these characteristics were recognized, the validity of the average minutes late caused by these scenarios were verified by Power BI analysis. In each case, a statistically significant variable found in all the regressions resulted in a severely high average minutes late in Power BI.
After running this analysis, next we needed to determine how to implement these findings to facilitate more on time rides. So, a ranking system was developed using 14 characteristics that caused the latest rides.
To establish the ranking system, the ride counts and average minutes late of each of these characteristics were analyzed. From here a five-point ranking system was established that could then be extended to all characteristics causing late rides.
Chart 4. Ranking System with Error
Each characteristic a ride possesses will get a ranking of 1-5 and then the rank of all the characteristics will be averaged to reveal the ride score. In the chart above, range shows the average predicted minutes late while Error reveals the possible variation in minutes late pick up time. Our navigation center/patient will then receive a score card for their ride shown below.
Chart 5. Ride scorecard
After implementation, this ride score has predicted minutes late about 45% of the time. Ultimately, the addition of this scorecard will enable Roundtrip to book patient’s trips earlier to ensure a reliable pick-up time. Furthermore, this scoring system will help guarantee punctual service, and thus improving riders’ access to timely healthcare, and ensuring better health outcomes.