
If you want to practice building the models and visualizations yourself, we’ll be using the following R packages:
It will also help to have some very basic statistics knowledge, but if you know what a mean and standard deviation are, you’ll be able to follow along. If you’re new to learning the R language, we recommend our R Fundamentals and R Programming: Intermediate courses from our R Data Analyst path.
We’ll use R in this blog post to explore this data set and learn the basics of linear regression. In this post, we’ll use linear regression to build a model that predicts cherry tree volume from metrics that are much easier for folks who study trees to measure. Linear regression is one of the simplest and most common supervised machine learning algorithms that data scientists use for predictive modeling.
Use the model to answer the question you started with, and validate your results. Using what you find as a guide, construct a model of some aspect of the data. Conduct an exploratory analysis of the data to get a better sense of it.
Clean, augment, and preprocess the data into a convenient form, if needed. Collect some data relevant to the problem (more is almost always better). A lot of the time, we’ll start with a question we want to answer, and do something like the following: For example, data scientists could use predictive models to forecast crop yields based on rainfall and temperature, or to determine whether patients with certain traits are more likely to react badly to a new medication.īefore we talk about linear regression specifically, let’s remind ourselves what a typical data science workflow might look like. In R programming, predictive models are extremely useful for forecasting future outcomes and estimating metrics that are impractical to measure.