Regression
Regression is a way of understanding the empirical relationship between variables, where the relationship is expressed in the form of an estimated linear equation developed from data. A demand curve, for example, would be such a line, relating the quantity bought and the price. Often, regression is a way of testing hypotheses about the relationship between 2 or more variables(one is called dependent variable, others are called independent variables). Multiple regression is a way of testing for these two way relationships when the effects of other variables is being held constant. Note: even though we call the variables dependent and independent, regression does not test for causality between the two variables. It tests for association between them (similar to correlation). The cause-effect relationship (eg the directions of causality) between two variables is possible to infer from theory, or from patterns of complex statistical results.
Regression analysis essentially displays the scatter of data between two variables. The regression line is essentially the best fitting straight line through the scatter.
If we were doing a cost analysis and trying to determine fixed costs and marginal costs we could use regression. So, if we had data on each year of operations, or each month of operations, or each day we would use those data to specify the following equation:
Total costs = A + b (Volume of output)
A would be an estimate of what costs would be if volume were 0. This is fixed costs! The coefficient “b” is the estimated change in costs when we add (or subtract) one unit of output—this is the marginal costs. Average costs can easily be calculated in the raw data.
If we had some other variable that changes and might help explain why the cost/volume relationship might “shift” during the data period we are studying, then we could add it to the regression. For example if we had a monthly data set on costs and volumes we might want to note (and control for) the fact that the last 6 data points were from a time when we were open in the evenings (and the other data points were from times when we were not open in the evenings). So our regression is going to be
Total costs = A + b (V) + c (evening open) this =1 for the last 6 months, and = 0 otherwise)
So, A and B still mean what we said, though their values may change a bit with the new model. The coefficient “c” tells us how our total costs change (per month, per day or per week depending on what our data is) when we are open evenings, compared to our costs when we are not open in the evenings.
Regression coefficient estimates have important interpretations in economics. In the above example of the cost regression, we might have an estimated equation
Total monthly cost = 50000 + 100(V) + 20 (evening open)
the coefficient estimate “b” is interpreted at the change in total cost when we increase the volume (V) by one unit (this might be the number of clients we saw each day. This is called variable (or marginal) cost. The 50000 is the estimate of fixed costs (which we incur independent of the volume we produce.
In the case where we might have estimated a demand curve, such as
Quantity sold = A + b (price) + c (household income in 000s) + d(competitor’s price)
We could estimate it from data and get:
Quantity sold = 500 – 15 (price) + 20 (income ) + 30 (competitors price)
- What this means is that if our price was zero, and income were zero and the competitors price was zero we would sell 500 units (silly, but it tells us where the demand curve crosses the horizontal axis (eg where out price is zero)
- It says that if we increase our price by $1, we’d sell 15 fewer units of the product (other stuff like income and competitors prices staying constant)
- it says that if household income were 1000 higher on average, we could expect to sell 20 more units (other things held constant, like our price and competitor prices. Is this a normal or an inferior good?
- it says that if competitors lowered their prices by $1 we would sell 30 fewer units (other things the same. Are they a substitute or a complement?
Excel does regression. Look under tools to see if you can add in the “data analysis” add in. If you have it, find it under the Data tab. You can do descriptive analyses and other things with “data analysis”, but scroll down to regression. It will ask you to highlight the column of data that represents the dependent variable. Usually it is best to highlight the name of the variable and all the data in the column. Then it will ask to designate the independent variable, and you do the same thing. And then make sure to check the box that says “include the data labels” (because you highlighted the data labels too). If you have 2 or more independent variables you can include them in the model. You do this by putting all these variables in adjacent columns, and highlighting all of them in one fell swoop. Note, excel will not do regression if cells are missing data, or if there is a non numeric value in a cell (a comma, etc.). You will get some message when you push the regression button to run the model, and you’ll have to locate the problem, and possibly through away one of the observations.
i did a simple regression on excel and have attached it below. The data set below was used to do a regression to understand the factors associated with the size of hospital bills across a bunch of patients (eg the variable called “ charges “). I ran a regression analysis to test three relationships:
- Does age matter to the size of bill
- Does the category of the age matter to the size of the bill
- Does severity of the diagnosis/procedure matter to the size of the bill
basically what i did in excel was to go to the data analysis, regression page–and key in the cell location of the dependent variable (in this case, hospital charges for 39 patients) which variation in i was trying to explain by 3 independent variables — age, age category, severity. I had to put these variables in adjacent columns and key in the cell locations of these three things. In the regressions they came out as three unnamed variables since i neglected to select the column heading.
I explain the results on the sheet showing what excel produced as results—much of which is not important at this stage of the game.
You could use the data set to create a different model , say one that used only patient severity as a independent variable to explain charges.
| charges | age | age category | Severity | dr code# | Female=1 | admit | disch |
| 8,254 | 57 | 2 | 2 | 730 | 1 | 1/1/2004 | 1/3/2004 |
| 24,655 | 43 | 1 | 4 | 730 | 1 | 1/1/2004 | 1/9/2004 |
| 27,234 | 81 | 3 | 4 | 730 | 0 | 1/2/2004 | 1/13/2004 |
| 21,345 | 56 | 2 | 3 | 730 | 0 | 1/9/2004 | 1/14/2004 |
| 2,417 | 17 | 1 | 1 | 730 | 1 | 1/3/2004 | 1/4/2004 |
| 5,420 | 61 | 2 | 1 | 730 | 1 | 1/4/2004 | 1/6/2004 |
| 18,823 | -61 | 2 | 2 | 730 | 1 | 1/6/1944 | 1/12/2004 |
| 20,280 | 61 | 2 | 3 | 730 | 1 | 1/6/2004 | 1/11/2004 |
| 4,360 | 44 | 1 | 1 | 730 | 0 | 1/2/2004 | 1/5/2004 |
| 22,382 | 90 | 3 | 3 | 730 | 1 | 1/2/2004 | 1/6/2004 |
| 12,673 | 39 | 1 | 3 | 730 | 1 | 1/4/2004 | 1/10/2004 |
| 22,632 | 70 | 3 | 4 | 730 | 1 | 1/3/2004 | 1/11/2004 |
| 22,642 | 77 | 3 | 4 | 730 | 0 | 1/3/2004 | 1/13/2004 |
| 14,111 | 85 | 3 | 2 | 730 | 0 | 1/5/2004 | 1/11/2004 |
| 9,763 | 52 | 2 | 2 | 730 | 1 | 1/6/2004 | 1/13/2004 |
| 13,343 | 65 | 2 | 2 | 730 | 0 | 1/7/2004 | 1/11/2004 |
| 4,886 | 54 | 2 | 1 | 730 | 1 | 1/4/2004 | 1/7/2004 |
| 22,712 | 87 | 3 | 3 | 730 | 0 | 1/4/2004 | 1/14/2004 |
| 7,194 | 50 | 2 | 2 | 730 | 1 | 1/3/2004 | 1/7/2004 |
| 24,809 | 73 | 3 | 3 | 730 | 0 | 1/3/2004 | 1/15/2004 |
| 9,405 | 62 | 2 | 1 | 730 | 1 | 1/2/2004 | 1/7/2004 |
| 9,990 | 63 | 2 | 1 | 499 | 1 | 1/2/2004 | 1/6/2004 |
| 24,042 | 67 | 3 | 3 | 499 | 1 | 1/1/2004 | 1/20/2004 |
| 17,591 | 68 | 3 | 4 | 499 | 0 | 1/2/2004 | 1/10/2004 |
| 10,864 | 85 | 3 | 2 | 499 | 0 | 1/3/2004 | 1/9/2004 |
| 3,535 | 20 | 1 | 2 | 499 | 1 | 1/2/2004 | 1/3/2003 |
| 6,042 | 61 | 2 | 1 | 499 | 0 | 1/4/2004 | 1/6/2004 |
| 11,908 | 59 | 2 | 1 | 499 | 0 | 1/4/2004 | 1/10/2004 |
| 24,121 | 86 | 3 | 44 | 499 | 0 | 1/5/2004 | 1/21/2004 |
| 15,600 | 72 | 3 | 3 | 499 | 1 | 1/5/2004 | 1/11/2004 |
| 25,561 | 92 | 3 | 4 | 499 | 0 | 1/4/2004 | 1/19/2004 |
| 2,499 | 39 | 1 | 1 | 499 | 0 | 1/6/2004 | 1/7/2004 |
| 12,423 | 69 | 3 | 3 | 499 | 1 | 1/6/2004 | 1/9/2004 |
| 24,980 | 71 | 3 | 4 | 499 | 1 | 1/7/2004 | 1/19/2004 |
| 19,873 | 59 | 2 | 3 | 499 | 0 | 1/8/2004 | 1/22/2004 |
| 21,311 | 92 | 3 | 4 | 499 | 1 | 1/6/2004 | 1/12/2004 |
| 15,969 | 60 | 2 | 3 | 499 | 1 | 1/5/2004 | 1/11/2004 |
| 16,574 | 72 | 3 | 3 | 499 | 0 | 1/7/2004 | 1/13/2004 |
| 24,214 | 89 | 3 | 3 | 499 | 0 | 1/7/2004 | 1/19/2004 |
| SUMMARY OUTPUT | ||||||||||||||
| this result was obtained by entering Y data (dependent variable as a2:a40. And, the independent or right hand side variables were entered as b2:d40. | ||||||||||||||
| Regression Statistics | there are 39 observations (patients). The last panel of excel output contains the basic hypothesis testing results.
Charges (the billed amount) is definitely related |
|||||||||||||
| Multiple R | 0.665009 | to the categorical age variable (the second x variable). We know this because the p value <.05. Each increment in the independent variable Age Cat is associated with an increase in Charges by $6740. Charges are not associated with the other independent variable (p is not < .05 for each. The first panel of results tells us about the overall regression model. The R squared statistic tells us that the three independent variables (age, age cat, and severity) explain about 44% of the variation in charges across the patients. The F test (second output table) tells us that the P value is <.05 (actually, p=.000121)and this model is explaining a significant amount of the variations in patient charges. | ||||||||||||
| R Square | 0.442236 | |||||||||||||
| Adjusted R Square | 0.394428 | |||||||||||||
| Standard Error | 5977.152 | |||||||||||||
| Observations | 39 | |||||||||||||
| about 44% of the variation in charges across these patients. | ||||||||||||||
| df | SS | MS | F | Significance F | ||||||||||
| Regression | 3 | 9.91E+08 | 3.3E+08 | 9.250197 | 0.000121 | |||||||||
| Residual | 35 | 1.25E+09 | 35726344 | |||||||||||
| Total | 38 | 2.24E+09 | ||||||||||||
| Coefficients | Standard Error | t Stat | P-value | Lower 95% | Upper 95% | Lower 95.0% | Upper 95.0% | |||||||
| Intercept | 373.5686 | 3227.067 | 0.115761 | 0.908504 | -6177.73 | 6924.863 | -6177.73 | 6924.863 | ||||||
| X Variable 1 | -17.8091 | 47.40916 | -0.37565 | 0.709446 | -114.055 | 78.4366 | -114.055 | 78.4366 | ||||||
| X Variable 2 | 6740.121 | 1766.655 | 3.815188 | 0.000531 | 3153.621 | 10326.62 | 3153.621 | 10326.62 | ||||||
| X Variable 3 | 198.3377 | 148.8665 | 1.33232 | 0.191365 | -103.877 | 500.5528 | -103.877 | 500.5528 | ||||||
I could have experimented with other regression models and tried to see if the doctor mattered in charge levels, or if gender mattered.