PDA

View Full Version : stat: influence in regression analysis



rajeshj
8th November 2005, 07:32 AM
hello,
i have a new problem today in another area, stat. i have a set of equations, rather data of parameters Y, X1, X2, X3, X4........ having 'n' observations. in the above Y is said to be dependant on Xi and Xi s are independant among themselves and Y, such that

Y(1)=a1*X1(1)+a2*X2(1)+a3*X3(1)+............
Y(2)=a1*X1(2)+a2*X2(2)+a3*X3(2)+............
.................................................. ............
Y(n)=a1*X1(n)+a2*X2(n)+a3*X3(n)+............

my question is how can i find out which Xi has higher influence on Y. i repeat i have plenty of Xi and i want a statistical analysis which tell which parameter Xi s explains the Variability in Y to a greater measure. hope it is clear to you. an early reply is solicited.
Thank you for making me continue,
Rajesh

deba
9th November 2005, 03:45 AM
Hi,

I suggest you try out multiple regression analysis of the same. After regressing the indpendent parameters (Xi s) against the dependent parameter (y), you will be able to obtain a correlation matrix like this:
VARIABLE X1 X2 X3 Y

X1 1.0000 -0.0849 0.5633 -0.7176

X2 -0.0849 1.0000 0.2339 0.0028

X3 0.5633 0.2339 1.0000 0.3381

Y -0.7176 0.0028 0.3381 1.0000

Now, the independent parameter which has a higher correlation value (consider the magnitude only) has the maximum effect on Y and similarly for others. In the present example if you go by this, we find that parameter X1 affects Y most followed by X3 and then X2.
Well i think this might answer your question to some extent. However, I must caution you that you also check out with any other sources that you are aware of or would get to know.

best wishes,
deba

rajeshj
9th November 2005, 05:00 AM
hello deba,
I actually sorted the independant variables on the basis of correlation. but does a significant correlation yield a good regression fit? i have as much as 140 variable at 95 % significance. so among that i want to select a few that make a good regression. i think you implied the partial correlation between the two-X and Y. and i think that an X with significant correlation may not prove vital in regression. is it correct? if you are sound tell me a few more details. is there any authentic methods for such a measure. thanks for reply.
Rajesh

rocksea
10th November 2005, 05:18 AM
I think there are statistical ways to analyze and quantify the "influence".
For example, Cook's distance:
"Cook's distance is a metric for deciding whether a particular point alone affects
regression estimates much. After a regression is run one can consider for each
data point how far it is from the means of the independent variables and the
dependent variable. If it is far from the means of the independent variables it
may be very influential and one can consider whether the regression results are
similar without it."

Reference for Cook's distance:
Draper NR, Smith H. Applied Regression Analysis (3rd edition). New York: Wiley 1998.

I am not familiar with this area, may be someone else or google may help.
There should be other methods too, for finding the influence.

deba
10th November 2005, 10:33 AM
Hi, Rajesh.
Yeah, you are correct regarding the partial correlation. Well i am not very familiar with advanced statistical techniques, but with my present knowledge i can say that u are correct in saying that "an X with significant correlation may not prove vital in regression", given your present problem, i.e., 140 independent variables at 95% significance.

However, putting aside the problem of such a large no. of variables, i can suggest two statistical techniques that may be of use to u.
1) Sensitivity Analysis, 2) Saliency Analysis.

In the first case, you first carry out a normal multiple regression.
Then u consider one parameter and increase/decrease its values by a fixed percentage keeping the values of all other parameters as they are. Then u carry out the regression again. Next u again increase/decrease the values of that particular variable by another fixed perdentage and proceed as above. After carring out each regression, u need to look at the satatistical parameters like correlation, std deviation/rms error (u have to look at the overall statistics and not the partial).
This u may have to repeat for each independent parameter and finally when u would look at all the statistics together and compare it with the rgression statistics that u had obtained in the first instance, u will be able to find how sensitive the dependent variable is to a variation in the values of each of the independent variables. This will definitely give you a clue to decide the significance of a parameter on the dependent variable as compared to others.

But given 140 variables, i think this will be real cumbersome and time consuming.

The second method of Saliency Analysis is "similar" to Sensitivity analysis in approach, but of course not the same. I am sorry, i am not very much aware of how to proceed with this particular analysis. u may find some material on this through a google search.

Hope this may help you to some extent. I will try to find out more about this and will post the same once i get some positive solutions.

All the best,
deba

rajeshj
11th November 2005, 07:14 AM
Dear Deba and Roxy,
I read the comments from both of you. I will try these. meanwhile I found the standard test for what i really require-stepwise regression or elimination to avoid insignificant regressors. so I could manage to identify the method, still very complicated. There is some toolbox, i mean m file for matlab that do it very skillfully. i did not get it. if you have such a file let me know. however, thank you for spending time over this.
with regards,
rajesh

Mozza
16th November 2005, 03:06 PM
rajeshj,

with 140 Xs, it sounds like you might want to consider a data reduction technique, such as principle component analysis. You could regress your Y against the set of first few principal components, or use only those Xs with the highest loading in the most omportant PCs.

Also, collinearity between your Xs might screw up any inferences you might draw from your regression model: i.e. you might include a term in the regression which has an apparent relationship to your Y, becasue it is related to another X.

I am trying to figure the collinearity problem out for my own data analysis. If you find anything useful, let me know.

Cheers, Mozza

rajeshj
17th November 2005, 04:49 AM
Dear Mozaa,
When I say 140 variables, it is for a few parameters for various seasons, various levels etc. So actually there are less number of paramaters. If you have the 'statistical methods' by rudolf freund/william wilson please refer page 354 for variable selection. there they explain forward selection, backward elemination and stepwise methods for variable selection. but what you said is good, that to find the loading in prime principal components. how these loading can be found out? i dont know. if you know please explain a bit like what software and the principle behind that.
thank you for your suggestions
Rajesh

rajeshj
17th November 2005, 04:51 AM
dear mozza,
in the same book i mentioned above they say about variance inflation factors to check out for multicollinearity. refer page 350.
thanks
rajesh