The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here

Normalize variables

KeithrKeithr Member Posts: 10 Contributor II
Hi,

I'm trying to create a predictive model for churn.  Some of the variables I'm using are the percentage change in sales from month to month.  In order to control the outliers on the positive side (e.g. 200% increase in sales from month-5 to month-4) I set a cap at 3 (300%).  On the negative side (i.e. drop in sales) the most a customer can drop is -1 (-100%), but I have many of these cases.  My distribution is pretty normal except for these customers, which is giving me a bimodal distribution.

Is there any calculation I can do with this variable to normalize the distribution including the -1 (-100%) instances? Or if there is no way to do this, any other suggestions would be great.

Thanks in advance for your help.

Keith

Answers

  • reports01reports01 Member Posts: 23 Maven
    Why don't you cluster them? say:

    Cluster 1: Outliers negative
    Cluster 2: Normal decrese
    Cluster 3: stable
    ...
    ...
    ...
    Cluster n: Extreme growth

  • KeithrKeithr Member Posts: 10 Contributor II
    Thanks.

    This will work as long as I make the interval range for the "Outlier negatives" smaller than the other bins.  In other words, in order to NOT include too many instances in the "large drop" bin I'd have to have the range from -100% to, let's say, -90%, while the other bins would have a much larger range (e.g.-89% to -40%).

    Statistically speaking, is it OK to have bins with different ranges like that?

    Keith
  • KeithrKeithr Member Posts: 10 Contributor II
    When I binned the sales variables to normalize the distribution I had use few bins so that the -100% would not overwhelm the other bins.  I then used the CR&T learner against some those variables and my accuracy actually decreased.  it seems that CR&T at least does OK with bimodal distributions.
Sign In or Register to comment.