Likelihood Model

Viewing 6 posts - 1 through 6 (of 6 total)
  • Author
  • #55837


    I would like to calculate a Likelihood of Win and Lose for combination of Factors basis the training data collected. I am working on Naive Bayes model, however the queries I have are:

    1) The training data shows only 1.7%(41/2373) cases in Lose category, is it enough for building a model?
    2) The multiplier formula shows very less % results for the likelihood of lose, since each factor where we lose is quite important to consider, can I “+ add” the results instead of multiplying it before calculating the % of Likelihood? This shows reasonable results



    Don Turnblade


    I do a lot of likelihood modeling for InfoSec work. I have some suggestions that may spark some useful thoughts. My notes are over focused on InfoSec/Risk Management concerns but may give you ideas to assist with odds based model building.

    Ideally, one would use half your data to fit the model and half your data to test the model. Data might be split based on present state followed by projection testing in the next state. This is one area where random splits of data sets may not help you test whether the models built have good protective capability.

    Not all odds are correlated. Function1(Odds1)*Function2(Odds2) can lead to several combinations. In my area Poisson modeling of odds helps build these models -ln(1-Odds_Failure) = L; Function1 = m1 *L1 + b1 ; Function2 = m2 * L2 + b2 ; Fuction1*Function2 = m1*m2*L1*L2 + m1*b2*L1 + m2*b1*L2 + b1*b2. Testing against these models is very easy to do using log transformation of sample data. ln(Funtion1*function2) = ln(function1) + ln(function2) ; Using Log transformations can help you use linear correlation testing to identify whether multiplication is a good fit.

    After data is roughly as linear in a graphical sense either naturally or due to simple transforms such as ln(x):

    Short review of linear correlation testing using spreadsheet functions.
    Var1 Var2
    N Count N Count N
    <x> average <x> average <y>
    Vxx sumproduct(x,x)/N-x^2 sumproduct(y,y)/N-y^2
    Vyx sumproduct(Y,X)/N-x*y
    F0 (Fdist hypothesis test) (N-2)*Vyx^2/(Vxx*Vyy-Vyx^2)
    Test 95% confidence 1=correlated if(F0>=tinv(.05,N-1)^2,1,0)

    M, Slope if(Test=1,Vyx/Vxx,0)
    B, intercept Ave_y – M*Ave_x
    S, Model Sigma if(Test=1,sqrt(N/(N-2)*(Vyy-M*Vyx)),sqrt(N/(N-1)*Vyy))
    CI95 Confidence Interval 95% if(Test=1,tinv(.025,N-2),tinv(.025,N-1))*S
    Failure of Test means that 95% of the time, random noise rather than linear modeling is a better explanation, no line should be used.

    Note: Both full factorial and partial factorial test tables remove any correlation between variables caused by the test design rather than the data itself. Thus, linear testing between any variable and the outcome could be statistically independent tests. Each variable could be tested against the outcome and fit. Then, the modeled change removed from the outcome and tests for other variable correlation can then begin again.
    Naturally, this is easier to do with a good statistical package, but it can be done by hand with sweat equity with a large spreadsheet — especially if this is needed but only sweat equity funding is available. (I can make a better one if I get the right tools is not a bad internal sales pitch.)

    Testing lots of possible models:
    Once linearized, Partial Factorial Testing of Models:
    Partial factorial arrays can build test cases for single factor and multiplied probability very nicely as well as comment on correlation with sample data. Cases in such arrays can naturally test Odds1 + Odds2 + Odds3 vs Odds1*Odds2 + Odds1*Odds3 + Odds2*Odds3 and even Odds1*Odds2*Odds3; In effect the test case combinations help you select among models for odds.

    In my case, I face significant uncertainties in input data and wish to look at models in a three level state. Sometimes, I have credible Average and Sigma on inputs. Other times, I have human expert interview data — which needs special treatment to avoid center bias, extreme distortions, self-reporting shame/honor bias and simple human mismatch with statistical or odds based thinking. Trinary full factorial arrays help in that case.

    Forced choice grading of combinations proposed by a partial factorial array can also help considerably when looking for fundamental relationships between factors.

    I had a group of InfoSec experts rate the risk of a system facing the internet on a force choice scale from 1 to 32 (only one choice is allowed) 1 being best protection and 16 being worst protection from their experience.

    V1: Web Server ISS/Apache
    V2: Database: MS SQL, MySQL
    V3: Application Server: Windows/Linux
    V4: Framework: .Net/Python
    V4: All on one system/Split into separate systems.

    This addressed a vexing case of risk management that was not process and policy dependent. When combined with attacker behavior models and Business Continuity Down Time estimates. This allowed a model to be used by IT Audit to estimate risk on developed platforms that included approximate modeled feedback form InfoSec staff. The result was improved credibility in risk ranking and a self-teaching model by project managers


    Don Turnblade


    The best advice I have when using statistical modeling is that if it fails a hypothesis test you do not have to think much. But, if it passes, that is when it is time to start thinking.

    Rule of thumb approximation for 95% confidence interval of tinv(0.025)*sqrt(Odds(1-Odds)/N) = sqrt(odds*(1-odds))*sqrt(5/(N-3))

    Your sample size N is 2372.
    Thus, worst case, 1/2*sqrt(5/(2373-3)) = 1.15%

    Your average is 1.7% so it is plausible to think you have enough samples to be 95% sure that random dice sampling could not cause your average. 1.7% +/- 1.1%
    Now, it is the time to start thinking carefully. If random dice is not the cause of this average, what is?


    Don Turnblade

    Curve fit of tinv(.025,N-1)*sqrt(Odds(1-Odds)/N) = sqrt(Odds*(Odds-1)/N)*sqrt(5*N/(N-3))

    This reduces to sqrt(Odds*(1-Odds))*sqrt(5/(N-3)) ; using Tinv(.025,N-1)*sqrt(Odds(1-Odds)/N) is a reasonable approximation of dice behavior after N>=4. Generally below about N=20, one can directly compute the probability distribution for dice using Binomial statistics.

    Odds(x_successes) = N!/(x!*(N-x)!)*Odds_success^x*(1-Odds_success)^(N-x)
    Decent approximation N!/(x!*(N-x)!) if x>0.
    = (12x/(12x+1))*(12*(N-x)/((12*(N-x)+1))*((12*N+1)/(12*N))*(N/(N-x))^N*((N-x)/x)^x*sqrt(N/(2*PI*x*(N-x)))

    Based on Stirlings Approximation n! = 1/sqrt(2*PI*n)*(n/e)^n*(1+1/(12*n))
    This allows for nearly direct computation of binomial odds even when a spreadsheet cannot handler numbers much larger than fact(139), 139!.

    OK, so your spreadsheet would have 2373 +1 rows of such computations in it. But it then could directly estimate your 95% confidence interval.


    Don Turnblade

    I was trying to build you a more precise model to look at the confidence interval around your measurement 41/2373. However the closer I look at it the less hope I have for this measurement. I believe you have enough confidence for Marketing which only required a 66% confidence interval to make an inference. Business benefits from the 95% confidence interval as this is the cut off line when a firm can legally guarantee something. Depending on the confidence interval modeling, this number meets or fails the 95% confidence interval. When it comes to Engineering modeling for reliability to protect human life 99.9% confidence quickly appears, this would cleanly fail.


    Don Turnblade

    44/2357 = 1.73%

    Exact Binomial computation:
    97.98% confidence is 54/2367 or 2.28%

    Student’s T approximation of Binomial
    tinv(.05,2367-1)/(2*sqrt(2367) = 2.02%

    Correctly done approximate of Student’s T for binomial
    sqrt(5/(2367-3))/2 =2.30%

    From each of these 44/2367 or 1.73% is not enough to be 95% confident that binomial success/failure “dice” effects could not cause this result at random.

    Binomial Statistics for “dice” odds
    1 = (1)^N = (Odds_Win – (1-Odds_Win))^N = sum_0-N( N!/(Num_win)!*(N-Num_win)!*Odds_win^Num_win * (1-Odds_win)^(N-Num_win)

    Sum 0 to X_wins until cumulative probability is on or near 97.5% cumulatively.
    Odds Wins Cumulative Odds
    0.017808975 51 94.692966%
    0.013981322 52 96.091098%
    0.01076459 53 97.167558%
    0.008130949 54 97.980652%

    54/2367 = 2.28% 95% confidence that any combination of wins or losses for 2367 tries could at random create 54 wins. 41 wins out of 2367 tries is not 95% confident that is could not be created by luck of the dice alone rather than a fair effect being measured.

Viewing 6 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic.