Assistance with a probability distribution problem

Ilriyas · May 2015

At the risk of glazing everyone's eyes over, my dad was wrestling with a problem involving statistics and was wondering if anyone here had a more recent grasp on these kinds of problems.

He's been trying to determine if random numbers fit to an expected distribution. He's aware of things like Chi Square which would be really easy to apply if the data was expected to be uniform (flat). The challenge is the distribution. It's more like the difference of two distributions.

Imagine you have a uniform distribution of numbers between 0 and 1 million. Your random numbers are uniformly distributed here with a condition. The numbers must have at least 1 '9' in them. So how would you determine if your random numbers fit this distribution? Is there a test, or a method of synthesizing expected values?

Poor dad has exhausted his Google-foo and I hate seeing smoke coming out of his ears.

Personally I have no clue, he actually wrote the above. For some reason he insisted on writing it from my viewpoint.

Thanks in advance,

-Ilriyas

PirateNinja · May 2015

Chi Square is very useful because it predicts the commonality between a theoretical model and actual data. I'm a tad confused if he has a model or just data. But let's set the Chi Square distribution aside.

I /think/ what he is saying through you is that he only has a raw set of data, and he wants to see if he can reverse engineer it in to a distribution. I did this type of thing back when I had to create simulations. This is often times guess work based on the source or nature of the data since there are so many distrubutions used with random numbers (normal, poisson, gamma, lognormal, whatever). The way I would then prove it is by simulating random number generation through the distribution I guessed it was, and then analyze the difference between the simulated numbers and the original data. That's how you test the accuracy of a simulation with known outcomes.

Can you, or he via you, give us an idea of where the raw data comes from and what exactly it represents? With that I could give you a list of distributions the numbers are likely to have been generated from.

Ilriyas · May 2015

The data is generated by a very strong random number generator that has already passed a lot of statistical tests. This is then used to generate random digits 0..9. A ChiSq on the individual digits confirms the distribution appears uniform.

The second part of the problem is to look at the distribution individual digits in larger numbers 0..N for some large N where there is at least one '9'. Things like mean and variance look good but a ChiSq of this rejects uniformity as you would expect given that some values are being discarded. The problem is really to model the expected distribution when N is large.

Not sure if there is a way to model this, or to subtract one distribution from another, or add several distributions together, or if there is another test that would be better.

PirateNinja · May 2015

By definition a uniform or rectangular distribution gives the exact same odds of any number to be generated.
So if we take a step back and simplify this then there are one of three scenarios I can think of:
1. This is not a uniform distribution
2. Your data set is too small and even though it appears to not be uniform, it is not
3. The actual programmatic generator of this data is altering the outcome /after/ the number is generated to avoid certain outcomes.

Even if I knew the nature of the problem at hand, it wouldn't be prudent to speculate.

This is the problem with reverse engineering. Even if you find a distribution and simulate it to give you very similar results, you won't know exactly what is happening unless you can peak at the source.

PirateNinja · May 2015

On that note, when it comes to discrete simulations you can certainly generate values randomly with distribution A and then use that value through a second distribution B to generate new values. However, I can't possibly think how to subtract one distribution from another.

Knowing nothing else about the problem at hand, I think you will have to speculate to the best of your ability because reverse engineering could very well be impossible. At least to my knowledge. If you ask a math PhD, they are going to ask you for the data and say give me a week.

PirateNinja · May 2015

For guesswork I think this would help a lot:

http://gamedev.stackexchange.com/questions/12638/weighted-random-distribution

Hero · May 2015

i love this community...you did indeed manage to cross my eyes, however once i managed to uncross them(partially) it was one of those "things" that gets the grey matter flowing. also f**king awesome that there was a knowledgeable response, period. awesomesauce on every post.

shwaip · May 2015

You can still do the chi-square test, you just need to only do it over the domain of the random variable (that is, only over the numbers that have a 9 in them)

PirateNinja · May 2015

@shwaip is far more authoritative than me. I appreciate the love though @XGPHero . @Ilriyas don't hesitate to let us know what additional info or guidance that can be given. It's just a difficult question because there is so little background.

Ilriyas · May 2015

My Dad made himself an account, he's just waiting on it to be approved and I won't have to serve as a go-between. (Hopefully he'll get involved in the community too cause that would be kinda cool)

alberio · May 2015

Hi,

Let me try to explain this again a bit better as I wasn't clear.

Background

I was analyzing a moderate sized sample (6,000 ish) of random 6-tuples. These were generated using a secure random number generator to select 6 random integers X from the sequential set 0 to B. There was also a requirement for some content restriction, specifically that each 6-tuple had a least one element with the numbers 0 to R. And of course 0 < R < B.

The SRNG is supposed to be FIPS 140 software tested but doing things like picking random bits from PRNGs is notoriously dodgy so I needed to do a sanity check test and ChiSq seemed to be a good choice.

There are a number of applications for this, such as random passwords that have content rules like there must be at least 1 digit or special character. I have a different application but this is easier to explain.

Without the content requirement the distribution is uniform and flat and the ChiSq is easy as the expected value of each 0 to B is a simple average.

By throwing away the 6-tuples with values R+1 to B the expected values for each X in this range should be lower than the simple average.

Problem

I'm not trying to reverse engineer from the data. Rather, I'm trying to find the theoretical expected values of this restricted distribution so I can calculate the ChiSq so I can evaluate the observed values.

Also since B > 60 , I can't just generate the full set of tuples and just count it because I'm into trillions of tuples and hundreds of trillions of integers. I might be able to do it with a brand new empty hard disk but then it doesn't generalize. (Rendering the integers as printable characters reduces the space needed for this but that's just an implementation trick and it doesn't work when B gets much bigger than about 90ish. And it certainly doesn't work on tuples any larger than 6).

Current Thoughts

Now that I've recovered access to a few more probability and stats brain cells, I think the following will work.

I can take the unrestricted size of the space B^6 (where ^ is exponentiation) and figure out the restricted part then use that to reduce the expect values of R+1 through B.

Because it effects each digit the totals size of the space of the tuples after applying the restriction is 6 * R * B^5

Therefore the total size of the rejected space is B^6 - (6 * R * B^5) which simplifies to B^5(B-6R)

Now I think if I start with the expected values of 0 to B as being the simple average, then I can just adjust the R+1 to B expected values downward by the number of tuples rejected / (B-R) and I should have it.

Does this sound reasonable?

alberio · May 2015

BTW some of these brain cells have been packed away in the back for 30+ years. It hurts growing them back

alberio · May 2015

Please excuse the noobiness, Ilriyas just told me that I should have added @shwaip and @PirateNinja

alberio · May 2015

I think I got it. Another couple of days, a little less sinus cold, and couple more coffees later ...

Simplify! So wanting the expected values for this distribution I visualized it in small scale with easy numbers I could do in my head. So much easier to visualize it as bins or a 2d graph.

B = full set of 12 integers 0, .., 11 used to generate tuples
R = sub set of 10 integers 0, .., 9 used to reject unwanted tuples
N = 3 the number of integers in each random tuple/vector/list

In a totally flat and perfectly random distribution of B in 3-grams there are B^3 or 1728 tuples. The expected values for each integer in B is B^(N-1) or 144.

If I further restrict the results by rejecting all tuples that contain only integers in R. The expected values for each integer in R is B^(N-1)-R^(N-1) or 144-100=44.

Now all that's left is to scale the expected values from a sample size of B^N to the actual sample size.

That's what I meant about subtracting out two distributions.

@shwaip @PirateNinja @Ilriyas

Took me long enough.

shwaip · May 2015

Glad you got it sorted - my brain was pretty fried from my own sinus cold!

Assistance with a probability distribution problem

Comments