Logistic Distribution
where
Logistic distribution and its CDF
Real Value to Probability
Suppose you want to calculate how likely a sample
For some probability
where
We further map
Finally, we can let
Solve for
Chapter 6.2: Maximum Entropy Model
References
https://repository.upenn.edu/cgi/viewcontent.cgi?article=1083&context=ircs_reports
Idea
When a distribution can not be determined by the given evidence, we make a guess that maximizes the entropy of the distribution subject to the evidence as constraints.
Solving for the maximum entropy model is actually an constraint optimization problem: we would like to maximize the entropy of our model while satisfying the constraint that our model has to match our observation.
This is a discriminative model, i.e. estimates directly
Set Up for Target Function
We let the entropy of
where
Set Up for Constraints
First, for an input sentence
For a problem, we may have multiple feature functions
For example, consider filling in the following blank
I will <BLANK> the piano when I get home.
In the training set, we see many different phrases. For example, ‘The musician plays the piano’, ‘The shop sells the piano’, etc. We may define the following feature functions:
It is then easy to see that the number of sentences where ‘play’ follows a person will then be
Therefore, the probability of observing such fact in the training set is
We define our model (estimation) as
Note that we factorize
If our model is correct, we must have
Therefore, in our model
Optimization Problem
We need to optimize the entropy of our model
Lagrange Function
We rewrite the above problem into a standard Lagrangian optimization problem
We define the Lagrange function
and solve the following
See link to the detailed explanation on why we do this.
The solution tells us that our model
where
The only problem left now is finding the optimal parameter
which we leave to the all-mighty gradient-descent method.