Home SLM Chapter 6 Logistic Regression
Post
Cancel

SLM Chapter 6 Logistic Regression

Logistic Distribution

F(x)=P(Xx)=11+e(xμ)γf(x)=F(x)=e(xμ)γγ(1+e(xμ)γ)2

where μ controls the center of the distribution, and γ controls its sharpness (smaller γ means shaper curve)

Graph Logistic distribution and its CDF

Real Value to Probability

Suppose you want to calculate how likely a sample x belongs to class k. You calculate z=wx+b for some weight w and bias b. The problem is, z will be in range (,+), while we require it to be in (0,1). Here is the solution

For some probability p(0,1), we calculate its odds as

odds(p)=p1p

where odd(p)(0,+). In real life, when we describe, let’s say, the chance of winning a lottery, we may say its ‘1 in 100000’. This is equivalent to saying that the probability p of winning the lottery satisfies:

odd(p)=chance of winningchance of losing=11000001

We further map (0,+) to (,+) by taking its log value, resulting in the logit function

logit(p)=logp1p

Finally, we can let

logit(p)=wx+b

Solve for p, we get

p=11+e(wx+b)

Chapter 6.2: Maximum Entropy Model

References

https://repository.upenn.edu/cgi/viewcontent.cgi?article=1083&context=ircs_reports

Idea

When a distribution can not be determined by the given evidence, we make a guess that maximizes the entropy of the distribution subject to the evidence as constraints.

Solving for the maximum entropy model is actually an constraint optimization problem: we would like to maximize the entropy of our model while satisfying the constraint that our model has to match our observation.

This is a discriminative model, i.e. estimates directly p(yx) (as opposed to generative model, see chapter 1 for comparison).

Set Up for Target Function

We let the entropy of P(YX) be:

H(P)=ExH(yx)=x,yp~(x)p(yx)ln(p(yx))

where p~(x) is the probability of x (calculated by counting).

H(P) is then our maximization target.

Set Up for Constraints

First, for an input sentence x and label y, we define feature function f(x,y) as

f(x,y)={1 if x and y satisfies some assertions0 otherwise

For a problem, we may have multiple feature functions f1,f2,,fn.

For example, consider filling in the following blank

I will <BLANK> the piano when I get home.

In the training set, we see many different phrases. For example, ‘The musician plays the piano’, ‘The shop sells the piano’, etc. We may define the following feature functions:

f1(x,y)={1 if word 'play' follows a person in x and y = 'play' 0 otherwise f2(x,y)={1 if word 'play' follows a place name in x and y = 'sell' 0 otherwise

It is then easy to see that the number of sentences where ‘play’ follows a person will then be x,yf1(x,y).

Therefore, the probability of observing such fact in the training set is

Ep~(x,y)f1(x,y)=x,yp~(x,y)f1(x,y)

We define our model (estimation) as p(yx). In our estimation, the probability of making such observation is

Ep(x,y)f1(x,y)=x,yp~(x)p(yx)f1(x,y)

Note that we factorize p(x,y) as p~(x)p(yx).

If our model is correct, we must have Ep~(x,y)f1(x,y) = Ep(x,y)f1(x,y) .

Therefore, in our model p(yx), for any feature function fi(x,y), we require that Ep(x,y) is equal to Ep~(x,y), i.e. our model (estimation) matches our observation of the data. This will be our constraints for our model (together with the constraint that our estimated distribution should sum to 1).

Optimization Problem

We need to optimize the entropy of our model H(P) subject to constraints. The optimization problem can be expressed as the following:

maxP=x,yp~(x)p(yx)ln(p(yx))s.t. Ep(fi)=Ep~(fi),i=1,2,...,nyP(yx)=1

Lagrange Function

We rewrite the above problem into a standard Lagrangian optimization problem

minP=x,yp~(x)p(yx)ln(p(yx))s.t. Ep(fi)Ep~(fi)=0,i=1,2,...,nyP(yx)=1

We define the Lagrange function L(p,w) as:

L(P,w)=H(P)+w0(1yP(yx))+i=1nwi(Ep(fi)Ep~(fi))

and solve the following

L(P,w)(P(yx))=0

See link to the detailed explanation on why we do this.

The solution tells us that our model Pw(yx) parametrized by w will have the form

Pw(yx)=1zw(x)eiwifi(x,y)

where Zw(x) is a normalization term calculated as

Zw(x)=yeiwifi(x,y)

The only problem left now is finding the optimal parameter w that maximizes Pw(yx), i.e.

w=argmaxwPw(yx)

which we leave to the all-mighty gradient-descent method.

This post is licensed under CC BY 4.0 by the author.