Skip to content Skip to sidebar Skip to footer

Find Conditional Density for Joint Continuous

Probability Distributions I

B.R. Martin , in Statistics for Physical Science, 2012

3.3.2 Marginal and Conditional Distributions

We may also be interested in the density function of a subset of variables. This is called the marginal density function f M , and in general for a subset x 1 ( i = 1 , 2 , , m < n ) of the variables is given by integrating the joint density function over all the variables other than x 1 , x 2 , , x m . Thus,

(3.20a) f M ( x 1 , x 2 , , x m ) = + + f ( x 1 , x 2 , , x m , x m + 1 , , x n ) i = m + 1 n d x i .

In the case of two variables, we may be interested in the density function of x regardless of the value of y, or the density function of y regardless of x. For example, the failure rate of a resistor may be a function of its operating temperature and the voltage across it, but in some circumstances we might be interested in just the dependence on the former. In these cases (3.20a) becomes

(3.20b) f M ( x ) = + f ( x , y ) d y and f M ( y ) = + f ( x , y ) d x .

These density functions correspond to the normalized histograms obtained by projecting a scatter plot of x and y onto one of the axes. This is illustrated in Fig. 3.4, again using the data of Fig. 1.3(b).

FIGURE 3.4. Normalized histograms obtained by projecting the data of Fig. 1.3(b) onto the x and y axes, together with the corresponding marginal probability density functions f M ( x ) and f M ( y ) .

We can also define the multivariate conditional density function of the random variables x 1 ( i = 1 , 2 , , m < n ) by

(3.21) f C ( x 1 , x 2 , , x m | x m + 1 , x m + 2 , , x n ) f ( x 1 , x 2 , , x n ) f ( x m + 1 , x m + 2 , , x n ) .

Again, if we consider the case of two variables x and y, the probability for y to be in the interval ( y , y + d y ) with any x (event B), given that x is in the interval ( x , x + d x ) with any y (event A), is

P [ B | A ] = P [ A B ] P [ A ] = f ( x , y ) d x d y f M ( x ) d x ,

where f M ( x ) is the marginal density function for x. The conditional density function for y given x, is thus

(3.22a) f C ( y | x ) = f ( x , y ) f M ( x ) = f ( x , y ) f ( x , y ) d y .

This is the density function of the single random variable y where x is treated as a constant. It corresponds to projecting the events in a band dx centered at some value x onto the y axis and renormalizing the resulting density so that it is unity when integrated over y. The form of f C ( y | x ) will therefore vary as different values of x are chosen.

The conditional density function for x given y is obtained from (3.22a) by interchanging x and y, so that

(3.22b) f C ( x | y ) = f ( x , y ) f M ( y ) = f ( x , y ) f ( x , y ) d x ,

and combining these two equations gives

(3.22c) f C ( x | y ) = f C ( y | x ) f M ( x ) f M ( y ) ,

which is Bayes' theorem for continuous variables.

We can use these definitions to generalize the law of total probability (2.6b) to the case of continuous variables. Using conditional and marginal density functions we have

(3.23) f ( x , y ) = f C ( y | x ) f M ( x ) = f C ( x | y ) f M ( y ) ,

so the marginal density functions may be written as

f M ( y ) = + f C ( y | x ) f M ( x ) d x

and

f M ( x ) = + f C ( x | y ) f M ( y ) d y .

With more than one random variable we also have to consider the question of statistical independence (by analogy with the work of Chapter 2). If the random variables may be split into groups such that their joint density function is expressible as a product of marginal density functions of the form

f ( x 1 , x 2 , , x n ) = f 1 M ( x 1 , x 2 , , x i ) f 2 M ( x i + 1 , x i + 2 , , x k ) f n M ( x l + 1 , x l + 2 , , x n ) ,

then the sets of variables

( x 1 , x 2 , , x i ) ; ( x i + 1 , x i + 2 , , x k ) ; ; ( x l + 1 , x l + 2 , , x n ) ,

are said to be statistically independent, or independently distributed. So two random variables x and y are independently distributed if

(3.24) f ( x , y ) = f M ( x ) f M ( y ) .

It follows from (3.22) that in this case the conditional density function of one variate does not depend on knowledge about the other variate.

EXAMPLE 3.7

The joint mass function for two discrete variables x and y is given by

f ( x , y ) = { k ( 2 x + 3 y ) 0 x 3 , 0 y 2 0 otherwise ,

where k is a constant. Find: (a) the value of k, (b) P [ x 2 , y 1 ] , and (c) the marginal density of x.

The mass function may be tabulated as below.

y x 0 1 2 total 0 0 3 k 6 k 9 k 1 2 k 5 k 8 k 15 k 2 4 k 7 k 10 k 21 k 3 6 k 9 k 12 k 27 k total 12 k 24 k 36 k 72 k

(a)

The normalization condition is x , y f ( x , y ) = 1 , so k = 1 / 72 .

(b)

P [ x 2 , y 1 ] = P [ x = 2 , y = 1 ] + P [ x = 2 , y = 0 ] + P [ x = 3 , y = 1 ] + P [ x = 3 , y = 0 ] = 7 k + 4 k + 9 k + 6 k = 26 k = 13 / 36 .

(c)

The marginal probability of x is

P [ x ] = y P [ x , y ] = { 9 k = 3 / 24 x = 0 15 k = 5 / 24 x = 1 21 k = 7 / 24 x = 2 27 k = 9 / 24 x = 3

EXAMPLE 3.8

If f ( x , y ) is the joint density function of two continuous random variables x and y, defined by

f ( x , y ) = { e ( x + y ) x , y 0 0 o t h e r w i s e ,

find their conditional distribution.

From (3.22b),

f C ( x | y ) = f ( x , y ) f M ( y ) ,

where the marginal density of y is given from (3.20b) as

f M ( y ) = 0 f ( x , y ) d x = e y [ e x ] 0 = e y .

Thus

f C ( x | y ) = e ( x + y ) e y = e x .

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780123877604000032

Conceptual Econometrics Using R

Kris Boudt , ... Eric Zivot , in Handbook of Statistics, 2019

6 BIP and GAS MGARCH models

The estimated covariance updating equation of MGARCH models is mostly used as a filter to predict the conditional covariance based on the observed return series. The already discussed MGARCH models tend to use the same filter irrespective of the shape of the distribution function. This typically leads to a large spike in the conditional covariance prediction following an extreme return realization. In order to dampen the effect of outliers on covariance predictions, Boudt and Croux (2010) and Boudt et al. (2013) recommend to use robust MGARCH filters that have the property of Bounded Innovation Propagation (BIP). The resulting model is called BIP-MGARCH for which they propose robust M-estimators under the assumption of elliptical innovations. The BIP-BEKK model corresponding to the BEKK model in (5) is given by:

(63) H t = C C + k = 1 K j = 1 q A jk ɛ ˜ t j ɛ ˜ t j A jk + k = 1 K j = 1 p B jk H t j B jk ,

with

(64) ɛ ˜ t = ɛ t w ɛ t H t 1 ɛ t .

The weight function w(⋅) must be such that the effect of ɛ t on H t is bounded. Boudt and Croux (2010) use the following weight function

(65) w z = 1 if z c 1 1 1 c 1 / z 3 if c 1 < z c 2 c 2 / z 1 1 c 1 / c 2 3 else .

They set the parameters c 1 and c 2 equal the 99% and 99.9% quantile of the distribution of the squared Mahalanobis Distances (MD) ɛ t H t   1ɛ t . The left panel of Fig. 1 shows this weight function in case the conditional distribution of ɛ t is bivariate Normal (top plot) or Student t 4 (bottom plot). Note that only the observations with an extremely large MD are downweighted and that the weighting depends on the distributional assumption. In the right panel we also plot the function w(z)z, which is of interest since if z is the squared MD of ɛ ˜ t , then w(z)z is the squared MD of ɛ t . Note that the downweighting is such that the function w(z)z is nondecreasing and bounded by w(c 2)c 2. The smoothness of the weight function is needed to avoid numerical problems in the parameter estimation.

Fig. 1

Fig. 1. Plot of the functions w(z) and w(z)z used in the bivariate BIP-BEKK model. The parameters c 1 and c 2 equal to the 99% and 99.9% quantiles of the squared MD under Gaussian (upper panel) and Student t 4 innovations (lower panel).

The BIP-BEKK model can be seen as an ad hoc robustification of the BEKK filters. Similar BIP-DCC filters were proposed in Boudt et al. (2013). An elegant alternative to take the shape of the distribution function into account is the class of Generalized Autoregressive Score (GAS) and Dynamic Score models, proposed by Creal et al. (2013) and Harvey (2013) at Vrije Universiteit Amsterdam and Cambridge University, respectively. They developed a general framework to specifying the time-varying parameters of a conditional distribution function. The key feature of their framework is that the score of the conditional density function is used as the driver of the time variation in the parameters, making it possible to obtain the likelihood in closed form through a standard prediction error decomposition.

More formally, suppose that the variable of interest is r t with conditional density function f(…). The conditional density depends on a vector of time-varying parameters denoted by θ t   Θ    J . It typically contains the unique elements in the conditional covariance matrix H t , but may also consist of location and shape parameters, among others. Usually, the parameter space of θ t is restricted by various conditions, such as the requirement of positive definiteness for the conditional covariance. The standard solution under the GAS framework is to work with parameter transformations such that the parameter of interest θ t is in the parameter space. More precisely, when the unrestricted parameter vector is denoted by θ ˜ t J , the GAS model uses a link function Λ(⋅) to map the transformed parameter θ ˜ t J into the parameter of interest θ t . The evolution in the time-varying parameter vector θ ˜ t is driven by the scaled score of the conditional density function, defined as:

s t = S t log f r t θ t θ ˜ t ,

where the matrix S t is a J  × J positive definite scaling matrix known at time t. w The quantity s t indicates the direction to update the vector of parameters from θ t , to θ t   +   1 , acting as a steepest ascent algorithm for improving the model's local fit given the current parameter position. This updating procedure resembles the well-known Newton–Raphson algorithm.

The updating equation for the unconstrained parameter vector is given by a linear function of the score, together with an autoregressive component:

(66) θ ˜ t + 1 = κ + A s t + B θ ˜ t ,

where κ, A, and B are matrices of coefficients with proper dimensions.

A general implementation of univariate and multivarate GAS models can be found in the R package GAS (Ardia et al., 2018b; Catania et al., 2017). As a simple illustration we show in code Snippet 3 how one can define a multivariate GAS model in the GAS package and estimate it. Table 4 provides the methods and functions available for working with the model which include forecasting, filtering, simulation, and inference, among others. We refer the reader to the documentation of the package for more details.

Snippet 3

Snippet 3. GAS example.

Table 4. GAS multivariate GAS model functions and methods

Functions/methods Description Input classes
MultiGASSpec Model specification NA
MultiGASFit Model estimation 1
getFilteredParameters 1-ahead ahead filtering 1
ConfidenceBands Confidence bands for the filtered parameters 1
getMoments Extract conditional moments 1, 3, 5
getForecast Extract parameter forecast 3, 5
LogScore Extract log scores 3, 5
MultiGASFor 1- to h-ahead forecasts 2
MultiGASSim Simulation 2
residuals Conditional mean equation residuals 2
coef Coefficients of model 2
show Summary of output 3,4,5
summary Summary of output 2
MultiGASRoll Rolling estimation/forecasting 1

Note: The table provides a list of the methods and functions for working with multivariate GAS models in the GAS package. The input classes are as follows: 1   =   mGASSpec, 2   =   mGASFit, 3   =   mGASFor, 4   =   mGASSim, 5   =   mGASRoll.

The approach of specifying the time variation in all the distribution parameters jointly as a function of the conditional score is unfortunately not feasible in large-scale applications due to a curse of dimensionality. Creal et al. (2011) acknowledge this shortcoming and propose to use a time-varying copula specification in order to model the variances separately from the correlations. As in the DCC model of Engle (2002), it is then straightforward to combine the conditional variances and correlations into an estimate for the conditional covariance matrix H t . We illustrate this copula-approach next in the case of a GAS variance model assuming a Student t marginal distribution, and a GAS correlation model under the assumption of a bivariate t-copula specification.

In terms of modeling the conditional variance dynamics, we focus here on the case where a Student t distribution is assumed. Several GAS models exist under this framework. The Beta-t-EGARCH model introduced by Harvey et al. (2008) uses the exponential function as link function. The Beta-t-GARCH model of Harvey et al. (2008) and the t-GAS model of Creal et al. (2013) use no transformation. The latter then leads to a GAS volatility model that is the close to the GARCH(1,1) model and for which the estimates are published on https://vlab.stern.nyu.edu. Under this model, the conditional variance for asset i with zero mean and ν i degrees of freedom is given by:

(67) h ii , t + 1 = ω i + α i ν i + 3 ν i ν i + 1 ν i 2 + ϵ i , t 2 / h ii , t ϵ i , t 2 h ii , t + β i h ii , t .

Note that if ν i   =     , the GAS-t volatility model collapses to a traditional GARCH model and the score has a quadratic impact on the conditional variance. This can be seen as well in Fig. 2, where we show the scaled score for various value of ν i . Note that, the more fat-tailed the return distribution is, the more their effect on future variance is dampened due to to downweighting when ν i is small. This is desired, since in case of a fat-tailed distribution, a large value of ϵ i,t /h i,t 1/2 may as well be a tail realization and thus does not necessitate a substantial increase in the conditional variance.

Fig. 2

Fig. 2. Scaled score used as driver in the t-GAS conditional variance model for various values of ν i when h ii,t   =   1.

We now discuss the GAS correlation model under the assumption of a bivariate t-copula specification. Besides the use of hyperspherical coordinate transformation, Creal et al. (2011) discuss also the approach to decompose the correlation matrix R t as Δ t   1 Q t Δ t   1 , where Q t is a symmetric positive definite matrix and Δ t is a diagonal matrix whose nonzero elements equal the square root of the diagonal elements of Q t . They then use the GAS framework in (66) to obtain a score-driven calibration of the time variation in   vech   (Q t ). The score takes the shape of the student t copula into account, together with the surprise in the realized correlations compared to the predicted ones. We refer to Creal et al. (2011) for the general specification, and limit ourselves here to summarizing the discussion in Creal et al. (2011) regarding the bivariate case with fixed unit variance, correlation parameter ρ t , and Gaussian copula. Then, the scaled score is given by:

(68) s t = 2 1 ρ t 2 2 1 + ρ t 2 y 1 , t y 2 , t ρ t ρ t y 1 , t 2 + y 2 , t 2 2 .

The first term increases the conditional correlation when the realized correlation exceeds the conditional correlation, while the second term attenuates this correlation increase in case of a large dispersion in the input vector. In fact, the correlation signal of (1,   1) is much stronger than (1/4,   4), even though their cross product is the same. Boudt et al. (2012) present similar expressions for the scaled score in case of the t-copula. The most important change is that, alike in the univariate case, the input data are then also downweighted when they have a large Mahalanobis distance. The more fat-tailed the t-copula is, the larger the downweighting in order to compensate for the fact that large deviations of realized correlations from predicted ones may be tail events and thus do not necessarily imply changes in the conditional correlation.

The application of the GAS framework to create MGARCH models is still an active field of research, as can be seen on the overview website at http://www.gasmodel.com/. A large number of authors have worked out the GAS dynamics in case of a more flexible distribution function, like the univariate skewed t distribution (see, e.g., Harvey and Sucarrat (2014) and Ardia et al. (2018a)), the generalized hyperbolic skewed t distribution (Lucas et al., 2014), or the Wishard distribution (Gorgi et al., 2019) and general finite mixture of distributions (Catania, 2016). Others have generalized the GAS specification to account for regime switches (Boudt et al., 2012; Catania, 2018). In terms of attempts at obtaining GAS models that are feasible in high dimensions, we refer the reader to Boudt et al. (2012) and Lucas et al. (2017) for an analysis assuming (block) equicorrelation, and to Creal et al. (2011, 2014) for the use of dynamic factors under the GAS framework.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/S016971611930001X

Classical and Quantum Information Theory

Dan C. Marinescu , Gabriela M. Marinescu , in Classical and Quantum Information, 2012

Proof.

We assume that the discrete random variable, X, has a probability distribution, pX(x), and call X the range of values of X. We base our estimation of X on the random variable, Y, which is related to X by the conditional density function, pY|X(y|x); once we know Y, we estimate X as a function of Y,

=g(Y). Clearly, X, Y,

form a Markov chain, XY

. Now we introduce a binary random variable, A, defined as

A = { 1 i f X ˜ X , 0 i f X ˜ = X ,

We apply the chain rule for entropy and write

H ( A , X | Y ) = H ( X | Y ) + H ( A | X , Y ) = H ( X | Y )

.

Indeed, H (A|X, Y) = 0 as A is completely determined once we know X and Y. We apply again the chain rule for entropy to express H (A,X|Y) differently:

H ( A , X | Y ) = H ( A | Y ) + H ( X | A , Y )

.

A is a binary random variable, and Prob(A = 0) = 1 − perr and Prob(A = 1) = perr . Thus, H (A) = H(perr ); also, H(A|Y) ≤ H(A), since conditioning reduces the entropy. Now,

H ( X | A , Y ) = Prob ( A = 0 ) H ( X | Y , A = 0 ) + Prob ( A = 1 ) H ( X | Y , A = 1 )

.

We recognize that H(X|Y, A = 0) = 0 because X =

when A = 0; |χ| is the cardinality of the target set of X; thus, H(X|Y, A = 1) = log(|χ| − 1), and we conclude that

H ( X | A , Y ) H ( p e r r ) + ( 1 - p e r r ) 0 + p e r r o r log ( | X | - 1 ) = H ( p e r r ) + p e r r o r log ( | X | - 1 )

.

Combining the two expressions of H(X|A, Y), we obtain the Fano inequality:

H ( p e r r ) + p e r r o r log ( | X | - 1 ) H ( X | Y )

.

But H(perr ) ≤ 1 and |χ|, the cardinality of the set of sample values of the random variable X, is expected to be large, |χ| − 1 ≈ |χ|; therefore, weaker forms of Fano's inequality are

1 + p e r r o r log ( | χ | ) H ( X | Y ) or p e r r H ( X | Y ) - 1 log ( | χ | ) .

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780123838742000035

Poisson Processes

Mark A. Pinsky , Samuel Karlin , in An Introduction to Stochastic Modeling (Fourth Edition), 2011

Problems

5.4.1

Let W 1, W 2, … be the event times in a Poisson process {X (t); t ≥ 0} of rate λ. Suppose it is known that X (1) = n. For kn , what is the conditional density function of W 1, …, Wk −1, Wk + 1, …, Wn , given that Wk = w?

5.4.2

Let {N (t); t ≥ 0} be a Poisson process of rate λ, representing the arrival process of customers entering a store. Each customer spends a duration in the store that is a random variable with cumulative distribution function G. The customer durations are independent of each other and of the arrival process. Let X (t) denote the number of customers remaining in the store at time t, and let Y (t) be the number of customers who have arrived and departed by time t. Determine the joint distribution of X (t) and Y (t).

5.4.3

Let W 1, W 2,… be the waiting times in a Poisson process {X (t); t ≥ 0} of rate λ. Under the condition that X (1) = 3, determine the joint distribution of U = W 1/W 2 and V = (1 − W 3)/(1 − W 2).

5.4.4

Let W 1, W 2, … be the waiting times in a Poisson process {X (t); t ≥ 0} of rate λ. Independent of the process, let Z 1, Z 2, … be independent and identically distributed random variables with common probability density function f (x), 0 < x < ∞. Determine Pr{Z > z}, where

Z = min { W 1 + Z 1 , W 2 + Z 2 , } .

5.4.5

Let W 1, W 2, … be the waiting times in a Poisson process {N (t); t ≥ 0} of rate λ. Determine the limiting distribution of W 1, under the condition that N (t) = n as n → ∞ and t → ∞ in such a way that n/t = β > 0.

5.4.6

Customers arrive at a service facility according to a Poisson process of rate λ customers/hour. Let X (t) be the number of customers that have arrived up to time t. Let W 1, W 2, … be the successive arrival times of the customers.

(a)

Determine the conditional mean E [W 1| X (t) = 2].

(b)

Determine the conditional mean E [W 3| X (t) = 5].

(c)

Determine the conditional probability density function for W 2, given that X (t) = 5.

5.4.7

Let W 1, W 2, … be the event times in a Poisson process {X (t); t ≥ 0} of rate λ, and let f (w) be an arbitrary function. Verify that

E [ i = 1 X ( t ) f ( W i ) ] = λ 0 t f ( w ) d w .

5.4.8

Electrical pulses with independent and identically distributed random amplitudes ξ1, ξ2, … arrive at a detector at random times W 1, W 2, … according to a Poisson process of rate λ. The detector output θ k (t) for the k th pulse at time t is

θ k ( t ) = { 0 for t < W k , ξ k exp { α ( t W k ) } for t W k .

That is, the amplitude impressed on the detector when the pulse arrives is ξ k , and its effect thereafter decays exponentially at rate α. Assume that the detector is additive, so that if N (t) pulses arrive during the time interval [0, t], then the output at time t is

Z ( t ) = k = 1 N ( t ) θ k ( t ) .

Determine the mean output E [Z (t)] assuming N (0) = 0. Assume that the amplitudes ξ1, ξ2, … are independent of the arrival times W 1, W 2,….

5.4.9

Customers arrive at a service facility according to a Poisson process of rate λ customers per hour. Let N (t) be the number of customers that have arrived up to time t, and let W 1, W 2, … be the successive arrival times of the customers. Determine the expected value of the product of the waiting times up to time t. (Assume that W 1 W 2 ··· WN (t) = 1 when N (t) = 0.)

5.4.10

Compare and contrast the example immediately following Theorem 5.7, the shot noise process of Section 5.4.1, and the model of Problem 4.8. Can you formulate a general process of which these three examples are special cases?

5.4.11

Computer Challenge Let U 0, U 1, … be independent random variables, each uniformly distributed on the interval (0, 1). Define a stochastic process {Sn } recursively by setting

S 0 = 0 and S n + 1 = U n ( 1 + S n ) for n > 0.

(This is an example of a discrete-time, continuous-state, Markov process.) When n becomes large, the distribution of Sn approaches that of a random variable S = S , and S must have the same probability distribution as U (1 + S), where U and S are independent. We write this in the form

from which it is easy to determine that E [ S ] = 1 , Var [ S ] = 1 2 , and even (the Laplace transform)

E [ e θ S ] = exp { 0 + θ 1 e u u d u } , θ > 0.

The probability density function f (s) satisfies

f ( s ) = 0 for s 0 , and d f d s = 1 s f ( s 1 ) , for s > 0.

What is the 99th percentile of the distribution of S? (Note: Consider the shot noise process of Section 5.4.1. When the Poisson process has rate λ= 1 and the impulse response function is the exponential h (x) = exp {– x}, then the shot noise I (t) has, in the limit for large t, the same distribution as S.)

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780123814166000058

Conditional Probability and Conditional Expectation

Sheldon M. Ross , in Introduction to Probability Models (Twelfth Edition), 2019

3.3 The Continuous Case

If X and Y have a joint probability density function f ( x , y ) , then the conditional probability density function of X, given that Y = y , is defined for all values of y such that f Y ( y ) > 0 , by

f X | Y ( x | y ) = f ( x , y ) f Y ( y )

To motivate this definition, multiply the left side by dx and the right side by ( d x d y ) / d y to get

f X | Y ( x | y ) d x = f ( x , y ) d x d y f Y ( y ) d y P { x X x + d x , y Y y + d y } P { y Y y + d y } = P { x X x + d x | y Y y + d y }

In other words, for small values dx and dy, f X | Y ( x | y ) d x is approximately the conditional probability that X is between x and x + d x given that Y is between y and y + d y .

The conditional expectation of X, given that Y = y , is defined for all values of y such that f Y ( y ) > 0 , by

E [ X | Y = y ] = x f X | Y ( x | y ) d x

Example 3.5

Suppose the joint density of X and Y is given by

f ( x , y ) = { 6 x y ( 2 x y ) , 0 < x < 1 , 0 < y < 1 0 , otherwise

Compute the conditional expectation of X given that Y = y , where 0 < y < 1 .

Solution:  We first compute the conditional density

f X | Y ( x | y ) = f ( x , y ) f Y ( y ) = 6 x y ( 2 x y ) 0 1 6 x y ( 2 x y ) d x = 6 x y ( 2 x y ) y ( 4 3 y ) = 6 x ( 2 x y ) 4 3 y

Hence,

E [ X | Y = y ] = 0 1 6 x 2 ( 2 x y ) d x 4 3 y = ( 2 y ) 2 6 4 4 3 y = 5 4 y 8 6 y

Example 3.6 The t-Distribution

If Z and Y are independent, with Z having a standard normal distribution and Y having a chi-squared distribution with n degrees of freedom, then the random variable T defined by

T = Z Y / n = n Z Y

is said to be a t-random variable with n degrees of freedom. To compute its density function, we first derive the conditional density of T given that Y = y . Because Z and Y are independent, the conditional distribution of T given that Y = y is the distribution of n / y Z , which is normal with mean 0 and variance n / y . Hence, the conditional density function of T given that Y = y is

f T | Y ( t | y ) = 1 2 π n / y e t 2 y / 2 n = y 1 / 2 2 π n e t 2 y / 2 n , < t <

Using the preceding, along with the following formula for the chi-squared density derived in Exercise 87 of Chapter 2,

f Y ( y ) = e y / 2 y n / 2 1 2 n / 2 Γ ( n / 2 ) , y > 0

we obtain the density function of T:

f T ( t ) = 0 f T , Y ( t , y ) d y = 0 f T | Y ( t | y ) f Y ( y ) d y

With

K = 1 π n 2 ( n + 1 ) / 2 Γ ( n / 2 ) , c = t 2 + n 2 n = 1 2 ( 1 + t 2 n )

this yields

f T ( t ) = 1 K 0 e c y y ( n 1 ) / 2 d y = c ( n + 1 ) / 2 K 0 e x x ( n 1 ) / 2 d x ( by letting x = c y ) = c ( n + 1 ) / 2 K Γ ( n + 1 2 ) = Γ ( n + 1 2 ) π n Γ ( n 2 ) ( 1 + t 2 n ) ( n + 1 ) / 2 , < t <

Example 3.7

The joint density of X and Y is given by

f ( x , y ) = { 1 2 y e x y , 0 < x < , 0 < y < 2 0 , otherwise

What is E [ e X / 2 | Y = 1 ] ?

Solution:  The conditional density of X, given that Y = 1 , is given by

f X | Y ( x | 1 ) = f ( x , 1 ) f Y ( 1 ) = 1 2 e x 0 1 2 e x d x = e x

Hence, by Proposition 2.1,

E [ e X / 2 | Y = 1 ] = 0 e x / 2 f X | Y ( x | 1 ) d x = 0 e x / 2 e x d x = 2

Example 3.8

Let X 1 and X 2 be independent exponential random variables with rates μ 1 and μ 2 . Find the conditional density of X 1 given that X 1 + X 2 = t .

Solution:  To begin, let us first note that if f ( x , y ) is the joint density of X , Y , then the joint density of X and X + Y is

f X , X + Y ( x , t ) = f ( x , t x )

which is easily seen by noting that the Jacobian of the transformation

g 1 ( x , y ) = x , g 2 ( x , y ) = x + y

is equal to 1.

Applying the preceding to our example yields

f X 1 | X 1 + X 2 ( x | t ) = f X 1 , X 1 + X 2 ( x , t ) f X 1 + X 2 ( t ) = μ 1 e μ 1 x μ 2 e μ 2 ( t x ) f X 1 + X 2 ( t ) , 0 x t = C e ( μ 1 μ 2 ) x , 0 x t

where

C = μ 1 μ 2 e μ 2 t f X 1 + X 2 ( t )

Now, if μ 1 = μ 2 , then

f X 1 | X 1 + X 2 ( x | t ) = C , 0 x t

yielding that C = 1 / t and that X 1 given X 1 + X 2 = t is uniformly distributed on ( 0 , t ) . On the other hand, if μ 1 μ 2 , then we use

1 = 0 t f X 1 | X 1 + X 2 ( x | t ) d x = C μ 1 μ 2 ( 1 e ( μ 1 μ 2 ) t )

to obtain

C = μ 1 μ 2 1 e ( μ 1 μ 2 ) t

thus yielding the result:

f X 1 | X 1 + X 2 ( x | t ) = ( μ 1 μ 2 ) e ( μ 1 μ 2 ) x 1 e ( μ 1 μ 2 ) t

An interesting byproduct of our analysis is that

f X 1 + X 2 ( t ) = μ 1 μ 2 e μ 2 t C = { μ 2 t e μ t , if μ 1 = μ 2 = μ μ 1 μ 2 ( e μ 2 t e μ 1 t ) μ 1 μ 2 , if μ 1 μ 2

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780128143469000081

Meta-Analysis and Latent Variable Models for Binary Data

Jian Qing Shi , in Handbook of Latent Variable and Related Models, 2007

2.3 Implementation

The basic idea is to treat the latent variable β = (β1,…,β k ) in (6) as missing and use an EM algorithm. Let Z be the collection of K vectors of zi , the full log-likelihood for θ given ( Z, β) is

(10) L ( Z , β ; θ ) = Σ i = 1 K [ log { p ( z i | α i , β i ) } + log { ϕ ( β i ; μ β , τ β ) } ] ,

where p(zi | α i , β i ) is given by (5) but (α, β) is replaced by (α i , β i ). The EM algorithm involves the calculation of the expectation of the above full log-likelihood conditional on the current estimate of θ and the observation Z in the E-step, and then updates the estimate of θ by maximising this conditional expectation in the M-step. The details will be discussed in the rest of this section.

2.3.1 MCMC-EM: E step

In E-step of the (r + 1)th iteration, we need to calculate the following conditional expectation

(11) L ( θ | θ ( r ) ) = E [ L ( Z , β ; θ | Z , θ ( r ) ) ] = Σ i = 1 K { E [ log ( p ( z i | α i , β i ) ) | Z , θ ( r ) ] + E [ log ( ϕ ( β i ; μ β , τ β ) ) | Z , θ ( r ) ] } ,

where θ (r) is the current estimate after the rth iteration. The expectation is calculated in terms of β. As there is no analytical form for Eq. (11), we use MCMC-EM algorithm; see, for example, Wei and Tanner (1990) and Booth and Hobert (1999).

To do this, we generate A number of random vectors { β a , a = 1,…,A} from the conditional distribution p(β| Z , θ (r)), and approximate (11) by

(12) L A ( θ | θ ( r ) ) = 1 A Σ a = 1 A L ( Z , β a ; θ ) .

We will discuss how to update θ by maximising the above log-likelihood in Section 2.3.2. Now, we discuss how to generate those random numbers.

Given ( Z , θ), the latent variable β i 's are conditional independent for i = 1,…, K. We can therefore generate random number for each component individually. For each component, its conditional density function is

(13) p ( β i | z i , θ ( r ) ) p ( z i | α i , β i ) ϕ ( β i ; μ β , τ β ) ,

where p ( z i | α i ( r ) , β i ) is given by (5). We can use Metropolis-Hastings algorithm (Metropolis et al., 1953; Hastings, 1970) to generate a random variate β i from the above density function. Carlin and Louis (2000) discussed the details for Metropolis-Hastings algorithm. Shi and Copas (2004) gave the details how to define a transition density and calculate acceptance probability for a similar problem.

2.3.2 MCMC-EM: M-step

In the M-step, we need to update θ by maximising the conditional likelihood (12). Since the unknown parameters (μβ, τβ) are involved in the second term only in the full log-likelihood function (10), the calculation of the maximum likelihood estimate is rather simple. This is to estimate (μβ, τβ) by maximising the following objective function

Σ i = 1 K log { ϕ ( β i ; μ β , τ β ) } .

It is actually equivalent to the problem that we have observed {β1,…,β K } from the normal distribution Nβ, τβ) and want to calculate the maximum likelihood estimates of μβ and τβ. Their estimates are simply given by the sample mean and the sample standard error:

β ¯ = 1 K Σ β i and V β = 1 K Σ ( β i β ¯ ) 2 .

Here, β ¯ and Vβ are sufficient statistics for the parameters of (μβ, τβ). In the (r + 1)th iteration, M-step is to update β and τβ by their conditional expectations E ( β ¯ | Z , θ ( r ) ) and E ( V β | Z , θ ( r ) ) . It is not possible to get an analytic form for those conditional expectations. We therefore use the random vectors {β 1,…, β A } generated from the conditional distribution p ( β | Z , θ ( r ) ) in E-step, and approximate these conditional expectations by their sample means:

( β ¯ 1 + + β ¯ A ) / A and ( V β 1 + + V β A ) / A ,

where β ¯ a and Vβ a are the sample mean and sample standard error of β a = ( β 1 a , , β K a ) , respectively.

The parameter α i is updated by maximising L i = log { p ( z i | α i , β i ) } . There is no analytic solution. We use the following Newton method to approximate the estimate of α i at the M-step. For simplifying the notation, we omit the index i here. We update α by the following subiteration

α = α 0 L ˙ ( α 0 ) / L ¨ ( α 0 ) ,

where α0 is the current estimate of α, L ˙ ( α 0 ) and L ¨ ( α 0 ) are the first two derivatives of L i = log { p ( z i | α i , β i ) } in terms of α i for the ith component. Bear in mind that we actually need to maximise the conditional expectation of L = log { p ( z | α , β ) } given β. This can be approximated by the random numbers generated in MCMC-E step:

L = 1 A Σ a = 1 A log { f ( z | a , β a ) } .

The Newton method works very effectively as it is a univariate problem.

2.3.3 Average MCMC-EM algorithm and standard errors

As discussed in (Shi and Copas, 2002), the estimates by the MCMC-EM algorithm converges to the real maximum likelihood estimates when A is sufficiently large. The Monte Carlo error is mainly determined by the sample size A. To reduce the Monte Carlo error we need to take a sufficiently large A (see, e.g., Booth and Hobert, 1999). An alternative way is to use average MCMC-EM algorithm proposed in (Shi and Copas, 2002). Instead of increasing A, the average value of the estimates collected in the iterations after 'burn-in' is used. The average batch mean θ ¯ ( r ) is defined as the sample mean of { θ ( r j + 1 ) , θ ( r j ) , , θ ( r ) } , where θ (r) is the estimate obtained in the rth iteration. The estimates θ ¯ ( r ) is roughly equivalent to the one calculated by using the MCMC-EM algorithm with Monte Carlo sample size J A, but the former is more efficient and much easier to implement than the latter.

The standard error of θ ˆ can be calculated quite easily for the MCMC-EM algorithm. The related observed information matrix can be calculated by Louis (1982)

I ( θ ) = E { L ¨ ( Z , β | Z ) } E { L ˙ ( Z , β | Z ) . L ˙ T ( Z , β | Z ) } ,

where L ˙ and L ¨ are the first two derivatives of the full log-likelihood (10) with respect to θ. The expectation is defined in terms of β, which can be approximated by the random samples generated in MCMC-E step. In each iteration, we calculate

I ( θ | β ) = L ¨ ( Z , β ) L ˙ ( Z , β ) . L ˙ T ( Z , β ) ,

evaluated at β = β a for each sample β a . The observed information matrix I (θ) is therefore approximated by the sample mean of { I ( θ | β a ) , a = 1 , , A } .

The final estimate of the information matrix can be calculated by the average 'batch mean' for average MCMC-EM algorithm. The variance of the estimates can be calculated from the inverse of the information matrix.

For the model we discussed in this section, we update the estimates of α i 's and (μβ, τβ) in the M-step separately. Thus, it is rather simple to calculate standard errors for those parameters since they can also be calculated separately. The expressions for the first two derivatives related to α i are (the index i is omitted for simplifying the notation)

L ˙ ( α | β ) = Σ j = 0 n [ z j m j exp ( α + β x j ) 1 + exp ( α + β x j ) ] , L ¨ ( α | β ) = Σ j = 0 n [ m j exp ( α + β x j ) ( 1 + exp ( α + β x j ) ) 2 ] ,

with the related quantities evaluated for the ith component. The variance of α is therefore calculated by

[ 1 A Σ a = 1 A { L ¨ ( α | β a ) L ¨ 2 ( α | β a ) } ] 1 .

The first two derivatives for (μβ, τβ) are given by

L ˙ ( μ β , τ β | β ) = ( K ( β ¯ μ β ) / τ β 2 K / τ β + Σ i = 1 K ( β i μ β ) 2 / τ β 3 ) , L ¨ ( μ β , τ β | β ) = ( K / τ β 2 2 K ( β ¯ μ β ) / τ β 3 2 K ( β ¯ μ β ) / τ β 3 K / τ β 2 3 Σ i = 1 m ( β i μ β ) 2 / τ β 4 ) ,

where β ¯ is the average of {β1,…,β K }. The observed information matrix for (μβ, τβ) is calculated by

Σ a = 1 A [ L ¨ ( μ β , τ β | β a ) L ˙ ( μ β , τ β | β a ) L ˙ T ( μ β , τ β | β a ) ] / A .

The final estimate of the information matrix can be approximated by the batch mean. The covariance matrix of (μβ, τβ) is the inverse of information matrix.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B978044452044950015X

Goodness-of-fit test for interest rate models: An approach based on empirical processes

Abelardo Monsalve-Cobis , ... Manuel Febrero-Bande , in Computational Statistics & Data Analysis, 2011

2.1 Estimation of the model

There exists an extensive literature focused on estimating the drift and volatility functions of Model (2). For detailed reviews of the various methods, see Prakasa-Rao (1999) and Iacus (2008). In practise, all continuous time models are observed in discrete time. Therefore, to estimate of the parameters of the continuous time model (2), the discrete model (5) is considered. Given the Markovian nature of the process r t , which the discrete observations inherit, the likelihood function L n ( θ ) of the discrete model can be derived from

(6) L n ( θ ) = i = 0 n 1 p θ ( Δ , r t i + 1 | r t i ) p θ ( r t 0 ) ,

where p θ ( Δ , r t i + 1 | r t i ) denotes the transition density or conditional density function. Usually p θ , with r t 0 = r 0 , (initial condition). If the number of observations increases with time, it can be assumed that the relative weight of p θ ( r t 0 ) in the likelihood function L n ( θ ) decreases, and it can then be assumed that p θ ( r t 0 ) = 1 . In the following we denote r 0 by r t 0 . Let

(7) n ( θ ) = log L n ( θ ) = i = 0 n 1 log p θ ( Δ , r t i + 1 | r t i ) + log ( p θ ( r t 0 ) ) , = i = 0 n 1 i ( θ ) + log ( p θ ( r t 0 ) ) ,

be the logarithm of the likelihood function, where i ( θ ) = log p θ ( Δ , r t i + 1 | r t i ) , for i = 1 , , n 1 .

The maximum likelihood estimator (MLE) is expressed as θ ˆ = arg max θ n ( θ ) . If the parametric model that generates the observations { r t i } i is known, then the maximum likelihood method is the natural choice. However, with the exception of particular cases, (e.g., the diffusion models of Vasicek (VAS)–Orsntein Uhlenbeck or CIR), an explicit form of the transition density function is not available. This is the case with the CKLS model, among others.

A technique commonly used to estimate Model (2) is to proceed as if the observations come from the Gaussian distribution, with a mean equal to the drift of the model and a standard deviation equal to the volatility function, and then to obtain the maximum likelihood estimator. Note that this method is efficient if Δ , the discretisation step, is sufficiently small (seeFan and Zhang, 2003), but the estimation can be significantly biased when Δ is large. To reduce the estimation bias, Aït-Sahalia (1999) suggests approximating the transition density function using Hermite polynomials. By applying this approximation, the author obtains a "root- n " consistent maximum likelihood estimator of a diffusion process (seeAït-Sahalia, 2002).

This approximation is crucial to our goodness-of-fit method, and in the following, the maximum likelihood method suggested by Aït-Sahalia (2002) will be applied. The implementation of this estimator is quite reasonable and is in accordance with the requirements needed to develop the next section's proposal.

Read full article

URL:

https://www.sciencedirect.com/science/article/pii/S0167947311002052

goodethoughte.blogspot.com

Source: https://www.sciencedirect.com/topics/mathematics/conditional-density-function

Post a Comment for "Find Conditional Density for Joint Continuous"