Mit ‘R’ getaggte Artikel

Small timer for R

Dienstag, 09. Dezember 2008

My coworker told me that pie charts never make sense. To prove him wrong I proudly present a fantastic R function that reminds you to spend not too much time on coffee drinking, rss feed reading or whatever else is stealing your time:


GeSHi Error: GeSHi could not find the language r (using path /www/htdocs/w006f92c/blog/wp-content/plugins/codecolorer/lib/geshi/) (code 2)

So timer(someNumber) will result in the following graphical output and the red part will take approximate someNumber seconds to fill the whole pie:

Animated Timer

You are still reading? Hmmm… then here also is the code that produced the animated gif above:


GeSHi Error: GeSHi could not find the language r (using path /www/htdocs/w006f92c/blog/wp-content/plugins/codecolorer/lib/geshi/) (code 2)

Execute the code above in R. Then open a shell and produce an animated gif out of the one hundred png files with ImageMagick:

convert -delay 20 -loop 0 timer*.png animatedtimer.gif

P.S.: My wife tells me that she likes pie charts because of the pie… :-)

Vortrag in Dortmund

Montag, 18. August 2008

Wir haben in der letzten Woche auf der

useR

in Dortmund einen Vortrag gehalten. Einen Vortrag an den man sich erinnern wird… – vielleicht nicht wegen dem Inhalt, sondern weil sich Bernds vorbereiteter Computer im Talk vor uns ausgeschaltet hat. Von unseren 15 Minuten Vortragszeit warteten wir 10 Minuten darauf, dass der Computer startete. Emule, Antivir, Windows Sicherheitscenter, diverse Mediacenter und andere Programme mussten sich erst starten und mit Informationen versorgen, bevor wir endlich mit Folien weitermachen konnten.

Trotz allem haben wir zu drei Leuten neuen, vielversprechenden Kontakt bekommen und inhaltlich war der Vortrag trotz allem gut. An uns ran kamen von der Präsentation sonst nur zwei Vorträge: In dem einen wurde nach 3 Minuten Einleitung einfach ein Präsentationsvideo abgespielt und im anderen Vortrag hatte der Vortragende sich so erkältet, dass er die Folien ohne Kommentar durchging.

Btw.: Wo kann man in Dortmund sinnvoll übernachten? Ich habe 70€/Nacht für ein Hotel mit dreckigen Handtüchern und ohne WLAN bezahlt! (Wenn ich es selber zahlen müsste, hätte ich sonst in der Uni übernachtet – Ich konnte nicht bei meinem Kollegen übernachten, da wegen Wasserschäden schon seine Freundin notdürftig bei ihm untergekommen war…) Das Hostel in München für 20€/Nacht war in jeder Hinsicht (nagut, es gab kein Frühstück) um Meilen angenehmer…

Projekt vorbei

Montag, 14. Juli 2008

Das große Projekt bei uns am Institut ist nun vorbei.

Unser GUI

Vorstellen werden wir das Framework des GUI-Teilprojektes unter dem Titel “Towards a Java Framework for Rapid Development of Graphical User Interfaces for Statistical Applications based on R” auf der UseR im August.

Jemand Interesse?

Die Hilfe

Kein Spaß zu Googlen

Sonntag, 13. April 2008

Es gibt Namen, die sind schwer zu googlen. Wer zum Beispiel die Sprache R mit .net benutzen möchte, hat es nicht einfach, Informationen zu finden. Hier mal Googles Ergebnisse zu: R und .net.

Viel Werbung und Suche könnte man sich sparen, wenn man andere Namen gewählt hätte… ;-)

Btw.: Ob es eine weitere einfache Möglichkeit außer über DCOM gibt, konnte ich bisher nicht herausfinden…

Random Correlation Matrices

Dienstag, 19. Februar 2008

One of my colleagues has implemented a new method for evaluating the cumulative distribution function of a multivariate normal distribution and wanted to compare the speed of his method with that of randomized quasi-Monte Carlo methods. A few days ago, while we were going to lunch, he asked me how to generate random correlation matrices, because the speed of his method strongly depends on the correlation matrix and he wanted to have some sort of average.

But what is a random correlation matrix?

Let's first give a characterization of correlation matrices.

It is well known that for a matrix $C:=(c_{i,j})_{1\leq i,j\leq n}\in\ensuremath{\mathbb{R}}\xspace ^{n\times n}$ there exist (multivariate normal distributed) random variables $X,Y$ with


if and only if

  1. $-1\leq c_{i,j}\leq 1$ for all $i,j\in\{0,\ldots,n\}$,
  2. $c_{i,i}= 1$ for all $i\in\{0,\ldots,n\}$,
  3. $C$ is symmetric (therefore all eigenvalues $\lambda_1,\ldots,\lambda_n$ of $C$ are real)
  4. and all eigenvalues of $C$ are greater or equal to zero.

But what is the right notion of randomness for these matrices? For example let's look at the orthogonal matrices. In many numerical applications we need uniformly distributed random orthogonal matrices in terms of the Haar measure.

Unfortunately in our case there is no clear, natural notion of randomness. :-(

Method 1 - Try and Error: We generate a matrix fulfilling no. 1, 2 and 3 of the characterization (these matrices are called pseudo correlation matrices) by generating independent pseudo-random numbers uniformly distributed between -1 and 1 for the entries $c_{i,j}=c_{j,i}$, $1\leq i<j\leq n$.

If this random symmetric matrix is positive semidefinite (i.e. all eigenvalues of $C$ are greater or equal to zero), we have the desired result. Otherwise we try again. Here is the corresponding R code:

random.pseudo.correlation.matrix <-
function(n) {
  a <- diag(n)
  for(i in 1:(n-1)) {
    for(j in (i+1):n) {
      a[i,j] <- a[j,i] <- runif(1,-1,1)
    }
  }
  return(a)
}

random.correlation.matrix.try.and.error <-
function(n) {
  repeat {
    a <- random.pseudo.correlation.matrix(n)
    if (min(eigen(a)$values)>=0) return(a)
  }
}

This approach is only reasonable for very small dimensions (try it with $n=6,7,8$).

Method 2 - Lift the Diagonal:

We denote by $I$ the identity matrix. If $C$ has the eigenvalues $\lambda_1\leq\ldots\leq\lambda_n$ then $(C+a\cdot I)$ has the eigenvalues $\lambda_1+a\leq\ldots\leq\lambda_n+a$ since $x$ is a solution of $\det(C-x\cdot I)=0$ if and only if $x+a$ is a solution of $\det(C+a\cdot I-x\cdot I)=\det(C-(x-a)\cdot I)=0$.

So we start again with a pseudo correlation matrix $C$, but instead of retrying if $C$ has negative eigen values, we lift the diagonal by $\lambda_1$ and obtain $C+\lambda_1\cdot I$, which is always positive semidefinite. After dividing by $1+\lambda_1$ we have a correlation matrix which is ``some kind of random''. ;-)

Unfortunately the diagonal is accentuated and the smallest eigen value is always zero. The second problem we could avoid by adding $\lambda_1+b$ where $b$ is some random number, but the first remains.

make.positive.semi.definite <-
function(a, offset=0) {
  (a + (diag(dim(a)[1]) * (abs(min(eigen(a)$values))+offset))) /
    (1+(abs(min(eigen(a)$values))+offset))
}

random.correlation.matrix.lift.diagonal <-
function(n, offset=0) {
  a <- random.pseudo.correlation.matrix(n)
  make.positive.semi.definite(offset)
}

Method 3 - Gramian matrix - my favorite: Holmes [Holmes1991] discusses two principal methods for generating random correlation matrices. One of them is to generate $n$ independent pseudo-random vectors $t_1,\ldots,t_n$ distributed uniformly on the unit sphere $S^{n-1}$ in $\ensuremath{\mathbb{R}}\xspace ^n$ and to use the Gram matrix $T^tT$, where $T:=(t_1,\ldots,t_n)$ has $t_i$ as $i$-th column and $T^t$ is the transpose of $T$.

To create the $t_i$ in R, we load the package mvtnorm, generate $\tau_i\sim\mathcal{N}(0,I)$ and set $t_i:=\tau_i/\vert\vert\tau_i\vert\vert$:

random.correlation.matrix.sphere <-
function(n) {
  require("mvtnorm")
  t <- rmvnorm(n,rep(0,n),diag(n))
  for (i in 1:n) {
    t[i,] <- t[i,]/sqrt(t(t[i,])%*%t[i,])
  }
  t%*%t(t)
}

Conclusion: There are futher methods (like e.g. to generate a random spectrum and then construct the correlation matrix), which are not as easy to implement as the ones described above. But just as the three given methods, they all are unsatisfactory in some way because we don't really know how random correlation matrices should be distributed.

For my colleague an average of calculation time does only make sense when he knows which kind of correlation matrices occurs in the applications. He decided to describe and compare the different cases individually.

But does it perhaps make sense to use random correlation matrices as test cases or are the special cases more important? For example, random correlation matrices generated with method 1 and 3 are only singular with probability zero.

Any critique, comments, suggestions or questions are welcome!

And for the next time: Given a correlation matrix C. How do we generate tuples of pseudo-random numbers following a given multivariate distribution with correlation matrix C?

Bibliography

Holmes1991
Holmes, R. B. 1991.
On random correlation matrices.
Siam J. Matrix Anal. Appl., Vol. 12 No. 2: 239-272.

Solve double integrals with R numerically

Mittwoch, 06. Februar 2008

The goal of this post is to show how to evaluate in R terms like

\begin{displaymath }\int_a^b\int_c^d f(x,y)\; dx\; dy \end{displaymath}

with
\begin{displaymath}a,b,c,d\in[\infty,\infty]=:\overline{\ensuremath{\mathbb{R}}\...mathbb{R}}\xspace ^2\rightarrow\ensuremath{\mathbb{R}}\xspace .\end{displaymath}

The R function “integrate”:
We use the R function integrate and cite some arguments from its help:
integrate(f, lower, upper, …)

f: an R function taking a numeric first argument and
returning a numeric vector of the same length.
Returning a non-finite element will generate an error.

lower, upper: the limits of integration. Can be infinite.

…: additional arguments to be passed to ‘f’.


There are further arguments, but see ?integrate for the complete and correct signature of this function.

Examples:

\begin{displaymath}\int_0^3x^2\;dx = 9\end{displaymath}

> integrate(function(x){x^2},0,3)
9 with absolute error < 1e-13

\begin{displaymath}\int_0^{2\pi}\sin(x)\;dx = 0\end{displaymath}

> integrate(sin,0,2*pi)
2.032977e-16 with absolute error < 4.4e-14


\begin{displaymath}\int_1^{\infty}\frac{1}{x^2}\;dx = 1\end{displaymath}

> integrate(function(x){1/x^2},1,Inf)
1 with absolute error < 1.1e-14

Double integral:
To evaluate
\begin{displaymath}\int_a^b\int_c^d f(x,y)\; dx\; dy\end{displaymath}

we define:
\begin{displaymath}g:\;y \mapsto \int_c^d f(x,y)\; dx\end{displaymath}

g <- function(y) {integrate(function(x) {f(x,y)},c,d)$value}

and then integrate over g:

\begin{displaymath}\int_a^b\int_c^d f(x,y)\; dx\; dy=\int_a^b g(y)\; dy\end{displaymath}
integrate(Vectorize(g),a,b)$value

Example:
$f(x,y)=x^2\cdot y,\;a=c=0,\;b=d=1$,
\begin{displaymath}\int_0^1\int_0^1 x^2\cdot y\; dx\; dy=...\end{displaymath}

R-Code:
> f <- function(x,y) { x^2*y }
> g <- (function(y) {integrate(function(x)
{f(x,y)},0,1)$value})
> integrate(Vectorize(g),0,1)
0.1666667 with absolute error < 1.9e-15

Warning
Not all functions are suited: “Like all numerical integration routines, these evaluate the function on a finite set of points. If the function is approximately constant (in particular, zero) over nearly all its range it is possible that the result and error estimate may be seriously wrong” (from the integrate help – see ?integrate).

To demonstrate this, we integrate over the indicator function of ]1,3[:

plot

> f<-function(x){ifelse(x<1,0,ifelse(x<3,1,0))}
> integrate(f,-20,20)
2 with absolute error < 2.2e-15
> integrate(f,-30,30)
0 with absolute error < 0

On the other hand the last integral

\begin{displaymath}\int_0^\infty\left(\int_{-\infty}^\infty\binom{m}{j}[\Phi(b...\d t\right)f_\nu(s)\d s\end{displaymath}

we were interested in was no problem to evaluate… ;-)

Since LaTeX2HTML2Wordpress is a stressful conversion, I defer the theoretical background indefinitely.