structure.tex

\documentclass[a4paper,10pt]{article}

\usepackage{ucs}
\usepackage[utf8]{inputenc}
%\usepackage{babel}
\usepackage{fontenc}
\usepackage{graphicx}
\usepackage{rotating}
\usepackage{amsmath}
\usepackage{amsfonts}
\usepackage{amssymb}


\usepackage[dvips]{hyperref}

\date{06/05/2014}

\begin{document}
 
\title{gmGeostats: an R package for massive, parallelized geostatistical 
analysis}

\maketitle

\section*{Content}

\paragraph{abstract}
A multivariate geostatistics R package is desired, that is able to deal 
with large data sets, massive interpolation/simulation grids 
and flexible variogram/spatial dependence structures, taking 
all the profit possible of parallelization capabilities. 
Apart of containing state of the art methods, it should be the framework
to develop, test and disseminate our own geostatistical methods.
It will be highly compatible/interfaced with our other packages (systemstats, 
compositions, gmDatabase and gmGeometallurgy), and specially adapted
to the needs of modern and future geometallurgical analysis,
regional geochemistry, and other main research topics at HIF.
Eventually, should have graphical interfaces with QGIS, Rcmdr/RKWard, 
self-standing GUI 
(as an extra package?)


\begin{sidewaystable}
\begin{tabular}{lllll}
\hline
 model &  feature & structural tool & interpolation & simulation \\
\hline
Gaussian & continuous & covariance & Simple coKriging  & Turning Bands (TB)   \\
         &            & covariogram & Ordinary coKriging & TB + ApproxChol  \\
         &    & residual covariance & Universal coKriging  & TB + several  \\
        &   & MAF and rank-&   & \\
        &   & defficient covs & ScoK, OcoK, UcoK  & TB \\
\hline
transGaussian & continuous & covariance & ScoK & Turning Bands (TB)   \\
\hline
transition probs & discrete & transiograms &   & sequential indicator \\
indicator    &   & indicator covariance & IScoK & \, cosimulation \\
  & & & & plurigaussian sim. \\
\hline
multipoint  & discrete  & training image & E-type from & MPS \\  
        & any & unsaturated GLM & \, simulations & MPS \\
\hline
\end{tabular}
\end{sidewaystable}

\section{Data input, structure and spatial manipulation}

Typical geostatistical software manage data sets with 2D and 3D spatial 
coordinates. In most of their applications, one data set is used for all steps 
of the geostatistical analysis. Most of the times, each variable is considered 
separated, even though some of them can be observed at the same locations. Two 
exceptions are notable. The first is MPS, where a training image must also be 
provided, i.e. as a sort of ``training set''. The second is collocated cokriging 
(a misnome), where a secondary variable densely sampled is used to interpolate a 
less measured one. 

On the other hand, recent work of this group has shown how important is to 
consider that it is not always appropriate to consider the data as cartesian 
products of separate, univariate variables. Some sets of variables form natural 
groups, that should be analysed and interpreted together (parts in a 
composition, mass fractions in a particle size distribution, disjunctive 
indicators of facies, dip and strike of a direction, summaries of MLA images, 
etc). These sets of jointly defined variables are called here \emph{layers}.

In this framework, we consider necessary to manage data of the following 
characteristics:
\begin{itemize}
 \item a project might contain several layers of information, and should be 
georeferenced (1D to 4D geographic space: borehole, plane, volume, 
volume$\times$ time);
 \item each \textbf{layer} can contain any number of collocated variables 
forming a system (a composition, a direction, a particle size distribution, a 
facies, etc): though missings should be allowed, in principle a layer should be 
(almost) fully observed; 
 \item each layer might indicate the need of a \textbf{scaling function} 
(univariate or multi\-variate) and give its direct and inverse transformation 
function (analytical function, like an alr/ilr; Hermite polynomials for Gaussian 
anamorphosis; flow specification for multivariate Gaussian anamorphosis; 
isofactorial model?)
 \item it is necessary to distinguish between ``\textbf{conditioning set}'' (the 
data set that will provide the conditioning points in interpolation/simulation) 
and the ``\textbf{training set}'' (the data set that will provide the spatial 
structure information, e.g. the training image in MPS); note that, up to 
irruption of MPS, most geostatistical practice did not distinguish them;
 \item training sets might have relative geographic coordinates, but the scale 
and projection should be compatible with that of the data;
 \item for the moment, the range of foreseen applications suggest that it should 
be possible that each layer has its training set, or that each project has a 
unique multi-layer training set.
\end{itemize}

Further aspects of spatial data storage and manipulation that are often 
required, typical of GIS and similar software, are:
\begin{description}
 \item[Domaining] the geographic space can be partitioned in disjoint domains; 
contacts between domains might be ``soft'' (information can go throught them, 
data in both domains might contribute to interpolations-simulations) or ``hard'' 
(information should not go through the contact, interpolations-simulations 
should only use data on one side of the contact). Domains are usually large 
bodies, not necessarily continuous. These might be identified in several 
alternative ways: a wireframe surface (vector paradigm) or an facies-like 
indicator (raster paradigm).
 \item[Blocking] the geographic space can be partitioned in disjoint blocks for 
which some averaged quantity is desired. These are mostly small, regular, 
continuous bodies, hence more easily identified as the interior of a grid. 
Blocks must often be discretized in a moderate number of internal 
points-segments-pixels-voxels. 
\end{description}
Other spatial manipulations: rotation, erosion, dillation, global geometric 
anistotropy compensation.


\section{Structural tool estimation}

Up to now, most geostatistical sofwtare can manage only one type of structural 
tool: (cross-)variograms. MPS software is, of course, prepared to manage 
training images. However, the requirements on the kinds of data to manipulate 
and the new methods that we want to implement set these as too limiting. The 
following kinds of pairs [training data/structural function] should be 
available.

\begin{itemize}
 \item the data used should be the training data;
 \item all layers enter the calculations after being applied the scaling 
function;
 \item it should be possible to use a domaining to restrict the pairs of 
locations $(x_i,x_j)$ to enter the calculations only if they are in the same 
domain, or restrict the calculations to one single subdomain.
\end{itemize}


\paragraph{[real data/scalar-variogram]} The basic tool of classical two-point 
geostats. The user should be able to select which variable treat $Z$, define a 
polar grid of lag distances $h$ to search for pairs of data, and estimate the 
variogram
$$
\hat{\gamma}_Z(h) = \frac{1}{2N(h)} \sum_{i,j}^{N(h)} (z(x_i)-z(x_j))^2.
$$


\paragraph{[real data/scalar-covariance]} Also a common tool of classical 
two-point geostats. The user should be able to select which variable treat $Z$, 
give its mean value $\mu$, define a polar grid of lag distances $h$ to search 
for pairs of data, and estimate the covariance function
$$
\hat{C}_Z(h) = \frac{1}{N(h)} \sum_{i}^{N(h)} (z(x_i)-\mu)^2.
$$


\paragraph{[real vector data/matrix-variogram]} The basic tool of multivariate 
two-point geostats. The user should be able to select which layer(s) to treat, 
jointly in a column-vector $\mathbf{Z}$, define a polar grid of lag distances 
$h$ to search for pairs of data, and estimate the variogram
$$
\hat{\boldsymbol{\gamma}}_{\bf Z}(h) = \frac{1}{2N(h)} \sum_{i,j}^{N(h)} 
[\mathbf{z}(x_i)-\mathbf{z}(x_j)]\cdot [\mathbf{z}(x_i)-\mathbf{z}(x_j)]^t;
$$
actual calculations might be better done by individual coordinates, to allow for 
some missing values in the vectors.

\paragraph{[compositional data/variation-variogram]} The proposed tool of 
compositional two-point geostats. The user should be able to select a 
compositioanl layer to treat, jointly in a column-vector $\mathbf{Z}$, define a 
polar grid of lag distances $h$ to search for pairs of data, and estimate the 
variogram
$$
\hat{\boldsymbol{\tau}}_{\bf Z}(h) = [\hat{\tau}_{kl}(h)], \quad \quad
\hat{\tau}_{kl}(h)=\frac{1}{2N_{kl}} \sum_{i,j}^{N_{kl}} \left(\ln 
\frac{z_k(x_i)}{z_l(x_i)}-\ln \frac{z_k(x_j)}{z_l(x_j)}\right)^2.
$$


\paragraph{[categorical data/indicator-variogram]} The classical tool of 
indicator two-point geostats. The user should be able to select which 
categorical variable treat $Z$, define a polar grid of lag distances $h$ to 
search for pairs of data, and estimate the variogram
$$
\hat{\boldsymbol{\gamma}}_{Z}(h) = [\hat{\gamma}_{kl}(h)], \quad \quad
\hat{\gamma}_{kl}(h)= \frac{1}{2N_{kl}(h)} \sum_{i,j}^{N_{kl}} 
(J_k(x_i)-J_l(x_j))^2;
$$
where the disjunctive indicators show whether we are in one category $K_k$ or 
not:
$$
J_k(x_i) = \begin{cases}
 1 & Z(x_i)=K_k  \\
 0 & otherwise
           \end{cases}.
$$

\paragraph{[categorical data/transiogram]}

\paragraph{[categorical training image/MPS search list]}

\paragraph{[multilayer training image/MPS search list]}  ???


\section{Structural tool fitting/modelling}

At the step of fitting/modelling, the following pairs of [structural 
function/model] should be offered:

\paragraph{[scalar-variogram or covariance/linear model of regionalization]} The 
typical combination of valid simple models, 
$$
\gamma(h|\boldsymbol{\theta}) = \sum_{a=1}^A c_a \cdot 
\rho_a(h|\boldsymbol{\theta}_a)
$$
with those unitary variograms $\rho_a(h|\boldsymbol{\theta}_a)$ included in 
table XXXX. The parameter vector $\boldsymbol{\theta}_a$ will include range and 
anisotropy, and can 
include other accessory scalars (periodicity and smoothness). Up to our present 
knowledge, periodicity and range must share the same anisotropy structure, and 
smoothness cannot be anisotropic. 

\paragraph{[matrix-variogram or covariance/linear model of coregionalization]} 
Also valid for compositions and indicators, as in the preceding case, it is a 
combination of 
valid simple models, 
$$
\gamma(h|\boldsymbol{\theta}) = \sum_{a=1}^A \mathbf{C}_a \cdot 
\rho_a(h|\boldsymbol{\theta}_a),
$$
where each sill matrix $\mathbf{C}_a$ can be:
\begin{itemize}
 \item a covariance matrix (full-rank or rank-deficient positive definite 
matrix) for real vector-valued layers
 \item a variation matrix for compositional layers (i.e. for 
$\boldsymbol{\tau}_{\bf Z}(h)$), possibly rank-deficient as well
 \item a covariance matrix with at least one singular eigenvalue, for indicators
\end{itemize}
Individual unitary variograms $\rho_a(h|\boldsymbol{\theta}_a)$ satisfy the same 
conditions as for the linear model of regionalization.

\paragraph{[directional-covariance/complex regionalization]} Wackernagel book, 
paper from Sandra de Iaco

\paragraph{[transiogram/Markov transition matrix]} Thesis Enayat

\paragraph{[categorical MPS search list/saturated model]} 

\paragraph{[MPS search list/unsaturated JointSim model]} paper Strategic Mine 
Planning


\section{Interpolation}

\paragraph{Methods included} the co/kriging following methods should be 
included:
\begin{itemize}
 \item simple (co)kriging
 \item ordinary (co)kriging
 \item universal (co)kriging
 \item (co)kriging with a trend
\end{itemize}


\paragraph{Compositional particularities} for a compositional layer, one should 
be able to manipulate missing values 


\paragraph{Interpolation model specification}


\paragraph{How to specify the interpolation grid}


\section{simulation}


\end{document}