Mastodawn

Zhian N. Kamvar Sep 12, 2024

Fun fact for #RStats: as of last month, it's been 10 years since @hadleywickham's "Tidy Data" paper was published in #JStatSoft

https://www.jstatsoft.org/article/view/v059i10

Tidy Data by Hadley Wickham

A huge amount of effort is spent cleaning data to get it ready for analysis, but there has been little research on how to make data cleaning as easy and effective as possible. This paper tackles a small, but important, component of data cleaning: data tidying. Tidy datasets are easy to manipulate, model and visualize, and have a specific structure: each variable is a column, each observation is a row, and each type of observational unit is a table. This framework makes it easy to tidy messy datasets because only a small set of tools are needed to deal with a wide range of un-tidy datasets. This structure also makes it easier to develop tidy tools for data analysis, tools that both input and output tidy datasets. The advantages of a consistent data structure and matching tools are demonstrated with a case study free from mundane data manipulation chores.

Achim Zeileis Apr 6, 2023

New in #jstatsoft: #rstats pkg intRinsic by Francesco Denti

Model-based estimation of the intrinsic dimension of a dataset

https://doi.org/10.18637/jss.v106.i09

intRinsic: An R Package for Model-Based Estimation of the Intrinsic Dimension of a Dataset by Francesco Denti

This article illustrates intRinsic, an R package that implements novel state-of-the-art likelihood-based estimators of the intrinsic dimension of a dataset, an essential quantity for most dimensionality reduction techniques. In order to make these novel estimators easily accessible, the package contains a small number of high-level functions that rely on a broader set of efficient, low-level routines. Generally speaking, intRinsic encompasses models that fall into two categories: homogeneous and heterogeneous intrinsic dimension estimators. The first category contains the two nearest neighbors estimator, a method derived from the distributional properties of the ratios of the distances between each data point and its first two closest neighbors. The functions dedicated to this method carry out inference under both the frequentist and Bayesian frameworks. In the second category, we find the heterogeneous intrinsic dimension algorithm, a Bayesian mixture model for which an efficient Gibbs sampler is implemented. After presenting the theoretical background, we demonstrate the performance of the models on simulated datasets. This way, we can facilitate the exposition by immediately assessing the validity of the results. Then, we employ the package to study the intrinsic dimension of the Alon dataset, obtained from a famous microarray experiment. Finally, we show how the estimation of homogeneous and heterogeneous intrinsic dimensions allows us to gain valuable insights into the topological structure of a dataset.

Achim Zeileis Mar 23, 2023

Started publishing volume 106 of #jstatsoft #rstats #glmnet

https://doi.org/10.18637/jss.v106.i01

Elastic Net Regularization Paths for All Generalized Linear Models by J. Kenneth Tay, Balasubramanian Narasimhan, Trevor Hastie

The lasso and elastic net are popular regularized regression models for supervised learning. Friedman, Hastie, and Tibshirani (2010) introduced a computationally efficient algorithm for computing the elastic net regularization path for ordinary least squares regression, logistic regression and multinomial logistic regression, while Simon, Friedman, Hastie, and Tibshirani (2011) extended this work to Cox models for right-censored data. We further extend the reach of the elastic net-regularized regression to all generalized linear model families, Cox models with (start, stop] data and strata, and a simplified version of the relaxed lasso. We also discuss convenient utility functions for measuring the performance of these fitted models.

Show thread

Achim Zeileis Nov 14, 2022

@fabitmart @rstats @latex @vincentab In addition to the nice package web page, there is also an introductory paper, published in the Journal of Statistical Software #jstatsoft

https://doi.org/10.18637/jss.v103.i01

modelsummary: Data and Model Summaries in R by Vincent Arel-Bundock

modelsummary is a package to summarize data and statistical models in R. It supports over one hundred types of models out-of-the-box, and allows users to report the results of those models side-by-side in a table, or in coefficient plots. It makes it easy to execute common tasks such as computing robust standard errors, adding significance stars, and manipulating coefficient and model labels. Beyond model summaries, the package also includes a suite of tools to produce highly flexible data summary tables, such as dataset overviews, correlation matrices, (multi-level) cross-tabulations, and balance tables (also known as

Achim Zeileis Nov 2, 2022

Hi Mastodon! Day 1 of #movember for me.

Short bio:
- Professor of #statistics, Uni Innsbruck
- Co-Editor-in-Chief, Journal of Statistical Software #jstatsoft
- Software developer, especially in #rstats (zoo, sandwich, party #partykit, #colorspace, R/exams #rexams, ...)