## 统计研究中心系列讲座第3期：大数据分析方法新进展

**|**来源 未知

**|**

主题：大数据分析方法新进展

主持人：林华珍教授

时 间：2017年6月1日（星期四）下午2：30-5：30

地 点：弘远楼402B会议室

主办单位：统计研究中心 统计学院 科研处

主讲人一：新加坡国立大学 余涛副教授 (下午2:30-3:30)

主讲人简介：

Dr. YU, Tao received his B.S. degree and M.S. in Mathematics and Probability & Statistics from Nankai University in 2001 and 2004 respectively. He obtained his Ph.D. degree from University of Wisconsin-Madison in 2009. He was assistant professor from September 2009 to December 2016 in Department of Statistics and Applied Probability (DSAP) at National University of Singapore (NUS), and now he is associate professor in DSAP at NUS.

His research interests are Statistical Modelling of the Brain Imaging Data, Theory and application of the Semi- and Non-parametric likelihood methods , Shape Constrained Inference in Non-parametric Models ,Statistical Modelling of the High Throughput Gene Data and Density Estimation in Multiple Sample Data.

He has published about 5 pieces of papers in the top international statistical journals, such as Biometrika, Journal of the American Statistical Association The Annals of Applied Statistics and Annals of Statistics.

主题：Using a monotone single-index model to stabilize the propensity score in missing data problems and causal inference

摘要：

The augmented inverse weighting method is one of the most popular methods for estimating the mean of the response in causal inference and missing data problems. An important component of this method is the propensity score. Popular parametric models for the propensity score include the logistic, probit, and complementary log-log models. A common feature of these models is that the propensity score is a monotonic function of a linear combination of the explanatory variables. To avoid the need to choose a model, we model the propensity score via a semiparametric single-index model, in which the score is an unknown monotonic nondecreasing function of the given single index. Under this new model, the augmented inverse weighting estimator of the mean of the response is asymptotically linear, semiparametrically efficient, and more robust than existing estimators. Moreover, we have made a surprising observation. The inverse probability weighting and augmented inverse weighting estimators based on a correctly specified parametric model may have worse performance than their counterparts based on a nonparametric model. A heuristic explanation of this phenomenon is provided. A real-data example is used to illustrate the proposed methods.

主讲人二：深圳大学 林炳清博士（下午3:30-4:30）

主讲人简介：

Bingqing Lin is currently an Assistant Professor at College of Mathematics and Statistics, Shenzhen University. He received the Ph.D degree from Nanyang Technological University at Singapore in 2014 and then worked as a postdoctoral scholar at School of Biological Sciences at NTU. His current research interests include variable selection and post-selective inference in massive datasets, bioinformatics and machine learning.

主题：

Stability of Differential Expression Analysis in RNA Sequencing Data

摘要：

As RNA-seq becomes the assay of choice for measuring gene expression levels, differential expression analysis has received extensive attentions of researchers. To date, for the evaluation of DE methods, most attention has been paid on validity. Yet another important aspect of DE methods, stability, is often overlooked. In this study, we empirically show the need of assessing stability of DE methods and propose a stability metric, called Area Under the Correlation curve (AUCOR), that generates the perturbed datasets by a mixture distribution and combines the information of similarities between sets of selected features from these perturbed datasets and the original dataset. Empirical results support that AUCOR can effectively rank the DE methods in terms of stability for given RNA-seq datasets. In addition, we explore how biological or technical factors from experiment and data analysis affect the stability of DE methods

主讲人三：厦门大学 钟威副教授（下午4:30--5：30）

主讲人简介：

钟威, 现任厦门大学王亚南经济研究院和经济学院统计系副教授，博士生导师，厦门大学经济学院院长助理。2008年毕业于北京师范大学统计学专业，2012年在美国宾夕法尼亚州立大学统计系获得统计学博士学位，同年加入厦门大学，并于2014年破格晋升副教授。研究方向主要是高维数据统计分析方法及其应用、非参数统计、计量经济学等。2014年入选福建省高校杰出青年科研人才培养计划，2016年获得厦门大学教学技能大赛暨英语教学比赛一等奖，主持国家自然科学青年基金1项、面上基金1项，发表Journal of American Statistical Association、Annals of Statistics，Journal of Business and Economic Statistics、Annals of Applied Statistics、Statistica Sinica等国际著名统计学和计量经济学期刊论文10多篇。

个人主页请见: http://wzhongwise.weebly.com/

主题：

A Lack-of-Fit Test with Screening in Sufficient Dimension Reduction

摘要：

It is of fundamental importance to infer how the conditional mean of the response varies with the predictors. Sufficient dimension reduction techniques reduce the dimension by identifying a minimal set of linear combinations of the original predictors without loss of information. This paper is concerned with testing whether a given small number of linear combinations of the original ultrahigh dimensional covariates is sufficient to characterize the conditional mean of the response. We first introduce a novel consistent lack-of-fit test statistic when the dimensionality of covariates is moderate. The proposed test is shown to be $n$-consistent under the null hypothesis and root-$n$-consistent under the alternative hypothesis. A bootstrap procedure is also developed to approximate p-values and its consistency has been theoretically studied. To deal with ultrahigh dimensionality, we introduce a two-stage lack-of-fit test with screening (LOFTS) procedure based on data splitting strategy. The data are randomly partitioned into two equal halves. In the first stage, we apply the martingale difference correlation based screening to one half of the data and select a moderate set of covariates. In the second stage, we perform the proposed test based on the selected covariates using the second half of the data. The data splitting strategy is crucial to eliminate the effect of spurious correlations and avoid the inflation of Type-I error rates. We also demonstrate the effectiveness of our two-stage test procedure through comprehensive simulations and two real-data applications.