Comparison to Traditional Statistical Analysis

Pre-existing NIAAA sponsored alcohol-related databases may not be fully exploited because existing statistical software lacks the capabilities to adequately represent non-linear causal relationships. Moreover, state, county, and national agencies as well as public and private hospitals have limited resources for combating alcohol-related problems. Such groups need to make more effective use of existing databases of information in order to allocate their limited resources most effectively. A number of such databases are currently available. For example, the Alcohol Epidemiologic Data Directory (June, 1993) lists over 27 national health and alcohol data sets which are helpful in characterizing the nature of alcohol problems in society. Our project utilized the results of the National Longitudinal Survey - Youth (NLSY) study conducted by NORC at the University of Chicago. The creation and distribution of the NLSY data base was performed by the Center for Human Resource Research. The NLSY data base was provided to us by the NIAAA.

One important use of such data bases as NLSY is to identify longitudinal patterns of alcohol-related symptoms and administrative strategies, as well as medical and psychiatric conditions which effectively predict patient outcomes. Such information is invaluable to:

clinicians for deciding which sequences of treatment procedures are most effective,
administrators for deciding how to improve the quality of patient care at reduced costs, and
national, state, and county agencies for evaluation and monitoring purposes.

The interpretation of a database containing alcohol-related information is limited by the quality and the quantity of the data. However, if appropriate prior knowledge and expectations are introduced directly into the statistical model used to analyze the database, improved statistical inferences can be obtained from the same data set. This rather non-traditional perspective upon statistical inference is now possible given recent advances in the field of econometrics.

Traditional Statistical Analysis

The traditional approach to an alcohol disorder classification problem involves using a multiple linear regression model. Such a model is designed to predict a dependent variable whose values are integers (e.g., Diagnostics Standard Measure (DSM-IV) where 1 = dependence and abuse, 2 = abuse 3 = dependence only, 4 = none). Unfortunately, this type of modeling assumption is not consistent with the fundamental assumption of multiple linear regression which is that the dependent measure is a continuous linear function of the independent variables which is perturbed by Gaussian noise. Thus, traditional multiple linear regression is not only incapable of representing prior knowledge about potential logical causal relationships but it incorporates wrong prior knowledge for the outcome prediction. These observations suggest standard multiple linear regression methods may not be the most effective tools for addressing generic alcohol classification problems or other data base analysis problems.

A classical direct solution to part of this problem with multiple linear regression is log-linear or categorical data analysis. Categorical data analysis explicitly allows the dependent measure to be restricted to a small number of values whose similarity relations are determined by the data rather than potentially inappropriate and incorrect assumptions about the nature of the dependent measure. Only a relatively small number of researchers in alcohol and drug related fields (or other fields) have been using categorical data analysis methods. This is because categorical data analysis requires large amounts of data usually not available even for large data sets of thousands of records to directly estimate all relevant conditional probabilities.

One very popular method of statistical analysis is linear factor analysis which uses an underlying linear model to discover useful second-order statistics in a given data set. Unfortunately, linear factor analysis also requires an assumption that a stimulus can be represented as a list of continuous random variables which may not consistent with the problem under evaluation. Cluster and nonlinear factor analyses offer an improvement over linear factor analysis methods in that such methods do not require the assumptions of linearity and Gaussian noise. Unfortunately, all factor analysis methods (linear and nonlinear) are usually based upon a "leap of faith" that the resulting data analyses will yield some interpretable structure.

Structured (constrained) multiple linear regression path analysis and linear confirmatory factor analysis methods have proven useful for addressing the "leap of faith" problem in a principled manner. Such techniques incorporate prior knowledge about the relationships between state variables and outcomes directly into the statistical model, structured linear regression path analysis methods can be used to "confirm" whether a given set of prior knowledge is consistent with a given data set. One major problem with structured linear regression path analysis, however, is that classical methods of statistical inference assume that the data was actually generated by the assumed probability model. Or in other words, the model must fit the data before reliable statistical inferences can be made. It is well known that highly structured statistical models are likely to be misspecified (i.e., incorrect) in a variety of different ways. A second major problem with conventional structured linear regression path analysis is that the dependent measure (outcome variable) is a continuous Gaussian random variable as in standard multiple linear regression.

Constrained Categorical Regression Modeling

Martingale Research Corporation has used the structured Constrained Categorical Regression (CCR) modeling approach for predicting the value of an outcome variable from a collection of assertion state variables. Unlike traditional categorical (log-linear) data analysis, considerable structure in the statistical model was assumed during the development of candidate CCR models. We developed explicit formulas for incorporating a data base of "heuristic logical rules" directly into the statistical model. The capability to include the user's "expert knowledge" (in our case, a nationally recognized alcohol research consultant) about potential relationships explicitly into the causal model is a key feature of this approach. Moreover, the CCR approach exploits that knowledge with rigorous statistical theory to provide an "intelligent" statistical analysis tool for performing causal modeling and making statistical predictions. In support of the actual alcohol causal model development, we also developed statistical tests for confirming or disconfirming the user's intuitions about which logical causal relationships are relevant or irrelevant with respect to specific subpopulations.

The assumption of a highly structured statistical model implies the inevitable presence of some degree of model misspecification. From the perspective of classical theories of statistical inference, reliable statistical inferences simply can not be made with such "imperfect" statistical models. Thus, in order to make reliable statistical inferences using the highly structured CCR model, recent research findings from the field of econometrics were exploited. These new econometric methods are specifically designed to handle the problem of making correct statistical inferences in the presence of model misspecification. Thus, reliable statistical inferences in the presence of model misspecification are guaranteed.

In our project, we focused on a particular general class of alcohol-related outcome prediction problems. Such problems consist of a set of assertion state variables [s(1), ..., s(d)] where each state variable represents an assertion which can be either true or false. In addition, we had a categorical outcome variable (using the DSM-IV or DSM-IV alcohol dependence diagnostics) which could take on only one of a small number of values [o(1), ..., o(m)]. In order to improve the quality of statistical inferences, we utilized mechanisms for introducing prior knowledge in the form of logical causal relationships directly into the statistical model. An example of such a causal relationship is shown in Figure 2.

Both state variables and outcome variables can be represented in the CCR Modeling System. For example: (i) a discrete measure such as Standard Metropolitan Statistical Area (SMSA) can be represented by a set of assertion state variables (e.g.., in Central City, Central City not known, not Central City, not in SMSA or not applicable), or (ii) a continuous measure such as age can be represented by a set of assertion state variables (i.e., age < 21 , 22 < age < 50, 51 < age < 70, age > 71), and (iii) longitudinal information can be represented by specifying when information about a particular state variable was collected (e.g., frequency of religious attendance at age 15). Similarly, outcome variables can be represented by multiple targets, e.g. DSM-IV - DIAG4: 1- Dependence and Abuse, 2 - Dependence Only, 3 - Abuse Only, or 4 - None.

CAUSAL RELATIONSHIP RULE 1: IF Female AND Religious_Attendance_Infrequent_At_15 AND Number_Years_Poverty_3 AND Income_Decrease_Age_15 AND Family_Members_0-4 AND THEN CAUSAL for DSM-IV = 1 (No Alcohol Problem) CAUSAL for DSM-IV - DIAG4 = 3 (Alcohol Abuse).

Figure 2. Logical causal rules represent the relationships between the chosen input variables and the selected outcomes.

Incorporating such potential heuristics (rules of thumb) directly into a statistical model in an appropriate manner can tremendously increase the power of the statistical analysis since the statistical analysis can focus upon the expert knowledge and hints provided by the user. The CCR modeling approach we developed does exactly this.