# Constrained Categorical Regression Modeling

With the increase in information flow faced by companies in today's business world, many formerly exotic technologies such as database mining and statistical modeling have become vital to maintaining a corporation's competitiveness. However, traditional techniques for understanding and exploiting corporate and other databases are limited in capability, difficult to employ, and difficult to validate. In general, the goal of any database exploitation effort is to extract and analyze information that is currently or potentially important to the operation of the enterprise. Many of the existing tools available are simply variations of standard "cross-tab" tools, in one or more dimensions, that allow the user to obtain summary views of their data by cross-tabulating the data along the selected dimensions (e.g. spreadsheet programs). However, more powerful forms of data exploitation are now becoming available that allow the user to peer more deeply into structural and causal relationships that may exist in the data.

This data exploitation effort often takes the form of constructing a causal model that relates certain items of information (for example, in a corporate data base) to certain outcome conditions. An example of a simple causal model is as follows:

```
IF Customer Income is >$25,000/year AND
Number in Household is > 6 people, AND
NOT Residence in rural location,
THEN (CAUSAL) Probability of High_Monthly_Phone_Bill is Increased.
```

Note the resemblance to a so-called "expert system", a technology that has been utilized in certain well-defined applications within the last decade. A causal model, however, is much more powerful because the relationships expressed can be much more complex than simple IF-THEN rules, can exploit the statistics of the information from which the model was constructed, and can inherently represent probabilistic or "fuzzy" outcomes. These kinds of capabilities make causal modeling well suited to handling real-world problems, where the available data are almost never clear-cut, complete, or unambiguous.

To address this need for improved approaches to intelligent information exploitation, Martingale Research has developed an advanced technique for causal modeling called Constrained Categorical Regression (CCR). The technology incorporates recent advances in neural networks research and advanced statistical methods to provide a fast, powerful, and flexible capability for constructing casual models from real-world data. Unlike other neural network paradigms, which provide a "black-box" solution that is not easily understandable, the CCR technology supports the construction of fully transparent, easily understandable causal models at all stages of system development.

This technology was developed on a project sponsored by the National Institutes of Health, National Institute on Alcohol Abuse and Alcoholism (NIAAA). The project used the CCR modeling methodology to analyze information from certain national databases containing information on alcohol dependence and abuse among a representative group of subjects. The results from this project, while proving extremely useful for the alcohol-research field, also demonstrated the CCR technique's applicability to many areas within the social science and health care fields. Moreover, it is a statistical approach that any organization with a set of data (i.e., database) and a need to understand relationships within the data can effectively utilize. Examples of these types of applications include understanding customer databases, market research analysis, operations research for process re-engineering, and analysis of maintenance databases. In fact, wherever information is available, but not clearly understood or exploited, the CCR modeling technique can be used to gain new insights into underlying causal relationships. This makes it a powerful tool for analyzing trends or effects or making outcome predictions for a wide variety of applications.

The CCR modeling technique provides an important base for further advances Martingale Research has made in the area of specialized statistical modeling and analysis. It is one of a number of novel statistical methods Martingale Research is integrating into its upcoming statistical software package.

There are a number of benefits when using the CCR modeling approach for database mining and analysis. In particular, other data analysis and data mining technologies fall short when dealing with real-world, noisy, conflicting, or incomplete data. Since the CCR modeling technology is specifically designed to deal with situations where all conditions and relationships may not be known, it provides tools to estimate predictive confidence that allow the user to apply 'sanity tests' to any modeling results.

Back to top## Unique Benefits

- Reliable inferences when model does not fit data:
- Confidence intervals and statistical tests are reliable under many conditions where the model does not fit the data (i.e, model is misspecified). In contrast, classical statistical theory assumes that the model fits the data, although in many cases it does not.
- Comparison of totally different models:
- Supports model selection statistical tests for deciding which of two statistical models best fits a given data set. The two models may be nested, non-nested, or overlapping. In addition the two models may be misspecified or correctly specified. By contrast classical statistical theory (Wilk's generalized likelihood ratio test) can only handle the case where the models are nested and the full model is correctly specified.
- Automatic statistical analysis selection for categorical, continuous, or mixed models:
- User can specify the types of predictor or outcome variables (e.g. categorical, continuous) and the CCR program automatically determines the correct statistical model to create (e.g., logistic regression, multinomial logit regression, simple linear regression, multiple linear regression, or the appropriate combination of these). A user can also specify multiple outcome variables for any combination of categorical and continuous types. Moreover, all statistical analyses allow selective weighting of observations.
- Mechanisms for handling ill-conditioned data sets:
- When a data set is ill-conditioned with respect to statistical analysis, a Bayesian ridge regression option is available for forcing the data to become more well-conditioned with respect to the model. The assumptions that are involved in this operation are readily interpretable.
- Improved stepwise regression:
- Stepwise regression options are available for model development that utilize advanced model selection statistical test features which are reliable in the presence of model misspecification. Other stepwise procedures typically use only Wilk's generalized likelihood ratio test for model selection and thus may make incorrect inferences in the course of model development.
- Continuous outcome covariance options:
- For continuous outcomes (e.g., linear regression), the user can specify assumptions so that all continuous outcome variables have either a common variance, a unique variance, or share a covariance matrix.
- Rulebase representation scheme:
- Each parameter in the CCR Modeling methodology is associated with a rule that links a combination of predictor variables with a combination of output variables. Rules can be easily chosen in such a way as to implement logistic regression, multinomial logit regression, simple linear regression, or multiple linear regression models. Rules are especially useful for recoding predictor variables.

## Applications

There are a wide variety of application areas that can benefit from applying the CCR Modeling system for data exploitation. Some of these application areas are shown below:### Corporate Information Analysis

- Customer profiling, customer database analysis
- Employee profiles, training, and productivity
- Loss analysis
- Risk management and exposure analysis
- Credit risk analysis

### Health and Social Science Fields

- Epidemiological database analysis
- Health care alternatives and impacts assessment
- Clinical diagnostic aids
- Statistical inferencing from sociological and demographic databases

### Logistics and Information Management

- Fleet Service Record database mining (DOD and civilian transportation)
- Technical training effectiveness
- Collections database causal assessment

### Financial Market Analysis

- Equity Market selection
- Commodity trades effectiveness
- Investment modeling
- Risk analysis
- Trading system development

### Law Enforcement

- Crime and offender profiling
- Resource allocation and optimization
- Database filtering

### Insurance

- Risk assessment
- Customer profiling, customer database analysis
- Rate analysis

## Additional Information

The following links provide additional technical background for our CCR technology:

- Comparison to Traditional Statistical Analysis
- A more in-depth explanation of the ways in which constrained categorical regression modeling differs from and improves on traditional statistical analysis.
- Pruning a Softmax Neural Network Using Principled Optimal Brain Damage
- World Congress on Neural Networks. Abstract
- Data Modeling Using Constrained Categorical Regression
- Artificial Neural Networks in Engineering Conference. Abstract
- Using Constrained Categorical Regression to Identify Structural Relationships in Epidemiological Data
- Conference on Simulation in the Medical Sciences. Abstract