Sugi 25 proc freq it s more than counts
Pdf File 124.77 KByte, 10 Pages
Paper 6925
PROC FREQ: It's More Than Counts Richard Severino, The Queen's Medical Center, Honolulu, HI
Beginning Tutorials
ABSTRACT
The FREQ procedure can be used for more than just obtaining a simple frequency distribution or a 2way crosstabulation. Multidimension tables can be analyzed using proc FREQ. There are many options which control what statistical test is performed as well as what output is produced. Some of the tests require that the data satisfy certain conditions. Some options produce a set of results from which one must select appropriately for the situation at hand. Which of the results produced by using the CHISQ option should one use? What is the WEIGHT statement for? Why would one create an output data set with the OUT= option? This paper (beginning tutorial) will answer these questions as many of the options available in Proc FREQ are reviewed.
INTRODUCTION
The name alone might lead anyone to think that primary use of PROC FREQ is to generate tables of frequencies. According to the SAS? documentation, "the FREQ procedure produces oneway to nway frequency and crosstabulation tables". In the second edition of The Little SAS? Book, Delwiche and Slaughter state that the most obvious reason for using PROC FREQ is to create tables showing the distribution of categorical data values. In fact, PROC FREQ is more than just a procedure for counting and cross tabulating. PROC FREQ is capable of producing test statistics and other statistical measures in order to analyze categorical data based on the cell frequencies in 2way or higher tables.
There are quite a few options one can use in PROC FREQ and the output often includes additional information the user did not request or expect. A first time user trying to obtain a simple chisquare test statistic from a 2way table may be surprised to see that the CHISQ option gives them more than just the Pearson ChiSquare. What are the different statistical tests and measures available in PROC FREQ? Can the output be controlled? Can you eliminate the unwanted or inappropriate test statistics? These are some of the questions that this paper will address.
OVERVIEW
The general syntax for PROC FREQ is:
PROC FREQ options; BY variablelist; TABLES requests / options; WEIGHT variable; OUTPUT
with the last statement, TEST, being a new addition in version 7. As the options are discussed, any that are new with version 7 and not available in version 6.12 will be identified.
The only required statement is PROC FREQ; which will produce a oneway frequency table for each variable in the data set. For example, suppose we are using a data set consisting of the coffee data in chapter 4 of The Little SAS Book. The data consists of two variables: the type of coffee ordered and the window it was ordered from. If we run the following code:
proc freq; run;
then the resulting output would look like that in Output 1.
Output 1. Default output for PROC FREQ
Coffee Data 
Output when running: PROC FREQ; RUN;

Cumulative Cumulative
COFFEE Frequency Percent Frequency Percent

cap
6
20.7
6
20.7
esp
8
27.6
14
48.3
ice
4
13.8
18
62.1
kon
11
37.9
29
100.0
Frequency Missing = 1
Cumulative Cumulative
WINDOW Frequency Percent Frequency Percent

d
13
43.3
13
43.3
w
17
56.7
30
100.0
It is best to use the TABLES statement to specify the variables for which a frequency distribution or crosstabulation is desired. Failing to do so will result in a frequency distribution which lists all the unique values of any continuous variables in the data set as well as the categorical ones. It is good practice to include the DATA= option especially when using multiple data sets . More than one TABLE statement can be used in PROC FREQ, and more than one table request can be made on each TABLE statement.
We can divide all of the statements and options available in PROC FREQ into three primary categories:
1. Controlling the frequency or crosstabulation output as far as content and appearance is concerned
2. Requesting statistical tests or measures
and 3. Writing tables and results to SAS data sets.
I will begin by addressing oneway tables. Those readers already familiar with oneway tables and the options that can be used with them may wish to skip to the section on twoway and higher tables.
ONEWAY TABLES
The simplest output from PROC FREE is a oneway frequency table which lists the unique values of the variable, a count of the number of observations at each value, the percent this count represents, a cumulative count and a cumulative percent.
Suppose that we have data on the pain level experienced 24 hours after one of 3 different surgical procedures used to repair a hernia is performed. The data consists of 3 variables: the
Beginning Tutorials
medical center where the procedure was performed, the procedure performed and the level of pain (none, tolerable, intolerable) reported by the patient 24 hours later. The data is shown in the following data step:
Data pain ; input site group pain ; label site = 'Test Center'
group = 'Procedure' pain = 'Pain Level' ;
cards; 1 1 2 1 2 0 1 2 1
. . . 3 3 1 ; run;
To obtain a frequency distribution of the pain level, we would run the following:
proc freq data=pain; tables pain;
run;
which would result in the oneway table in Output 2.
Output 2. Oneway Frequency for Pain Data.
Pain Data 
Pain Level
Cumulative Cumulative
PAIN Frequency Percent Frequency Percent

0
6
22.2
6
22.2
1
15
55.6
21
77.8
2
6
22.2
27
100.0
In the column with the heading `PAIN' we see the 3 unique values for pain in the data: 0,1 and 2. The `Frequency' column shows the number of observations for each value of PAIN and the `Percent' column shows what percent of the data have each value of PAIN. The `Cumulative Frequency' and `Cumulative Percent' are simply running totals. For PAIN = 1, the cumulative frequency of 21 means that 21 cases have a PAIN value of 1 or 0. The corresponding cumulative percent of 77.8% is obtained by dividing the cumulative frequency, 21, by the total count, 27, and multiplying by 100.
The cumulative frequency and percent are most meaningful if the variable is at least ordinal. In this example, the variable PAIN is ordinal, i.e. the pain level increases as the value increases, and it makes sense to say "21 cases, or 77.8% of the sample, reported a pain level of 1 or less".
If we use PROC FORMAT to define a format to be used for the variable PAIN, we can then include a format statement in PROC FREE as in the statements below which result in the output shown in Output3.
proc format; value pain 0='None' 1='Tolerable'
2='Intolerable' ; run;
proc freq data=pain; tables pain;
format pain pain.;
run;
Output 3. Using the FORMAT statement in PROC FREE
Pain Data  with FORMAT statement 
Pain Level
Cumulative Cumulative
PAIN Frequency Percent Frequency Percent

None
6
22.2
6
22.2
Tolerable
15
55.6
21
77.8
Intolerable
6
22.2
27
100.0
Notice that the numeric values of PAIN have been replaced with some meaningful description.
Now let's say that the variable GROUP is coded such that Procedures A, B and C are coded as 2, 1, and 3 respectively. A format can be defined and used in PROC FREE to produce the oneway table in Output 4.
Output 4. Using the FORMAT statement in PROC FREE
Pain Data  with FORMAT statement 
Procedure
Cumulative Cumulative
GROUP Frequency Percent Frequency Percent

B
9
33.3
9
33.3
A
9
33.3
18
66.7
C
9
33.3
27
100.0
Notice that procedure `B' appears before procedure `A'. By default, PROC FREE orders the data according to the unformatted values of the variable, and since the unformatted value of GROUP for procedure `A' is 2, it comes after 1 which is the unformatted value for procedure `B'. We can rectify the situation by using the option ORDER=FORMATTED in the PROC FREE statement.
Output 5 shows the oneway table obtained from the following code:
proc freq data=pain order=formatted; tables group;
format pain pain.; run;
Output 5. Using ORDER=FORMATTED
Pain Data  ORDER=FORMATTED 
Procedure
Cumulative Cumulative
GROUP Frequency Percent Frequency Percent

A
9
33.3
9
33.3
B
9
33.3
18
66.7
C
9
33.3
27
100.0
Assume that the differences between procedures A, B and C are qualitative rather than quantitative. The variable GROUP is therefore a nominal variable whose values have no natural order and the cumulative frequencies and percent produced by PROC FREE are rather meaningless. Including the option NOCUM on the TABLES statement will eliminate the cumulative frequency and percent from the table as shown in Output 6.
Beginning Tutorials
proc freq data=pain order=formatted; title1 'Pain Data' ; title2 '' ;
tables group / nocum; format group group. ; run;
Output 6. Cumulative Frequency and Percent Eliminated form the Output: NOCUM
Pain Data  using NOCUM Option 
Procedure
GROUP Frequency Percent

A
9
33.3
B
9
33.3
C
9
33.3
Notice the absence of the cumulative frequency and cumulative percent. Output 7 shows that we can also eliminate the percent column by including the option NOPERCENT in the TABLES statement:
proc freq data=pain order=formatted; tables group / nocum nopercent;
format group group. ; run;
Output 7. Result of using NOCUM and NOPERCENT
Pain Data  Using NOCUM and NOPERCENT 
Procedure
GROUP Frequency

A
9
B
9
C
9
In the Pain data example, we have the actual data from which to create the oneway frequency tables. That is, the data above consists of the response for each patient in each group at each site.
The WEIGHT statement is used when we already have the counts. For example, if we are told that in a study to estimate the prevalence of asthma, 300 people from 3 different cities were examined with the following results: of: of the 100 people examined in each city, 35, 40 and 25 were found to have asthma in Los Angeles, New York and St. Louis respectively. The following data step reads the summarized data, and then using the WEIGHT statement in PROC FREE, we produce a oneway table ( Output 8) for the overall sample prevalence of asthma in the 3 cities combined.
data asthma; input city asthma count; cards; 1 1 35 1 0 65 2 1 40 2 0 60 3 1 25 3 0 75; run;
proc format; value asth 0='No Asthma' 1 `Asthma' ;
run;
proc freq data=asthma; weight count; tables asthma;
format asthma asth.; run;
Output 8. Cumulative Frequency and Percent Eliminated form the Output: NOCUM
Asthma Data  Using WEIGHT statement 
Cumulative Cumulative
ASTHMA Frequency Percent Frequency Percent

No Asthma
200
66.7
200
66.7
Asthma
100
33.3
300
100.0
Of the options available on the TABLES statement, only NOCUM, NOPERCENT, NOPRINT and OUT=
The NOPRINT option may be used if we want to use the frequencies or percents from the oneway table as input to a data step, macro or other SAS program for customized processing.
The options for statistical tests or measures do not produce any output for oneway tables. However, one might be interested in computing a 95% confidence interval for a single proportion from a oneway table. Consider the data in example 13.1 of Common Statistical Methods for Clinical Research with SAS? Examples by Glenn Walker. The data is from a study in which 25 patients with genital warts are given a combination of treatments which resulted in 14 patients being cured. The standard treatment has a known cure rate of 40% and so the question is whether the success rate of the combination of treatments is consistent with that of the standard treatment. We can compute a 95% confidence interval for the proportion of successful treatments based on the study and then see if the interval includes .4 (40%) or not. To do this we will make use of the WEIGHT statement and the OUT= option in PROC FREE. The DATA step, PROC FREE statements and resulting output follow.
data ex13_1; input cured $ count; cards; YES 14 NO 11; run;
proc freq data=ex13_1; weight count; tables cured / nocum out=curedci ;
run;
The option NOCUM has suppressed the printing of the cumulative counts and frequencies which are not necessary as there are only two categories. The OUT=CUREDCI has written the table as shown to a SAS data set named CUREDCI. The data set CUREDCI has 3 variables (CURED, COUNT and PERCENT) and 2 observations (one for NO, one for YES).
Beginning Tutorials
Output 9. Oneway Table for Data from Glenn Walker's Example 13.1
Data from Example 13.1
CURED Frequency Percent

NO
11
44.0
YES
14
56.0
The following data step uses this output data set to compute an approximate 95% confidence interval for the cure rate.
data curedci; set curedci ; P = PERCENT/100 ; N = COUNT + LAG(COUNT) ;
LB = P  ( 1.96*SQRT( P*(1P)/N ) ) ; * reset lower bound to 0 if <0 ;
IF LB < 0 THEN LB = 0 ;
UB = P + ( 1.96*SQRT( P*(1P)/N ) ) ; * reset upper bound to 1 if >1 ;
IF UB > 1 Then UB = 1 ;
if _N_ = 2;
label run;
p = 'Proportion Cured' LB = 'Lower Bound' UB = 'Upper Bound' ;
Output 10 shows the results of the above data step when printed using PROC PRINT. While it is not within the scope of this paper to explain the details of the data step, the code is included here for illustrative purposes. From the printout we see that a 40% cure rate is included in the 95% confidence interval and so we would conclude that there is no evidence that the combination of treatments is any different form the standard treatment.
Output 10. Printout of Approximate 95% Confidence Interval for a Single Proportion
Approximate 95% Confidence Interval For Proportion Cured
Proportion Cured
Lower Bound
Upper Bound
0.56
0.36542 0.75458
In version 8 of SAS, the BINOMIAL option on the TABLES statement will yield similar results to those in Output 10 without any extra steps. Output 11 shows the results of running the following PROC FREE in version 8. Note that the EXACT statement has also been included. This is available in version 8 and is used here to obtain the exact test in addition to the test using the normal approximation
proc freq data=ex13_1 order=data; weight count; EXACT BINOMIAL; tables cured / nocum BINOMIAL (P=0.4) ;
run;
Notice that the TABLES statement above includes (P=0.4) in addition to BINOMIAL. This specifies that the null hypothesis to be tested is H0:p=0.4. If (P= value) is omitted then the null hypothesis tested will be H0:p=0.5 .
Output 11. Results with BINOMIAL Option and EXACT statement
The FREE Procedure
cured Frequency
Percent

YES
14
56.00
NO
11
44.00
Binomial Proportion for cured = YES

Proportion (P)
0.5600
ASE
0.0993
95% Lower Conf Limit
0.3654
95% Upper Conf Limit
0.7546
Exact Conf Limits 95% Lower Conf Limit 95% Upper Conf Limit
0.3493 0.7560
Test of H0: Proportion = 0.4
ASE under H0 Z Onesided Pr > Z Twosided Pr > Z
0.0980 1.6330 0.0512 0.1025
Exact Test Onesided Pr >= P Twosided = 2 * Onesided
0.0778 0.1556
There are 2 sets of confidence limits given in Output 11. The first set, identical to the limits in Output 10, are an approximation based on the normal distribution. The next set, labeled Exact Conf Limits are based on the Binomial distribution.
Results based on exact tests are especially desirable when the sample size is small.
TWOWAY TABLES
The 2x2 Table
The simplest crosstabulation is a 2x2 (said "2 by 2") table. It is called a 2x2 table because it is a crosstabulation of two categorical variables, each with only two categories. In a crosstabulation, one variable will be the row variable and the other will be the column variable.
The general form of a 2x2 crosstabulation is shown in Figure 1. There are 4 cells. A cell is one combination of categories formed from the column and row variables.
Figure 1. A 2x2 table
Column Variable
1
2
Row Variable
1
a
b
r1
2
c
d
r2
c1
c2
n
In Figure 1, a is the number of cases that are in category one of the column variable and category one of the row variable while b is the number of cases in category two of the column variable and
Beginning Tutorials
category one of the row variable. The total number of cases in category one of the row variable is r1=a+b, while the total number of cases in the second category of the row variable is r2=c+d. Similarly, the total number of cases in the first and second categories of the column variable are c1= a+c and c2 = b+d respectively. The total number of cases is n = a+b+c+d = r1+r2 = c1+c2.
Consider the RESPIRE data described on page 41 of Categorical Data Analysis Using the SAS System. The variables in the data set are TREATMNT, a character variable containing values for treatment (placebo or test), RESPONSE, another character variable containing the value of response defined as respiratory improvement (y=yes, n=no), CENTER, a numeric variable containing the number of the center at which the study was performed, and COUNT, a numeric variable containing the number of observations that have the respective TREATMNT, RESPONSE and CENTER values. To obtain the crosstabulation of treatment with response shown in Output 12, we use the PROC FREE statement:
proc freq data=respire; weight count; tables treatmnt*response;
run;
Note that the WEIGHT statement was used and that there were no options specified on the TABLES statement.
Output 12. Respire Data: 2x2 Table of TREATMNT*RESPONSE
TABLE OF TREATMNT BY RESPONSE
TREATMNT
RESPONSE
Frequency
Percent 
Row Pct 
Col Pct n
y
 Total
+++
placebo 
52 
38 
90
 28.89  21.11  50.00
 57.78  42.22 
 68.42  36.54 
+++
test

24 
66 
90
 13.33  36.67  50.00
 26.67  73.33 
 31.58  63.46 
+++
Total
76
104
180
42.22 57.78 100.00
In general, to obtain a 2x2 table using PROC FREE, simply include the statement
TABLES R*C ;
where R and C represent the row and column variables respectively.
Percent, Row Percent, Col Percent
The upper left hand corner of the table contains a legend of what the numbers inside each cell represent.
Let us consider the second cell of the table in Output 12 as a means of reviewing what is meant by percent, row percent and column percent (shown as col percent). The second cell is the combination of `Yes Response' and `Placebo', i.e. it represents the patients who were given placebo and responded favorably anyway. There were 38 cases (frequency) which responded to treatment and in which placebo was given as the treatment . Thirty eight cases is 21.11% (percent) of the 180 cases. The
row percent of 42.22% indicates that 42.22% (38/90) of the cases receiving placebo responded and the column percent of 36.54% indicates that 36.54% (38/104) of the cases responding were given placebo.
Suppose we are interested in comparing the response rates between treatments. In this case we may want to ignore the column percent as well as the overall percent. The printing of the percent and column percent in each cell can be suppressed by specifying the NOPERCENT and NOCOL options on the TABLES statement. Output 13 shows the result of running the following statements:
proc freq data=respire; weight count; tables treatmnt*response / nocol nopercent;
run;
Output 13. 2x2 Table with Percent and Column Percent Values Suppressed
TABLE OF TREATMNT BY RESPONSE
TREATMNT
RESPONSE
Frequency
Row Pct n
y

+++
placebo 
52 
38 
 57.78  42.22 
+++
test

24 
66 
 26.67  73.33 
+++
Total
76
104
Total 90 90
180
The percent and column percent have been eliminated from the output, making it much more readable. Notice also that the margin percent values have been suppressed.
If the table had been specified with the TREATMNT as the column variable then we could use the NOROW option instead of the NOCOL. The following statements
proc freq data=respire; weight count; tables response*treatmnt /norow nopercent;
run;
result in the output shown in Output 14.
Output 14. Percent and Row Percent Values Suppressed
TABLE OF RESPONSE BY TREATMNT
RESPONSE
TREATMNT
Frequency
Col Pct placebo test 
+++
n

52 
24 
 57.78  26.67 
+++
y

38 
66 
 42.22  73.33 
+++
Total
90
90
Total 76
104 180
Using NOCOL, NOROW and NOPERCENT on the TABLES statement allows the output to be customized to suit the needs of the situation. Of course, there may be times when a two way table of counts is needed just to check the data (Delwiche and Slaughter),
Beginning Tutorials
in which case the three options can be specified together in order suppress all but the actual counts.
The ORDER = Option
Just as it affected the order in which the rows of a oneway table are printed, using the ORDER= option on the PROC FREE statement affects the order in which the columns or rows are printed. The effect is similar to that on the oneway tables except that it affects the column order as well as the row order. For example, if in the RESPIRE data the first observation had a response value of `y' and treatmnt value of `test' , then including ORDER=DATA on the PROC FREE statement that produced the table in output 14 would now produce a table in which the row and column order are opposite of what they were (Output 15).
Output 15. 2x2 Table when ORDER=DATA is used (Compare to Output 14)
TABLE OF RESPONSE BY TREATMNT
RESPONSE
TREATMNT
Frequency
Col Pct test placebo 
+++
y

66  38 
 73.33  42.22 
+++
n

24 
52 
 26.67  57.78 
+++
Total
90
90
Total 104 76 180
If ORDER=INTERNAL is used, which is the default, then regardless of the order the data is in when read by the DATA step, the rows and column will be in ascending order (numerical or alphabetical).
ORDER=FORMATTED will cause PROC FREE to use the format values if any to determine the order of rows or columns and ORDER=FREQUENCY will order the columns and rows by observed column and row frequencies respectively
Use of the ORDER= option can be helpful in forcing a particular order on the table. However, as will be discussed later, the order of the rows and columns has a direct effect on the interpretation of many of the statistical measures and tests produced by other options.
More Output Control
The CUMCOL option will add cumulative column percents to the table output.
Specifying NOFREQ will suppress the printing of the cell counts, and NOPRINT will suppress printing of the table in the output. NOPRINT will not suppress the printing of any requested statistical tests or measures.
The MISSING option causes missing values to be interpreted as nonmissing and hence be included in any calculations of statistics and analysis while the MISSPRINT option will include any missing values in the table but exclude them from any calculation of statistics.
The LIST option causes the table to printed in a list format rather than as a crosstabulation. Expected values, which will be discussed a little later, will not be printed if LIST is used. This
option will be ignored if any statistical tests or measures are requested.
The SPARSE option is used to have all possible combinations of levels of the variables printed when using LIST or saved to an output data set when using OUT= . This option does not affect the printed output unless used with LIST.
The ChiSquare Test
A ChiSquare test can be used to test the null hypothesis that two categorical variables are not associated , i.e. they are independent of each other. The categorical variables need not be ordinal. In general, the ChiSquare test involves the cell frequencies and the expected frequencies. The expected frequencies are the counts we would expect if the null hypothesis of no association is true.
As an example, consider again the RESPIRE data as shown in Output 15. Of the subjects receiving the `test' treatment, 73.33% responded favorably (i.e. showed respiratory improvement) while only 42.22% of those receiving placebo responded favorably. Is the difference in response rates statistically significant? To answer the question we can perform a chisquare test by using the CHISQ option on the TABLES statement. Output 16 shows the results obtained from running the following:
proc freq data=respire; weight=count; tables response*treatmnt
/ CHISQ norow nopercent; run;
Only the results specific to the CHISQ option are shown in Output 16 as the resulting twoway table is already shown on Output 15. What is very noticeable is that the CHISQ option has produced more than one ChiSquare test result.
There are 4 headings in the STATISTICS results of the output. Below the Statistic heading is the name of the statistic from the associated statistical test. The degrees of freedom for each statistic is shown below the DF heading. The value of the test statistic and the pvalue are printed under the Value and Prob headings respectively. These are the ChiSquare, the Likelihood Ratio ChiSquare, the Continuity Adjusted ChiSquare and the MantelHaenszel ChiSquare statistics.
The pvalues for Fisher's Exact Test are also given.
Output 16. Results obtained with the CHISQ Option
STATISTICS FOR TABLE OF RESPONSE BY TREATMNT
Statistic
DF Value
Prob

ChiSquare
1 17.854
0.001
Likelihood Ratio ChiSquare 1 18.195
0.001
Continuity Adj. ChiSquare 1 16.602
0.001
MantelHaenszel ChiSquare 1 17.755
0.001
Fisher's Exact Test (Left)
1.000
(Right)
1.99E05
(2Tail)
3.99E05
Phi Coefficient
0.315
Contingency Coefficient
0.300
Cramer's V
0.315
Sample Size = 180
The last three statistics, for which there are no degrees of freedom or pvalues printed, are measures of association: the Phi Coefficient, the Contingency Coefficient and Cramer's V.
 des moines city council members
 bmw extended warranty price list
 ford dealership coon rapids mn
 genuine ram truck accessories
 east brunswick italian restaurants
 chevy express passenger van used
 h r agri power fayetteville tn
 best compact crossover vehicles
 woodbridge middle school links
 earnhardt employee resource center
 peoria ford used car inventory
 enterprise car sales phoenix az
 vance county news henderson
 hollern auto sales johnstown pa
 cub cadet troubleshooting problems
 carey paul honda service department
 city of detroit municipal parking
 the news messenger obituaries
 amazon main office phone number
 salvage motorcycles near me