P-STAT, Inc. Release V223.1 ------------HELP ON turf------------------ TURF stands for Total Unduplicated Reach and Frequency. It is most often used in market research applications. Last updated: July 9, 2005. For example, a file has the responses of 1000 cases on 40 items. Each item is a TV program which the respondent did or did not watch. You would like to know which group of 8 programs was reached by the largest number of respondents. A respondent is "reached" by a given group of programs if they watched at least one of the programs in the group. TURF, to find the best group, evaluates each of the 76.9 million combinations of 8 items taken from a pool of 40, and writes a file identifying the 100 best combinations. This takes about a minute on a 2.4GHz pc. Weighting can be applied to the cases, or to the response values, or to the items. The "being reached" criterion of one response can be increased. Adding options affects speed. The above one-minute run takes 12 minutes if case weights are used, and would take about 30 minutes if the other options are used. **************************** * reach versus frequency * **************************** Reach and frequency are different measurements in TURF. The REACH score for a combination is the number of cases that have at least N positive responses on the variables in that combination. N is the reach threshold in use, which has a default setting of one. The FREQ score for a combination is, for the reached cases, normally the total number of positive responses on the variables in the combination. The response.weights option causes the FREQ score to be the sum of the positive responses. If a case indeed watched all eight programs in a group, the frequency count for that group would be increased by 8, but the reach count would only be increased by 1. If the input were hours watched rather than just watched or not, the response.weights option would sum the response values and thereby measure the impact of the combination. ********************************** * features of the TURF command * ********************************** Allows up to 210 items (i.e., variables). Allows combinations of items up to size 60. Allows several combination sizes to be done in one run. Allows tens of thousands of cases. Allows weighting of cases. Allows weighting of items. Allows weighting of responses; this allows the intensity of a response to be utilized. Allows setting a reach threshold of more than one. Allows forcing designated items into every combination. Allows limits on how many of a set of items can be placed in the combinations to be analyzed. Writes a result file containing the best combinations. The items within each combination are ordered by their importance. Writes a template file for use by the TURF.SCORES command. Writes (in TURF.SCORES) the reach score for each case. Takes 3.1 seconds for 1,000 cases on 40 items, 6 at a time on a 2.4 GHz PC. This evaluates 3.838 million groups. Runs so rapidly that billions of combinations can be done. Takes 18.5 minutes for 1,000 cases on 100 items, 6 at a time on a 2.4 GHz PC. This evaluates 1.192 billion groups. Note: using all options would take about 9 hours. Shows the percent of combinations already processed in a progress window. Writes a detailed report when the command finishes. ******************************* * a simple dataset, used * * in some examples that * * show various TURF options * ******************************* The following dataset has a case identifier (case.id), a case weight (www), and the responses to five variables. The responses are zeros, ones and two twos. This dataset is used to show various TURF options. The twos are treared differently from ones only when the RESPONSE.WEIGHTS option is in use. -----------------file ddd-------------- case.id www v1 v2 v3 v4 v5 9001 1 1 1 0 0 0 9002 1 1 1 0 0 0 9003 1 1 1 0 0 0 9004 1 1 1 0 0 0 9005 1 1 1 0 0 0 9006 1 1 1 0 0 0 9007 1 0 1 0 0 0 9008 1 0 0 1 0 0 9009 1 0 0 1 0 0 9010 1 0 0 1 0 0 9011 1 0 0 1 0 0 9012 1 0 0 0 1 0 9013 1 0 0 0 1 0 9014 1 0 0 0 1 0 9015 3 0 0 0 0 2 9016 3 0 0 0 0 2 ******************************* * example 1: * * a TURF run using defaults * ******************************* turf ddd [drop case.id www], size 3, reach.results rrr $ list rrr $ This command uses PPL (P-STAT Programming Language) to drop the first two variables, leaving five items for the turf analysis. Using [keep v1 to v5] would have done the same thing. SIZE 3 tells the command to look at the items in groups of 3. REACH.RESULTS RRR creates a file named RRR that identifies the best groups. That file is then listed. Ten groups will be tested: v1-v2-v3, v1-v2-v4, etc through v3-v4-v5. We are looking for the group that has at least one positive response for the largest number of cases. In other words, which group of three items REACH the most cases ? The best combination uses variables v2, v3, and v4, which reach 14 of the 16 cases. The FREQ score for that combination is also 14. V1 is not included because it adds nothing once v2 is selected. V1-v2- v3 have the highest number of of responses, but do not reach as many different cases as the group of v2-v3-v4. The twos in cases 9015 and 9016 on v5 are treated as ones. ************************************* * example 2: * * a TURF run using case weighting * ************************************* turf ddd [drop case.id ], size 3, reach.results rrr, case.weights www $ list rrr $ Here, WWW is not dropped because it is needed in the command. We are seeking the group of three variables that has the largest WEIGHTED number of reached cases. Cases 9015 and 9016 are the only cases with a caseweight of other than one, so they are the ones affected by case- weights in this example. The best group is v2-v3-v5, which reach 17 weighted cases. V2 and V3 provide 11 cases (all of which had unit weights). Adding v4 would pro- vide 3 more, but adding v5 increases the weighted reach count by 6, since each of those two cases has a caseweight of three on WWW. The FREQ score for that combination is also 17. ***************************************** * example 3: * * a TURF run using response weighting * * and a threshold of more than one * ***************************************** turf ddd [drop case.id www], size 3, reach.results rrr, response.weights, reach.threshold 2 $ list rrr $ In examples 1 and 2, the responses to the items were treated in a zero versus nonzero manner. Using RESPONSE.WEIGHTS causes the actual response values to contribute to the reach scores. In addition, using REACH.THRESHOLD 2 causes a case to be reached only when its reach score for a given group is 2 or more. The best group in this example is v1-v2-v5, which reached 8 cases. Cases 1 through 6 were reached by a response of 1 to both v1 and v2; cases 15 and 16 were reached because of responses of 2 on v5. The FREQ score for that combination is 16. ************************************** * example 4: * * a TURF run using item weighting * * and a threshold of more than one * ************************************** /* create a file containing a weight value of 2 for item v3 */ build work1, vars name:c weight; v3 2 $ turf ddd [drop case.id www], size 3, reach.results rrr, item.weights work1, reach.threshold 2 $ list rrr $ Normally each item has a weight of one; each has the same contribution to a reach score. It is possible, however, to make some items worth more than other, possible reflecting, for example, differences in costs of the items. In this example, item v3 is weighted. This is conveyed in file WORK1 whose record defines a weight of 2 for item v3. This causes responses on v3 to be worth twice what they would otherwise be worth. As in exam- ple 3, a threshold of 2 is used. The best group in this example is v1-v2-v3, which reach 10 cases. Cases 1 through 6 achieve a reach score of 2 using items v1 and v2. Cases 8 through 11 have reach scores of 2 because of the item weight given to v3. The FREQ score for that combination is 20. ************************* * general identifiers * ************************* TURF xxx, this supplies the input filename. Except for an optional weight variable, all variables are treated as analysis items. The values on the analysis items should be zeros or positive numbers. A positive value signifies a "hit". Cases with any missing or negative values on the analysis items are ignored. The SET.MISS.TO.ZERO identifier, described below, sets missing analysis items to zeros. When case weighting is being used, any cases with a missing, negative or zero value on the weight variable are also ignored. SIZE 6, what size combinations to use. required. SIZE 4 to 7, SIZE 6 to 3, SIZE 4 6 8, One or more sizes can be done in one run. They are done in the order given; for SIZE 6 to 4, size 6 is done first. The final report shows the result for each size separately. The output files show the best results from the first size, then the second, and so forth. Many (20 or more) sizes can be done in a run; each size must be from 1 to 60, and there should not be any repeated sizes in a run. Note...some sizes cannot be run in a reasonable amount of time. Consider 40 items. Depending on number of cases and on options: Size 4 takes 91,390 iterations. Seconds. Size 6 takes 3.8 million iterations. Minutes. Size 10 takes 847 million iterations. An hour. Size 15 takes 40 billion iterations. A day. Size 20 takes 137 billion iterations. A week. This command produced the above numbers. DO #j = 1, 20; PUT #j (combinations( 40,#j)); ENDDO $ The F2 key can be used to cause a TURF command to abandon the current size being processed. It will produce the report and the output files for the sizes already completed. REACH.THRESHOLD 2, optional. can be fractional. This permits the user to control what constitutes a successful "reach". The default is one; if a case has a positive response on any of the items in a given combination, that case is added to the reach total for that set of items. Using REACH.THRESHOLD 3, for example, means a case needs a reach score of 3 or more to have been reached on a given group. Having several responses increases a case's reach score; weighting of either items or responses can also affect the reach score. PROGRESS 5, optional. controls how often the progress window or report line is updated. The default is 1, which means every million combinations. PROGRESS 0 turns it off. SET.MISS.TO.ZERO, optional. If used, missing analysis values in the input file are set to zeros. If needed, this saves having to write some PPL as the file is read. ***************************************** * identifiers that control the makeup * * of the combinations to be used * ***************************************** USE list-of-vars min max, optional. This provides a limitation on the makeup of the combinations to be tried. Of the variables whose names (or ranges) follow USE, at least MIN of them and at most MAX of them should be in every combination that will be tried. The MIN value can be zero. Up to 30 such USE phrases can be given. Combinations are used only if they pass the constraints in every one of the USE phrases. Each use of USE is followed by: (1) The names of the variables in the group. Ranges, like TOPPING.1 TO TOPPING.8, can be used. (2) The smallest number of those variables that are required. Can be zero. A combination must have AT LEAST that many of the variables in the group. (3) The largest number of those variables that may be used. A combination may have AT MOST that many of the variables in the group. All of the group could be used if the supplied number is equal to or larger then the size of the group. Therefore, using 999 is a vivid way of saying there is no upper limit for the group. For example: TURF xxx, size 8, use aaa bbb to ddd 1 999, use eee to ggg jjj to mmm 2 4, use yyy zzz 0 1 $ In the above command, the only combinations that will be evaluated are those that have at least one variable from the first group, and at least two but no more than four variables from the second group, and no more than one variable from the third group. FORCE vars, optional. names or ranges of items that should be part of every combination. Suppose there are 30 items and size is 6; without force, 593,775 combinations are done, because we take 30 items 6 at a time. If 2 items are forced, only 20,475 combinations will be done because the run reduces to 28 items taken 4 at a time. If size is 6 and all 6 items are forced, just that one pass will be done. ***************************** * identifiers for various * * kinds of weighting * ***************************** CASE.WEIGHTS varname, optional. The named variable will be used as a caseweight, and not as an analysis item. ITEM.WEIGHTS filename, the default is treat all of the items the same, i.e., with weights of 1. When ITEM.WEIGHTS is used, it should be followed by the name of a p-stat system file which itself has exactly 2 variables. In each record, the first variable has the name of a item being used for the TURF analysis, the second is the weight to be used for that item. The first variable is therefore character, and the second is numeric. The file is not required to have a record for every item. In other words, some items can be given changed weights; others can be left as is ( i.e., still set to 1). The file can have names and weights for items not used in the current run; if so, they are ignored. RESPONSE.WEIGHTS, the default is to store the input data as zeros or ones, with one meaning a yes. This option leaves the input values intact; they should be in zero (no) or a positive value (not necessarily an integer) to show the INTENSITY of a yes. **************************** * the REACH.RESULTS file * **************************** REACH.RESULTS rrr 300, optional output p-stat system file. This file holds the combinations with the best REACH values. They are in descending order on REACH. Within ties on reach, the combinations are in descending order on FREQ. The item names in a combination are ordered by the reach contribution that each in turn adds. The default is to write the 100 best combinations for each size. If an integer like 300 follows the file name, that many are written for each size. Each combination will take from 1 to 5 lines, as determined by the REACH.DETAILS identifier, described below. The default is two lines: one for the item names, the second for the cumulative reach for each successive item. The names of the variables in the REACH.RESULTS file itself are these. Note, some (or all) of the initial 6 can be dropped by using the OMIT identifier, described below. (1) SIZE: the combination size. (2) RANK: the rank within size. (3) REACH: the reach value for the combination. (4) PCT.REACHED: the percent of usable cases reached by the combination. The usable cases are the cases with no invalid data. This includes cases with no responses whatsoever. (5) PCT.OF.MAX.REACH: the percent of active cases reached by the combination. An active case is a usable case that has at least one positive response; other cases cannot possibly be reached. (6) FREQ: the freq value for the combination. (7+) ITEM.1, ITEM.2, ITEM.3, etc: These variables contain the names of the items that make up a combination. The item names are ordered by their contribution to the reach score. I.e., the name appearing under ITEM.1 is the 'best' item in the combination. If sizes 6 and 8 are both being done, the file will have item.1 through item.8. The results for size 6 will have blanks for item.7 and item.8. ******************************** * the REACH.RESULTS file: * * the items in a combination * * are ordered by importance * ******************************** Suppose 4 items, AA, BB, CC and DD, comprise a combination about to be written to the reach.results file. Before writing them, they are reordered so that the leftmost item is the one with the highest individual reach. The next item shown has, when paired with the leftmost item, the largest 2-item reach score, and so on. The reordering is done in this manner. Each of the four variables is used in a size 1 pass to see which has the best standalone reach. Suppose it was CC. CC is placed in the ITEM.1 column. Now, given CC is best, which item is next ? It is the item that, when paired with CC, provides the best increase in reach. This is done by making three size 2 passes over the data, using CC-AA, CC-BB and CC-DD. Again, we take the best result. Suppose it is CC-BB. BB is therefore placed in the second position (in the ITEM.2 column). Now we try CC-BB-AA and CC-BB-DD to see which of AA and DD should be in the ITEM.3 column. The remaining item in this stepwise procedure goes into the ITEM.4 column. ************************************* * the REACH.RESULTS file: * * TURF can be flummoxed by small, * * carefully constructed data sets * ************************************* It should be noted that selecting the best two items in a stepwise manner is not quite the same as selecting the best two by trying all possible pairs. Suppose we have a file of 14 cases. Again, there are 4 items: AA, BB, CC and DD. We would like to find the 'best' two items. AA reaches cases 1-10, BB reaches cases 11-13, CC reaches cases 1- 5 and 11-12, DD reaches cases 6-10 and 13-14. The stepwise approach selects AA and, having AA in hand, adds BB to get its best two items. They have a reach of 13. A non-stepwise approach tries all combinations of size 2 and would select CC and DD. They have a reach of 14. The TURF command uses a stepwise procedure only in the REACH.RESULTS (and FREQ.RESULTS) reordering; otherwise all runs are done trying every possible combination of the size being analyzed. **************************************** * the REACH.RESULTS file: * * using OMIT to drop some * * (or all) of the first 6 statistics * **************************************** OMIT size pct.of.max.reach, The default is for the reach.results and freq.results files to have six numeric values before the items appear. These are: SIZE RANK REACH PCT.REACHED PCT.OF.MAX.REACH FREQ An OMIT phrase can be used to drop any number of them, including all of them. This may reduce the number of print passes to see it. One OMIT phrase applied to both results files. OMIT, in other words, can be used to cause a better looking listing. In LIST itself, using BLANK.MISSING would convert the dashes that represent missing in LIST output into blanks. Also, using SKIP 2 when there is one extra line helps appearances. *************************************** * the REACH.RESULTS file: * * using REACH.DETAILS to select * * which (if any) extra lines should * * be written for each combination * *************************************** REACH.DETAILS cumulative.pct, When a reach.results file is written, the items within each combination are ordered by their reach contribution within the combination. This is always done. In addition, an extra line is written for each group which shows the cumulative reach as each item is added. That is the default, but it can be changed. As many as four extra lines are possible: (1) cumulative, the increasing reach as each succesive item is added. This is the default. (2) separate, which has the additional reach provided by each successive item. (3) cumulative.pct, the percent of the cases reached as each itme is added. (4) separate.pct, the additional percent of cases reached by each successive item. REACH.DETAILS can be followed by: (1) NONE by itself, no line are written. (2) ALL by itself, 4 lines are written. (3) one or more of CUMULATIVE, SEPARATE, CUMULATIVE.PCT and SEPARATE.PCT. The requested lines would be written. *************************** * the FREQ.RESULTS file * *************************** FREQ.RESULTS fff 500, optional output p-stat system file. This file holds the combinations with the best FREQ values. They are in descending order on FREQ. Within ties on FREQ, the rows are in descending order on REACH. The default is to write the 100 best combinations for each size. If an integer like 500 follows the file name, that many are written for each size. The item names in a combination are ordered by the freq contribution that each in turn adds. The FREQ.RESULTS file has the same variables as the REACH.RESULTS file. *************************************** * the FREQ.RESULTS file: * * using FREQ.DETAILS to select * * which (if any) extra lines should * * be written for each combination * *************************************** FREQ.DETAILS cumulative.pct, When a freq.results file is written, the items within each combination are ordered by their freq contribution within the combination. This is always done. In addition, an extra line is written for each group which shows the cumulative freq as each item is added. That is the default, but it can be changed. Two extra lines are possible: (1) cumulative, the increasing total freq as each succesive item is added. Default. (2) separate, which has the additional freq provided by each successive item. FREQ.DETAILS can be followed by: (1) NONE by itself, no line are written. (2) ALL by itself, 2 lines are written. (3) one or both of CUMULATIVE and SEPARATE. The requested lines would be written. ******************************************** * other optional output file identifiers * ******************************************** REACH.SUMMARY qqq 200, optional output p-stat system file. This file tells you how many combinations had each of the reach values that were found. The default is to write the 100 best reach values for each size being processed. If an integer like 200 follows the file name, that many would be written for each size. Each row in the reach.summary file contains: (1) SIZE: the combination size. (2) RANK: the rank within size. (3) REACH: a reach value. (4) NUMBER.OF.COMBOS: the number of combinations that have that reach value. (5) PCT.OF.COMBOS: the percent of the combinations that have that reach value. (6) LOWEST.FREQ: The lowest freq value in the combinations at that reach value. (7) HIGHEST.FREQ: The highest freq value in the combinations at that reach value. FREQ.SUMMARY qqq 200, optional output p-stat system file. This file tells you how many combinations had each of the freq values that were found. The default is to write the 100 best freq values for each size being processed. If an integer like 200 follows the file name, that many would be written for each size. Each row in the freq.summary file contains: (1) SIZE: the combination size. (2) RANK: the rank within size. (3) FREQ: a freq value. (4) NUMBER.OF.COMBOS: the number of combinations that have that freq value. (5) PCT.OF.COMBOS: the percent of the combinations that have that freq value. (6) LOWEST.REACH: The lowest reach value in the combinations at that freq value. (7) HIGHEST.REACH: The highest reach value in the combinations at that freq value. FULL.OUTPUT fff 200, optional output p-stat system file. This has the results of combinations in the order that they were processed. The default is to write the results of the initial 5,000 combinations that were tried, regardless of size. If an integer like 25,000 follows the file name, up to that many would be written. Such a large number should be used with caution, and only if really needed, because this can create a very large file. Each row in the full.output file contains: (1 ) the reach value for the combination. (2 ) the FREQ value for the combination. (3+) the positions of the items that make up the combination. If the USE.NAMES identifier is also used, the names of the items will be used instead of the positions. Names, however, make the file larger. ************************************************ * passing results to the TURF.SCORES command * * using a TEMPLATE file * ************************************************ The TURF.SCORES command needs to know the items and options to be used in the scoring. These can be supplied within the TURF.SCORES command itself. However, if you want to score the best result from a TURF run, it is easier to have TURF write a template file which TURF.SCORES can read directly. This option cannot be used when several sizes are being run. TEMPLATE ttt, optional output p-stat system file. This contains the names of the items that comprised the best combination. It also contains information about the options (like weighting) that were used. This file can be given to the TURF.SCORES command, to score the cases on the combination contained in the file. TURF.SCORES then writes an output file that has the reach score for each case on that combination. It is then quite easy to investigate the demographics of the reached cases. **************************** * a simple TEMPLATE file * **************************** The following shows a typical template file for a combination of 5 items, with variable www being used as a caseweight, and no other options in use. item case response reach items weights weights weights threshold VAR3 1 www no 1 VAR4 1 - - - VAR5 1 - - - VAR13 1 - - - VAR23 1 - - - ********************************************* * a typical final report produced by TURF * ********************************************* ---------TURF analysis for file work2 completed---------- | OPTIONS: none | | | | 100 items were used in the analysis. | | | | 1,000 cases were read and used. | | 973 cases had at least one positive response, | | making that the maximum possible reach. | | | | SIZE 6 evaluated 1,192,052,400 combinations: | | 941 was the best REACH, found in 1 combination. | | 1,956 was the FREQ value in that combination. | | 1,983 was the best FREQ in any size 6 combination. | | | | The FREQ score for a combination is the count | | of the non-zero responses for that combination, | | summed over the reached cases. | | | | REACH.RESULTS file work101 has the 100 | | combinations with the highest reach scores. | | The items are ordered by their REACH contribution. | | Cumulative reach is shown. | | | | Time: 18 minutes, 35.5 seconds. | --------------------------------------------------------- ********************************************** * processing speed: cases and combinations * ********************************************** There are three components whose effect on the speed of a TURF run is more or less linear: (1) more cases. Twice the cases for the same analysis will take twice the time. (2) more combinations to be tested. 30 items taken 6 at a time (ie, 30,6) has 593,775 combinations, 30,7 has 2,035,800. That is 3.4 times as many combinations. Therefore, it will take 3.4 times longer to run. (3) CPU speed. Since the data and the results are held within memory during a run, going from an 800 mHz chip to a 2.4 GHz chip should be about 3 times faster. ************************************************** * processing speed: effects of various options * ************************************************** TURF speed is also greatly affected by the options chosen. The fastest run is one that uses none of the options: for example, TURF INFILE, SIZE 6, REACH.RESULTS OUTFILE $. If this takes one second, how much longer do the various options take ? (1) 10 seconds if adding just CASE WEIGHTING. (2) 28 seconds if using response weighting, item weighting or reach threshold (with or without case weighting). ************************************ * processing speed: output files * ************************************ The various output files take very little extra time. A 29,7 run for 500 cases with no output files took 2.5 seconds. Adding the default sized (best 100) REACH.RESULTS or FREQ.RESULTS files made the runtime 2.7 seconds; doing both took 2.8 seconds. Asking for the 20,000 best REACH.RESULTS results instead of the default best 100 took 3.1 seconds instead of 2.7. The REACH.SUMMARY and FREQ.SUMMARY output files take a little more time than the REACH.RESULTS and FREQ.RESULTS files. The FULL.OUTPUT file takes almost no extra time since there is no sort management involved. *********************************************** * ------how the reach scoring is done------ * * determining if a case has been "reached" * * on a given combination of items * *********************************************** INPUT FILE. A TURF run is done on an input file that contains some numeric items to be analyzed. The number of analysis items (NV) can be from 1 to 210, but will often be 20 or 30 or so. The file can contain thousands of cases. CASE.WEIGHT. The file may also have a weight variable which provides a weight for each case. If not, a weight of one is assumed for each case. SIZE. A combination size needs to be provided. This is the number of items to be examined in each pass over the data. Suppose NV is 30 and the size is 7. A pass over the data will be done for every different combination of 7 of the 30 items, causing 2,035,800 passes. In each pass, the number of cases that have been reached by the current combination of items is counted. The goal is to identify the combination that reaches the largest number of the cases. RESPONSE.WEIGHTS. The response values themselves can be used as weights that reflect the intensity of the response. Using this option causes the responses for each case to be placed in memory without any change. When response.weights is not used, the responses for each case are stored in 0/1 form; 0 if the response was indeed zero, and 1 if the response was any value greater than zero. This takes much less memory space. ITEM.WEIGHTS. The items themselves are assumed to be equally important. In other words, the default is for each of the NV items to have a weight of one in the reach scoring. Different weights can be provided for some or all of the items. These are read from a file associated with the ITEM.WEIGHTS option. REACH.THRESHOLD. Finally, the reach threshold, which defaults to one, can be changed to 3, for example, by saying REACH.THRESHOLD 3. The threshold can be fractional, like 3.5. A case is reached when its reach score equals or exceeds the reach threshold. Suppose we are scoring a case on a combination that consists of items V2, V5, V11 and V17. Remember, the responses are stored internally as 0 or 1 except when the RESPONSE.WEIGHTS option is in use. The reach score for a given case is: V2 response times V2's item weight, plus V5 response times V5's item weight, plus V11 response times V11's item weight, plus V17 response times V17's item weight. If that score equals or exceeds the reach threshold, the case's caseweight is added to the number of cases that have been reached for that combination. One is used when there is no CASE.WEIGHT variable. When responses and items are unweighted and the threshold is one, a case will have been reached when it has a positive response on any item in the combination. ********************************************** * ------how the freq scoring is done------ * ********************************************** The FREQ score for a combination is the sum of the freq scores of the cases that were reached. If the case.weight option is not in use, consider each case to have a weight of one. A case's freq score depends on the options in use. (1) no use of response.weights or item.weights: Count the positive responses within a case on the variables in the combination. Multiply that count by the case's weight. (2) response weights are used, but not item weights: Sum the positive responses within a case on the variables in the combination. Multiply that sum by the case's weight. (3) item weights are used, but not response weights: Sum the item weights for those variables in the combination that have a positive value. Multiply that sum by the case's weight. (4) response weights and item weights are in use: Sum the item weight times the response value for those variables in the combination that have a positive value. Multiply that sum by the case's weight. ***************************** * limitations on run size * ***************************** ITEMS: Using lots of items can be done, but only with sensible combina- tion sizes. For example, one might look at 4 items out of 200 (64.7 million passes), but even 6 out of 200 would become excessive (82.4 bil- lion passes). COMBINATION SIZE: the maximum combination size is 60 items. One could look at 16 out of 24, for example, or even 40 out of 45. However, 60 out of 210 is so large that it will never finish. This P-STAT command, PUT( COMBINATIONS(50,7))$ would return the number of combinations that 7 out of 50 would require, for example, and may be useful in estimating the time of a prospective run. HOW LARGE A RUN CAN BE DONE: as described above, it depends on the number of cases, the number of combinations, the options used, and the speed of the PC itself. If you are considering a large run, 10 billion combinations or more, you might try smaller run first, get it's time, and use the ratio of the combinations to estimate the time needed for the larger run. IS 10 OUT OF 50 POSSIBLE: this is 10.27 billion combinations and would undoubtedly take quite a few hours, but is possible. The progress meter ticks every million combinations, so you can easily tell how long a run will take once it starts. ******************************** * limitations on memory size * ******************************** MEMORY CONSTRAINTS FOR INPUT: the input data must fit in memory. (We really don't want to read the data afresh from disk for each of 50 mil- lion different combinations.) In most situations memory should not be a problem because the data is usually stored very compactly. The final report shows how much of the input data area was used; that line however is omitted when less than 50% was needed. MEMORY CONSTRAINTS FOR OUTPUT: the output files (except for the FULL.OUTPUT file) are collected in memory in sort order as the run progresses. The default sizes cause no problems. If one asks for 20,000 results in the REACH.RESULTS or FREQ.RESULTS files, the run will be slightly slower but the file should fit. ------------HELP ON turf.scores----------- TURF.SCORES xin, template ttt, out xout $ TURF.SCORES xin, items v2 v4 to v8, carry case.num, out xout $ ************************* * command description * ************************* TURF.SCORES computes the REACH score on a specified combination of items for each case of an input file. These scores are written to an output file. The calculations are identical to those in the TURF command. The output file will have the items used in the scoring, the reach score, and any "carried" variables. Carried variables are usually variables that identify the individual cases, facilitating demographic breakdowns of the reached cases. The command must be given the names of the variables (items) them make up the combination to be scored. In addition, the defaults can be changed for various options. These are: case weighting, item weighting, response weighting and the response threshold. This information can be supplied in 2 ways: (1) by providing a TEMPLATE file (created in a TURF command). (2) by providing the controls as part of this command. ******************************************** * identifiers when using a TEMPLATE file * ******************************************** TURF.SCORES file, this supplies the input data file. Required. TEMPLATE file, this is a file, created in a TURF run, that contains the settings that were used, and the names of the items that made up the best combination. Required. OUT file, name for the result file. Required. CARRY vars, variables that should be carried over from the input file to the OUT file, even though they are not involved in the execution of the command. Optional. ************************************************ * identifiers when NOT using a TEMPLATE file * ************************************************ TURF.SCORES file, this supplies the input data file. Required. ITEMS vars, names of the variables (items) that make up the combination to be scored. Ranges can be used: var2 to var5. Required. ITEM.WEIGHTS numbers, weights of the items. The default is to treat all items the same, i.e., with weights of 1. Optional. CASE.WEIGHTS var, name of a variable to be used for case weighting when computing the overall reach for the dataset. Optional. RESPONSE.WEIGHTS, causes the actual values of the items to be used in the scoring instead of just zero/one. Optional. REACH.THRESHOLD number, default one. the value that a case must achieve to be 'reached'. Optional. OUT file, name for the result file. Required. CARRY vars, variables that should be carried over from the input file to the OUT file, even though they are not involved in the execution of the command. Optional. ****************************** * contents of the OUT file * ****************************** The OUT file will have the following variables: the items in the combination that was scored, the caseweight variable (if there was one), and any CARRY variables that were requested. In addition, two variables are added: (1) REACH.SCORE, the score of each case for the combination. This is set to M1 if there are negative or missing values for the case in the combination items. It is also M1 if the case has a non-positive caseweight. (2) REACH.CATEGORY, was the case reached, given its score and the threshold in use ? It is M1 when reach.score is M1. It is zero when reach.score is less than the threshold. It is one when reach.score satisfies the threshold. **************** * an example * **************** The TURF documentation shows a 16 case file named ddd. The following command reads that file and computes reach scores using items v1, v2 and v5. The response.weights option is selected, and a threshold of 2 is used. Variable case.id is carried across into out file sss. turf.scores ddd, items v1 v2 v5, response.weights, reach.threshold 2, carry case.id, out sss $ ******************************* * the report produced by * * running the above command * ******************************* --------------TURF.SCORES completed-------------- | The reach scoring was done using these items: | | v1 v2 v5 | | | | 16 cases were read from file ddd. | | The reach.threshold was 2. | | The threshold was met by 8 cases. | | The RESPONSE.WEIGHTS option was in use. | | 6 variables were written to file sss. | ------------------------------------------------- ********************************* * the output file produced by * * running the above command * ********************************* case reach reach id v1 v2 v5 score category 9001 1 1 0 2 1 9002 1 1 0 2 1 9003 1 1 0 2 1 9004 1 1 0 2 1 9005 1 1 0 2 1 9006 1 1 0 2 1 9007 0 1 0 1 0 9008 0 0 0 0 0 9009 0 0 0 0 0 9010 0 0 0 0 0 9011 0 0 0 0 0 9012 0 0 0 0 0 9013 0 0 0 0 0 9014 0 0 0 0 0 9015 0 0 2 2 1 9016 0 0 2 2 1 For more information: email support@pstat.com phone 609-466-9200