INF: One Solution for a One-to-Many Distinct Rowcount Problem (323895)

MORE INFORMATION

How MCS Identified the Source of the Problem

Upon investigation, MCS found that the problem centered on the functionality limitations in a component that interacts with SQL Server Analysis Services, specifically, the Microsoft Data Access (MDAC) component. The problem was with the version of MDAC that was running on the customer's computer. The customer ran a secured Microsoft Windows 2000 desktop. MDAC version 2.5 was installed on the customer's desktop systems. MCS built a solution by using Microsoft SQL Server 2000 Service Pack 1 (SP1) and Microsoft Analysis Services Service Pack 1 (SP1). MDAC 2.5 uses the SQL Server version 7.0 OLE DB provider for OLAP Services (MSOLAP.1), which cannot view cubes that do not take advantage of any of the newer features, such as distinct count.

Discussion

The customer used Analysis Services to feed data to a Web application that made use of the Microsoft PivotTable dynamic view component of Microsoft Office Web Components (OWC). The application presented summarized information effectively for online analysis, and provided quick responses to iterative complex analytical queries. The specific project requirements called for presenting the count of the number of people in the database, and the count of the number of people invited to each event.

The development team populated a fact table by running a query that joined person to event, and then inserted the result into the fact table. Counting the rows of the resulting fact table produced incorrect results. Because each person can potentially attend many events, joining person to event produces more rows in the result set than exist in the person table. Additionally, not every person might attend an event. Therefore, joining these tables together potentially drops rows from the result.

There are several ways to resolve the issue of dropped rows:

Use a covering query.

One method is the use of a "covering query." Each person in the database would have an entry for every event. However, generating a covering query can result in the return of millions of rows.
Use an outer join.

Another method is the use of an outer join. The people who have not attended an event would have null entries for event related information. Null values in the query are translated to the appropriate key of the dimension table. For the sake of simplicity, the value 0 is used to catch null rows. OLAP adds a corresponding row to the dimension table to make sure that OLAP does not drop rows when OLAP creates the cube. However, duplicate rows are still an issue.

A distinct count typically handles the problem of duplicate rows. In the client situation, where people were being counted, the count was the distinct number of people. If there are 100 rows in the Person table, and each person attends two events, a join between the Person and the Event tables will have a result set that contains 200 rows. The distinct count of people, however, will only show 100. Distinct count seems to be the solution for the problem. Unfortunately, this was not the case because clients that tried to access the cube could not access the cube. Presenting information from a one-to-many relationship is a common issue with OLAP. One documented approach is to create separate cubes for each relationship. For this situation, then, there is one cube for the person information and one cube for the person event information. The solution worked, although it was not ideal, and MCS and the client investigated an approach that did not require the creation and maintenance of multiple cubes.

One of the original designs for the database used a table that described each action a person attending an event might have. Summarizing the information stored in this table used the technique of creating a cross-tab query. Cross-tab queries use a searched CASE statement to determine how to handle information. Here is an example from SQL Server Books Online:
```
SELECT Year, 
    SUM(CASE Quarter WHEN 1 THEN Amount ELSE 0 END) AS Q1,
    SUM(CASE Quarter WHEN 2 THEN Amount ELSE 0 END) AS Q2,
    SUM(CASE Quarter WHEN 3 THEN Amount ELSE 0 END) AS Q3,
    SUM(CASE Quarter WHEN 4 THEN Amount ELSE 0 END) AS Q4
FROM Northwind.dbo.Pivot
GROUP BY Year
						
```
This proved to be a better solution:

A modification of the concept of creating a cross-tab query and the use of an outer join.

How MCS Found the Right Solution

The key to the solution is to understand that only the first entry must be counted. Later entries for that person are not counted. The additional entries are used only to satisfy the many portion of the relationship. Using a searched case, as was used in the cross-tab query, the first entry is assigned a value of one. The syntax for the searched case checks that the ID of the many table is the first row. This is found by running a subquery that looks for the minimum ID for that person. The program also looks for a NULL value because the row might not have a corresponding entry. By then summing the PersonCount column from the fact table, the distinct count is found; in essence, distinct count without "distinct" or "count." This solution is shown in the following code example:

SELECT
	p.PersonID
	, CASE WHEN 
(
		pe.personeventid = 
(
select min(personeventid) 
from 
PersonEvent pe1 
where 
pe1.personid = p.personid
) OR pe.personeventid is null
) 
THEN 1 
ELSE 0 
END as PersonCount
, Invited
, Attended
FROM 
	Person p
	left join PersonEvent pe on p.PersonID = pe.PersonID

Multiple distinct count support is possible by adding additional subqueries to the searched CASE statement for the facts that are not associated with any newly added relationships. For example, a table that represents attributes for people is added to the query. The subquery that checks for the first attribute entry is added to the searched case for people. For example:

CASE WHEN 
(
pe.personeventid = 
(
select min(personeventid) 
from 
PersonEvent pe1 
where 
pe1.personid = p.personid
)
		OR pe.personeventid is null
)
	AND 
(
pa.personattributeid = 
(
select min(personattributeid) 
from 
PersonAttribute pa1 
where 
pa1.personid = p.personid
)
		OR pa.personattributeid is null 
) 
THEN 1 ELSE 0 END as Person

Developers who use this technique must be aware that filtering attributes that are not associated with the fact will produce inaccurate results. You have to program the interface to include only relevant facts for a particular set of attributes.

Conclusion

This article shows that it is possible to create multiple distinct count measures by using a Transact-SQL Case statement with correlated subqueries. Additionally, this technique supports all versions of the OLE DB provider for Analysis Services.