Chapter 132. Distributed Query Processing (DQP)

132.1. History
132.2. What DQP does
132.3. Grammar, functions and example queries
132.3.1. Grammar
132.3.2. Functions
132.3.3. Example queries
132.4. Current limitations of DQP
132.5. How this version of DQP differs from previous versions
132.6. DQP related publications
Distributed Query Processing (DQP) is a component of OGSA-DAI that enables distributed queries over relational data resources exposed by OGSA-DAI servers.

132.1. History

The DQP functionality was originally designed and developed by at the University of Manchester and the University of Newcastle upon Tyne as part of the OGSA-DAI project and the MyGrid project. Previous versions of DQP have been released as a separate product from OGSA-DAI using the title OGSA-DQP. From OGSA-DAI release 3.2 the DQP functionality has been rewritten and is now simply an extension to OGSA-DAI rather than a separate product. For those familiar with previous versions of DQP, details of how this version differs can be found in Section 132.5, “How this version of DQP differs from previous versions”.

132.2. What DQP does

DQP allows the tables from multiple distributed relational databases to be queried, using SQL, as if there were multiple tables in a single database. Consider two distributed data resources. The first is called MySurvey and contains survey data of people:

Data Resource: MySurvey
Table: People
namegenderagepostcode
AlanM20AB1 2GB
BobM50EH5 2TG
CathF25JK3 4RE

The second data resource is called PostcodeStats and maps postcodes to classifications of a region type:

Data Resource: PostcodeStats
Table: PostcodeRegionType
postcoderegionType
AB1 2GBCity
FG2 5GBRural
EH5 2TGCity
JK3 4RERural

If these two distributed relational data resources could be queried as if they were two tables in a single database then useful questions could be answered. DQP allows the creation of a virtual data resource that appears to the user to contain these two tables. By default, table names in the virtual data resource are prefixed with the original resource names so the two tables in our virtual data resource become:

Data Resource: MyDQPResource
Table: MySurvey_People
namegenderagepostcode
AlanM20AB1 2GB
BobM50EH5 2TG
CathF25JK3 4RE

Data Resource: MyDQPResource
Table: PostcodeStats_PostcodeRegionType
postcoderegionType
AB1 2GBCity
FG2 5GBRural
EH5 2TGCity
JK3 4RERural

The MyDQPResource resource can be used to execute SQL queries that use both these tables. For example, to obtain the region type that Alan lives in, the SQL query is:

SELECT regionType 
FROM MySurvey_People p JOIN PostcodeStats_PostcodeRegionType r
  ON p.postcode = r.postcode
WHERE name = 'Alan'

which produces result:

| regionType |
| City       |

The SQL query to find the the average age for people living in each region type would be:

SELECT regionType, avg(age) as avgAge
FROM MySurvey_People p JOIN PostcodeStats_PostcodeRegionType r
  ON p.postcode = r.postcode
GROUP BY regionType

which produces the result:

| regionType | avgAge |
| City       | 35.0   |
| Rural      | 25.0   |

The DQP federation will contain one table for each table in each of the distributed resources. Consider the following new relational resource that we now wish to add to our federation.

Data Resource: JoesSurvey
Table: People
namegenderagepostcode
DarrenM20FG2 5GB
EdM31EH5 2TG
FionaF29FG2 5GB

This table will appear in our data federation as JoesSurvey_People. This will be an entirely different table from the MySurvey_People table. If we wish to consider these two tables to be portions of a single table then we must use a UNION operator in our SQL queries. For example, the query to obtain the average age for each region type now becomes:

SELECT regionType, avg(age) as avgAge
FROM 
  ((SELECT age, postcode FROM MySurvey_People) UNION ALL 
   (SELECT age, postcode FROM JoesSurvey_People)) as u
  JOIN PostcodeStats_PostcodeRegionType r ON u.postcode = r.postcode
GROUP BY regionType

It will be possible in the future to combine DQP with OGSA-DAI's support of views to hide the need for union operators such as this from the client. For example, the DQP resource could be wrapped in a view resource that defines a new People table as:

SELECT * FROM
((SELECT name, age, postcode FROM MySurvey_People) UNION ALL 
 (SELECT name, age, postcode FROM JoesSurvey_People))

As the time of writing the OGSA-DAI Views extension pack is not compatible with DQP. Check the OGSA-DAI website to see if a new version of the OGSA-DAI views extension pack that is compatible with DQP has been released.

The virtual federated resources that DQP provides can be configured statically at the server or dynamically by a client. Once the DQP resource has been created it can be used by clients just like any other relational resource.

132.3. Grammar, functions and example queries

This section shows DQP's SQL grammar, functions and some example SQL queries.

132.3.1. Grammar

DQP supports a subset of SQL 92. The grammar that DQP aims to support is shown below. This release of DQP does not support the whole grammar. The following features are not supported:

  • OR operator in search conditions
  • IN operator in search conditions
  • BETWEEN operator in search conditions
  • ALL, SOME and ANY comparison modifiers
  • FULL OUTER JOIN operator
  • DATE, TIME, TIMESTAMP and INTERVAL constants

  
statement
	: 	query_expression order_by? ';'? EOF 
	;
query_expression
	: 	query_term (set_op query_term)* 
	;
query_term
	:	query ( 'INTERSECT' query)*
	;
set_op	
	:	(setop='UNION' | setop='EXCEPT') 'ALL'? 
	;
query
	:	sub_query
		| 'SELECT' set_quantifier? select_list 
	  	  'FROM' table_expression 
	  	   ('WHERE' s1=search_condition)? 
	  	   ('GROUP BY' column_list)? 
	  	   ('HAVING' s2=search_condition)?
	;
set_quantifier
	:	'DISTINCT' | 'ALL';
sub_query
	:	'(' query_expression ')';

select_list
	:	'*'
	| 	derived_column (',' derived_column)*;
derived_column
	:	value_expression ('AS'? ID)?
	;
	
order_by
	:	'ORDER' 'BY' sort_specification (',' sort_specification)* ;
sort_specification
	:	column_name | INT | reserved_word_column_name
	;

reserved_word_column_name
	:	(tableid=ID'.')?
	        (s='DATE' | s='TIMESTAMP' | s='TIME' | s='INTERVAL' | s='YEAR' | 
	         s='MONTH' | s='DAY' | s='HOUR' | s='MINUTE' | s='SECOND' )	
	;
	
value_expression
	:	string_value_expression
	|	numeric_value_expression 
	;
numeric_value_expression
	: 	factor (('+'|'-') factor)* ;
	
factor	:	numeric_primary (('*'|'/') numeric_primary)*;

numeric_primary
	: 	('+'|'-')? value_expression_primary; 
value_expression_primary 
	:	'(' value_expression ')'
	| 	function
	| 	column_name
	| 	literal
	| 	sub_query
	;

literal	:	INT | FLOAT | NUMERIC | STRING | datetime | interval | 
                'NULL' | 'TRUE' | 'FALSE';
datetime
	:	('DATE' | 'TIMESTAMP' | 'TIME') STRING
	| 	(tableid=ID'.')?(s='DATE' | s='TIMESTAMP' | s='TIME')
	;
interval
	:	'INTERVAL' STRING ('YEAR' | 'MONTH' | 'DAY' | 'HOUR' | 'MINUTE' | 'SECOND')
	|	(tableid=ID'.')?(s='INTERVAL' | s='YEAR' | s='MONTH' | s='DAY' | s='HOUR' | 
	        s='MINUTE' | s='SECOND')
	;
		
function:	(name=ID) '(' value_expression? (',' value_expression)* ')' 
	| 	(name=ID) '(' '*' ')' 
	;

string_value_expression
	:	(column_name | STRING) ('||' (column_name | STRING))+
	;
	
table_expression
	:	table_reference
	;
table_reference
	:	table (',' table_reference)*
	;

join_type
	:	'RIGHT' 'OUTER'? 'JOIN' 
	|	'LEFT' 'OUTER'? 'JOIN' 
	|	'FULL' 'OUTER'? 'JOIN' 
	| 	'INNER'? 'JOIN' 
	;
table 	:	non_join_table (join_type non_join_table 'ON' search_condition)*
	;


non_join_table
	:	table_name correlation_specification? 
	| 	table_function correlation_specification 
	| 	sub_query correlation_specification 
	;
	
table_function
	:	name=ID '(' table_function_subquery? (',' table_function_subquery)* 
	        (','? table_function_param)* ')'
	;

table_function_subquery
	:	sub_query correlation_specification
	;

table_function_param
	:	search_condition
	|	value_expression
	;	

relation 
 	:	table_name 
 	|	table_function 
 	|	query 
 	;
	
search_condition
	:	boolean_factor ('OR' boolean_factor)*;
boolean_factor
	:	boolean_term ('AND' boolean_term)*
	;
boolean_term
	:	boolean_test
	|	'NOT' boolean_term ;
boolean_test
	:	boolean_primary;
boolean_primary
	:	predicate | '(' search_condition ')';	
predicate
	:	comparison_predicate | like_predicate | in_predicate | null_predicate | 
	        exists_predicate | between_predicate;
null_predicate
	:	row_value 'IS' 'NULL' 
	|	row_value 'IS' 'NOT' 'NULL' 
	;
in_predicate
	:	row_value 'NOT' 'IN' in_predicate_tail
	|	row_value 'IN' in_predicate_tail
	;
in_predicate_tail
	:	sub_query 
	| 	'(' (value_expression (',' value_expression)*) ')';
between_predicate
	:	value=row_value 'BETWEEN' btw1=row_value 'AND' btw2=row_value 
	|	value=row_value 'NOT' 'BETWEEN' btw1=row_value 'AND' btw2=row_value
	;
exists_predicate
	:	'EXISTS' sub_query;
comparison_predicate
	:	row_value ('=' | '<>' | '!=' | '<' | '>' | '>=' | '<=') 
	        ('ALL'|'SOME'|'ANY')? row_value;
like_predicate
	:	row_value 'LIKE' row_value
	|	v1=row_value 'NOT' 'LIKE' v2=row_value;

row_value
	:	value_expression | 'NULL' ;

correlation_specification
	:	('AS')? ID;	
table_name
	:	ID;
column_list
	:	(column_name | reserved_word_column_name) 
	        (',' (column_name | reserved_word_column_name))*;
column_name
	:	(tableid=ID'.')?columnid=ID
	;

ID	:	('a'..'z' | 'A'..'Z') ( ('a'..'z' | 'A'..'Z') | ('0'..'9') | '_' )* 
	|	'`' (~('\''|'\n'|'\r'|'`')) ( (~('\''|'\n'|'\r'|'`')) )* '`' ;
FLOAT	:	('0'..'9')+ '.' ('0'..'9')+ ;
INT	:	('0'..'9')+ ;
NUMERIC	:	(INT | FLOAT) 'E' INT ;
STRING	:	'"' (~('"'|'\n'|'\r'))*  '"'
	|	'\'' (~('\''|'\n'|'\r'))*  '\'';
WS	:	(' ' | '\t' | '\r' | '\n' ) {skip();} ;

132.3.2. Functions

DQP supports aggregation and scalar functions. These are extensibility points of DQP so if the function you require is not provided it is possible to write your own and deploy them into DQP, see Chapter 136, DQP for service developers.

DQP supports the aggregation functions shown in Table 132.1, “DQP aggregation functions.”. All the aggregation functions ignore null values except COUNT(*) which counts all the rows regardless of the data values they contain.

NameSupported typesDescriptionExamples
COUNTAllReturns a count of the number of rows returned.COUNT(*), COUNT(colName)
AVGShort, Integer, Long, Float, Double, BigDecimalReturns the average value of the argument.AVG(colName)
MAXShort, Integer, Long, Float, Double, BigDecimalReturns the maximum value of the argument.MAX(colName)
MINShort, Integer, Long, Float, Double, BigDecimalReturns the minimum value of the argument.MIN(colName)
SUMShort, Integer, Long, Float, Double, BigDecimalReturns the sum of the values of the argument.SUM(colName)
STDDEVShort, Integer, Long, Float, Double, BigDecimalReturns the population standard deviation.STDDEV(colName)
VARIANCEShort, Integer, Long, Float, Double, BigDecimalReturns the population standard variance.VARIANCE(colName)
MEDIANShort, Integer, Long, Float, Double, BigDecimalReturns the median of the values of the argument.MEDIAN(colName)

Table 132.1. DQP aggregation functions.


DQP supports the scalar functions shown in Table 132.2, “DQP scalar functions.”.

NameSupported signaturesDescriptionExamples
GetHourOfDay(Time)->Long, (Timestamp)->LongReturns the hour of the day.GetHourOfDay(colName)
GetDate(Timestamp)->DateExtracts the date from a timestamp.GetData(colName)
AddDays(Date, Number)->Date, (Timestamp, Number)->TimestampAdds the given number of whole days.AddDays(colName, 2)
SubDays(Date, Number)->Date, (Timestamp, Number)->TimestampSubtracts the given number of whole days.SubDays(colName, 2)
CurrentDate()->DateReturns the current date.CurrentDate()
CurrentTime()->TimeReturns the current time.CurrentTime()
DateDiff(Date,Date)->Integer, (Date,Timestamp)->Integer, (Timestamp, Timestamp)->Integer, (Timestamp,Date)->IntegerReturns the number of whole days between the two given values. Return value will be negative if the first date chronologically precedes the second date, and positive if first date is chronologically after the second date. DateDiff(colName1, colName2)

Table 132.2. DQP scalar functions.


132.3.3. Example queries

This section shows some example SQL queries that DQP supports. If you are having difficulty forming a query that DQP can execute try adapting those shown here.

Simple SELECT... FROM...WHERE... queries:

SELECT * FROM MyResource_People

SELECT name, age, gender FROM MyResource_People

SELECT name, age FROM MyResource_People WHERE age > 18 AND gender = 'F'

SELECT name, age FROM MyResource_People WHERE name LIKE 'J%'

SELECT DISTINCT age FROM MyResource_People

Implicit join queries:

SELECT * 
FROM MyResource_People p, OtherResource_PostcodesStats pc
WHERE p.postcode = pc.postcode

Explicit join queries:

SELECT * 
FROM 
  MyResource_People p 
  JOIN OtherResource_PostcodesStats pc ON p.postcode = pc.postcode

Union queries:

SELECT intT, dateTimeT
FROM 
((select intT, datetimeT from MySQLResource_mysql_types)  
  UNION ALL 
 (select intT, datetimeT from Resource2_mysql_types) 
) sub

Using aggregate functions:

SELECT COUNT(*), AVG(p.age) 
FROM 
  MyResource_People p 
  JOIN OtherResource_PostcodesStats pc ON p.postcode = pc.postcode

Using scalar functions:

SELECT GetHourOfDay(startTime) FROM MyResource_runRecord

SELECT * 
FROM MyResource_runRecord 
WHERE GetHourOfDay(startTime) > 10

Using grouping:

SELECT C.eid, MAX(A.cruisingrange) 
FROM MySQLResource_certified C, MySQLResource_aircraft A
WHERE C.aid = A.aid AND A.aid>10 
GROUP BY C.eid

SELECT C.eid, MAX(A.cruisingrange)
FROM MySQLResource_certified C, Resource2_aircraft A 
WHERE C.aid = A.aid 
GROUP BY C.eid
HAVING COUNT(C.eid) > 3

SELECT d, h, AVG(i) FROM 
 (SELECT
   GetDate(startTime) AS d, GetHourOfDay(startTime) AS h, numValues AS i 
   FROM MyResource_runRecord) AS sub 
GROUP BY d, h
ORDER BY d, h

Ordering:

SELECT name, age, gender FROM MyResource_People
ORDER BY age

Using subqueries:

SELECT DISTINCT E.ename
FROM MySQLResource_employees E 
WHERE E.salary < 
 ( SELECT MIN (F.price) 
   FROM MySQLResource_flights F
   WHERE F.origin = 'Los Angeles' AND F.destination = 'Honolulu' )

SELECT Temp.name, Temp.AvgSalary
FROM 
( SELECT A.aid, A.aname AS name, AVG (E.salary) AS AvgSalary 
  FROM MySQLResource_aircraft A, Resource2_certified C, Resource3_employees E 
  WHERE A.aid = C.aid AND C.eid = E.eid AND A.cruisingrange > 1000 
  GROUP BY A.aid, A.aname ) AS Temp 
ORDER BY Temp.name

Queryies using various combinations of supported functionality:

SELECT d, h, AVG(i) FROM
(SELECT 
  GetDate(datetimeT) AS d, GetHourOfDay(datetimeT) AS h, intT AS i 
  FROM 
  ((select intT, datetimeT from Resource1_runRecord)
   UNION ALL 
   (select intT, datetimeT from Resource2_runRecord) 
  ) sub1 
  WHERE datetimeT >= '1978-09-29' and datetimeT < '1978-11-30'
) AS sub2 
GROUP BY d, h
ORDER BY d, h

132.4. Current limitations of DQP

This section lists some of the limitations of DQP. The OGSA-DAI team hope to overcome many of these limitations in the future.
  • All the existing optimisers are essentially heuritic and do not make use of a cost model.

  • Join ordering makes use of very basic cardinality estimates. This model could be considerable improved and it an extensibility point in OGSA-DAI.

  • Many SQL operations are not pushed to the database even when it is possible to do so and are instead executed inside OGSA-DAI activities. This is true for all functions, ordering and group by operations.

  • Most join implementations store the data from one side of the join in memory. This can lead to excessive use of memory. Alternative approaches not yet implemented include writing to disk, executing a streaming merge join on sorted data streams or using a parallel hash join algorithm to distribute the load to multiple OGSA-DAI servers.

  • The errors returned to the user if the SQL query does not parse are often vague and unhelpful.

  • There is no dynamic fault tolerence so DQP fails if one of the resources it federates is unavailable.

132.5. How this version of DQP differs from previous versions

This version of DQP is a major redesign and rewrite from the previous OGSA-DQP release (titled OGSA-DQP 3.2). The main differences include:

  • Removal of the DQP Evaluator service. All the functionality previously provided by the evaluator service is now provided by new OGSA-DAI activities so evaluator services can be replaced by OGSA-DAI services with the appropriate activities installed. This greatly simplifies deployment and enables extensibility within a single framework.
  • Support for more SQL operations
    • GROUP BY and HAVING clauses
    • EXISTS predicate
    • SELECT *
  • Improved extensibility. Much of DQP can now be extended and adapted to suit a project's needs.
  • Removal of function calls to web services. If this functionality is desired DQP support for an extensible set of functions could be used to implement something similar.
  • An entirely new way of configuring DQP so your any previous DQP configurations will have to be ported to the new configuration style. We apologise for this but believe it should not be too difficult to do.
  • Parallel query processing optimisations utilising additional evaluators are not yet supported in this version.
  • DQP resources expose interface identical with any other read only relational resource. Resource properties exposing query plans are no longer supported. Query plans can be obtained by other means.
  • There is no GUI client. We are planning to develop a web based user interface.

132.6. DQP related publications

Publications describing the current DQP release.

  • Dobrzelecki, B., Krause, A., Hume, A. C., Grant, A., Antonioletti, M., Alemu, T. Y., Atkinson, M., Jackson, M. and Theocharopoulos, E. Integrating Distributed Data Sources with OGSA-DAI DQP and Views. To appear in Phil. Trans. R. Soc. A.

Publications describing the design and applications of the previous OGSA-DQP release. Note that the design and implementation details differ significantly from the current version of DQP.

  • Lynden, S., Mukherjee, A., Hume, A. C., Fernandes, A. A., Paton, N. W., Sakellariou, R., and Watson, P. 2009. The design and implementation of OGSA-DQP: A service-based distributed query processor. Future Gener. Comput. Syst. 25, 3 (Mar. 2009), 224-236. DOI= http://dx.doi.org/10.1016/j.future.2008.08.003
  • Lynden, S., Pahlevi, S. M., and Kojima, I. 2008. Service-based data integration using OGSA-DQP and OGSA-WebDB. In Proceedings of the 2008 9th IEEE/ACM international Conference on Grid Computing - Volume 00 (September 29 - October 01, 2008). International Conference on Grid Computing. IEEE Computer Society, Washington, DC, 160-167. DOI= http://dx.doi.org/10.1109/GRID.2008.4662795