Class WebLogAnalysis


  • public class WebLogAnalysis
    extends Object
    This program processes web logs and relational data. It implements the following relational query:
    
     SELECT
           r.pageURL,
           r.pageRank,
           r.avgDuration
     FROM documents d JOIN rankings r
                      ON d.url = r.url
     WHERE CONTAINS(d.text, [keywords])
           AND r.rank > [rank]
           AND NOT EXISTS
               (
                  SELECT * FROM Visits v
                  WHERE v.destUrl = d.url
                        AND v.visitDate < [date]
               );
     

    Input files are plain text CSV files using the pipe character ('|') as field separator. The tables referenced in the query can be generated using the WebLogDataGenerator and have the following schemas

    
     CREATE TABLE Documents (
                    url VARCHAR(100) PRIMARY KEY,
                    contents TEXT );
    
     CREATE TABLE Rankings (
                    pageRank INT,
                    pageURL VARCHAR(100) PRIMARY KEY,
                    avgDuration INT );
    
     CREATE TABLE Visits (
                    sourceIP VARCHAR(16),
                    destURL VARCHAR(100),
                    visitDate DATE,
                    adRevenue FLOAT,
                    userAgent VARCHAR(64),
                    countryCode VARCHAR(3),
                    languageCode VARCHAR(6),
                    searchWord VARCHAR(32),
                    duration INT );
     

    Usage: WebLogAnalysis --documents <path> --ranks <path> --visits <path> --result <path>
    If no parameters are provided, the program is run with default data from WebLogData.

    This example shows how to use:

    • tuple data types
    • projection and join projection
    • the CoGroup transformation for an anti-join

    Note: All Flink DataSet APIs are deprecated since Flink 1.18 and will be removed in a future Flink major version. You can still build your application in DataSet, but you should move to either the DataStream and/or Table API. This class is retained for testing purposes.

    • Constructor Detail

      • WebLogAnalysis

        public WebLogAnalysis()