Use Pig and PigUnit to do Weblog Analysis as a Maven project
- Exploratory Data Analysis: Hortonworks Sandbox, Hive, HDFS, vim, sublime
- Processing & Analysis: Pig, Piggybank, Datafu
- Testing & Validation: PigUnit, Junit, Java, Maven
- Version Control: Git/Github, https://github.com/johnnycaol/pig-maven.git
-
- Import the src folder and pom.xml file as a Maven project in Eclipse
-
- Install all the dependencies specified in the pom.xml file
-
- unzip
2015_07_22_mktplace_shop_web_log_sample.log.gz
- unzip
-
- Sessionize the web log by web log IP
-
- Determine the average session time
-
- Determine unique URL visits per session (i.e. count a hit to a unique URL only once per session)
-
- Find the most engaged users/IPs (i.e. the IPs with the longest session times)
-
- IP addresses can be used to uniquely identify users
-
- Sessionize means aggregate all page hits by IP by a fixed time window rather than navigation
-
- Unique URL visits means count a hit to a unique URL only once per session
-
- Most engaged users means the IPs with session length greater than or equal to 95th percentile
-
- If required data is missing or error, that log will be ignored
-
- testSessionize()
-
- testAvgSessionLength()
-
- testAvgUrlCountPerSession()
-
- testMostEngagedUserIps()
-
- Run PigUnitTest.java as a jUnit test. This will use the test data in test_data.log
-
- To process the data provided in the WeblogChallenge repository, first, REGISTER piggybank and datafu jars in the pig scripts
-
- Second, use pig command line by specifying the path to the log data,
e.g.
pig -x local -f sessionize.pig -param input='../data/2015_07_22_mktplace_shop_web_log_sample.log'
- Second, use pig command line by specifying the path to the log data,
e.g.
- average session time: 100.28 seconds
- unique URL visits per session: 8.31
- most engaged users: Top IPs based on quantile (generated in the output)
- IP addresses do not guarantee distinct users, user id, sessions will help
- Sessionize code repeated couple of times
- Explore the results with different settings, e.g. sessionize by navigation or different time intervals, and then compare the results
- Make use of other fields, e.g. user_agent to understand what device users are coming from
- Use visualization tools like Tableau or ELK to visualize the data to find potential patterns for better understanding of the data, e.g. time series patterns, sales event patterns, networks issues, difference between engaged users and not engaged users, device usage comparison, most popular url/section of site, etc.