AWS EMR でSparkRを使って見る

AWSEMRとは、SparkやらHiveやらそれら一式を簡単に使える様にしてくれている仕組みです。

ぽちぽちっとEMRでサーバを作成。

この間10分程度

SparkRでサンプルデータを解析してみます

こちらの内容をアレンジしてみました

http://engineer.recruit-lifestyle.co.jp/techblog/2015-08-19-sparkr/

データ取得

http://stat-computing.org/dataexpo/2009/the-data.html

こちらから2001、2、3のデータをダウンロード

$ wget http://stat-computing.org/dataexpo/2009/2001.csv.bz2

unzip

$ bunzip2 2001.csv.bz2

s3にアップロード

$ aws s3 cp 2001.csv s3://samplebucket/airline/

同様に2002,2003も繰り返す

Hive

$ hive
hive> add jar /usr/lib/hive/lib/hive-contrib.jar;
Added [/usr/lib/hive/lib/hive-contrib.jar] to class path
Added resources: [/usr/lib/hive/lib/hive-contrib.jar]
hive> create table airline(
		> Year STRING,
		> Month STRING,
		> DayofMonth STRING,
		> DayOfWeek STRING,
		> DepTime STRING,
		> CRSDepTime STRING,
		> ArrTime STRING,
		> CRSArrTime STRING,
		> UniqueCarrier STRING,
		> FlightNum STRING,
		> TailNum STRING,
		> ActualElapsedTime STRING,
		> CRSElapsedTime STRING,
		> AirTime STRING,
		> ArrDelay STRING,
		> DepDelay STRING,
		> Origin STRING,
		> Dest STRING,
		> Distance STRING,
		> TaxiIn STRING,
		> TaxiOut STRING,
		> Cancelled STRING,
		> CancellationCode STRING,
		> Diverted STRING,
		> CarrierDelay STRING,
		> WeatherDelay STRING,
		> NASDelay STRING,
		> SecurityDelay STRING,
		> LateAircraftDelay STRING
		> )
		> ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n'
		> LOCATION 's3://samplebucket/airline/' tblproperties ("skip.header.line.count"="1");
hive> select * from airline limit 1;
OK
2001	1	17	3	1806	1810	1931	1934	US	375	N700��	85	84	60	-3	-4	BWI	CLT	361	5	20	0	NA	0	NA	NA	NA	NA	NA

SparkR

$ sparkR
> install.packages("magrittr")
> library(magrittr)
> hiveContext <- sparkRHive.init(sc)
> airline<-sql(hiveContext,"select * from airline")
> class(airline)
[1] "DataFrame"
attr(,"package")
[1] "SparkR"
> airline %>%
+	 filter(airline$Origin == "JFK") %>%
+	 group_by(airline$Dest) %>%
+	 agg(count=n(airline$Dest)) %>%
+	 head
	Dest count																																		
1	IAH	1214
2	STL	2922
3	SNA	 805
4	MSP	1580
5	STT	1085
6	SAN	2723

こんな感じで簡単にできました