amazon ec2 - Join performance on AWS elastic map reduce running hive -


i running simple join query

 select count(*) t1 join t2 on t1.sno=t2.sno  

table t1 , t2 both have 20 million records each , column sno of string data type.

the table data imported in hdfs amazon s3 in rcfile format. query took 109s 15 amazon large instances takes 42sec on sql server 16 gb ram , 16 cpu cores.

am missing anything? can't understand why getting slow performance on amazon?

some questions tune hadoop performance:

  • what io utilization on instances? maybe large instances not right balance of cpu / disk / memory job.
  • how files stored? single file, or many small files? hadoop isn't hot many small files, if they're combinable
  • how many reducers did run? want have 0.9*totalreducecapacity ideal
  • how skewed data? if there many records same key go same reducer, , you'll have o(n*n) upper bound in reducer if you're not careful.

sql-server might fine 40mm records, wait till have 2bn records , see how does. break. i'd see hive more clever wrapper map reduce rather alternative real database.

also experience think having 15 c1.mediums might perform large machines, if not better. large machines don't have right balance of cpu/memory honestly.


Comments

Popular posts from this blog

c# - SVN Error : "svnadmin: E205000: Too many arguments" -

c# - Copy ObservableCollection to another ObservableCollection -

All overlapping substrings matching a java regex -