2020年8月21日

(R) - Rewrite for-loop to Sapply

Performance is one of the important topics in the coding field. However, when the program is running during a human's sleeping time, that could be not the priority to focus. Recently I was improving part of my low-performance code in R in order to make running faster, the result is quite satisfied. In this article, you will learn how to rewrite a normal for-loop into SAPPLY function in R.

效能在程式開發中一直是一個重要的議題,但有些程式會是透過排程在半夜運轉的,這時我覺得效能就不一定是最重要的,反而是資料的正確性。近期寫了一個程式來檢核資料是否有遺失,一開始透過迴圈來處理4萬多筆資料時間高達6分鐘,改用R裡的Sapply function來改寫,效能大大提升,而原因在於它只針對單一vector做處理。

When you receive a task, you have two options to complete your task, the first option is making your hand dirty and makes the program output the result you want, for example, for-loop and if-else. Another option is to investigate what is the best performance approach to do code implementation, but it depends on how much time you have.  

In my opinion, you can make it functional quickly and then improve the performance afterward slowly. In my task, there is a process which is to reformat the data frame, the value of the event label column is split into three-part and convert into a list, the values will be inserted into other three columns according to the different rules. 

here is the for loop version
#If the close message is case.x.x, given it 0
  for(i in 1:nrow(data)){
    if(i == 1){timer1 <- Sys.time()}
    
    conversationNum <- unlist(strsplit(data[i,]$eventLabel, "[.]"))
    data[i,]$case = as.numeric(conversationNum[1])
    
    if(as.character(conversationNum[2]) != "x" && as.character(conversationNum[3]) != "x"){
      data[i,]$dialogue = as.numeric(conversationNum[2])
      data[i,]$message = as.numeric(conversationNum[3])
    }else{
      data[i,]$dialogue = as.numeric(0)
      data[i,]$message = as.numeric(0)
    }
  } 


the for-loop version handle 42061 rows taking 6.3 mins.
and then let's see how to make it much faster using sapply function. 

function : 
handle.eventlabel <- function(x,position){
  y <- unlist(strsplit(x,"[.]"))
  z <- ifelse(y[position] == "x", as.numeric(0), as.numeric(y[position]))
  return(z)


sapply:
data$case <- sapply(X = data$eventLabel, FUN = handle.eventlabel, position = 1)
data$dialogue <- sapply(X = data$eventLabel, FUN = handle.eventlabel, position = 2)
data$message <- sapply(X = data$eventLabel, FUN = handle.eventlabel, position = 3) 


the sapply version takes less than 1 min(0.0807 min) to finish 42061 rows reformating work.
Super quicker.

for more information about apply family function, please reference to here

沒有留言:

張貼留言

<Javascript> How to uncompressed GZIP at front-end using Javascript

It's been a while I haven't share my coding work. In this article I would like to share how to receive a Gzip file via stream, unzip...