Importing a WhatApp-Conversation in R

I recently saw some people on Reddit analyzing their chat-conversations and I wanted to try it too. You can export a Whatsapp-Conversation by sending it as an Email to yourself. You will receive a txt-File with all the conversations. Because it isn’t formatted in a useful manner, you have to do it yourself. I will do this in this post and analyse the data in a second one. So this will be a bit more technical than usual. You can find the complete code here.

I used the readLines-Command and encode it in UFT-8 because the language is German and there are letters like ö,ä and ü.

whatsapp = readLines('whatsapp analyse/whatsapp.txt',encoding="UTF-8")

The format of the lines isn’t too bad, there is a clear structure. It looks like this:

08.04.15, 00:09 – Adrian: Bye

08.04.15, 00:09 – Bartholomäus: brb

The problem is there are lines some lines which are just text. That happens if the written messages are long. To check every single line I created a loop and checked if it an extra-line. If not I put the different categories (date, time, author, message) each into a matrix. I create two name variables, to account for the different length of the names.

nameA="Adrian"
nameB="Bartholomäus"
extralines=c()
sorted=matrix(nrow = length(whatsapp), ncol = 4)

for(i in 1:length(whatsapp)) { 
#If Dates aren't in this format: '08.04.15, 00:09' - Substring-numbers need to be changed.
#The condition checks for the dot and the comma
    if(substring(whatsapp[i],9,10)==", "& substring(whatsapp[i],3,3)=="."){
      #fill date, time collumn
      sorted[i,1]=substr(whatsapp[i],0,8)
      sorted[i,2]=substr(whatsapp[i],11,15)

      #different length of names. just checks for the first two letters
      if(substr(whatsapp[i],19,20)==substr(nameA,1,2)){
         sorted[i,3]=nameA
         #Adds everything after the Name to the Message colum
         sorted[i,4]=substring(whatsapp[i],19+nchar(nameA)+1)
      }
      else{
        sorted[i,3]=nameB
        #Adds everything after the Name to the Message colum
                sorted[i,4]=substring(whatsapp[i],19+nchar(nameB)+1)
        }
    }
 # save them to "extralines" and add the content to the messagefield of the line before
  else{
  extralines=c(extralines,i)
  sorted[i-1,4]=paste(sorted[i-1,4],whatsapp[i],sep=" ")
  }
}

There are probably a lot of ways to improve this and make the script more flexible. This wouldn’t work for group-conversations.

As a final step I removed the extra-lines and saved it as a data-frame. Now the analysis can start.

data=data.frame(sorted=sorted[-extralines,]) 
colnames(data)=c("Date","Time","Author","Message")
write.csv2(data,file="wazapdata.csv",fileEncoding = "UTF-8")

For the analysis I’ll create a separate post. If you want to see a different approach to the same problem, you can find it here. (I unfortunately found it only after finishing my script, so it’s quite different)