PYTHON: Creating Facebook chat Dataset
Sometimes when you are working with machine learning, one of the major problems you should face is to get an enough big dataset to train a neural network or achieve a good regression. When you are trying to get such data, you can find some websites dedicated to selling consumer information, but the prices are outstanding and the data, sometimes, so specific.
A good source of data, that almost every people have, is his own data, in this case, we are going to use Facebook chat history to generate a dataset, the good news is that most of you are working on this data-set full time for very years.
First, you need to get the Facebook Chat Downloader Add-on for chrome, then download the chat history following the instructions in HTML format.
The HTML information of hundred or thousands of messages needs to be turned into a convenient format such as a CSV(Comma Separated Value) file, so you can manage it easily with Matlab with CSV import support.
The following python script can generate a CSV file with the HTML exported facebook chat history.
The script can also be downloaded through THIS LINK
A good source of data, that almost every people have, is his own data, in this case, we are going to use Facebook chat history to generate a dataset, the good news is that most of you are working on this data-set full time for very years.
First, you need to get the Facebook Chat Downloader Add-on for chrome, then download the chat history following the instructions in HTML format.
The HTML information of hundred or thousands of messages needs to be turned into a convenient format such as a CSV(Comma Separated Value) file, so you can manage it easily with Matlab with CSV import support.
The following python script can generate a CSV file with the HTML exported facebook chat history.
# Script to make csv files from messenger conversations obtained through Facebook Chat Downloader Add-on
# This script is designed for huge Gb files
# Usage: python RawToCsv.py conversation_input.html conversation_output.csv
import sys
import csv
BUFFER_SIZE = 1024
MESSAGE_LIMITER = b""
PART_LIMITER = b""
def read_in_chunks(file_object, chunk_size=BUFFER_SIZE):
"""Lazy function (generator) to read a file piece by piece.
Default chunk size: BUFFER_SIZE."""
while True:
data = file_object.read(chunk_size)
if not data:
break
yield data
def write_csv_file(writer, chunk):
messages = chunk.split(MESSAGE_LIMITER)
for msg in messages[0:len(messages)-1]:
parts = msg.split(PART_LIMITER)
date, time = parts[0].split(b' ')
date_s = date.split(b'-')
time_s = time.split(b':')
# Save every message line as an entry
# Format: YYYY;MM;DD;HH;MM;MESSAGE
for x in range(2, len(parts)):
row = [date_s[0].decode("utf-8"), date_s[1].decode("utf-8"), date_s[2].decode("utf-8"),
time_s[0].decode("utf-8"), time_s[1].decode("utf-8"), parts[x].decode("utf-8", "ignore")]
writer.writerow(row)
# Return the last part of the list for buffer in the next cycle
return messages[-1]
def main(argv):
if len(argv) != 3:
print("Invalid number of arguments")
exit(-1)
input_file_name = argv[1]
output_file_name = argv[2]
csv_file = open(output_file_name, 'w')
msg_writer = csv.writer(csv_file, dialect='excel', delimiter=';')
middle_buffer = b""
with open(input_file_name, 'br') as i_file:
# Jump initial body and head declarations
i_file.seek(41)
for chunk in read_in_chunks(i_file):
middle_buffer = write_csv_file(msg_writer, middle_buffer+chunk)
csv_file.close()
pass
if __name__ == "__main__":
main(sys.argv)
The script can also be downloaded through THIS LINK

Leave a Comment