PYTHON: Creating Facebook chat Dataset

Sometimes when you are working with machine learning, one of the major problems you should face is to get an enough big dataset to train a neural network or achieve a good regression. When you are trying to get such data, you can find some websites dedicated to selling consumer information, but the prices are outstanding and the data, sometimes, so specific.


A good source of data, that almost every people have, is his own data, in this case, we are going to use Facebook chat history to generate a dataset, the good news is that most of you are working on this data-set full time for very years.

First, you need to get the Facebook Chat Downloader Add-on for chrome, then download the chat history following the instructions in HTML format.

The HTML information of hundred or thousands of messages needs to be turned into a convenient format such as a CSV(Comma Separated Value) file, so you can manage it easily with Matlab with CSV import support.

The following python script can generate a CSV file with the HTML exported facebook chat history.

# Script to make csv files from messenger conversations obtained through Facebook Chat Downloader Add-on
# This script is designed for huge Gb files
# Usage: python RawToCsv.py conversation_input.html conversation_output.csv
import sys
import csv

BUFFER_SIZE = 1024 
MESSAGE_LIMITER = b""
PART_LIMITER = b""


def read_in_chunks(file_object, chunk_size=BUFFER_SIZE):
    """Lazy function (generator) to read a file piece by piece.
    Default chunk size: BUFFER_SIZE."""
    while True:
        data = file_object.read(chunk_size)
        if not data:
            break
        yield data


def write_csv_file(writer, chunk):
    messages = chunk.split(MESSAGE_LIMITER)

    for msg in messages[0:len(messages)-1]:
        parts = msg.split(PART_LIMITER)
        date, time = parts[0].split(b' ')
        date_s = date.split(b'-')
        time_s = time.split(b':')

        # Save every message line as an entry
        # Format: YYYY;MM;DD;HH;MM;MESSAGE
        for x in range(2, len(parts)):
            row = [date_s[0].decode("utf-8"), date_s[1].decode("utf-8"), date_s[2].decode("utf-8"),
                   time_s[0].decode("utf-8"), time_s[1].decode("utf-8"), parts[x].decode("utf-8", "ignore")]
            writer.writerow(row)

    # Return the last part of the list for buffer in the next cycle
    return messages[-1]


def main(argv):
    if len(argv) != 3:
        print("Invalid number of arguments")
        exit(-1)

    input_file_name = argv[1]
    output_file_name = argv[2]
    csv_file = open(output_file_name, 'w')
    msg_writer = csv.writer(csv_file, dialect='excel', delimiter=';')

    middle_buffer = b""
    with open(input_file_name, 'br') as i_file:
        # Jump initial body and head declarations
        i_file.seek(41)
        for chunk in read_in_chunks(i_file):
            middle_buffer = write_csv_file(msg_writer, middle_buffer+chunk)
    csv_file.close()
    pass

if __name__ == "__main__":
    main(sys.argv)

The script can also be downloaded through THIS LINK 

No comments

Powered by Blogger.