Thursday, March 28, 2024

Python pandas *.csv import and export differences

As i am working on a project involving *.csv data files using the Python programming language and pandas a march 2024 blogpost about this topic. The data gets imported in Python pandas dataframes and later also saved to *.csv files. I hope to post later more details about the resulting program (I also have plans to post the final project on my GitHub)  
When checking details i noticed differences between the imported and exported data. When reading files and writing them back i immediately noticed the difference in file length. Recognizing the problem (more about them later) was not extreme difficult.

Python pandas
Python pandas

Illustration made with help of  NightCafe AI Art Creator

Solving the issue did cost more time. I wasted a lot of time with trying the solutions provided by ChatGTP, giving feedback about the problems however it did not solve my problem. By searching on internet i learned i was not the only one with the problem. However finding a solution was more difficult. In fact there where two problems in the *.csv

Missing double quotes ""

All data in the *.csv file where strings surrounded by double quotes " . When writing the *.csv file back the double quotes where gone.

Solution: When writing using the to_csv() use the option    quoting=csv.QUOTE_ALL 

NaN (or Null) values

The *.csv file contained items as "" and "None" . During reading of the data all both these values are converted to NaN in the Python panda dataframes. NaN ( in databases known as Null ) values are missing or not assigned values. 
For my project i needed to keep "" and "None" when writing back the data. Solutions provided mostly changed both values to "" or "None".  ChatGPT could not help me, however it (or he?) knew the answer.  When entered the solution that i found the response was the explanation of my solution. 

Solution: When reading using read_csv() use the option     keep_default_na=False

From the documentation on  https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html  :
By default the following values are interpreted as NaN: “ “, “#N/A”, “#N/A N/A”, “#NA”, “-1.#IND”, “-1.#QNAN”, “-NaN”, “-nan”, “1.#IND”, “1.#QNAN”, “<NA>”, “N/A”, “NA”, “NULL”, “NaN”, “None”, “n/a”, “nan”, “null “.

ChatGPT provided only solutions with the na_values options, However if keep_default_na is True, and na_values are specified, na_values is appended to the default NaN values used for parsing.

The final code

Important parts of Python ( pandas ) code used

import pandas as pd

my_import_df = pd.read_csv(csv_import_file, encoding='latin-1', keep_default_na=False, dtype=object)

my_export_df.to_csv( csv_export_file, quoting=csv.QUOTE_ALL, index=False)

Hoping to post something about my first pre-release version with some first functionalities of my project soon.

No comments: