Reading a CSV from a file with Python and Pandas is super simple and something that you are likely to have to do many times as a data scientist.
As an example let's read an image dataset I gathered of all the paintings I could find about the Nativity.
I will not go into detail on how I was able to gather this dataset, but if you are curious I have previously created a video showing how I gathered images using Panda data frames and BeautifulSoup.
You can find it here:
Let’s first import the pandas library so we can read our CSV:
To understand how to read CSV using pandas.read_csv, let’s use Python’s help function:
help(pandas.read_csv)Help on function read_csv in module pandas.io.parsers:
read_csv(filepath_or_buffer:Union[str, pathlib.Path, IO[~AnyStr]], sep=',', delimiter=None, header='infer', names=None, index_col=None, usecols=None, squeeze=False, prefix=None, mangle_dupe_cols=True, dtype=None, engine=None, converters=None, true_values=None, false_values=None, skipinitialspace=False, skiprows=None, skipfooter=0, nrows=None, na_values=None, keep_default_na=True, na_filter=True, verbose=False, skip_blank_lines=True, parse_dates=False, infer_datetime_format=False, keep_date_col=False, date_parser=None, dayfirst=False, cache_dates=True, iterator=False, chunksize=None, compression='infer', thousands=None, decimal:str='.', lineterminator=None, quotechar='"', quoting=0, doublequote=True, escapechar=None, comment=None, encoding=None, dialect=None, error_bad_lines=True, warn_bad_lines=True, delim_whitespace=False, low_memory=True, memory_map=False, float_precision=None)
Read a comma-separated values (csv) file into DataFrame.
Also supports optionally iterating or breaking of the file
Additional help can be found in the online docs for
`IO Tools <https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html>`_.
filepath_or_buffer : str, path object or file-like object
Any valid string path is acceptable. The string could be a URL. Valid
URL schemes include http, ftp, s3, gs, and file. For file URLs, a host is
expected. A local file could be: file://localhost/path/to/table.csv.
If you want to pass in a path object, pandas accepts any ``os.PathLike``.
By file-like object, we refer to objects with a ``read()`` method, such as
a file handler (e.g. via builtin ``open`` function) or ``StringIO``.
sep : str, default ','
Delimiter to use. If sep is None, the C engine cannot automatically detect
the separator, but the Python parsing engine can, meaning the latter will
be used and automatically detect the separator by Python's builtin sniffer
tool, ``csv.Sniffer``. In addition, separators longer than 1 character and
different from ``'\s+'`` will be interpreted as regular expressions and
will also force the use of the Python parsing engine. Note that regex
delimiters are prone to ignoring quoted data. Regex example: ``'\r\t'``.
delimiter : str, default ``None``
Alias for sep.
header : int, list of int, default 'infer'
Row number(s) to use as the column names, and the start of the
data. Default behavior is to infer the column names: if no names
are passed the behavior is identical to ``header=0`` and column
names are inferred from the first line of the file, if column
names are passed explicitly then the behavior is identical to
``header=None``. Explicitly pass ``header=0`` to be able to
replace existing names. The header can be a list of integers that
specify row locations for a multi-index on the columns
e.g. [0,1,3]. Intervening rows that are not specified will be
skipped (e.g. 2 in this example is skipped). Note that this
parameter ignores commented lines and empty lines if
``skip_blank_lines=True``, so ``header=0`` denotes the first line of
data rather than the first line of the file.
There are quite a lot of parameters that pandas.read_csv will accept, for many use-cases that you might encounter.
For now, let’s keep it simple and just read our CSV:
pd = pandas.read_csv("nativity_dataset.csv")
It was that easy! You will notice that I used the display function to show the panda data frame with a nice look and feel. You can use the print() function instead, but it will not look so pretty:
print(pd)Image URL ... Labels 0 https://d3d00swyhr67nd.cloudfront.net/w1200h12... ... NaN 1 https://d3d00swyhr67nd.cloudfront.net/w1200h12... ... NaN 2 https://d3d00swyhr67nd.cloudfront.net/w1200h12... ... NaN 3 https://d3d00swyhr67nd.cloudfront.net/w1200h12... ... NaN 4 https://d3d00swyhr67nd.cloudfront.net/w944h944... ... NaN .. ... ... ... 215 https://upload.wikimedia.org/wikipedia/commons... ... NaN 216 https://preview.redd.it/4tc6qepayw521.jpg?widt... ... NaN 217 https://upload.wikimedia.org/wikipedia/commons... ... NaN 218 https://upload.wikimedia.org/wikipedia/commons... ... NaN 219 https://upload.wikimedia.org/wikipedia/commons... ... NaN [220 rows x 3 columns]
You don’t always have to use the display() function. In certain conditions, you can omit it.
What happened there?¶
The python notebook called the display function for us because it was the last value in the python block.
If you don’t like the column headers you can easily change them when reading the CSV:
pd = pandas.read_csv("nativity_dataset.csv", names=["Precise Image URL", "Precise Source URL", "Precise Labels"])
It was almost what we wanted. But seems like the previous column names are still there. No worries, it is easy to fix, using header parameter.
pd = pandas.read_csv("nativity_dataset.csv", header=0, names=["Precise Image URL", "Precise Source URL", "Precise Labels"])
You will notice that we have a column without any data: (“Precise Labels”). Let’s delete it.
del pd["Precise Labels"]
The column is gone. Let’s save the panda data frame to a CSV
That was easy, right? There is a lot more that we can do with pandas and CSVs but I am sure this will help you get started.
That will be all for now. Happy Coding!