Skip to main content

Create NumPy array from Text file

1. Intro

NumPy has helpful methods to create an array from text files like CSV and TSV. In real life our data often lives in the file system, hence these methods decrease the development/analysis time dramatically.

numpy.loadtxt(fname, dtype=, comments='#', delimiter=None, converters=None, skiprows=0, usecols=None, unpack=False, ndmin=0, encoding='bytes', max_rows=None) 

The Numpy loadtxt() method is an efficient way to load data from text files where each row has distinct value counts.

2. NumPy array from CSV file

We have a CSV file with Delhi rainfall data in millimeters for every month of years 2017 and 2018.

CSV file

12.0, 12.0, 14.0, 16.0, 19.0, 12.0, 11.0, 14.0, 17.0, 19.0, 11.0, 11.5
13.0, 11.0, 13.5, 16.7, 15.0, 11.0, 12.0, 11.0, 19.0, 18.0, 13.0, 12.5

We will create a NumPy array from a CSV file using numpy.loadtxt() method. This method takes a delimiter character, which makes it very flexible to handle files.

#%%
# Create an array from rain-fall.csv, keeping rainfall data in mm
array_rain_fall = np.loadtxt(fname="rain-fall.csv", delimiter=",")
print("NumPy array: \n", array_rain_fall)
print("Shape: ", array_rain_fall.shape)
print("Data Type: ", array_rain_fall.dtype.name)

OUTPUT

NumPy array: 
 [[12. 12. 14. 16. 19. 12. 11. 14. 17. 19. 11. 11.5]
 [13. 11. 13.5 16.7 15. 11. 12. 11. 19. 18. 13. 12.5]]
Shape:  (2, 12)
Data Type:  float64

2.1 Error when different column counts in rows

While creating NumPy array using numpy.loadtxt() method, make sure CSV rows have distinct column counts, lack of it will result in an error.

We are trying to use numpy.loadtxt() method when there is a difference in column counts in the rain-fall-wrong.csv file.

#%%
# Check error when different column counts in rows
array_rain_fall_wrong = np.loadtxt(
    fname="rain-fall-wrong.csv", delimiter=","
)

OUTPUT:

ValueError: Wrong number of columns at line 2

2.2 Skipping rows and columns in CSV

We can skip rows and columns while creating a NumPy array from CSV. It is useful when CSV contains row and column names.

We have to pass skiprows and usecols argument to loadtxt() method.

rain-fall-row-col-names.csv file:

Year, Jan, Feb, Mar, Apr, May, Jun, July, Aug, Sep, Oct, Nov, Dec
2017, 12.0, 12.0, 14.0, 16.0, 19.0, 12.0, 11.0, 14.0, 17.0, 19.0, 11.0, 11.5
2018, 13.0, 11.0, 13.5, 16.7, 15.0, 11.0, 12.0, 11.0, 19.0, 18.0, 13.0, 12.5
#%%
# Skip first row and first column
array_rain_fall_named = np.loadtxt(
    fname="rain-fall-row-col-names.csv",
    delimiter=",",
    skiprows=1,
    usecols=(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12),
)
print("NumPy array: \n", array_rain_fall_named)
print("Shape: ", array_rain_fall_named.shape)
print("Data Type: ", array_rain_fall_named.dtype.name)

OUTPUT:

NumPy array: 
 [[12. 12. 14. 16. 19. 12. 11. 14. 17. 19. 11. 11.5]
 [13. 11. 13.5 16.7 15. 11. 12. 11. 19. 18. 13. 12.5]]
Shape:  (2, 12)
Data Type:  float64

2.3 Create NumPy array with GZipped file

Gzip is helpful in reducing the size of files, especially text. For .gz extension file, NumPy.loadtxt() automatically unzip first; before processing as usual.

We can use it for text value files with any delimiters.

#%%
# Create array from gzipped csv
array_rain_fall_zip = np.loadtxt(
    fname="rain-fall-row-col-names.csv.gz",
    delimiter=",",
    skiprows=1,
    usecols=(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12),
)
print("NumPy array: \n", array_rain_fall_zip)
print("Shape: ", array_rain_fall_zip.shape)
print("Data Type: ", array_rain_fall_zip.dtype.name)

OUTPUT:

NumPy array: 
 [[12. 12. 14. 16. 19. 12. 11. 14. 17. 19. 11. 11.5]
 [13. 11. 13.5 16.7 15. 11. 12. 11. 19. 18. 13. 12.5]]
Shape:  (2, 12)
Data Type:  float64

3. Create NumPy array from TSV

TSV (Tab Separated Values) files are used to store plain text in the tabular form. We create a NumPy array from TSV by passing \t as value to delimiter argument in numpy.loadtxt() method.

#%%
# Create array from tsv files
array_rain_fall_tab = np.loadtxt(
    fname="rain-fall-row-col-names.tsv",
    delimiter="\t",
    skiprows=1,
    usecols=(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12),
)
print("NumPy array: \n", array_rain_fall_zip)
print("Shape: ", array_rain_fall_zip.shape)
print("Data Type: ", array_rain_fall_zip.dtype.name)

OUTPUT:

NumPy array:
[[12. 12. 14. 16. 19. 12. 11. 14. 17. 19. 11. 11.5]
[13. 11. 13.5 16.7 15. 11. 12. 11. 19. 18. 13. 12.5]]
Shape: (2, 12)
Data Type: float64

4. Conclusion

In this tutorial, we learned about key techniques to create a NumPy array using data stored on plain text files like CSV, TSV, etc. These methods are very handy while doing data exploration as well as developing programs.

Please download source code related to this tutorial here. You can run the Jupyter notebook for this tutorial here.

Comments

Popular posts from this blog

Working with request header in Jersey (JAX-RS) guide

In the  previous post , we talked about, how to get parameters and their values from the request query string. In this guide learn how to get request header values in Jersey (JAX-RS) based application. We had tested or used the following tools and technologies in this project: Jersey (v 2.21) Gradle Build System (v 2.9) Spring Boot (v 1.3) Java (v 1.8) Eclipse IDE This is a part of  Jersey (JAX-RS) Restful Web Services Development Guides series. Please read Jersey + Spring Boot getting started guide . Gradle Build File We are using Gradle for our build and dependency management (Using Maven rather than Gradle is a very trivial task). File: build.gradle buildscript { ext { springBootVersion = '1.3.0.RELEASE' } repositories { mavenCentral() } dependencies { classpath("org.springframework.boot:spring-boot-gradle-plugin:${springBootVersion}") } } apply plugin: 'java' apply plugin: 'eclipse' a

Ajax Cross Domain Resource Access Using jQuery

Some time back in our project we faced a problem while making an Ajax call using jQuery. Chrome Browser console had given some weird error message like below when we try to access one of our web pages: When we try to access the same web page in the Firefox browser, it doesn't give any error in the console but some parsing error occurred. In our case, we were accessing XML as an Ajax request resource. I was curious to check if the non-XML cross-domain resource was successfully loading or not. But finally, I realized that it is not going through. jersey-spring-boot-quick-starter-guide In our Ajax call, requesting domain was not the same as the requested URL domain. $.ajax({ url: "https://10.11.2.171:81/xxxxxx/xxxxxxx.xml" , type : "get" , success: function (response) { alert( "Load was performed." ); }, error : function (xhr, status) {

FastAPI first shot

Setup on my Mac (Macbook Pro 15 inch Retina, Mid 2014) Prerequisite Python 3.6+ (I used 3.7.x. I recently reinstalled OS after cleaning up disk, where stock Python 2.7 was available. I installed Pyenv and then used it to install 3.7.x). I already had a git repo initialized at Github for this project. I checked that out. I use this approach to keep all the source code safe or at a specific place 😀. I set the Python version in .python-version file. I also initialize the virtual environment using pyenv in venv folder. I started the virtual environment. FastAPI specific dependencies setup Now I started with basic pip commands to install dependency for the project. I saved dependencies in requirements.txt  the file. Minimal viable code to spin an API Server FastAPI is as cool as NodeJS or Go Lang (?) to demonstrate the ability to spin an API endpoint up and running in no time. I had the same feeling for the Flask too, which was also super cool. app/main.py: from typing i