Using files of text
Contents
Using files of text¶
In order to work with the content of a file, a Python program need to access this content. The program needs to have access to the file and use ways for reading the content into the program, whether as a whole ore piecewise. The same goes for writing to a file.
In this short notebook we focus on files of text whose content is read into the program as strings. The same goes for writing. In case you need Python’s documentation you find it here.
We include a section that shows the tools you can find in Python to break up strings into interesting substrings (for example words!).
Reding from a file of text¶
You need to know the name of the file (including the path relative to the folder where the program is running). Then you can just use
file_variable = open(file_name, 'r')
in order to have a variable in your program to work with the file. The r
means that you can use your file variable with methods that read from the file.
You can choose to read line by line to get one string per line using
line = file_variable.readline()
If you want to read all lines from a file you can use a for
-loop as follows:
for line in file_variable:
# do what you need to do with the string in line
Or, you can read all of the content in the file at once:
text = file_variable.read()
When you are done reading you should close the file:
file_variable.close()
In order to write to a file you need to open it for writing as in
file_variable = open(file_name, 'w')
and you can use
file_variable.write('the string you want to add to the file')
When you are done writing you should close the file:
file_variable.close()
Here are a couple of examples for which you need to have the files that we used in the same folder as you are running this notebook!
f = open('text-test', 'r')
txt = f.read()
# we both print the string txt and ask for its value.
# Hope you see the \n for newlines in the value. These are used in print to change line.
print(txt)
txt
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
/var/folders/82/jclchxtc8v5c1z006s6fwjm9v2lh42/T/ipykernel_30481/3195468843.py in <module>
----> 1 f = open('text-test', 'r')
2 txt = f.read()
3
4
5 # we both print the string txt and ask for its value.
FileNotFoundError: [Errno 2] No such file or directory: 'text-test'
f = open('text-test', 'r')
for line in f:
print(line)
This is a file of text. In this text there are some sentences organised in paragraphs. We are interested in counting the number of sentences.
This second paragraph is here to be able to have an extra line. There are sentences in this line too.
# The file 'output-text' does not need to exist before running this code
f = open('output-text', 'w')
f.write('Adding output! \n\n')
f.write(txt)
f.write('That is it!')
f.close()
Splitting strings¶
It is often the case that you want to analyse the content of a text file. You might want to grab the sentences or the words for example. For doing this there is a lot of help in the module for regular expressions. We present you here with an example of how to use this module.
To get access to the functions and methods in the module you need to import it:
import re
The method re.findall(pattern, text)
returns a list with all the strings in the text that have the form given in the pattern. It is this pattern that is given as a so called regular expression.
In Python you write down a pattern as a string.
The string '[a-z]'
means any letter between 'a'
and 'z'
.
The string [a-z]+
means any string that uses at least one letter between 'a'
and 'z'
.
If you want to include capital letters you can write '[a-zA-Z]'
in the pattern.
So, here is a piece of code to get the list of words in the file we read before:
import re
re.findall(r'[a-zA-Z]+', txt)
['This',
'is',
'a',
'file',
'of',
'text',
'In',
'this',
'text',
'there',
'are',
'some',
'sentences',
'organised',
'in',
'paragraphs',
'We',
'are',
'interested',
'in',
'counting',
'the',
'number',
'of',
'sentences',
'This',
'second',
'paragraph',
'is',
'here',
'to',
'be',
'able',
'to',
'have',
'an',
'extra',
'line',
'There',
'are',
'sentences',
'in',
'this',
'line',
'too']
And now sentences! Count the number of full stops!
len(re.findall(r'.', txt))
243
This was not what we expected!
This is because ‘.’ in regular expressions stands for any character (except new lines)!
In order to count full stops in the text we need to use \.
and we need to remove the r
before the pattern. If we do not remove it it will look for both the characters '\'
and .
.
len(re.findall('\.', txt))
5