Intro
MongoDB is a NoSQL database program using JSON type of documents with schemas. It’s open source cross-platform database. MongoDB is the representative NoSQL database engine. To me, I’ve started to learn Python for some reasons. One of them, for me, is that I want to insert webCrawling datasets including text data, image url, and so on. to NoSQL database.
MongDB Installation
I highly recommend users to use mongoDB atlas. Since Cloud rapidly dominates over IT industry, it’s better to practice a good cloud product like mongodb atlas cloud. It’s free to use it connecting to major cloud service agencies, AWS, GCP, Azure. The details are followed link: https://www.mongodb.com/cloud/atlas
Note - Network Access
Following instructions, it’s not much difficult to build mongoDB cluster. But, when setting Network Access up, it’s a bit confused where to click. For educational purpose please click ALLOW ACCESS FROM ANYWHERE for your sake.
By clicking, users are freely access to mongoDB in cloud.
Python Connecting to MongoDB Cluster
Many ways to connect in different languages. Here, fortunately, for python users, this platform provides sample code.
Now, it’s time to code in python.
Python module Installation
In Python, type below code and install on terminal.
$ pip install pymongo
$ pip install dnspython
$ pip install motor
If using python3, then use pip3 instead of pip
The details are explained at https://pypi.org/project/pymongo/
Via pymongo.MongoClient("your_uri")
, users are able to reach mongoDB.
Error Handling 1 - dnspython
To me, the most difficult one was dealing with dnspython
. When facing with the error dnspython, then please find solution Driver Compatibility
Erros Handling - Python Error Certificate Verify Failed
When trying to execute get data with query, then Certificate Verify Failed
message may pop up. Then kindly visit How To Fix Python Error Certificate Verify Failed: Unable To Get Local Issuer Certificate In Mac OS
import pymongo
import dns
import pprint
import motor
url_path = "mongodb+srv://<user>:<password>@tafuncluster-rmc7j.gcp.mongodb.net/test?retryWrites=true&w=majority"
client = pymongo.MongoClient(url_path)
db = client['sample_dataset_R']
collection = db['iris']
pprint.pprint(collection.find_one())
{'Petal_Length': 1.4,
'Petal_Width': 0.2,
'Sepal_Length': 5.1,
'Sepal_Width': 3.5,
'Species': 'setosa',
'_id': ObjectId('5db94dbfe6188c5426265283')}
From MongoDB to Pandas
The major work of data scientist or data analyst is to get data as DataFrame not JSON type. So, this is another important to convert JSON to DataFrame using Pandas. Let's try sample code below. Let's print again.
pprint.pprint(collection.find_one())
{'Petal_Length': 1.4,
'Petal_Width': 0.2,
'Sepal_Length': 5.1,
'Sepal_Width': 3.5,
'Species': 'setosa',
'_id': ObjectId('5db94dbfe6188c5426265283')}
At this moment, we don't need to get _id
. So, we will exclude the column.
import pandas as pd
exclude_column = {'_id': False}
mong_data = list(collection.find({}, projection=exclude_column))
iris = pd.DataFrame(mong_data)
print(iris)
Sepal_Length Sepal_Width Petal_Length Petal_Width Species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
.. ... ... ... ... ...
595 6.7 3.0 5.2 2.3 virginica
596 6.3 2.5 5.0 1.9 virginica
597 6.5 3.0 5.2 2.0 virginica
598 6.2 3.4 5.4 2.3 virginica
599 5.9 3.0 5.1 1.8 virginica
[600 rows x 5 columns]
-
why we need list? After the command line, mong_data is stored in json format
-
From JSON format, we can converting json to pandas dataframe.
Creating Database and Collection
MongoDB will create very quickly and automatically if it does not exist, and make a connection to it. Let's create a database called my1stdatabase
import pymongo
my1stdatabase = client['my1stdatabase']
my1stcollection = my1stdatabase['my1stcollection']
This two line codes are big enough creating database and collection
Insert iris data into Collection
To insert data into Collection, it's not quite difficult to do it. We will use iris data already imported.
my1stcollection.insert_many(iris.to_dict('records'))
<pymongo.results.InsertManyResult at 0x116afc280>
Instead of iris
, my1stcollection
named is shown at mongodb cloud.
Conclusion
In this post, you have learnt that you can set up a mongoDB server very quickly, and that mongoDB is a breeze!
I like Cloud. Cloud is very powerful when working with others. As Data Scientist, building infrastructure is indeed horrible. Although docker is powerful, but it frustrates me when needed to study network, bridge, etc concepts. So, I gave up. I turned my goal to find a good solution related with cloud, and I found mongodb cloud. Yes, still needs to study more about MongoDB. For sure, I am able to be a big fan of mongoDB which is able to import and export unstructured data like img url, text data, etc with relational data.
I wanted to stick to a simple example, and I’m sure that it will already be very useful to you. In future posts, we’ll see how to access the server from a remote machine, and how to fill and read the database asynchronously, for an interactive display with bokeh.
For R Users, Please click my another post [http://rpubs.com/Evan_Jung/r_mongodb]