En Atlas mete o dataset de exemplo na túa base de datos.
Engade o enderezo IP autorizado a conectar para que poidas conectar á base de datos.
Proba a conectar dende compass coa URL de conexión que che da.
Da BBDD de proba: sample_mfix, na colección movies: https://www.mongodb.com/docs/atlas/sample-data/sample-mflix/
#!conda install -y pymongo dnspython
Collecting package metadata (current_repodata.json): ...working... done Solving environment: ...working... done ## Package Plan ## environment location: C:\Users\diego.guizanlopez\AppData\Local\miniconda3\envs\bigdata added / updated specs: - dnspython - pymongo The following NEW packages will be INSTALLED: cffi pkgs/main/win-64::cffi-1.16.0-py38h2bbff1b_0 cryptography pkgs/main/win-64::cryptography-42.0.2-py38h89fc84f_0 dnspython pkgs/main/win-64::dnspython-2.4.2-py38haa95532_0 idna pkgs/main/win-64::idna-3.4-py38haa95532_0 pycparser pkgs/main/noarch::pycparser-2.21-pyhd3eb1b0_0 pymongo pkgs/main/win-64::pymongo-3.12.0-py38hd77b12b_0 The following packages will be SUPERSEDED by a higher-priority channel: ca-certificates conda-forge::ca-certificates-2024.2.2~ --> pkgs/main::ca-certificates-2023.12.12-haa95532_0 Downloading and Extracting Packages: ...working... done Preparing transaction: ...working... done Verifying transaction: ...working... done Executing transaction: ...working... done
==> WARNING: A newer version of conda exists. <== current version: 23.9.0 latest version: 24.1.2 Please update conda by running $ conda update -n base -c defaults conda Or to minimize the number of packages updated during conda update use conda install conda=24.1.2
Conecta a Atlas dende Python, carga a colección en Pandas. Averigua o modo máis adecuado.
from pymongo import MongoClient
import pandas as pd
#HOST="(...).mongodb.net"
HOST="z"
PORT=27017
USERNAME="x"
PASSWORD="y"
if HOST == 'localhost':
if not USERNAME:
cli_mongo = MongoClient(HOST, PORT)
else:
cli_mongo = MongoClient(HOST, PORT, USERNAME, PASSWORD)
else:
cli_mongo = MongoClient(f"mongodb+srv://{USERNAME}:{PASSWORD}@{HOST}/?retryWrites=true&w=majority&appName=Diego")
# Tras la conexión al cliente de mongo le indicamos la BBDD y la colección sample_mflix-movies y le decimos que recoja todos los datos
movies = cli_mongo['sample_mflix']['movies']
df = pd.DataFrame(movies.find({}))
A) Contar o total de películas.
# Contar las películas
len(df)
21349
B) Contar o número de películas de cada xénero.
# Número de géneros diferentes
df.explode('genres')['genres'].value_counts()
genres Drama 12385 Comedy 6532 Romance 3318 Crime 2457 Thriller 2454 Action 2381 Adventure 1900 Documentary 1834 Horror 1470 Biography 1269 Family 1249 Mystery 1139 Fantasy 1055 Sci-Fi 958 Animation 912 History 874 Music 780 War 699 Short 442 Musical 440 Sport 366 Western 242 Film-Noir 77 News 44 Talk-Show 1 Name: count, dtype: int64
C) Contar cantas películas hai por ano.
# Películas de cada año
df.groupby('year').size()
year 1896 2 1903 1 1909 1 1911 2 1913 1 .. 2009è 2 2010è 4 2011è 2 2012è 3 2014è 2 Length: 130, dtype: int64
D) Mirar se hai correlación entre imdb e rotten tomatoes.
import numpy as np
# Hace una consulta obteniendo el rating de imdb y el de tomatoes, si no tiene lo remplaza por N/A
valores=[value for value in movies.aggregate([
{
"$project": {
"_id": 0,
"tomatoes": { "$ifNull": ["$tomatoes.viewer.rating", "N/A"] },
"imdb": { "$ifNull": ["$imdb.rating", "N/A"] }
}
}]
)]
# Convertimos los valores a un DataFrame como float64 y remplazamos los N/A por np.nan
dfg=pd.DataFrame(valores)
dfg['imdb']=pd.to_numeric(dfg['imdb']).astype(np.float64)
dfg.replace('N/A',np.nan,inplace=True)
dfg.corr()
tomatoes | imdb | |
---|---|---|
tomatoes | 1.000000 | 0.405018 |
imdb | 0.405018 | 1.000000 |
E) Contar cantas películas ten cada director
# Agrupamos por director y contamos sus apariciones
pd.DataFrame(df.explode('directors')['directors'].value_counts())
count | |
---|---|
directors | |
Woody Allen | 40 |
Martin Scorsese | 32 |
Takashi Miike | 31 |
John Ford | 29 |
Sidney Lumet | 29 |
... | ... |
Mark Levin | 1 |
Jennifer Flackett | 1 |
Alex Bowen | 1 |
Jason Gileno | 1 |
Marcin Koszalka | 1 |
10843 rows × 1 columns
F) Contar en cantas películas ten participado cada actor
# Lo mismo pero con cada uno de los actores
pd.DataFrame(df.explode('cast')['cast'].value_counts())
count | |
---|---|
cast | |
Gèrard Depardieu | 67 |
Robert De Niro | 58 |
Michael Caine | 51 |
Bruce Willis | 49 |
Samuel L. Jackson | 48 |
... | ... |
Shègo Oshinari | 1 |
Charles Lampkin | 1 |
Manoel de Oliveira | 1 |
Miguel èngel Ferriz | 1 |
Mickael Gouin | 1 |
38636 rows × 1 columns
G) Contar cantas películas hai de cada idioma
# Lo mismo que los dos anteriores pero con idiomas
pd.DataFrame(df.explode('languages')['languages'].value_counts())
count | |
---|---|
languages | |
English | 14639 |
French | 2405 |
Spanish | 1600 |
German | 1377 |
Italian | 1228 |
... | ... |
Korean Sign Language | 1 |
Awadhi | 1 |
Mapudungun | 1 |
Nama | 1 |
Tigrigna | 1 |
231 rows × 1 columns
H) Cal é a media de premios das películas
# Recogemos todos los awards y los sumamos dividiendo entre el total de líneas
dfP = pd.DataFrame([dic for dic in df['awards']])
dfP['wins'].mean()
4.063047449529252
J) Amosar as películas que teñen alomenos 3 premios
# Las películas que tienen 3 wins por lo menos
df[df['awards'].str.get('wins')>=3]
_id | plot | genres | runtime | cast | num_mflix_comments | poster | title | fullplot | languages | ... | writers | awards | lastupdated | year | imdb | countries | type | tomatoes | rated | metacritic | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
18 | 573a1391f29313caabcd944c | Young Harry is in love and wants to marry an a... | [Drama, Romance] | 76.0 | [Greta Garbo, Lewis Stone, Gavin Gordon, Ellio... | 0 | https://m.media-amazon.com/images/M/MV5BYTc4Yj... | Romance | Young Harry is in love and wants to marry an a... | [English] | ... | [Edward Sheldon (from the play by), Bess Mered... | {'wins': 3, 'nominations': 0, 'text': 'Nominat... | 2015-08-08 00:38:18.023000000 | 1930 | {'rating': 6.0, 'votes': 379, 'id': 21310} | [USA] | movie | {'viewer': {'rating': 3.5, 'numReviews': 191, ... | APPROVED | NaN |
21 | 573a1391f29313caabcd96ae | Hildy Johnson, newspaper reporter, is engaged ... | [Comedy] | 101.0 | [Adolphe Menjou, Pat O'Brien, Mary Brian, Edwa... | 0 | https://m.media-amazon.com/images/M/MV5BMjM4ND... | The Front Page | Hildy Johnson, newspaper reporter, is engaged ... | [English] | ... | [Ben Hecht (by), Charles MacArthur (by), Bartl... | {'wins': 5, 'nominations': 0, 'text': 'Nominat... | 2015-09-06 00:10:44.853000000 | 1931 | {'rating': 6.9, 'votes': 1290, 'id': 21890} | [USA] | movie | {'viewer': {'rating': 3.2, 'numReviews': 957, ... | TV-PG | NaN |
22 | 573a1392f29313caabcd97a7 | At an all-girls boarding school, Manuela falls... | [Drama, Romance] | 87.0 | [Emilia Unda, Dorothea Wieck, Hedwig Schlichte... | 0 | https://m.media-amazon.com/images/M/MV5BMDhkNj... | Mèdchen in Uniform | German film in which a sensitive girl is sent ... | [German] | ... | [Christa Winsloe (screenplay), Friedrich Damma... | {'wins': 3, 'nominations': 0, 'text': '3 wins.'} | 2015-09-03 00:52:22.460000000 | 1931 | {'rating': 7.8, 'votes': 1486, 'id': 22183} | [Germany] | movie | {'viewer': {'rating': 4.0, 'numReviews': 667, ... | PASSED | NaN |
33 | 573a1392f29313caabcda3ce | Bea Pullman and her daughter Jessie have had a... | [Drama, Romance] | 111.0 | [Claudette Colbert, Warren William, Rochelle H... | 1 | https://m.media-amazon.com/images/M/MV5BZGExYW... | Imitation of Life | Bea Pullman and her daughter Jessie have had a... | [English] | ... | [Fannie Hurst (novel), William Hurlbut (screen... | {'wins': 4, 'nominations': 0, 'text': 'Nominat... | 2015-08-22 00:06:38.750000000 | 1934 | {'rating': 7.6, 'votes': 2530, 'id': 25301} | [USA] | movie | {'viewer': {'rating': 4.1, 'numReviews': 4477,... | APPROVED | NaN |
37 | 573a1392f29313caabcda654 | In this fictionalized biography, young Pancho ... | [Biography, Western] | 115.0 | [Wallace Beery, Leo Carrillo, Fay Wray, Donald... | 1 | https://m.media-amazon.com/images/M/MV5BZjMwZW... | Viva Villa! | In this fictionalized biography, young Pancho ... | [English] | ... | [Ben Hecht (screen play), Edgecumb Pinchon (su... | {'wins': 4, 'nominations': 4, 'text': 'Won 1 O... | 2015-08-20 00:24:08.127000000 | 1934 | {'rating': 6.6, 'votes': 781, 'id': 25948} | [USA] | movie | {'viewer': {'rating': 3.3, 'numReviews': 235, ... | NaN | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
21333 | 573a13f7f29313caabde71b9 | Human nature begins where truth ends Welcome t... | [Crime, Drama, Mystery] | 126.0 | [Elaine Jin, Aaron Kwok, Jessie Li, Michael Ning] | 0 | NaN | Port of Call | Human nature begins where truth ends Welcome t... | [Cantonese] | ... | [Philip Yung] | {'wins': 3, 'nominations': 0, 'text': '3 wins.'} | 2015-08-06 00:56:04.570000000 | 2015 | {'rating': 6.7, 'votes': 25, 'id': 4417522} | [Hong Kong] | movie | NaN | NaN | NaN |
21336 | 573a13f8f29313caabde8161 | NaN | NaN | 120.0 | [Aharon Traitel, Khalifa Natour, Riki Blich, G... | 0 | NaN | Tikkun | NaN | [Hebrew, Yiddish] | ... | NaN | {'wins': 7, 'nominations': 1, 'text': '7 wins ... | 2015-09-04 00:39:57.163000000 | 2015 | {'rating': 7.0, 'votes': 23, 'id': 4482858} | [Israel] | movie | NaN | NaN | NaN |
21338 | 573a13f9f29313caabdea63e | Three different love stories, set in three con... | [Drama] | 123.0 | [Tihana Lazovic, Goran Markovic, Nives Ivankov... | 0 | https://m.media-amazon.com/images/M/MV5BMTQ4ND... | The High Sun | Three different love stories, set in three con... | [Croatian] | ... | [Dalibor Matanic] | {'wins': 8, 'nominations': 3, 'text': '8 wins ... | 2015-08-21 00:54:49.047000000 | 2015 | {'rating': 7.8, 'votes': 161, 'id': 4593108} | [Croatia, Serbia, Slovenia] | movie | NaN | NaN | NaN |
21343 | 573a13f9f29313caabdeb527 | After having left a long time ago, a humble co... | [Drama] | 97.0 | [Josè Felipe Cèrdenas, Haimer Leal, Edison Rai... | 0 | https://m.media-amazon.com/images/M/MV5BMTUwZT... | Land and Shade | After having left a long time ago, a humble co... | [Spanish] | ... | [Cèsar Augusto Acevedo] | {'wins': 4, 'nominations': 3, 'text': '4 wins ... | 2015-09-04 00:26:23.710000000 | 2015 | {'rating': 7.6, 'votes': 215, 'id': 4663992} | [Colombia, France, Netherlands, Chile, Brazil] | movie | NaN | NaN | NaN |
21345 | 573a13faf29313caabdec74f | NaN | [Drama] | 104.0 | [Val Maloku, Astrit Kabashi, Adriana Matoshi, ... | 0 | NaN | Babai | NaN | [Albanian] | ... | [Visar Morina] | {'wins': 6, 'nominations': 1, 'text': '6 wins ... | 2015-08-26 00:23:12.820000000 | 2015 | {'rating': 7.0, 'votes': 23, 'id': 4741170} | [Germany, Kosovo, Republic of Macedonia, France] | movie | NaN | NaN | NaN |
7967 rows × 22 columns
K) Contar as películas por país
# Contamos las películas por país arupando y luego seleccionando
pd.DataFrame(df.explode('countries')['countries'].value_counts())
count | |
---|---|
countries | |
USA | 10921 |
UK | 2652 |
France | 2647 |
Germany | 1494 |
Canada | 1260 |
... | ... |
Zaire | 1 |
Zimbabwe | 1 |
Faroe Islands | 1 |
Gabon | 1 |
Vanuatu | 1 |
158 rows × 1 columns