T3.2 Analizar e visualizar estatísticas de datos baixados de MongoDB¶

Conexión a MongoDB dende Python e análise de datos¶

En Atlas mete o dataset de exemplo na túa base de datos.

Engade o enderezo IP autorizado a conectar para que poidas conectar á base de datos.

Proba a conectar dende compass coa URL de conexión que che da.

Da BBDD de proba: sample_mfix, na colección movies: https://www.mongodb.com/docs/atlas/sample-data/sample-mflix/

In [3]:
#!conda install -y pymongo dnspython
Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

## Package Plan ##

  environment location: C:\Users\diego.guizanlopez\AppData\Local\miniconda3\envs\bigdata

  added / updated specs:
    - dnspython
    - pymongo


The following NEW packages will be INSTALLED:

  cffi               pkgs/main/win-64::cffi-1.16.0-py38h2bbff1b_0 
  cryptography       pkgs/main/win-64::cryptography-42.0.2-py38h89fc84f_0 
  dnspython          pkgs/main/win-64::dnspython-2.4.2-py38haa95532_0 
  idna               pkgs/main/win-64::idna-3.4-py38haa95532_0 
  pycparser          pkgs/main/noarch::pycparser-2.21-pyhd3eb1b0_0 
  pymongo            pkgs/main/win-64::pymongo-3.12.0-py38hd77b12b_0 

The following packages will be SUPERSEDED by a higher-priority channel:

  ca-certificates    conda-forge::ca-certificates-2024.2.2~ --> pkgs/main::ca-certificates-2023.12.12-haa95532_0 



Downloading and Extracting Packages: ...working... done
Preparing transaction: ...working... done
Verifying transaction: ...working... done
Executing transaction: ...working... done

==> WARNING: A newer version of conda exists. <==
  current version: 23.9.0
  latest version: 24.1.2

Please update conda by running

    $ conda update -n base -c defaults conda

Or to minimize the number of packages updated during conda update use

     conda install conda=24.1.2


Conecta a Atlas dende Python, carga a colección en Pandas. Averigua o modo máis adecuado.

In [4]:
from pymongo import MongoClient
import pandas as pd

#HOST="(...).mongodb.net"
HOST="z"
PORT=27017
USERNAME="x"
PASSWORD="y"

if HOST == 'localhost':
    if not USERNAME:
        cli_mongo = MongoClient(HOST, PORT)
    else:
        cli_mongo = MongoClient(HOST, PORT, USERNAME, PASSWORD)
else:
    cli_mongo = MongoClient(f"mongodb+srv://{USERNAME}:{PASSWORD}@{HOST}/?retryWrites=true&w=majority&appName=Diego")

# Tras la conexión al cliente de mongo le indicamos la BBDD y la colección sample_mflix-movies y le decimos que recoja todos los datos

movies = cli_mongo['sample_mflix']['movies']
df = pd.DataFrame(movies.find({}))

A) Contar o total de películas.

In [5]:
# Contar las películas

len(df)
Out[5]:
21349

B) Contar o número de películas de cada xénero.

In [6]:
# Número de géneros diferentes

df.explode('genres')['genres'].value_counts()
Out[6]:
genres
Drama          12385
Comedy          6532
Romance         3318
Crime           2457
Thriller        2454
Action          2381
Adventure       1900
Documentary     1834
Horror          1470
Biography       1269
Family          1249
Mystery         1139
Fantasy         1055
Sci-Fi           958
Animation        912
History          874
Music            780
War              699
Short            442
Musical          440
Sport            366
Western          242
Film-Noir         77
News              44
Talk-Show          1
Name: count, dtype: int64

C) Contar cantas películas hai por ano.

In [7]:
# Películas de cada año

df.groupby('year').size()
Out[7]:
year
1896     2
1903     1
1909     1
1911     2
1913     1
        ..
2009è    2
2010è    4
2011è    2
2012è    3
2014è    2
Length: 130, dtype: int64

D) Mirar se hai correlación entre imdb e rotten tomatoes.

In [8]:
import numpy as np


# Hace una consulta obteniendo el rating de imdb y el de tomatoes, si no tiene lo remplaza por N/A
valores=[value for value in movies.aggregate([
  {
    "$project": {
      "_id": 0,
      "tomatoes": { "$ifNull": ["$tomatoes.viewer.rating", "N/A"] },
      "imdb": { "$ifNull": ["$imdb.rating", "N/A"] }
    }
  }]
)]

# Convertimos los valores a un DataFrame como float64 y remplazamos los N/A por np.nan

dfg=pd.DataFrame(valores)
dfg['imdb']=pd.to_numeric(dfg['imdb']).astype(np.float64)
dfg.replace('N/A',np.nan,inplace=True)
dfg.corr()
Out[8]:
tomatoes imdb
tomatoes 1.000000 0.405018
imdb 0.405018 1.000000

E) Contar cantas películas ten cada director

In [9]:
# Agrupamos por director y contamos sus apariciones

pd.DataFrame(df.explode('directors')['directors'].value_counts())
Out[9]:
count
directors
Woody Allen 40
Martin Scorsese 32
Takashi Miike 31
John Ford 29
Sidney Lumet 29
... ...
Mark Levin 1
Jennifer Flackett 1
Alex Bowen 1
Jason Gileno 1
Marcin Koszalka 1

10843 rows × 1 columns

F) Contar en cantas películas ten participado cada actor

In [10]:
# Lo mismo pero con cada uno de los actores

pd.DataFrame(df.explode('cast')['cast'].value_counts())
Out[10]:
count
cast
Gèrard Depardieu 67
Robert De Niro 58
Michael Caine 51
Bruce Willis 49
Samuel L. Jackson 48
... ...
Shègo Oshinari 1
Charles Lampkin 1
Manoel de Oliveira 1
Miguel èngel Ferriz 1
Mickael Gouin 1

38636 rows × 1 columns

G) Contar cantas películas hai de cada idioma

In [11]:
# Lo mismo que los dos anteriores pero con idiomas

pd.DataFrame(df.explode('languages')['languages'].value_counts())
Out[11]:
count
languages
English 14639
French 2405
Spanish 1600
German 1377
Italian 1228
... ...
Korean Sign Language 1
Awadhi 1
Mapudungun 1
Nama 1
Tigrigna 1

231 rows × 1 columns

H) Cal é a media de premios das películas

In [16]:
# Recogemos todos los awards y los sumamos dividiendo entre el total de líneas

dfP = pd.DataFrame([dic for dic in df['awards']])
dfP['wins'].mean()
Out[16]:
4.063047449529252

J) Amosar as películas que teñen alomenos 3 premios

In [13]:
# Las películas que tienen 3 wins por lo menos

df[df['awards'].str.get('wins')>=3]
Out[13]:
_id plot genres runtime cast num_mflix_comments poster title fullplot languages ... writers awards lastupdated year imdb countries type tomatoes rated metacritic
18 573a1391f29313caabcd944c Young Harry is in love and wants to marry an a... [Drama, Romance] 76.0 [Greta Garbo, Lewis Stone, Gavin Gordon, Ellio... 0 https://m.media-amazon.com/images/M/MV5BYTc4Yj... Romance Young Harry is in love and wants to marry an a... [English] ... [Edward Sheldon (from the play by), Bess Mered... {'wins': 3, 'nominations': 0, 'text': 'Nominat... 2015-08-08 00:38:18.023000000 1930 {'rating': 6.0, 'votes': 379, 'id': 21310} [USA] movie {'viewer': {'rating': 3.5, 'numReviews': 191, ... APPROVED NaN
21 573a1391f29313caabcd96ae Hildy Johnson, newspaper reporter, is engaged ... [Comedy] 101.0 [Adolphe Menjou, Pat O'Brien, Mary Brian, Edwa... 0 https://m.media-amazon.com/images/M/MV5BMjM4ND... The Front Page Hildy Johnson, newspaper reporter, is engaged ... [English] ... [Ben Hecht (by), Charles MacArthur (by), Bartl... {'wins': 5, 'nominations': 0, 'text': 'Nominat... 2015-09-06 00:10:44.853000000 1931 {'rating': 6.9, 'votes': 1290, 'id': 21890} [USA] movie {'viewer': {'rating': 3.2, 'numReviews': 957, ... TV-PG NaN
22 573a1392f29313caabcd97a7 At an all-girls boarding school, Manuela falls... [Drama, Romance] 87.0 [Emilia Unda, Dorothea Wieck, Hedwig Schlichte... 0 https://m.media-amazon.com/images/M/MV5BMDhkNj... Mèdchen in Uniform German film in which a sensitive girl is sent ... [German] ... [Christa Winsloe (screenplay), Friedrich Damma... {'wins': 3, 'nominations': 0, 'text': '3 wins.'} 2015-09-03 00:52:22.460000000 1931 {'rating': 7.8, 'votes': 1486, 'id': 22183} [Germany] movie {'viewer': {'rating': 4.0, 'numReviews': 667, ... PASSED NaN
33 573a1392f29313caabcda3ce Bea Pullman and her daughter Jessie have had a... [Drama, Romance] 111.0 [Claudette Colbert, Warren William, Rochelle H... 1 https://m.media-amazon.com/images/M/MV5BZGExYW... Imitation of Life Bea Pullman and her daughter Jessie have had a... [English] ... [Fannie Hurst (novel), William Hurlbut (screen... {'wins': 4, 'nominations': 0, 'text': 'Nominat... 2015-08-22 00:06:38.750000000 1934 {'rating': 7.6, 'votes': 2530, 'id': 25301} [USA] movie {'viewer': {'rating': 4.1, 'numReviews': 4477,... APPROVED NaN
37 573a1392f29313caabcda654 In this fictionalized biography, young Pancho ... [Biography, Western] 115.0 [Wallace Beery, Leo Carrillo, Fay Wray, Donald... 1 https://m.media-amazon.com/images/M/MV5BZjMwZW... Viva Villa! In this fictionalized biography, young Pancho ... [English] ... [Ben Hecht (screen play), Edgecumb Pinchon (su... {'wins': 4, 'nominations': 4, 'text': 'Won 1 O... 2015-08-20 00:24:08.127000000 1934 {'rating': 6.6, 'votes': 781, 'id': 25948} [USA] movie {'viewer': {'rating': 3.3, 'numReviews': 235, ... NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
21333 573a13f7f29313caabde71b9 Human nature begins where truth ends Welcome t... [Crime, Drama, Mystery] 126.0 [Elaine Jin, Aaron Kwok, Jessie Li, Michael Ning] 0 NaN Port of Call Human nature begins where truth ends Welcome t... [Cantonese] ... [Philip Yung] {'wins': 3, 'nominations': 0, 'text': '3 wins.'} 2015-08-06 00:56:04.570000000 2015 {'rating': 6.7, 'votes': 25, 'id': 4417522} [Hong Kong] movie NaN NaN NaN
21336 573a13f8f29313caabde8161 NaN NaN 120.0 [Aharon Traitel, Khalifa Natour, Riki Blich, G... 0 NaN Tikkun NaN [Hebrew, Yiddish] ... NaN {'wins': 7, 'nominations': 1, 'text': '7 wins ... 2015-09-04 00:39:57.163000000 2015 {'rating': 7.0, 'votes': 23, 'id': 4482858} [Israel] movie NaN NaN NaN
21338 573a13f9f29313caabdea63e Three different love stories, set in three con... [Drama] 123.0 [Tihana Lazovic, Goran Markovic, Nives Ivankov... 0 https://m.media-amazon.com/images/M/MV5BMTQ4ND... The High Sun Three different love stories, set in three con... [Croatian] ... [Dalibor Matanic] {'wins': 8, 'nominations': 3, 'text': '8 wins ... 2015-08-21 00:54:49.047000000 2015 {'rating': 7.8, 'votes': 161, 'id': 4593108} [Croatia, Serbia, Slovenia] movie NaN NaN NaN
21343 573a13f9f29313caabdeb527 After having left a long time ago, a humble co... [Drama] 97.0 [Josè Felipe Cèrdenas, Haimer Leal, Edison Rai... 0 https://m.media-amazon.com/images/M/MV5BMTUwZT... Land and Shade After having left a long time ago, a humble co... [Spanish] ... [Cèsar Augusto Acevedo] {'wins': 4, 'nominations': 3, 'text': '4 wins ... 2015-09-04 00:26:23.710000000 2015 {'rating': 7.6, 'votes': 215, 'id': 4663992} [Colombia, France, Netherlands, Chile, Brazil] movie NaN NaN NaN
21345 573a13faf29313caabdec74f NaN [Drama] 104.0 [Val Maloku, Astrit Kabashi, Adriana Matoshi, ... 0 NaN Babai NaN [Albanian] ... [Visar Morina] {'wins': 6, 'nominations': 1, 'text': '6 wins ... 2015-08-26 00:23:12.820000000 2015 {'rating': 7.0, 'votes': 23, 'id': 4741170} [Germany, Kosovo, Republic of Macedonia, France] movie NaN NaN NaN

7967 rows × 22 columns

K) Contar as películas por país

In [14]:
# Contamos las películas por país arupando y luego seleccionando

pd.DataFrame(df.explode('countries')['countries'].value_counts())
Out[14]:
count
countries
USA 10921
UK 2652
France 2647
Germany 1494
Canada 1260
... ...
Zaire 1
Zimbabwe 1
Faroe Islands 1
Gabon 1
Vanuatu 1

158 rows × 1 columns